TWI294569B

TWI294569B - Apparatus and method for performing fast pop operation from random access cache memory and computer-readable storage medium

Info

Publication number: TWI294569B
Application number: TW94100032A
Authority: TW
Inventors: E Hooker Rodney
Original assignee: Ip First Llc
Priority date: 2004-01-16
Filing date: 2005-01-03
Publication date: 2008-03-11
Also published as: CN100378650C; CN1627252A; TW200525351A

Description

I2945^L,doc/m 九、發明說明：【相關申請案之交互參考】本案主張美國優先權，本案之基礎案為美國專利申請案第10/759489號，申請日為2004年1月16日，名稱為「MICROPROCESSOR AND APPARATUS FOR PERFORMING FAST POP OPERATION FROM RANDOM ACCESS CACHE MEMORY 〇 • 【發明所屬之技術領域】 ® 本發明是關於一種微處理機的快取記憶體，且特別是關於一種可分辨堆疊式與非堆疊式記憶體存取的快取記憶體。 - 【先前技術】 . 微處理機是執行電腦程式指令的數位裝置。典型的電腦系統有一個微處理機，連接到系統記憶體，而系統記憶體能儲存程式指令以及程式指令要處理的資料。這種系統在執行時會遇到的一種瓶頸，從系統記憶體讀取資料到微 • 處理機，或從微處理機寫入資料到系統記憶體所需花費的時間，通常會比微處理機執行處理這些資料的指令所需花費的時間更多。這兩種時間通常有十倍，甚至百倍的差距（因此，當在等待記憶體要做讀或寫的同時，微處理機會閒置著。 b 不過，微處理機的設計者從报久以前就知道，程式都傾向於在一段長時間内只取用-小部分的資料，例的程式k數。這-類程式就是有良好的時間區域性 1294569 14394twf.doc/m (temporal ^Hty) ^^ 貝Jdocahty of reference prindple)。為了利用這個原則現代的微處理機通常有至少—個快取記憶體。快取記情體， ”取’是在電路上靠近微處理機核心的一顿記 =1來暫時儲存—小部份資料，而其他資料還是存在，罝較大，離微處理機也較遠的系統記憶體。快取記憶體是用内含的儲存單位(stGmgeelement)來儲存資料，猶後取出資料時，會比從遙遠的系統記憶體取出資料的速度還要快。匕當微處理機執行讀取記憶體的指令，例如載入(1〇ad) 指令或彈出(pop)指令時，微處理機會先檢查快取記憶體有 /又有現成的資料，也就是說，讀取的位址有沒有「擊中」 (hit)快取記憶體。如果沒有，也就是說，讀取的位址「錯過」（miss) 了快取記憶體，微處理機除了將資料載入微處理機當中受指定的暫存器之外，還會把資料也存入快取記憶體。現在，資料已經存在於快取記憶體，下一次遇到讀取同一筆資料的指令時，就可以從快取記憶體將資料載入暫存裔’而不必從系統記憶體讀取。由於資料已經存在於快取記憶體，上述的記憶體讀取指令幾乎可以立即完成。快取s己憶體是以快取線(cache line)或快取區塊(cache block)儲存資料。快取線是快取記憶體與系統記憶體之間傳輸資料的最小單位。舉例來說，快取線的大小可以是64 位元組(byte)。當一個記憶體讀取指令錯過了快取記憶體，讀取位址所在的一整個快取線都會被載入快取記憶體，而 1294569 14394twf.doc/m 不僅是載入讀取指令所要求的貧料。於是，稍後讀取同一個快取線資料的讀取指令可以很快就執行完畢，因為資料就在快取記憶體中，而不用讀取系統記憶體。此外，當執行記憶體寫入指令，例如存放(st〇re)指令或推入(push)指令時，如果寫入位址擊中快取記憶體，資料就可以直接寫入快取線，藉以延緩將資料寫入系統記憶體。稍後，快取記憶體會將快取線寫入系統記憶體，通常是為了讓出空間給新的快取線。上述的程序通常稱為寫回程序(writeback operation)。除此之外，有些快取記憶體會在記憶體寫入位址錯過快取記憶體之後更新記錄單位 (entry)。也就是說，快取記憶體會先寫回快取記憶體的一個記錄單位當巾之—㈣的快取線，紐將寫人位址所在的陕取線，；k系統記憶體載入之前舊的快取線所佔據之快取記憶體記錄單㈣巾。上賴料通常稱為寫人配置程序（write allocate operation)。冋效率的快取記憶體可以大幅增進微處理機的效月匕。衫響快取記憶體效率的兩個主要因素是快取擊中率 (cache hit rate)以及快取存取時間㈣如&⑽s ti㈣。快祕體的料率是絲擊巾:欠鎌以擊巾次數與錯過次數 =總和所得的時。而存取咖是從快取記紐讀取指定貝料J寫入指定貧料到快取記憶體，所需要的微處理機核心的時脈週期(clock cycle)數。戍 j響快韓㈣喊大因錢絲記憶_大小，也就疋說，快取記賴可齡㈣料位元數。絲記憶體越 1294569 fH / 14394twf.doc/m 大，可儲存的系統記憶體資料的子集合就越大，資料所在的快取線就越容易存在於快取記憶體。因此，快取記憶體總是有變大的趨勢。傳統上，快取記憶體的大小通常受限於微處理機晶片的空間’能分派給快取記憶體的空間有限。但隨著電路元件越做越小，這項限制也逐漸消失。然而，快取記憶體大小也會影響傳統快取記憶體的存取時間。林的是’快取記憶體越大，通常存取時間也越長。這是因為傳_快取記憶體是隨機存取記憶體(_ access memory)，也就是說，存取快取記憶體之内的任何一個快取線’所需的時間都相同。快取記憶體内可容的位置衫，電路就會越_，需要歸更㈣時間才能找到記憶體位址所指定的資料。所幸，電路元件的持續縮的、負快取存取時間，有助於彌補加大快取記憶體的· f而嘯著加快微處理機時脈頻率(—k ftequenCy) 週期來存取快取記憶體。因此現在有一以;= 理機内部的快取記㈣，_是第—小微處 L1)快取々丨产駚與如品丄弟、’及（be1-1，簡稱為㈣j 士 &體。舉例S，Pentium _的第-級快取士己 =有_相比之下，Pentiumm⑧的第= =有16KB。如崎錄取記鐘容量，並曰己以機核心時脈上能。Η峨取疏體^，儘管縮小快取記憶體會降低效 12945¾ 4twf.doc/m 容量缩3”-種方法’來增加快取記憶體的有效兩個目標。谭關，或朗時達成上述【發明内容】就-觀點來說，本發明提供_ r；：;r；rr _’以儲存列數值。當推入指令的資煜I隸、取5己憶體時’存放推入資料列的列數值會被選擇性地推入至後進先出記憶體的最頂端。當推入資料位於快取記憶體内新配置的快取線時，列數值才會被推入；否則，列數值不會被推入後進先出記憶體。此裝置是利用彈出指令通常都與前-個推入指令相關，以進行猜測彈出程序。饭设推入與彈出指令有關聯，為了回應於彈出指令，此裝置會立刻從後進先出記憶體最頂端記錄單位的列數值所才曰疋的快取線猜測提供資料，而不等待彈出來源位址的计异結果，也不等待來源位址是否擊中快取記憶體的決定結果。此裝置稍後會計算彈出來源位址，並比較它的索引部分與後進先出記憶體最頂端記錄單位的列數值，以決定猜測提供給彈出指令的資料正確與否。如果不正確，會有修正動作以提供正確資料。如果彈出指令會讓下一個彈出指令取用別的快取線，也就是說，如果彈出指令正在彈出列數值所指定的快取線之最後資料，此裝置會將上述的列I2945^L, doc/m IX. Invention Description: [Reciprocal Reference of Related Applications] This case claims US priority. The basic case of this case is US Patent Application No. 10/759489, and the filing date is January 16, 2004. The name is "MICROPROCESSOR AND APPARATUS FOR PERFORMING FAST POP OPERATION FROM RANDOM ACCESS CACHE MEMORY 〇• [Technical Field of the Invention] The present invention relates to a cache memory of a microprocessor, and more particularly to a resolvable stacked type Cache memory with non-stacked memory access - [Prior Art] A microprocessor is a digital device that executes computer program instructions. A typical computer system has a microprocessor connected to the system memory, and the system The memory can store program instructions and the data to be processed by the program instructions. A bottleneck encountered in the execution of such a system, reading data from the system memory to the microprocessor, or writing data from the microprocessor to the system memory. The time it takes to get the body is usually more time-consuming than the microprocessor executing the instructions to process the data. There are usually ten times or even a hundred times the difference between the two times (hence, the micro-processing opportunity is idle while waiting for the memory to be read or written. b However, the designer of the microprocessor knows from the long time ago, The program tends to use only a small amount of data for a long period of time, the program k number. This type of program has a good time regionality 1294569 14394twf.doc/m (temporal ^Hty) ^^ Shell Jdocahty Of reference prindple). In order to take advantage of this principle, modern microprocessors usually have at least one cache memory. The cache memory, "take" is a note on the circuit close to the core of the microprocessor. Storage - a small amount of data, while other data still exist, larger, and farther away from the microprocessor system memory. Cache memory uses the built-in storage unit (stGmgeelement) to store data. When you retrieve data, it is faster than taking data from remote system memory. When the microprocessor executes an instruction to read a memory, such as a load (1〇ad) instruction or a pop (pop) instruction, the microprocessor will first check whether the cache memory has/has ready-made data, that is, Whether the read address has "hit" cache memory. If not, that is, the read address "miss" the cache memory, the microprocessor will not only load the data into the specified scratchpad in the microprocessor, but also the data. Save to cache memory. Now, the data already exists in the cache memory. The next time you encounter an instruction to read the same data, you can load the data from the cache memory to the temporary storage instead of reading from the system memory. Since the data already exists in the cache memory, the above memory read command can be almost completed immediately. The cache memory is stored in a cache line or a cache block. The cache line is the smallest unit of data transferred between the cache memory and the system memory. For example, the size of the cache line can be 64 bytes. When a memory read instruction misses the cache memory, an entire cache line where the read address is located is loaded into the cache memory, and 1294569 14394twf.doc/m is not only required to load the read command. Poor material. Thus, the read command that reads the same cache line data later can be executed very quickly because the data is in the cache memory without reading the system memory. In addition, when a memory write command is executed, such as a st〇re command or a push command, if the write address hits the cache memory, the data can be directly written to the cache line. Delay writing data to system memory. Later, the cache memory writes the cache line to the system memory, usually to make room for the new cache line. The above program is often referred to as a writeback operation. In addition, some cache memories update the entry after the memory write address misses the cache. That is to say, the cache memory will first write back to a record unit of the cache memory as the cache line of the towel (4), and the button will be written in the Shaanxi line where the address is located; k system memory is loaded before the old The cache memory card (four) towel occupied by the cache line. The above is often referred to as a write allocate operation.冋Efficient cache memory can greatly improve the efficiency of the microprocessor. The two main factors that drive memory efficiency are the cache hit rate and the cache access time (4) such as & (10)s ti (4). The rate of the secret is the silk scarf: the time when the number of hits and the number of misses = the sum of the sums. The access coffee is the number of clock cycles of the core of the microprocessor that is required to read the specified material from the cache to the cache memory.戍 j 快快韩(4) shouting because of Qiansi’s memory _ size, also said that the quick-received Lai Keling (four) material number. The larger the silk memory is 1294569 fH / 14394twf.doc/m, the larger the subset of the system memory data that can be stored, the easier the cache line where the data is located exists in the cache memory. Therefore, cache memory always has a tendency to become larger. Traditionally, the size of the cache memory is typically limited by the space available on the microprocessor chip's space that can be allocated to the cache memory. But as circuit components get smaller and smaller, this limitation has gradually disappeared. However, the size of the cache memory also affects the access time of traditional cache memory. Lin's is that the larger the cache memory, the longer the access time. This is because the pass-memory memory is random access memory (_access memory), that is, the time required to access any of the cache lines within the cache memory is the same. Cache the location of the memory in the memory, the circuit will be more _, need to return to (4) time to find the data specified by the memory address. Fortunately, the continuous shrinking and negative cache access time of the circuit components helps to compensate for the increase in the memory of the cache memory and accelerates the clock frequency of the microprocessor (-k ftequenCy) to access the cache. Memory. Therefore, there is now one; = the internal cache of the machine (four), _ is the first - small micro-L1) cache 々丨駚如如如如如如如如如 be be be be be be be be be be be be be be be be be be be be be be be be be be be For example, S, Pentium _'s level-level Cache is already = _ compared to Pentiumm8's == has 16KB. If you want to record the clock capacity, and you can use the machine core clock. Take the sparse ^, although reducing the cache memory will reduce the effect of 129453⁄4 4twf.doc/m capacity shrink 3" - method 'to increase the effective two goals of the cache memory. Tan Guan, or Lang Shi reached the above [invention content 】 In terms of point of view, the present invention provides _r;:;r; rr _' to store column values. When pushing the resource of the instruction, the value of the column of the push data column is stored. Will be selectively pushed to the top of the last-in, first-out memory. When the push data is in the newly configured cache line in the cache memory, the column value will be pushed in; otherwise, the column value will not be Push in the first-in first-out memory. This device uses pop-up instructions that are usually associated with the previous push command to make a guess pop-up. The push is associated with the pop-up command. In response to the pop-up command, the device immediately provides the data from the cache line of the last-in-first-out memory of the top-most record unit, without waiting for the pop-up source. The result of the calculation of the address does not wait for the decision of whether the source address hits the cache memory. The device will calculate the pop-up source address later and compare its index portion with the top-most record unit of the last-in first-out memory. The column value determines whether the data supplied to the pop-up instruction is correct or not. If it is not correct, there will be a corrective action to provide the correct information. If the pop-up instruction will cause the next pop-up instruction to use another cache line, that is to say If the pop-up instruction is popping up the last data of the cache line specified by the column value, the device will list the above columns.

I294H 數值彈出後進先出記憶體。在一實施例中，上述的快取記憶體是一個集合關聯式快取C憶體(set associative cache)。在本實施例中，此骏置包括一個第二後進先出記憶體，以儲存欄數值。在推入程序中，在快取記憶體當中儲存推入資料的欄，也會根據^ 列數值相同的條件，選擇性地被推人第二後進先出記憶，。在彈出程序中，此裝置會立刻從後進先出記憶體最項鲁端，錄單位的列數值，以及第二後進先出記憶體最頂端記、 f單位的攔數值所共同指定的快取線猜測提供資料，而不等待彈出來源位址的計算結果，也不等待來源位址是否擊中快取記憶體的決定結果。在一實施例中，第一與第二後 : 進先出記憶體是同一個後進先出記憶體，其中列數值與攔 • 數值是成對儲存。、在一實施例中，此裝置會記錄後進先出記憶體最頂端記錄單=的列數值與酿值所指定的快取線當中，最近被推入的資料位移值。此位移值在每次執行彈出或推入指令 # 日夺都會更新。如果彈出指令使得此位移值指向下一個快取 $，後進先出記憶體就會被彈出。反之，如果推入指令使得位移值指向上-個快取線，推入的快取線的列與棚會被推入後進先出記憶體。此外’如果有直接修改堆疊指標的指令，此位移值也會跟著修改，例如將一數值加上堆疊指標暫存器的指令。如果上述的修改造成此位移值指向下一個快取線，後進先出記憶體就會跟著彈出。快速彈出程序提供彈出資料，可以比沒有上述裝置的 12945^2— 快取記憶體快上幾鱗脈補。更明麵說，猜測提供出資料的時間’並不包含計算彈出來源位址與轉譯實體來源位址所需的時間。此外，猜測提供彈出資料所需的時間，也不包含標籤比較所需的時間。因此，在至少一個實施例中，提供彈出資料比傳統快取記憶體快上三個時脈週期。【實施方式】本毛、月疋利用’私式通常會將系統記憶體分為兩個區域也就是，區(stack region)和非堆疊區—·伽汰 region)。非堆豐區通常也稱為堆積(heap)。堆疊與堆積的主要差別是，堆射崎齡取，鱗枝常是轉進先出 Oast-in-flm-out，縮寫為UF〇)的方式存取。堆疊積的另-個差別’是讀寫指令的位址表達方式。讀取入堆積的齡，通常是直接指定記龍位址。而讀取= 通常是用微處理機的一個特殊暫存器間接才曰疋5己丨思體位址，這個暫存器通常稱為堆疊指桿暫 (他ck _ter register)。推入指令(_)會以要推二堆最: ::以新2指標暫存器，然後將資料從微處理“ 到已更新的堆疊指標暫存器所儲存之記贿ER指令)會以要推入的資料= 遞減隹豐“暫存器（例如，假如資料是雙字組（—Μ，或douMeword)，大小就是4個位元組(byte))，貢料存放在堆疊中，已更新的堆疊指標暫存器所指㈣位址。反之，Μ指令(卿)會從堆4指標暫存时放的位址 12The I294H value pops up in the first in, first out memory. In one embodiment, the cache memory described above is a set associative cache. In this embodiment, the device includes a second last in first out memory to store the column values. In the push program, the column for storing the push data in the cache memory is selectively pushed to the second last in, first out memory according to the same condition of the column value. In the pop-up program, the device will immediately select the cache line from the last item of the memory, the column value of the unit, and the cache line of the second last-in first-out memory. Guess the data provided without waiting for the result of the pop-up source address calculation, nor for the decision of whether the source address hits the cache memory. In one embodiment, the first and second post-in and first-out memories are the same last-in, first-out memory, wherein the column values and the block values are stored in pairs. In one embodiment, the device records the column value of the last top-in memory of the last-in, first-out memory and the data displacement value of the most recently pushed data among the cache lines specified by the brew value. This displacement value is updated each time a pop-up or push-in command is executed. If the pop-up instruction causes the displacement value to point to the next cache $, the last-in first-out memory will be ejected. Conversely, if the push command causes the displacement value to point to the upper-fast line, the column and slot of the pushed cache line are pushed into the last-in first-out memory. In addition, if there is an instruction to directly modify the stack indicator, the displacement value will be modified, for example, by adding a value to the instruction of the stacked index register. If the above modification causes the displacement value to point to the next cache line, the last in first out memory will pop up. The quick pop-up program provides pop-up data, which can be faster than the 12945^2-cache memory without the above device. To be more explicit, guessing when the data is provided does not include the time required to calculate the pop-up source address and translate the entity source address. In addition, guess the time it takes to provide pop-up data, and it does not include the time required for label comparison. Thus, in at least one embodiment, the pop-up data is provided three clock cycles faster than conventional cache memory. [Embodiment] The use of the private mode of the hair and the moon is usually divided into two regions, that is, a stack region and a non-stack region. Non-heap areas are also commonly referred to as heaps. The main difference between stacking and stacking is that the stacking age is taken, and the scales are often accessed in the way of Oast-in-flm-out (abbreviated as UF〇). The other difference in stacking is the address representation of the read and write instructions. The age of reading into the stack is usually specified directly. The read = is usually indirect with a special register of the microprocessor. This register is usually called the stack cockerel (the ck _ter register). The push command (_) will push the second stack: :: with the new 2 indicator register, and then the data from the micro-processing "to the updated stack indicator register stored in the bribe ER command" will The data to be pushed = Decrement 隹 “ “Scratchpad (for example, if the data is a double word (—Μ, or douMeword), the size is 4 bytes), the tribute is stored in the stack, The updated stack indicator register refers to the (4) address. On the other hand, the Μ instruction (Qing) will put the address when it is temporarily stored from the heap 4 indicator.

1294569 … 14394twf.doc/m 讀取資料’將資料载入微處理機的出堆疊的資料大小，更新堆衫神心料& a後以弹 x86 且私軚暫存器。舉例而言，在木構巾#出扎令（例如p〇p、re 會以彈出的資料大小’遞增堆最 ^LEAVEk ) ,θ隹且柏裇暫存裔。因此，傳統成長(\η」Γ性’堆疊都是隨著資料推入而往上成長（也，n魏體恤會逐漸彈出而向下卿（也就是說，記憶體位址會逐漸因此暫存糾存放的數值稱鱗疊_端_。田堆豐疋配置記憶體空間的傳統機制。在—般程式中，堆豐的主，，途之―，是推人副程式(瞻。的參數 (rameters)與上料叫程式的返回位址。被呼叫的副程式曰將返、回位址彈入微處理機的程式計數器(program ΤΓΠ’Γ时叫程式，㈣叫程式會彈出參數，將堆，恢復原狀。這個概念有個極佳的特色是可用於遞迴副程式呼叫（nested subroutine call)。此處描述的快速彈出程序，是利用推入與彈出指令之間通常有-對-的關係。也就是說’每個彈出指令所彈出的資料，通常都是由前面一個對應的推人指令推入堆疊。在本發明說明書中中，彈出指令是將資料從記憶ς移入微處理機的齡’例如移域處理_暫存驗扣細r =當中的-轉存n ’而資料所在的記憶體位址是間接才曰疋，而非由指令直接指定。更詳細的說，彈出資料的記憶體位址是根據於堆疊指標暫存器所儲存的數值。'在X% 架構中’彈出指令的範例是POP、RET、以及LEAVE指 13 1294569 14394twf.doc/m 令，它們的來源運算元(s〇urce 〇perand)是根據於堆疊指標暫存器的内谷數值之相對位址，而它們的目標運算元 (destinahon operand)則指定微處理機當中的一個暫存器。在本發明說明書中，載入指令(load)是將資料從記憶體移入微處理機的非彈出指令。也就是說，載入指令會直接指定目標資料所在的記憶體位址，至少也會直接指定一個或一組指定目標資料來源的記憶體位址的暫存器。在 x86架構中的載入指令範例是M〇v指令，它的來源運算元指定一個記憶體位置，而它的目標運算元則指定微處= 機暫存器組的一個暫存器。1294569 ... 14394twf.doc/m Read the data 'Load the data into the microprocessor's stacking data size, update the stacker & a to play x86 and private register. For example, in the wooden frame #出出令 (for example, p〇p, re will increase the heap size by the size of the pop-up data), θ隹 and the 裇裇裇。. Therefore, the traditional growth (\η" Γ '' stacking is growing with the data push (also, n Wei compassion will gradually pop up and down (that is, the memory address will gradually be temporarily fixed) The stored value is called the scale _ end _. The traditional mechanism of the memory space of Tian Duanfeng. In the general program, the master of the heap, the way, is the subroutine of the subroutine ) and the return address of the program called the program. The called sub-program will bounce and return the address into the microprocessor's program counter (program ΤΓΠ 'Γ 程式 program, (4) called program will pop up parameters, the heap will be restored The original. This concept has an excellent feature that can be used for nested subroutine calls. The quick pop-up procedure described here uses the usual -to-to-bet relationship between push and pop instructions. That is to say, 'the data popped up by each pop-up instruction is usually pushed into the stack by the previous corresponding push command. In the present specification, the pop-up command is the age of moving data from the memory to the microprocessor'. Shift processing _ temporary check deduction fine r = among the - transfer n ' and the memory address of the data is indirect, not directly specified by the instruction. In more detail, the memory address of the pop-up data is based on the stack The value stored in the indicator register. 'Examples of 'pop-up instructions in X% architecture' are POP, RET, and LEAVE refer to 13 1294569 14394twf.doc/m, their source operands (s〇urce 〇perand) are According to the relative address of the inner valley value of the stack indicator register, their target operand (destinahon operand) specifies a register among the microprocessors. In the present specification, the load instruction (load) Is a non-pop-up instruction that moves data from the memory to the microprocessor. That is, the load instruction directly specifies the memory address of the target data, and at least directly specifies one or a set of memory addresses of the specified target data source. The scratchpad. The load instruction example in the x86 architecture is the M〇v instruction, whose source operand specifies a memory location, and its target operand specifies the microsite = a temporary temporary register group Save.

“，本發明說明書中，推入指令是將資料從微處理機移至兄憶體的指令，其中的記憶體位置是間接指定，而不是在指令中直接指定。更明確的說，推人龍的記憶體位= 是根據於微處理機的堆疊指標暫存器中所存放的數值。在 xf 6架構中’推入指令的範例為PUSH、cALL、以及ENT£R #曰々匕們的目標運异元是根據於堆疊指標暫存器的内容數，的相對位址，而它們的來源運算元則指定微處理機暫存器組的一個暫存器。在本I明5兒明書中，存放指令(store)是將資料從微處理，移入記憶_非推人指令。也就是說，存放指令會^ 接才曰疋貧料要存放的記憶體位址，至少也會直接指定一個或一組指定資料要存放的記憶體位址之暫存器。在χ86架構中的存放指令範例是MOV指令，它的目標運算元指= 個记憶體位置，而它的來源運算元則指定微處理機暫存 14 1294569 14394twf.doc/m 器組的一個暫存器。請參照圖卜圖1是根據本發明的管線(pipeline)式微處理機100的方塊圖。在一實施例中，微處理機1〇〇包括 ^個微處理機，其指令集(instruction set)大體上遵循X86 架構的指令集。更明確的說，它的指令集至少包括χ86的 POP、PUSH、CALL、RET、ENTER、以及 LEAVE 指令。此外，它的指令集也包括從記憶體載入資料，以及存放資 _ 料到記憶體的指令，例如x86的MOV指令。不過，本發月並不侷限於採用χ86架構的微處理機，亦不侷限於x86 指令集。 ' 夕微處理機1〇〇包括暫存器組112。暫存器組112包括 ' 多數，f存器，用以儲存微處理機100的運算元(operand) ：與狀態貧訊。在一實施例中，暫存器組112包括一般用途暫存器(general pUrpose registers)、位址區段暫存 segment^ . ^1^# |i (index 制暫存為（status and control registers)、以及指令指標暫存 _ 器（instruction p〇inter register)，或稱為程式計數暫存器 (program counter register)。在一實施例中，暫存器組 i 12 至少包括一個使用者可見的，x86架構微處理機的暫存器集合:更明確的說，暫存器組112包括儲存堆疊頂端位^ 的堆疊指標暫存器152。在一實施例中，堆疊指標暫存器 152很類似χ86的ESP暫存器。微處理機1〇〇包括指令快取伽也11如011(^11幻1〇2，用以儲存指令碼的快取線(cache line)。在一實施例中，指人 15 1294569 14394twf.doc/m"In the specification of the present invention, the push command is an instruction to move data from the microprocessor to the brother memory, wherein the memory location is specified indirectly, rather than directly in the instruction. More specifically, the push dragon The memory bit = is based on the value stored in the microprocessor's stack indicator register. In the xf 6 architecture, the examples of push instructions are PUSH, cALL, and ENT£R #曰々匕The different elements are relative addresses based on the number of contents of the stacked indicator registers, and their source operands specify a register of the microprocessor register group. In this book, The store instruction is to transfer the data from the micro-processing to the memory-non-pushing instruction. That is to say, the storage instruction will connect to the memory address to be stored in the poor material, at least one or a group of directly. A temporary register for the memory address to which the data is to be stored. The example of the store instruction in the χ86 architecture is the MOV instruction, whose target operand refers to = memory location, and its source operand specifies the microprocessor. Save 14 1294569 14394twf.doc/m 1 is a block diagram of a pipeline microprocessor 100 in accordance with the present invention. In one embodiment, the microprocessor 1 includes a microprocessor. Its instruction set generally follows the instruction set of the X86 architecture. More specifically, its instruction set includes at least POP86 POP, PUSH, CALL, RET, ENTER, and LEAVE instructions. In addition, its instruction set is also This includes loading data from memory and storing instructions to memory, such as the x86 MOV instruction. However, this month is not limited to microprocessors using the χ86 architecture, nor is it limited to the x86 instruction set. The processor 1 includes a register set 112. The register set 112 includes a 'majority', an operand for storing the operand of the microprocessor 100: with state lag. In the example, the register group 112 includes general pUrpose registers, address segment temporary segment^^^^^# |i (status and control registers, and instructions) Indicator temporary _ (instruction p〇inter regis Ter), or as a program counter register. In one embodiment, the register set i 12 includes at least one user-visible, x86 architecture microprocessor register set: more specific The register set 112 includes a stack indicator register 152 that stores the top bits of the stack. In one embodiment, the stacked indicator registers 152 are very similar to the ESP registers of the 86. The microprocessor 1 includes an instruction cache gamma 11 such as 011 (^11 幻1〇2, for storing a cache line of the instruction code. In one embodiment, the finger 15 1294569 14394twf.doc /m

快取102包含-個第—級（leveM =指令快取搬儲存_接於微處理機⑽料统$ ，抓取的指令’例如推人與彈出指令: 疊指標暫存器152的堆疊恤;止= 存取系統記憶體内的堆疊。 =理機100也包_接於指令快取1〇2的匯流排介面早=(bUS mterfaee unit) 118。匯流排介面單元⑽ 於微處理機匯流排(proce肅bus) 132，經由微處理機匯流排㈣cess〇rbus) m將微處理機100連接至系統記憶體。匯流排介面單元118是微處理機⑽之内的各㈣件與微處理機匯流排132的介面。舉例而言，匯流排介面單元u8 從系統記憶體抓取指令到指令快取1〇2。此外，匯流排介面單元118會在系統§己憶體讀取或寫入資料，例如位於系統記憶體中，由堆疊指標暫存器152指示頂端位址的堆疊: 微處理機100也包括輕接於指令快取102的指令抓取态（instruction fetcher) 104。指令抓取器1〇4從指令快取1〇2 抓取指令。指令抓取器104循序抓取暫存器組112當中的指令指標暫存器（instruction pointer register)所指定的下一個指令，除非遇到變更程式流程的事件，例如分支指令 (branch instruction)，此時指令抓取器1〇4會開始抓取位於分支指令的目標位址之指令，或例外事件(eXCepti〇n)，此時指令抓取器104會開始抓取對應的例外事件處理程式 (exception handler routine)的指令。微處理機100也包括耦接於指令抓取器104的微程式 16 1294^1 f.doc/m 碼記憶體(microcode memory) 128。微程式碼記憶體128存放指令抓取器104要抓取的指令。更明確的說，微程式碼記憶體128包括例外事件處理程式的指令，以處理微處理機1〇〇產生的各種例外事件。在一實施例中，微處理機1〇〇會在偵測到彈出或推入資料的猜測錯誤時，產生一個例外事件，以修正微處理機1〇〇的堆疊存取狀態，細節後述。微處理機100也包括耦接於指令抓取器1〇4的指令轉譯器（instruction translator) 106。指令轉譯器1〇6從指令抓取器104接收指令，例如推入與彈出指令，將指令解碼，然後轉譯為微處理機100的管線的其他部分所執行的微程 ^碼(miCroinstructions)。在一實施例中，微處理機丨⑻的官線的其他部分包括一個執行微程式碼的精簡指令集電腦 (red職d inStruction set c〇mpmer，縮寫為 Risc)核心。在另一實施例中，指令轉譯器1〇6會為每個指令產生一個標不(mdicator)，以指出微程式碼所據以轉譯的指令於推入、彈出、載人、或存放，亦稱為巨集程式碼轉。微處理機100也包括麵接於指令 ^ 的指令排程器108。指令排程器108從指令轉二二106接收已轉譯的微程式碼，並且發出微程式碼134 、、口執仃微，式石馬的執行單元(execud〇nun㈣ιΐ4。抵114自指令排程器108接收微程式碼134， n4 + ° 152的内谷數值，並執行微程式碼 °在—只施例中’執行單元114包括—個整數單元 17 1294569 14394twf.doc/m (integer 仙 it)、一個浮點單元(fl〇atingp〇intunit)、一個多媒體延伸單元(MMX unit)、一個串流單指令多資料延伸單元 (SSE unit)、一個分支單元(branch皿抝、一個載入單元(1⑽d unit):以及一個存放單元(st〇re unit)。載入單元執行係將資料從系統記憶體載入微處理機1〇〇的指令，包括彈出指々存放單元執行存放指令，也就是將資料從微處理機工⑻ 存放至系統記憶體的指令，包括推入指令。 •微處理機100也包括耦接於執行單元114的寫回單元 (wi^te-backstage) 116。寫回單元116接收執行單元114執行才曰々的、、、α果，並且將結果，例如彈出指令的資料，寫回暫存器組112。 ..... 微處理機1〇〇也包括資料快取(data cache) 126，資料快取126透過匯流排136耦接於匯流排介面單元118，並十透過匯流排138耦接於執行單元114。在一實施例中，貧料快取126是第一級的資料快取記憶體。資料快取126 包括堆疊快取(stack cache) 124與非堆疊快取(non_stack cac^e) 122。匯流排介面單元118將資料從系統記憶體抓取 ^貧料，取126，並且從資料快取126寫人系統記憶體。。羊、、、田的況，匯流排介面單元118會從堆疊快取124與非堆璺快取122將快取線寫㈣統記憶體，並且從系統記憶體讀取快取線，以寫入堆疊快取124與非堆疊快取122所配，的記錄單位(all〇cated entry)。更明確的說，匯流排介面單元118會在系統記憶體的堆疊與堆疊快取124之間，傳送推入與彈出指令所指定的資料。 1294569 14394twf.d〇c/n 在一實施例中，非堆叠快取122 A體上包含一個傳統 —級請快取記憶體，其設計目標是面對隨機分布的 j記憶體位址時，可有均—的存取時離⑽㈣。在 m却=中’非堆豐快取122包括一個四攔集合關聯式快 ^ .¾ -(4-way set associative cache)。不過，存放單元會二與非推人的指令資料，以決定要將資料存放在堆 =祖六24或f堆讀取122。存放單元會將推入指令的二;：放在堆*絲124，而不是非堆疊快取122，且將非料，如5放指令#料’存放在非堆疊快取同非堆宜快取122就與傳統的快取記憶體不冋。堆豐快取124在稍後會配合圖2進一步解說。 rle1r實施例中’微處理機100也包括一個第二級取二及;以支援第-級_ 快取是存放從第一級資料:： = 第二級盘换蟲w… 取2 (包括非堆疊快取122 :從ίι_4)移出的快取線’而第一級資料快取126 曰從第一級快取記憶體抓取快取線。请參照圖2,圖2緣示根據於本發明的顧-個料，出堆tL 2memory)的儲存單位(st〇rage ei~s)。雖缺 124本身是堆疊，但與系統記憶體之内的堆t 端=。it由堆疊指標暫存器152的内容數值指ΐ頂 =置。而堆$快取124是儲存來自系統記憶體的堆疊的 19 1294569 14394twf.doc/m 圖2的實施例包括16個儲存單位，或稱為記錄單位 (entry)，分別標示為〇到15。最頂端的記錄單位稱為記錄單位0，而最底端的稱為記錄單位15。不過，本發明在堆疊快取124當中，並不侷限於特定數量的記錄單位。每一個記錄單位都有空間以容納快取線2〇6的資料、快取線2〇6 的位址標籤(address tag) 204、以及快取線206的快取狀態 (cache status) 202。在一實施例中，快取狀態2〇2大體遵循廣為人知的四種表示快取一致性(c〇herenCy)的狀態，也就疋已修改(Modified)、獨有（Exclusive)、共享（Shared)、以及無效(Invalid)，合稱為MESI。在一實施例中，一個快取線206包含64位元組的資料。在另一實施例中，位址標籤 204包含快取線206的實體位址(physical address)。在一實施例中，位址標籤204包括快取線206的實體位址的高段位元(upper significant bits)，用來唯一辨識快取線206。在一實施例中，微處理機1〇〇包含一個記憶體分頁系統(memory paging system)，負責將虛擬記憶體位址 (virtual memory addresses)轉譯為實體記憶體位址，而且位址標籤204也包含快取線206的虛擬位址。在一實施例中，這個虛擬位址其實是虛擬位址位元的雜湊(hash)結果，以降低所需的儲存空間。以下會詳細說明如何利用位址標籤 204的虛擬位址部分，在堆疊快取124做猜測載入 (speculative loads) 〇堆疊快取124接收新的快取狀態，以透過 sc-write一MESI訊號212放入最頂端的記錄單位的快取狀 20 1294569 14394twf.doc/m 恶202攔位。堆疊快取124接收新的位址標籤，以透過 sc—write一tag訊號214放入最頂端的記錄單位的位址標籤 204攔位。堆豐快取124接收新的快取線，以透過 sc一write一data訊號216放入最頂端的記錄單位的資料2〇6 攔位。堆疊快取124也從圖3的控制邏輯電路(c〇ntr〇11〇gic) 302接收push—sc訊號232。當控制邏輯電路3〇2將push—sc 口孔號232 a又為真值(true value)，堆疊快取124會向下平移一個d錄單位，也就是說，最底下的記錄單位會移出堆疊快取124，其餘每一個記錄單位都會接受上一個記錄單位的儲存内容，而且 sc一writeJViESI 訊號 212、sc—write—tag 訊號214、以及sc—write一data訊號216的内容會寫入位於堆f快取124隶頂端的記錄单位。在一實施例中，快取線 206的每一個雙字組可以透過sC-Write—data訊號216個別寫入。在另一實施例中，一個雙字組包含四個位元組。本發明也包括其他實施例，其中堆疊快取124之中的快取線 206的每個字組（word，即2個位元組）或每個位元組，都可以透過sc—write—data訊號216個別寫入。堆疊快取124以scJV[ESI[15:0]訊號222提供十六個記錄單位的MESI狀態202。堆疊快取124以sc_tag[15:0] 訊號224提供十六個記錄單位的位址標籤204。堆疊快取 124以sc—data[15:0]訊號226提供十六個記錄單位的快取線資料206。最頂端的記錄單位的快取線206是以sc_data[0；| 訊號提供，第二個記錄單位的快取線206是以sc_data[lj 訊號提供，依此類推，最底下的記錄單位的快取線2〇6是 21 1294569 143 94twf.doc/m 以SC_data[15]訊號提供。位址標籤2〇4與Mesi狀態2〇2 也是以相同方式提供。堆疊快取124也接收來自圖3的控制邏輯電路302的pGp—se訊號234。#控制邏輯電路3〇2 設定P〇P_SC訊號234為真值時，堆疊快取124會往上平 f-個記錄單位，也就是說’最上方的記錄單位會移出堆 i决取124，而其餘母個έ己錄單位會接收下一個記錄單位 1内容。在β—實施例中’當一個記錄單位彈出堆疊快取i Μ 日:，也就疋當p〇p_sc訊號234為真值時，堆疊快取124 最底下的記錄單位的MESI狀態2G2會更新為無效。剛開始堆疊快取124的所有記錄單位的MESI狀態2〇2都是無效。 …、請參照圖3，圖3是根據本發明的，圖丨當中的堆疊快取124的附加元件的方塊圖。堆疊快取124包括控二輯電路302。 0控制邏輯電路302接收來自圖丨執行單元114的存放單元之push—instr訊號342。當push—instr訊號342為真值時，表示存放單元正請求將資料存入圖丨的資料快取 126，以回應來自圖〗的指令排程器1〇8的推入指令。控制邏輯電路302也接收來自執行單元114的載入單元的pop—instr訊號344。當pop一instr訊號344為真值時，表示載入單元正請求從資料快取126載入資料，以回應接收自指令排程器108的彈出指令。控制邏輯電路302也接收來自執行單元丨14的載入單元的load一instr訊號346。當load一instr訊號346為真值時， 22 1294569 14394twf.doc/m 表示載入單元正請求從資料快取126載入資料，以回應接收自指令排程器108的載入指令。控制邏輯電路302也接收來自執行單元114的存放單元的store一instr訊號348。當store一instr訊號348為真值時，表示存放單元正請求存放資料到資料快取126,以回應接收自指令排程器108的存放指令。控制邏輯電路302也接收來自執行單元114的整數單元的 add—sp—instr 訊號 352。當 Add—sp一instr 訊號 352 為真值時，表示整數單元正通知資料快取126有來自指令排程器108的堆疊指標加法指令（add t0 the stack pointer instruction)，例如χ86的ADD指令。在一實施例中，這種指令是將常數加到堆疊指標暫存器，就像ADD ESP,imm 指令。堆4：快取124也包含位址產生器generat〇r) 306。位址產生器306接收來自圖1的暫存器組122的運算元，例如基準值(base values)、位移值(0ffsets)、以及記憶體描述值(memory descriptor values)，並根據接收的數值產生虛擬位址334。虛擬位址334是存取記憶體指令的虛擬記憶體位址，例如推入、彈出、載入、或存放指令。以載入指令來說，虛擬位址334是資料來源的虛擬位址。以存放指令來說，虛擬位址334是資料目的地的虛擬位址。以彈出指令來說，虛擬位址334是彈出資料來源的虛擬位址。以推入指令來說，虛擬位址334是推入資料的目的地之虛擬位址。在一實施例中，載入與存放單元都包含位址 23 1294569 143 94twf.doc/m 產生器306。堆疊快取124也包含耦接於位址產生器306的轉譯暫存區（translation lookaside buffer，簡稱為 TLB) 308。轉譯暫存區308儲存分頁表(page tabie)的資訊，以將虛擬位址334轉譯為實體位址336。在一實施例中，轉譯暫存區 308只轉譯實體位址336的高位部分，而低位部份就是虛擬位址334的對應低位部分。在另一實施例中，一個頁至少有4KB，因此實體位址336的最低12個位元不會經過轉譯。堆疊快取124也包括兩個耦接於位址產生器306的比較器（comparator) 312。每個比較器312各自接收虛擬位址 334。比較态312的其中之一接收圖2的sc__tag[0]訊號224 的虛擬位址部分’而另一個比較器312接收sc_tag[l]訊號224的虛擬位址部分。也就是說，兩個比較器312各自接收圖2當中堆疊快取124的最頂端兩個記錄單位的位址標籤204之虛擬位址部分，並各自比較對應的虛擬sc_tag 訊號224與虛擬位址334。如果虛擬sc—tag[0]訊號224等於虛擬位址334，第一個比較器312會在VA—match[0]訊號362產生真值，而VA一match[0]訊號362會提供到控制邏輯電路302。同理，如果虛擬scjag[l]訊號224等於虛擬位址334,第二個比較器312會在VA—match[l]訊號362 產生真值，而VA-match[l]訊號362也會提供到控制邏輯電路302。控制邏輯電路302也接收圖2當中來自堆疊快取124的sc—MESI[15:0]訊號222。控制邏輯電路302會 24 1294569^, 使用 VA—matchHiO]訊號 362 與 sc—MESI[1:〇]訊號您以決定虛擬位址334是否擊中堆叠快取124最上端的兩個記錄單位其中之-，以從堆疊快取124進行猜測載入，以下會有詳細說明。也就是說’控制邏輯電路3〇2使用 VA—matchniO]訊號 362 與 sc—MEsi[1:〇]訊號 222 來決定虛擬位址334是否等於虛擬sc—tag[1:〇]訊號224的其中任何一個有效(valid)的虛擬位址部分。在一實施例中了虛擬位址標藏204是虛擬位址位元的雜湊結果，虛擬位址334 ® 是在供給比較器312之前就經過雜凑。要注意的是，雖然圖3的一個實施例中，會檢查堆疊快取124的最頂端兩個記錄單位，以決定是否要進彳^猜^ ；載入，本發明也包含檢查兩個以上頂端記錄單位的其他實 - 施例，而且在另一個實施例中，只檢查最頂端的一個記錄 ♦ 單位。檢查的資料項越多，偵測到可進行快速載入的機會也越大。因此，快取線越大，需要檢查的記錄單位就越少。圖3的實施例是檢查128個位元組。 # 堆疊快取124也包含耦接於轉譯暫存區308的十六個比較器314。這些比較器314各自接收實體位址336，也各自接收對應的scj;ag[15:0]訊號224其中之一。也就是說，比較器314接收對應sc—tag訊號224其中的位址標籤204 的實體位址部分，並且比較它和實體位址336。如果實體 sc一tag[0] sfl號224和實體位址336相同，第一個比較器 314 就會在 PA—match[0]訊號 364 產生真值，PA_match[0] 訊號364會提供到控制邏輯電路3〇2 ;如果實體sc_tag[i] 25 1294棚— 訊號364 224和實體位址336相同，第二個比較器314就奚在PA一match[l]成號364產生真值，pA」natch[l]訊號 364也會提供到控制邏輯電路3〇2 ;依此類推，直到第十六個比較态314。控制邏輯電路3〇2使用pA—match[15:〇]訊號364與sc—MESI[15:0]訊號222以決定實體位址336是否擊中堆璺快取124的任何一個記錄單位，以從堆疊快取 124載入貧料’並決定猜測彈出與猜測載入的資料是否正 _ 確，細節後述。也就是說，控制邏輯電路3〇2使用 PA-match[15:0]訊號 364 與 sc—MESI[15:0]訊號 222 決定實體位址336是否與sc—tag[15:〇]訊號224其中一個有效的實體位址部分相同。，· 控制邏輯電路302也產生sc—hit訊號389，scjiit訊號 - 389會提供到執行單元U4的載入與存放單元，以指出彈出、推入、載入、或存放指令所涉及的快取線至少在猜測上存在於堆疊快取124。就彈出指令來說，控制邏輯電路 302在sc—hit訊號389猜測產生真值，以在確認彈出的來 _ 源位址擊中堆疊快取124之前，回應pop—instr訊號344的真值，細節會在後面配合圖5詳述。就推入指令來說，當 sc—MESI[15:0]訊號 222 以及 pa—match[15:0]訊號 364 指出實體位址336等於堆疊快取124其中一個有效的實體位址標籤，或當堆疊快取124配置實體位址336牽涉的快取線時’控制邏輯電路302會在sc_hit訊號389產生真值，後面會配合圖6詳加說明。就載入指令來說，當 sc—MESI[1:0]訊號 222 以及 VA_match[l:0]訊號 362 指出 26 I294H。- 虛擬位址334等於堆疊快取124最頂端的紀錄單位的其中一個有效的虛擬位址標籤時，控制邏輯電路3〇2會在％訊號389猜測產生真值，或在sc_MESI[15:〇]訊號222以及PA一matchtKO]訊號364指出實體位址336等於堆疊快取124其中一個有效的實體位址標籤時，非猜測地在sc^hk 訊號389產生真值，後面會參考圖8解說。就存放指令而言’當 sc—MESIfao]訊號 222 以及 PA_match[15:〇]訊號 364指出實體位址336等於堆疊快取124其中一個有效的實體位址標籤時，控制邏輯電路3〇2會在訊號產生真值，後面會配合圖9進一步說明。控制邏輯電路302也自圖！的非堆疊快取122接收 n〇n-sc一hit訊號366。當實體位址336擊中非堆疊快取122 時，non-SC_hit訊號366為真值。控制邏輯電路3〇2也產生圖 2 的 push一sc 號 232 和 p〇p sc 訊號 234。堆豐快取124也包括耦接於控制邏輯電路3〇2的 φ—offset暫存器322，以儲存一個稱為fp—〇ffset的數值。暫存器322以fp一offset訊號396輸出其數值，然後提供給控制邏輯電路302。fp一offset暫存器322的數值是用來從堆疊快取124進行快速彈出程序，細節後述。從後面的圖式，尤其是圖5至圖7的流程圖可知，fp一〇ffset暫存器322 指定存放在堆疊快取124最頂端的記錄單位的快取^當中的，最近的推入指令的資料位置。也就是說，fp一〇ffset暫存器322指定一個推入指令的資料位置，而且這個資料尚未被彈出主記憶體之内的堆疊。在一實施例中，电〇饱以 27 1294569 14394twf.doc/m 暫存器322包含一個四位元數值，以指定堆疊快取124最頂端的記錄單位所儲存的快取線206其中的十六個雙字組其中之一的位移值。控制邏輯電路302監視彈出、推入、與堆疊指標加法指令(add to stack pointer instructions)，以預期堆豐指標暫存器152的改變，並且使fp一〇ffset暫存器 322的數值與堆豐指標暫存器ία的位元[5:2]保持一致。在一實施例中，控制邏輯電路3〇2會在執行單元ii4 ❿ 的載入、存放、或整數單元分別指出有彈出、推入、或堆 :£#曰&加法指令時’更新fp—〇ffset暫存器322。在另一實施例中，控制邏輯電路302更新fp一〇ffset暫存器322時^ 不會等待寫回單元116更新堆疊指標暫存器152。如此， • j推入指令、堆疊指標加法指令、或其他彈出指令之後的；，出扣々可利用堆璺指標暫存器152的預期值，而不用等待寫回單元116更新堆疊指標暫存器152之後，才取得堆疊指標暫存器152的位元[5:2]。堆豐快取124也包含輕接於fp—〇ffset暫存器，有 ❿=六個輸入端的多工器(multiplexer)318。在一實施例中，多工器318的十六個輸入端各自接收sc—data[〇]訊號226 的十六個雙字組其中之一。多工器318接收I〇胸訊號 =做為轉輸人，崎取se—data[G]職的十六個雙字 ^其中之一，在fp—data訊號398輸出，並且在快速彈出 %序中供應給彈出指令，細節後述。抑堆豐快取124也包含耦接於控制邏輯電路302的算術單元(arithmetic unit) 304。算術單元 3〇4 接收 fp_〇fftet 訊 28 1294569 14394twf.doc/m 號 396。算術單元304也自控制邏輯電路3〇2接收遞減訊號 (decrement signal) 384。如果控制邏輯電路3〇2在遞減訊號 384產生真值，算術單元304會遞減從fp_〇ffset訊號396 接收的數值，將結果提供給輸出端372。如果遞減程序造成欠位(underflow)，算術單元304會在欠位訊號388產生真值，將它提供給控制邏輯電路302。算術單元304也自控制邏輯電路3〇2接收遞增訊號 (increment signal) 386。如果控制邏輯電路302在遞增訊號 386產生真值，算術單元304會遞增從fp_〇ffset訊號396 接收的數值，將結果提供給輸出端372。如果遞增程序造成溢位(overflow)，算術單元304會在溢位訊號392產生真值，將它提供給控制邏輯電路3〇2。算術單元304也自控制邏輯電路3〇2接收加法訊號 (add signal) 382。如果控制邏輯電路302在加法訊號382 產生真值，算術單元304會將接收自fp—0ffset訊號396的數值’加上接收自add—sp—val訊號394的數值，將結果提供給輸出端372。如果相加造成溢位，算術單元304會在溢位訊號392產生真值。在一實施例中，add_sp—val訊號 394是由圖1的執行單元114的整數單元提供。add_sp_val 訊號394的數值就是將一數值加上堆疊指標暫存器I%的才曰令所指定的數值。堆豐快取124也包含一個轉接於fp—〇ffset暫存器322 的兩輸入端的多工器316。多工器316的輸出端耦接至 29 1294棚twf.doc/m φ_滿et暫存器322的輸入端。多工器316以一個輸入端 =异術單元304的輸出372，以另—個輸入端接收堆疊指標暫存器152的輸出的位元[5:2]。多工器316自控制邏輯電路302接收控制訊號368，做為選擇輸入，以選取兩個輸入的其中之一，輸出至fp_〇ffset暫存器322。堆豐快取124也包含一個耦接於控制邏輯電路3〇2，有十六個輸入端的多工器326。多工器326的每個輸入端各自接收堆豐快取124提供在sc_data[15:0]訊號226的十六個快取線206其中之一。多工器326根據控制邏輯電路 3〇2產生的Writeback_mux一sel訊號328，在十六個 sc_data[15:0]訊號226當中選取其一。多工器326的輸出端提供為寫回線緩衝區(writeback line buffer) 324的輸入端。寫回線緩衝區324的輸出端則透過匯流排136提供給圖1的匯流排介面單元118。控制邏輯電路302也產生 writeback—request訊號338，後者也提供給匯流排介面單元 118。寫回線緩衝區324和writeback—request訊號338是用來將快取線從堆疊快取124寫回系統記憶體，細節後述。控制邏輯電路302會在allocate—fill_buffer訊號397 產生真值，以配置一個裝填暫存區(fill buffer)，以將快取線放入系統記憶體，或自微處理機100的另一個快取記憶體取得快取線，例如取自堆疊快取124或某個第二級快取記憶體，細節後述。控制邏輯電路302也會在例外事件訊號399產生真值，以表示有例外事件發生，使微處理機轉而執行微程式 30 12945敏wf.“ 碼記憶體128之内的例外事件處理程式，細節後述。控制邏輯電路302也產生spec—sc一load—mux sel訊號 391、normal—sc—load一mux一sel 訊電路號 393、以及The cache 102 includes a first level (leveM = instruction cache storage_connected to the microprocessor (10) system $, a fetched instruction' such as a push and pop command: a stacked shirt of the overlay indicator register 152; Stop = access to the stack in the system memory. = The processor 100 is also packaged with the bus interface of the instruction cache 1〇2 = (bUS mterfaee unit) 118. The bus interface unit (10) is connected to the microprocessor bus (proce bus) 132, the microprocessor 100 is connected to the system memory via a microprocessor bus (four) cess 〇 rbus) m. Busbar interface unit 118 is the interface between each of the four (4) components of microprocessor (10) and microprocessor busbar 132. For example, bus interface unit u8 fetches instructions from system memory to instruction cache 1〇2. In addition, the bus interface unit 118 will read or write data in the system § memory, for example in the system memory, and the stack indicator 152 indicates the stack of top addresses: the microprocessor 100 also includes a light connection. The instruction fetcher 104 of the instruction cache 102. The instruction fetcher 1〇4 fetches instructions from the instruction cache 1〇2. The instruction fetcher 104 sequentially fetches the next instruction specified by the instruction pointer register in the register group 112, unless an event that changes the program flow, such as a branch instruction, is encountered. The instruction fetcher 1〇4 will start fetching the instruction at the target address of the branch instruction, or the exception event (eXCepti〇n), at which point the instruction fetcher 104 will start fetching the corresponding exception handler (exception). Handler routine). The microprocessor 100 also includes a microprogram 16 1294^1 f.doc/m code memory 128 coupled to the instruction fetcher 104. The microcode memory 128 stores instructions to be fetched by the instruction fetcher 104. More specifically, the microcode memory 128 includes instructions for the exception event handler to handle various exception events generated by the microprocessor 1. In one embodiment, the microprocessor 1 will generate an exception event when detecting a pop-up or push-in guess error to correct the stack access state of the microprocessor, as described in more detail below. Microprocessor 100 also includes an instruction translator 106 coupled to instruction fetcher 1〇4. The instruction translator 1-6 receives instructions from the instruction fetcher 104, such as push and pop instructions, decodes the instructions, and then translates the miCroinstructions performed by other portions of the pipeline of the microprocessor 100. In one embodiment, the other portion of the official line of the microprocessor (8) includes a reduced instruction set computer (red in instruction set c〇mpmer, abbreviated as Risc) core that executes the microcode. In another embodiment, the instruction translator 1-6 generates a mdicator for each instruction to indicate that the translated code is translated, popped, loaded, or stored. Called the macro code. Microprocessor 100 also includes an instruction scheduler 108 that interfaces with instruction ^. The instruction scheduler 108 receives the translated microcode from the instruction switch 22, and issues the microcode 134, and executes the execution unit of the stone horse (execud〇nun(4) ιΐ4. 108 receives the intra-valley value of the microcode 134, n4 + ° 152, and executes the microcode. In the embodiment only, the execution unit 114 includes an integer unit 17 1294569 14394twf.doc/m (integer 仙it), A floating point unit (fl〇atingp〇intunit), a multimedia extension unit (MMX unit), a stream single instruction multiple data extension unit (SSE unit), a branch unit (branch dish, one load unit (1(10)d unit) ): and a storage unit (st〇re unit). The loading unit execution is an instruction to load data from the system memory into the microprocessor 1 , including the pop-up finger storage unit to execute the storage instruction, that is, to store the data from The microprocessor (8) stores instructions to the system memory, including push instructions. • The microprocessor 100 also includes a write-back stage 116 coupled to the execution unit 114. The write back unit 116 receives the execution unit 114 execution曰々, , , α, and write the result, such as the data of the pop-up instruction, back to the register group 112. ..... The microprocessor 1 also includes the data cache 126, data The cache 126 is coupled to the bus interface unit 118 through the bus bar 136 and coupled to the execution unit 114 via the bus bar 138. In an embodiment, the lean cache 126 is the first level data cache memory. The data cache 126 includes a stack cache 124 and a non-stack cache 122. The bus interface unit 118 grabs data from the system memory, takes 126, and extracts data from the data. Cache 126 writes the system memory. The status of the sheep, the field, the bus interface unit 118 will write the cache line from the stack cache 124 and the non-stack cache 122 (four) unified memory, and from the system memory The body reads the cache line to write the allocated entry of the stack cache 124 and the non-stack cache 122. More specifically, the bus interface unit 118 will be stacked in the system memory. The data specified by the push and pop instructions is transmitted between the stack cache 124 and the stack cache 124. 569 14394twf.d〇c/n In an embodiment, the non-stacked cache 122 A body includes a conventional-level cache memory, and the design goal is to face a randomly distributed j memory address, The access time is (10) (four). In m but the 'non-heap cache 122' includes a four-way set associative cache. However, the storage unit will use the non-pushing instruction data to determine whether to store the data in the heap = ancestor 24 or f heap read 122. The storage unit will push the second instruction of the instruction;: put it on the stack * wire 124 instead of the non-stack cache 122, and will not be expected, such as 5 put instructions # material 'stored in non-stack cache and non-stack cache 122 is not worthy of the traditional cache memory. The Heap Express 124 will be further explained later in conjunction with Figure 2. In the rle1r embodiment, the 'microprocessor 100 also includes a second level to take the second level; to support the level - _ cache is stored from the first level of data:: = second level disk swap w... take 2 (including non Stack cache 122: The cache line removed from ίι_4) and the first level data cache 126 抓 grab the cache line from the first level cache. Referring to Fig. 2, Fig. 2 shows the storage unit (st〇rage ei~s) of the pile tL 2memory according to the invention. Although 124 is itself a stack, it is with the heap t end within the system memory. It is indicated by the content value of the stack indicator register 152. The heap $cache 124 is a stack of storage from system memory. 19 1294569 14394twf.doc/m The embodiment of Figure 2 includes 16 storage units, or entries, which are labeled 〇15. The topmost record unit is called record unit 0, and the bottom line is called record unit 15. However, the present invention is not limited to a specific number of recording units in the stack cache 124. Each record unit has space to accommodate the data of the cache line 2〇6, the address tag 204 of the cache line 2〇6, and the cache status 202 of the cache line 206. In one embodiment, the cache state 2〇2 generally follows four well-known states representing cache coherency (c〇herenCy), ie modified, exclusive, shared. And invalid (Invalid), collectively known as MESI. In one embodiment, a cache line 206 contains 64-bit data. In another embodiment, the address tag 204 contains the physical address of the cache line 206. In one embodiment, the address tag 204 includes upper significant bits of the physical address of the cache line 206 for uniquely identifying the cache line 206. In one embodiment, the microprocessor 1 includes a memory paging system responsible for translating virtual memory addresses into physical memory addresses, and the address tags 204 also contain fast Take the virtual address of line 206. In one embodiment, this virtual address is actually a hash result of the virtual address bit to reduce the required storage space. The following will detail how to utilize the virtual address portion of the address tag 204 to make speculative loads on the stack cache 124. The stack cache 124 receives a new cache state to pass the sc-write MESI signal 212. Put into the top of the record unit of the cache type 20 1294569 14394twf.doc / m evil 202 stop. The stack cache 124 receives the new address tag to be placed in the address tag 204 of the topmost record unit via the sc_write one tag signal 214. The stack cache 124 receives the new cache line to place the data of the topmost record unit 2〇6 by sc-write-data signal 216. The stack cache 124 also receives the push-sc signal 232 from the control logic circuit (c〇ntr〇11〇gic) 302 of FIG. When the control logic circuit 3〇2 pushes the push-sc port number 232a to a true value, the stack cache 124 will translate a d-recording unit downward, that is, the bottom recording unit will be removed from the stack. The cache 124, each of the remaining recording units will accept the storage content of the previous recording unit, and the contents of the sc-writeJViESI signal 212, the sc-write-tag signal 214, and the sc-write-data signal 216 will be written in the heap f. Cache 124 records at the top of the record unit. In one embodiment, each double word of the cache line 206 can be individually written via the sC-Write_data signal 216. In another embodiment, a double block contains four bytes. The present invention also includes other embodiments in which each word group (word, i.e., 2 bytes) or each byte of the cache line 206 among the stack caches 124 can pass through sc-write data. Signals 216 are written individually. The stack cache 124 provides the MESI state 202 of sixteen recording units in scJV [ESI[15:0] signal 222. Stack cache 124 provides an address tag 204 of sixteen record units with sc_tag[15:0] signal 224. The stack cache 124 provides sixteen record units of cache line data 206 with an sc-data[15:0] signal 226. The cache line 206 of the topmost recording unit is provided by sc_data[0;| signal, the cache line 206 of the second recording unit is provided by sc_data[lj signal, and so on, the cache of the lowest recording unit. Line 2〇6 is 21 1294569 143 94twf.doc/m is provided by SC_data[15] signal. The address tag 2〇4 and the Mesi state 2〇2 are also provided in the same manner. Stack cache 124 also receives pGp-se signal 234 from control logic 302 of FIG. #控制逻辑电路3〇2 When the P〇P_SC signal 234 is set to true, the stack cache 124 will be flattened by f-record units, that is, the topmost record unit will be removed from the heap and the decision 124 will be taken. The remaining parent records will receive the next record unit 1 content. In the β-embodiment, 'When a recording unit pops up the stack cache i Μ day:, when the p〇p_sc signal 234 is true, the MESI state 2G2 of the bottom record unit of the stack cache 124 is updated to invalid. The MESI state 2〇2 of all the recording units of the stacking cache 124 just started is invalid. ..., please refer to Figure 3, which is a block diagram of additional components of the stacked cache 124 in the Figure, in accordance with the present invention. Stack cache 124 includes a control circuit 302. The 0 control logic circuit 302 receives the push-instr signal 342 from the storage unit of the execution unit 114. When the push-instr signal 342 is a true value, it indicates that the storage unit is requesting to store the data in the data cache 126 of the map in response to the push command from the instruction scheduler 1〇8 of the figure. Control logic circuit 302 also receives a pop-instr signal 344 from the load unit of execution unit 114. When the pop-instr signal 344 is true, it indicates that the load unit is requesting loading data from the data cache 126 in response to the pop-up instruction received from the instruction scheduler 108. Control logic circuit 302 also receives load-instr signal 346 from the load unit of execution unit 丨14. When the load-instr signal 346 is true, 22 1294569 14394twf.doc/m indicates that the load unit is requesting loading data from the data cache 126 in response to the load instruction received from the instruction scheduler 108. Control logic circuit 302 also receives store-instr signal 348 from the storage unit of execution unit 114. When the store-instr signal 348 is true, it indicates that the storage unit is requesting to store data to the data cache 126 in response to the store instruction received from the instruction scheduler 108. Control logic circuit 302 also receives an add_sp_instr signal 352 from the integer unit of execution unit 114. When the Add_sp_instr signal 352 is true, it indicates that the integer unit is notifying the data cache 126 that there is an add instruction (add t0 the stack pointer instruction) from the instruction scheduler 108, such as the ADD instruction of χ86. In one embodiment, this instruction adds a constant to the stack indicator register, just like the ADD ESP, imm instruction. Heap 4: The cache 124 also contains the address generator generat〇r) 306. The address generator 306 receives the operands from the register set 122 of FIG. 1, such as base values, shift values (0ffsets), and memory descriptor values, and generates based on the received values. Virtual address 334. The virtual address 334 is a virtual memory address that accesses a memory instruction, such as a push, pop, load, or store instruction. In the case of an input instruction, virtual address 334 is the virtual address of the data source. In the case of a store instruction, virtual address 334 is the virtual address of the data destination. In the case of a pop-up instruction, virtual address 334 is the virtual address of the pop-up data source. In terms of push instructions, virtual address 334 is the virtual address of the destination of the push data. In one embodiment, the load and store unit includes the address 23 1294569 143 94twf.doc/m generator 306. The stack cache 124 also includes a translation lookaside buffer (TLB) 308 coupled to the address generator 306. The translation buffer 308 stores information of the page tabie to translate the virtual address 334 into a physical address 336. In one embodiment, the translation buffer 308 translates only the upper portion of the physical address 336, while the lower portion is the corresponding lower portion of the virtual address 334. In another embodiment, a page has at least 4 KB, so the lowest 12 bits of the physical address 336 are not translated. The stack cache 124 also includes two comparators 312 coupled to the address generator 306. Each comparator 312 receives a virtual address 334, each. One of the comparison states 312 receives the virtual address portion ' of the sc__tag[0] signal 224 of FIG. 2 and the other comparator 312 receives the virtual address portion of the sc_tag[1] signal 224. That is, the two comparators 312 each receive the virtual address portion of the address tag 204 of the top two recording units of the stack cache 124 of FIG. 2, and compare the corresponding virtual sc_tag signal 224 and virtual address 334, respectively. . If the virtual sc_tag[0] signal 224 is equal to the virtual address 334, the first comparator 312 will generate a true value at the VA-match[0] signal 362, and the VA-match[0] signal 362 will be provided to the control logic. Circuit 302. Similarly, if the virtual scjag[l] signal 224 is equal to the virtual address 334, the second comparator 312 will generate a true value in the VA-match[l] signal 362, and the VA-match[l] signal 362 will also be provided. Control logic circuit 302. Control logic circuit 302 also receives the sc-MESI[15:0] signal 222 from stacker 124 in FIG. The control logic circuit 302 will use the VA-matchHiO] signal 362 and the sc-MESI[1:〇] signal to determine whether the virtual address 334 hits the top two recording units of the stack cache 124, The guessload is loaded from the stack cache 124, as described in more detail below. That is, the 'control logic circuit 3〇2 uses the VA-matchniO' signal 362 and the sc-MEsi[1:〇] signal 222 to determine whether the virtual address 334 is equal to any of the virtual sc-tag[1:〇] signals 224. A valid virtual address portion. In one embodiment, the virtual address tag 204 is a hash result of the virtual address bit, and the virtual address 334 ® is hashed before being supplied to the comparator 312. It should be noted that although in one embodiment of FIG. 3, the top two recording units of the stack cache 124 are checked to determine whether or not to enter the ^^^; loading, the present invention also includes checking two or more tops. Record other real-implementation units of the unit, and in another embodiment, only the topmost one record ♦ unit is checked. The more items that are checked, the greater the chances of detecting a fast load. Therefore, the larger the cache line, the fewer record units that need to be checked. The embodiment of Figure 3 is to examine 128 bytes. #Stack cache 124 also includes sixteen comparators 314 coupled to translation buffer 308. These comparators 314 each receive a physical address 336 and also each receive one of the corresponding scj;ag[15:0] signals 224. That is, the comparator 314 receives the physical address portion of the address tag 204 corresponding to the sc-tag signal 224 and compares it with the physical address 336. If the entity sc_tag[0] sfl 224 and the physical address 336 are the same, the first comparator 314 will generate a true value in the PA-match[0] signal 364, and the PA_match[0] signal 364 will be provided to the control logic. Circuit 3〇2; if the entity sc_tag[i] 25 1294 shed-signal 364 224 is the same as the physical address 336, the second comparator 314 generates a true value in the PA-match[l] number 364, pA"natch The [l] signal 364 is also provided to the control logic circuit 3〇2; and so on, up to the sixteenth comparison state 314. The control logic circuit 〇2 uses the pA-match[15:〇] signal 364 and the sc-MESI[15:0] signal 222 to determine whether the physical address 336 hits any of the record units of the stack cache 124 to Stacking cache 124 loads the poor material' and decides whether the pop-up and guessing loaded data is positive or not. Details will be described later. That is, the control logic circuit 3〇2 uses the PA-match[15:0] signal 364 and the sc-MESI[15:0] signal 222 to determine whether the physical address 336 is associated with the sc-tag[15:〇] signal 224. A valid entity address is partially the same. The control logic circuit 302 also generates a sc-hit signal 389, which is provided to the load and store unit of the execution unit U4 to indicate the cache line involved in the pop-up, push-in, load, or store instructions. At least on the guess, there is a stack cache 124. In the case of a pop-up instruction, the control logic 302 asserts a true value at the sc-hit signal 389 to respond to the true value of the pop-instr signal 344 before confirming that the pop-up source address hits the stack cache 124. It will be detailed later in conjunction with Figure 5. In the case of a push instruction, when sc-MESI[15:0] signal 222 and pa-match[15:0] signal 364 indicate that physical address 336 is equal to one of the valid entity address labels of stack cache 124, or When the stack cache 124 configures the cache line involved in the physical address 336, the control logic circuit 302 will generate a true value in the sc_hit signal 389, which will be described in detail later in FIG. For the load instruction, when the sc-MESI[1:0] signal 222 and the VA_match[l:0] signal 362 indicate 26 I294H. - When virtual address 334 is equal to one of the valid virtual address labels of the topmost record unit of stack cache 124, control logic 3〇2 will guess the true value at % signal 389, or at sc_MESI[15:〇] The signal 222 and the PA-matchtKO] signal 364 indicate that the physical address 336 is equal to one of the valid physical address labels of the stack cache 124, and the true value is generated non-guessed at the sc^hk signal 389, which will be explained later with reference to FIG. In the case of a store instruction, when the 'sc-MESIfao' signal 222 and the PA_match[15:〇] signal 364 indicate that the physical address 336 is equal to one of the valid entity address labels of the stack cache 124, the control logic circuit 3〇2 will The signal produces a true value, which will be further explained later in conjunction with FIG. Control logic circuit 302 is also self-illustrated! The non-stack cache 122 receives the n〇n-sc-hit signal 366. When the physical address 336 hits the non-stack cache 122, the non-SC_hit signal 366 is a true value. Control logic circuit 3〇2 also produces push-sc number 232 and p〇p sc signal 234 of FIG. The stack cache 124 also includes a φ-offset register 322 coupled to the control logic circuit 3〇2 to store a value called fp-〇ffset. The register 322 outputs its value as an fp-offset signal 396 and is then supplied to the control logic circuit 302. The value of the fp-offset register 322 is used to perform a quick pop-up procedure from the stack cache 124, as described in detail below. As can be seen from the following figures, especially the flowcharts of FIGS. 5-7, the fp_ffset register 322 specifies the most recent push command stored in the cache of the top recording unit of the stack cache 124. Location of the data. That is, the fp_ffset register 322 specifies the data location of a push command, and this data has not been popped up in the stack within the main memory. In one embodiment, the power supply saturates 27 1294569 14394 twf.doc/m register 322 contains a four-bit value to specify sixteen of the cache lines 206 stored in the topmost record unit of the stack cache 124. The displacement value of one of the double word groups. The control logic circuit 302 monitors the pop-up, push-in, and add to stack pointer instructions to anticipate changes in the heap index register 152 and to make the values of the fp-fset register 322 and the heap The bits [5:2] of the indicator register ία are consistent. In an embodiment, the control logic circuit 〇2 will update the fp when the load, store, or integer unit of the execution unit ii4 分别 respectively indicates a pop-up, push-in, or heap: £#曰&add instruction. 〇 ffset register 322. In another embodiment, when control logic 302 updates fp_ffset register 322, it does not wait for write-back unit 116 to update stack indicator register 152. In this way, j push-in command, stack indicator addition instruction, or other pop-up instructions; the buckle can use the expected value of the stack indicator register 152 without waiting for the write-back unit 116 to update the stack indicator register. After 152, the bits [5:2] of the stack indicator register 152 are obtained. The Stacker 124 also includes a multiplexer 318 that is lightly connected to the fp-〇ffset register and has six inputs. In one embodiment, the sixteen inputs of multiplexer 318 each receive one of sixteen doublewords of sc-data[〇] signal 226. The multiplexer 318 receives the I 〇 chest signal = as a transfer person, and takes one of the sixteen double words of the se-data [G] job, one of which is output at the fp-data signal 398, and the % sequence is quickly popped up. The instructions are supplied to the pop-up instructions, which are described later. The counterfeit cache 124 also includes an arithmetic unit 304 coupled to the control logic 302. The arithmetic unit 3〇4 receives fp_〇fftet information 28 1294569 14394twf.doc/m No. 396. Arithmetic unit 304 also receives a decrement signal 384 from control logic circuit 3〇2. If control logic 3〇2 produces a true value at decrement signal 384, arithmetic unit 304 decrements the value received from fp_〇ffset signal 396 and provides the result to output 372. If the decrementing procedure causes an underflow, the arithmetic unit 304 will generate a true value at the underbit signal 388 and provide it to the control logic 302. Arithmetic unit 304 also receives an increment signal 386 from control logic circuit 3〇2. If control logic 302 generates a true value at increment signal 386, arithmetic unit 304 increments the value received from fp_〇ffset signal 396 and provides the result to output 372. If the incrementing program causes an overflow, the arithmetic unit 304 will generate a true value at the overflow signal 392 and supply it to the control logic circuit 3〇2. Arithmetic unit 304 also receives an add signal 382 from control logic circuit 3〇2. If control logic 302 generates a true value at summing signal 382, arithmetic unit 304 adds the value received from fp_0ffset signal 396 to the value received from add_sp_val signal 394 and provides the result to output 372. If the addition causes an overflow, the arithmetic unit 304 generates a true value at the overflow signal 392. In an embodiment, the add_sp_val signal 394 is provided by an integer unit of the execution unit 114 of FIG. The value of add_sp_val signal 394 is the value specified by adding a value to the stack indicator register I%. The heap cache 124 also includes a multiplexer 316 that is coupled to the two inputs of the fp-〇ffset register 322. The output of the multiplexer 316 is coupled to the input of the 29 1294 shed ftf.doc/m φ_full et register 322. The multiplexer 316 receives the bits [5:2] of the output of the stacked index register 152 at the other input with the output 372 of one input terminal 410. The multiplexer 316 receives the control signal 368 from the control logic circuit 302 as a selection input to select one of the two inputs and output to the fp_〇ffset register 322. The stack cache 124 also includes a multiplexer 326 coupled to the control logic circuit 〇2, having sixteen inputs. Each of the inputs of the multiplexer 326 receives the stack cache 124 and provides one of the sixteen cache lines 206 of the sc_data[15:0] signal 226. The multiplexer 326 selects one of the sixteen sc_data[15:0] signals 226 based on the Writeback_mux-sel signal 328 generated by the control logic circuit 〇2. The output of multiplexer 326 is provided as an input to a writeback line buffer 324. The output of the write back line buffer 324 is provided through the bus 136 to the bus interface unit 118 of FIG. Control logic circuit 302 also generates a writeback_request signal 338, which is also provided to bus interface unit 118. The write back line buffer 324 and the writeback_request signal 338 are used to write the cache line from the stack cache 124 back to the system memory, as described in more detail below. The control logic 302 generates a true value in the allocate_fill_buffer signal 397 to configure a fill buffer to place the cache line into the system memory, or another cache memory from the microprocessor 100. The body obtains the cache line, for example, from the stack cache 124 or a second level cache memory, which will be described later. Control logic circuit 302 also generates a true value on exception event signal 399 to indicate that an exception event has occurred, causing the microprocessor to execute the microprogram 30 12945 sensitive wf. "Exception event handler within code memory 128, details The control logic circuit 302 also generates a spec-sc-load-mux sel signal 391, a normal-sc-load-mux-sel circuit number 393, and

Ll—niux一sel訊號395，後面都會配合圖4解說。請參照圖4，圖4是一個方塊圖，繪示根據於本發明的’圖的第一級資料快取126的多工邏輯電路(muxing logic)。資料快取126包括四個輸入端的多工器4〇2，其輸鲁出提供給圖1的匯流排138。更明確的說，多工器4〇2在輪出端138提供彈出與載入資料給圖丨的執行單元114的載入單元。 _ 多工器402的第一輸入端自圖1的非堆疊快取122接 • 收輪出資料432 ’以提供資料給發自非堆疊快取122的載 : ^程序。多工器402的第二輸入端接收一個有十六個輸入端的多工器404的輸出424 ’以提供資料給自堆疊快取i % =猜測載入程序。多工器搬的第三輸入端接收另一個有 ^六個輸入端的多工器406的輸出426，以提供資料給發。。堆豐快取m的正常（或稱為非猜測）載入程序。多工 =402的第四輸入端接收圖3的fp—data訊號398，以提供賁料給快速彈出程序。多工器4〇4接收-個雙輸入多工器412輸出的快取線 422的十六個雙字組。多工器404根據圖3的實體位址336 的位元[5¾選取快取線422的十六個雙字組其中之一。多工器406接收—個有十六個輸入端的多工器408輸出的快取線428的十六個雙輪。多工器概根據實體位 31 1294569 14394twf.doc/m 址336的位元[5:2]選取快取線428的十六個雙字組其中之一。多工器412的兩個輸入端經由sc—data[l:0]訊號226 接收堆疊快取124最頂端的兩個記錄單位的快取線。多工器412根據圖3的spec—sc—load一mux—sel訊號391選取 sc—data[l:0]訊號226的兩個快取線其中之一做為輸出的訊號422 ’而SpeC-SC一i〇ad—mux__sel訊號391是控制邏輯電路 302 根據 l〇ad—instr 訊號 346、VA一match[l:0]訊號 362、以及sc—MESI[l:0]訊號222的數值產生，細節後述。多工器408的十六個輸入端經由sc—data[15:0]訊號 226,各自接收堆疊快取124的十六個記錄單位的快取線。多工器 408 根據圖 3 的 normal一sc一load一mux—sel 訊號 393 選取sc—data[15:0]訊號226的十六個快取線其中之一做為輸出的訊號428 ’而normal一sc—load mux sel訊號393 是控制邏輯電路302根據load一instr訊號346、 PA一match[15:0]訊號 364、以及 sc—MESI[15:0]訊號 222 的數值產生，細節後述。圖5繪示根據本發明的，發自於圖1當中的堆疊快取 124的快速彈出程序流程圖。流程從步驟5〇2開始。在步驟502，圖1的指令轉譯器1〇6會將彈出指令解碼，圖1的指令排程器108會將彈出指令發給圖丨執行單元114的載入單元。接下來，載入單元會在圖3的p〇Rjnstr 訊號344產生真值。流程進行至步驟5〇4。在步驟504，多工器318從堆疊快取124最頂端的記 32 1294569 14394twf.doc/m 錄早位的快取線SC—data[〇]⑽選取適當的雙字組，以根據圖3 fp一offset暫存器322 _容數值提供办址訊號 398。為回應pop一峨訊號344的真值，圖3的控制賴電路302會在Ll—mux—sel訊號395輸出一數值，促使圖4 的多工器402選取圖3的fp—data訊號观為輸入，然後經由匯流排138提供給執行單元114的載入單元，以供給，的=出指令，然後寫回單元116會將印一她訊號观 Φ 、入至彈出指令所指定的，目1的暫存器組m的其中- • 個，器。例如說，如果彈出指令是X86的RET指;，彈出資料會載人暫存裔組1丨2的指令指標暫存器。另舉一 • 例’如果彈出指令是x86的LEAVE指令，則彈出資料會暫存态組112的EBP暫存器。再舉一例，如果彈出指 ' :疋X86的P〇P指令，則彈出資料會載入暫存器組112當 POP指令所指定的暫存器。由圖5可知，資料是猜測欧地供給載人單％。說翻是因為還不確定稍後在步驟 516w在實體位址336產生的，彈出指令的來源位址，會與 ΐ豐快取124的最頂端記料位所提供給載人單元的彈出貝料的位址相同。另外為了回應p〇pJnstr訊號344的真值，控制邏輯電路302會在圖3的scJlit訊號389產生真 sc—hit訊號389提供到執行單元114之載入單元。流程接著進行到步驟506。在步驟506，控制邏輯電路3〇2會在遞增訊號386產生真，，使得算術單元304遞增fp—offset訊號3%並且於輪出糕372提供遞增的結果，然後控制邏輯電路3〇2會透 33 1294569 14394twf.doc/m 過控制訊號368使多工器316選取這個遞增結果，以載入至圖3的fp_〇ffset暫存器322。接下來，在步驟508，控制邏輯電路3〇2會檢查溢位訊號392以決定步驟506的遞增程序是否造成每一〇饱以暫存态322溢位。也就是說，控制邏輯電路3 指令是否會造成堆疊指標暫存_旨向下一；=出如果是’流程會進入步驟512，否則進入步驟514。在步驟512控制邏輯電路302會在p〇p—%訊號234 產生真值，以彈出堆疊快取124最頂端的記錄單位。彈出最頂端的記錄單位，是為了使堆疊快取124盥的快取-致，因為目前的彈出指令正在將最頂端珊g 所儲存的快取_最後-個雙字轉出系統帥體在一實施例中，步驟512是在步驟518之後進行，正二後述，如此可比較實體位址336與步驟5〇4當中提供資料的記錄單位的sc_tag[0]訊號224數值。在一實施例中，、步驟 5〇4使用的sc—tag[〇]訊號224數值會儲存下來留待後面，步驟518使用。雖然在一實施例中，fp—〇饱的暫存器幻2 是一個雙字組的位移值，以因應雙字組的推入盥彈°出俨令，本發明也包含其他實施例’其中的推入與彈/出資料又小各有不同，例如單字組(word)、位元組(byte)、或四字組會計算彈出指接下來，在步驟514，位址產生器3〇6 令的來源虛擬位址334。接下來，在步驟516，轉譯暫存區3〇8會產生彈出指 34 1294總 twf.doc/m 令的來源實體位址336。接下來，在步驟518，圖3的比較器314的其中之一會比較在步驟516產生的實體位址336，以及圖2的實體 sc—tag[0]訊號 224，以產生圖 3 的 PA—match[〇]訊號 364。接下來，在決策步驟522，控制邏輯電路302會檢查 s:—MESI[0]訊號 222 與 pA—match[0]訊號 364，以決定堆宜快取124的最頂端記錄單位是否有效，以及彈出指令的來源實體位址336是否等於堆疊快取124的最頂端記錄單位的^體位址標籤204，也就是說，實體位址336是否擊中堆疊快取124的最頂端記錄單位。在一實施例中，實體位址336的位元[5··2]也會與Φ—offset訊號396的數值比較’後者是用於選取fp—data訊號398 的雙字組，以Ll-niux-sel signal 395, will be explained later with Figure 4. Please refer to FIG. 4. FIG. 4 is a block diagram showing the muxing logic of the first level data cache 126 according to the present invention. The data cache 126 includes four input multiplexers 4〇2, which are provided to the bus 138 of FIG. More specifically, the multiplexer 4〇2 provides a loading unit at the round-trip 138 that pops up and loads the data to the execution unit 114 of the map. The first input of multiplexer 402 is coupled to the non-stacked cache 122 of Figure 1 to receive data 432' to provide data to the load from non-stacked cache 122: ^ program. The second input of multiplexer 402 receives the output 424' of a multiplexer 404 having sixteen inputs to provide data to the self-stack cache i% = guess loader. The third input of the multiplexer receives another output 426 of the multiplexer 406 having six inputs to provide data to the transmitter. . The normal (or non-guessing) loader of the heap cache. The fourth input of multiplex = 402 receives the fp_data signal 398 of Figure 3 to provide a trick to the fast pop-up procedure. The multiplexer 4〇4 receives sixteen double blocks of the cache line 422 output by the two-input multiplexer 412. The multiplexer 404 selects one of the sixteen double blocks of the cache line 422 according to the bit of the physical address 336 of FIG. The multiplexer 406 receives sixteen two-wheels of the cache line 428 that is output by a multiplexer 408 having sixteen inputs. The multiplexer selects one of the sixteen doubles of the cache line 428 based on the bits [5:2] of the physical bit 31 1294569 14394twf.doc/m address 336. The two inputs of the multiplexer 412 receive the cache lines of the two top recording units at the top of the stack cache 124 via the sc_data[l:0] signal 226. The multiplexer 412 selects one of the two cache lines of the sc_data[l:0] signal 226 as the output signal 422' according to the spec_sc_load-mux-sel signal 391 of FIG. 3 and the SpeC-SC An i〇ad_mux__sel signal 391 is generated by the control logic circuit 302 according to the values of the l〇ad_instr signal 346, the VA-match[l:0] signal 362, and the sc-MESI[l:0] signal 222. . The sixteen inputs of multiplexer 408 each receive a cache line of sixteen record units of stack cache 124 via sc-data[15:0] signal 226. The multiplexer 408 selects one of the sixteen cache lines of the sc_data[15:0] signal 226 as the output signal 428' according to the normal-sc-load-mux-sel signal 393 of FIG. The sc_load mux sel signal 393 is generated by the control logic circuit 302 based on the values of the load-instr signal 346, the PA-match[15:0] signal 364, and the sc-MESI[15:0] signal 222, which will be described later. 5 is a flow chart of a quick pop-up procedure from the stack cache 124 of FIG. 1 in accordance with the present invention. The process begins with step 5〇2. At step 502, the instruction translator 1 图 6 of FIG. 1 decodes the pop-up instructions, and the instruction scheduler 108 of FIG. 1 sends the pop-up instructions to the load unit of the map execution unit 114. Next, the load unit will generate a true value at the p〇Rjnstr signal 344 of FIG. The flow proceeds to step 5〇4. At step 504, the multiplexer 318 selects the appropriate double word from the top line of the stack cache 124, 32 1294569 14394 twf.doc/m, the cache line SC_data[〇] (10), to fp according to FIG. An offset register 322 _ value provides an address signal 398. In response to the true value of the pop signal 344, the control circuit 302 of FIG. 3 outputs a value at the L1-mux-sel signal 395, causing the multiplexer 402 of FIG. 4 to select the fp-data signal view of FIG. Then, the load unit is provided to the loading unit of the execution unit 114 via the bus bar 138 to supply the = instruction, and then the write back unit 116 will print the signal Φ of the her signal to the specified position of the pop-up instruction. Among the registers m are - • , the device. For example, if the pop-up command is the RET finger of the X86; the pop-up data will carry the command indicator register of the temporary bank group 1丨2. Another example: If the pop-up instruction is an x86 LEAVE instruction, the pop-up data will temporarily store the EBP register of the state group 112. As another example, if the P〇P instruction of ':疋X86 is popped up, the popup data will be loaded into the scratchpad group 112 as the temporary register specified by the POP instruction. As can be seen from Figure 5, the data is a guess. This is because it is not certain that the source address of the pop-up instruction generated at the physical address 336 at step 516w will be provided to the manned unit's pop-up material with the top-most record bit of the HSBC cache 124. The address is the same. In addition, in response to the true value of the p〇pJnstr signal 344, the control logic 302 generates a true sc-hit signal 389 to the load unit of the execution unit 114 at the scJlit signal 389 of FIG. The flow then proceeds to step 506. At step 506, the control logic circuit 〇2 will generate true at the increment signal 386, causing the arithmetic unit 304 to increment the fp-offset signal by 3% and provide an incremental result in the round-out 372, and then the control logic circuit 3〇 will pass through. 33 1294569 14394twf.doc/m The over control signal 368 causes the multiplexer 316 to select this incremental result for loading into the fp_〇ffset register 322 of FIG. Next, at step 508, control logic circuit 3 检查 2 will check overflow signal 392 to determine if the incrementing of step 506 causes each of the suffocating states 322 to overflow. That is to say, whether the control logic circuit 3 instruction causes the stack indicator to be temporarily stored is determined to be the next one; = output if yes, the process proceeds to step 512, otherwise to step 514. At step 512, control logic 302 generates a true value at p〇p_% signal 234 to eject the topmost record unit of stack cache 124. The top-most record unit is popped up in order to make the stack cache 124 盥 cache, because the current pop-up instruction is transferring the cache _ last-double word stored in the top-end g In the embodiment, the step 512 is performed after the step 518, which is described later, so that the value of the sc_tag[0] signal 224 of the record unit providing the data in the physical address 336 and the step 5〇4 can be compared. In one embodiment, the value of sc-tag[〇] signal 224 used in step 〇4 is stored and left for later, step 518 is used. Although in an embodiment, the fp-saturated register illusion 2 is a double-word displacement value, the present invention also includes other embodiments in which the double-word group is pushed in. The push-in and the bullet-out data are different, for example, a word, a byte, or a quad will calculate the pop-up finger. Next, in step 514, the address generator 3〇6 The source virtual address 334 of the order. Next, at step 516, translation of the scratchpad 3〇8 will result in a source entity address 336 of the pop-up finger 34 1294 total twf.doc/m command. Next, at step 518, one of the comparators 314 of FIG. 3 compares the entity address 336 generated at step 516 with the entity sc_tag[0] signal 224 of FIG. 2 to produce the PA of FIG. Match[〇] signal 364. Next, at decision step 522, control logic 302 checks s: -MESI[0] signal 222 and pA-match[0] signal 364 to determine if the topmost recording unit of stack cache 124 is valid and pops up. Whether the source entity address 336 of the instruction is equal to the body address tag 204 of the topmost record unit of the stack cache 124, that is, whether the physical address 336 hits the topmost record unit of the stack cache 124. In one embodiment, the bit [5··2] of the physical address 336 is also compared with the value of the Φ-offset signal 396. The latter is a double word for selecting the fp-data signal 398 to

Sint是否正確。如果彈出齡的來源實體位址 m隹宜主快取124的最頂端記錄單位，流程至此結束，也就疋說，知測快速彈出程序提供了正確的彈出資料。否則流程會進入步驟524。㈣坪出貝村否在步驟524,控制邏輯電路3〇 ί ί二:微/理機_執行例外事件處理 =處中，例外事件产^序提供錯誤資料的狀況。在一實施例指標暫存器疊r124 ’並載入堆疊實施例中Γί的料。在一 Iβ二也包括將堆疊快快取線寫回系統記憶且124其中的有效個第一級快取。流程在步驟 35 12945獻— 524結束。到^前可知，也會在後面配合圖1〇詳細說明，圖5 的=速彈出程序可以使彈出指令收卿出㈣，比不會分辨彈出與載人指令的傳統快取記㈣，快上幾個時脈週3期。圖6繪示根據本發明的，推入圖丨當中的堆疊快取i24 的推入程序流程圖。流程從步驟6〇2開始。在步驟602，圖1的指令轉譯器1〇6會將推入指令解碼，然後指令排程器1〇8會發出推入指令至執行單元114 的存放單元。然後存放單元會在圖3訊號342 產生真值。接下來’在步驟604，控制邏輯電路302會在遞減訊號384產生真值，使得算術單元3〇4遞減fp—〇饱以訊號 396，並於輸出端372輸出遞減結果，控制邏輯電路3〇2 會利用控制訊號368使多工器316選取遞減結果，以將它载入圖3的fp一offset暫存322。此外，為了回應push_instr 訊號342的真值，控制邏輯電路3〇2會在sc_hit訊號389 產生真值，以供給執行单元114的存放單元。接下來，在步驟606，位址產生器306會計算推入指令的目標虛擬位址334。接下來，在步驟608，轉譯暫存區308會產生推入指令的目標實體位址336。接下來，在步驟612，圖3的比較器314的其中之一會比較在步驟516產生的實體位址336，以及圖2的實體 scjag[0]訊號 224,以產生圖 3 的 PA_match[0]訊號 364。 36 l294mi,〇c/m 接下來，在步驟614，控制邏輯電路3〇2會檢杳 s:_MESI[0]喊 222 與 PA_match[〇]訊號 364，以決定^ ®快取124的最頂端記錄單位是否有效，並決定推乂指八的目標實體位址336是否等於堆疊快取124的最頂端^ =的^紐_綱，也就是說，實齡址η6是否擊2堆㈣取124时頂軌鮮位。二 == 則會進⑽18。在-實施例中， .w 擊中的5己錄單位不在堆疊快取124的最頂端，取124會在其巾的有效記錄單位寫_統記憶體之後h空。然後流程進入步驟616。在入抢16推人資料會經由sc~write-data訊號216 的取124的最頂端記錄單位，存入實體位址336 、位το [5:2]所指定的快取線2〇6的雙字組位移值。必要，最頂端記錄單位的MESI狀態2〇2會經由 _ 212更新，例如更新為已修改狀態 =-貝料疋來自暫存器組112當中，推入指令所指定的暫。例如說’如果推入指令是χ㈣Call指令推入二=計算自暫存器組112的指令指標暫存器的下一個Is Sint correct? If the source entity address of the pop-up age m is the topmost record unit of the main cache 124, the process ends here, and the test pop-up program provides the correct pop-up data. Otherwise, the process proceeds to step 524. (4) Pingbeibei Village No In step 524, the control logic circuit 3〇 ί ί2: micro/computer _ execution exception event processing = where the exception event production provides the status of the error data. In an embodiment, the index register stack r124' is loaded into the stacking embodiment. The Iβ2 also includes writing the stacked cache lines back to the system memory and 124 of the active first level caches. The process ends in step 35 12945 - 524. As you can see before, you will also be described in detail later in Figure 1. The quick pop-up program in Figure 5 can make the pop-up command clear (4), which is faster than the traditional cache (4) that does not distinguish pop-up and manned commands. Several clocks are in week 3. 6 is a flow chart showing the push procedure of the stack cache i24 pushed into the map according to the present invention. The process begins with step 6〇2. At step 602, the instruction translator 1〇6 of Fig. 1 decodes the push-in instruction, and then the instruction scheduler 1〇8 issues a push-in instruction to the storage unit of the execution unit 114. The storage unit will then generate a true value in Figure 3, signal 342. Next, at step 604, the control logic circuit 302 generates a true value at the decrement signal 384, causing the arithmetic unit 3〇4 to decrement fp_〇 to the signal 396, and output a decrement result at the output terminal 372, and the control logic circuit 3〇2 The control signal 368 is utilized to cause the multiplexer 316 to select the decrement result to load it into the fp-offset temporary storage 322 of FIG. In addition, in response to the true value of the push_instr signal 342, the control logic circuit 3〇2 generates a true value at the sc_hit signal 389 to supply the storage unit of the execution unit 114. Next, at step 606, the address generator 306 calculates the target virtual address 334 of the push command. Next, at step 608, the translation buffer 308 will generate the target entity address 336 of the push command. Next, at step 612, one of the comparators 314 of FIG. 3 compares the entity address 336 generated at step 516 with the entity scjag[0] signal 224 of FIG. 2 to produce the PA_match[0] of FIG. Signal 364. 36 l294mi, 〇c/m Next, in step 614, the control logic circuit 3〇2 checks s:_MESI[0] shout 222 and PA_match[〇] signal 364 to determine the topmost record of the ^® cache 124. Whether the unit is valid, and decides whether the target entity address 336 of the referral finger is equal to the topmost ^= of the stack cache 124. That is, whether the real-age address η6 hits 2 heaps (four) takes 124 times. Track position. Two == will enter (10)18. In the embodiment, the 5 recorded units hit by .w are not at the top of the stack cache 124, and 124 will be empty after the effective recording unit of the towel is written. The flow then proceeds to step 616. The data of the push button 16 will be stored in the topmost record unit of the sc~write-data signal 216, and stored in the physical address 336, the double line of the cache line 2〇6 specified by the bit το [5:2] The word shift value. If necessary, the MESI status 2〇2 of the topmost recording unit will be updated via _212, for example, updated to the modified status =- 贝疋 from the register group 112, the temporary specified by the push instruction. For example, 'If the push command is χ (4) Call command push 2 = Calculate the next command from the register group 112 command indicator register

ΒΝ= ϋ指標。-另舉一例，如果推入指令是x86的 ^ . 9 7，推入貧料就是x86的暫存器組112的EBP =器的内容數值。再舉一例，如果推入指令是的令所指㈣暫存ϋ。餘在_16結束Γ中在决策步驟618，因為推入資料的目標位址336錯過 37 I294H-/m 堆璺=取124，堆疊快取124必須在最頂端配置一個新的圮錄單位，以儲存推入目標位址336所在的快取線。為此堆疊快取124會往下平移一個記錄單位，最底下的記錄單位會移出堆疊快取124。因此，控制邏輯電路3〇2會檢查 sc一MESI[15] 222以決定堆疊快取124最底下的記錄單位疋否有效。如果有效’流程會進入步驟622，否則會進入步驟624。在步驟622，控制邏輯電路3〇2會給堆疊快取124的最底下的記錄單位的寫回做排程，方法是在 writeback—mux一select訊號328產生真值，使多工器326選取sc—data[15]訊號226 ’也就是堆疊快取124的最底下的記錄單位的快取線，以供給寫回線緩衝區324，然後在 writeback—request訊號338產生真值，以請求圖1的匯流排介面單元118將上述的快取線寫回第二級快取或系統記憶體。接下來，在步驟624，控制邏輯電路3〇2會設立push_sc > 訊號232，使堆疊快取124向下平移一個記錄單位，以分別透過8(；一\¥1^一(1&〖3訊號216、3〇」¥1^」&2訊號214、以及SC一write-MESI訊號212存放推入資料和它的位址標籤與MESI狀態。接下來，在步驟626，控制邏輯電路3〇2會配置一個裝填暫存區，使堆疊快取124準備容納推入指令的目標位址336所在的快取線’並且將這個快取線存入堆疊快取 124。在一實施例中，步驟626也包括探視非堆疊快取122 38 1294歡 f.doc/m 流程於步驟628結束^錄早位的推人資料相結合。理媳=請/照圖7，圖7是根據本發明的，® 1的― 二，疊指標加法指令的程序流程圖。如:^ 用:部分行為良好的程式的= 應的彈出指令跟隨著=就ΐ說’每個推人指令都有對 =遞副程式所需的參數(。„^ 憶體C Ϊ言的函式(fUnetkm)參數是利用系統記二，Γ傳遞。4此會執行—連串PUSH指令將參數推，豐，母織令推人—項參數。舉例來說，在吟叫接收五個四位元組參數的函式之前，負責呼叫的 :五次_旨令’將五個參數推入堆疊。然後呼;^ ^LLf ^ ’、將返回位址推人堆疊，並且將控制權 :、、、口 ^耘式。副程式的最後一個指令是RET，它會從堆聂，出返回位址。_呼叫的程式必須恢復參數所^據的= 燮空間，一種方法是連續執行五次POP指令，使堆疊指標回到推入參數之前的數值。然而，由於呼叫的函式並不需要這些參數，大部分的編譯程式（C〇mpiler)都是直接用 39 1294掀一 ADD扣令，將參數佔據的空間大小加回堆疊指標。如此編潭私式就只產生一個ADD指令，而不是五個指令，這樣程式執行會比較快，程式碼也會比較小。在上面的範例中，呼叫的程式會給堆疊指標加上20。這是最常見的，推入與彈出指令不一致的狀況。所以在一實施例中，本發明的快速彈出裝置會找出給堆疊指標加上數值的指令，並且據以調整fp—offset 322的數值。這個流程是從圖7的步驟702開始。在步驟702，圖1的指令轉譯器1〇6將一個目標為堆疊才曰松暫存為152的加法指令解碼，指令排程器1〇8會將加法指令發給執行單元114的整數單元。接著整數單元會在圖3的add—sp一instr訊號3M產生真值。接下來，在步驟704，控制邏輯電路3〇2會在加法訊號382產生真值，使算數單元3〇4將add—印―訊號加上fp—offset訊號396，並且於輸出端372提供結果，控制邏輯電路302會透過控制訊號368使多工器316選取這個結果，以載入至圖3的fp—0ffset暫存器322。接下來，在決策步驟706，控制邏輯電路3〇2會檢查溢位訊號392以決定步驟704的加法程序是否造^ fp一offset暫存态322溢位。也就是說，控制邏輯電路3的會決定上述的加法指令是否會使堆疊指標暫存器152指向另一個快取線。在步驟706,溢位是指加法造成堆疊^標暫存器152+再指向堆疊快取m最頂端的儲存單元的快取線。更明確的說，如果加法造成溢位，堆疊指標暫存器 I294^i,oc/1 152通常指向記憶體位址鄰接於且大於堆疊快取124最頂端的儲存單元的快取線的位址的快取線。因此，堆疊快取 124必須做彈出動作，使正確的快取線處於頂端的記錄單位。在一實施例中，控制邏輯電路302會發出使堆疊指標暫存器152溢位超過一條快取線的加法指令。在此實施例中，在接下來的步驟708，堆疊快取124必須彈出的#絲單位數量N是以下财式計算，假娜取線的大小為 1 位元組·· 一oriset + add—sp一val) / 64 因此，假如N大於1，表示有溢位發生，此時流程合步驟708，否則流程結束。 3 =驟708 ’控制邏輯電路3〇2會在卿—sc訊號说 J真f，以彈出堆疊快取124最頂端的記錄單位。不過快否$決定最頂端的記錄單位的 3 有如果有效’就會將有效餘線寫回系統動作力…非程，就像圖咐如士的錄早位—般。如步驟7〇6所述，在-實祐 |中’ Ν的數值是由計算而來’堆疊快取以合彈ν 而且其中狀態為有效的都會寫回二程在步驟叠快程η根據本發明的，從圖1的堆與8C。簡單的說的,私，。圖8包含圖8八、犯、序；圖8Β緣示堆疊^ f不堆受快取124的猜測载入程且、取124的正常载入程序；圖8C則繪 f.doc/m 示從非堆疊快取122的載入程序。流程從步驟開始。在一般程式中，系統記憶體堆疊有另一個主要用途，就疋配置副程式的區域變數(l〇cal variables)所需的空間。副私式配置空間的方法，是將堆疊指標減去區域變數所需的空間數量。然後區域變數會被載入指令取用，其中載入指令的目標位址是根據堆疊指標計算出的相對位址。因此載入的貧料报可能與最近推入的資料在同一個快取線。此鲁外，，程式报可能執行載入指令，以取用呼叫程式為它推入堆豐的參數。被推入的參數很有可能跨越兩個快取線，也就是說，其中一個推入指令會使堆疊指標指向下一個快取線，就如圖6的步驟618到628所述。於是，某些參數 - 會在堆豐快取124的頂端的第二個記錄單位的快取線，而 ^ 非最頂端的記鮮位，錢錄在__三個記錄單 — 位，諸如此類。因此，在一實施例中，發自堆疊快取124 的，測載入程序會利用這一點，檢查堆疊快取124，看載入資料是否存在於最頂端的兩個記錄單位。藉著直接檢查 • 彳頂端的兩個記錄單位，可避免傳統快取記憶體的列^ 程序(row decode operation)，可以省下一個時脈週期。 ’此外，在一實施例中，猜測載入可以再省下一個時脈週期，方法是用載入指令的虛擬位址334，而非實體位址 33^以比較位址標籤，看載入資料是否存在於最頂端的兩個記錄單位。如果虛擬位址符合最頂端的兩個記錄單位其中之一，載入資料就很可能在剛才擊中的記錄單位，雖^ 因為可能是虛擬重址(virtual aliasing)而不確定。在微處理 42 1294569 14394twf.doc/m 機100的一實施例中，堆疊快取124會在猜測載入時提供錯誤資料，是因為作業系統切換工作(task)，因而更新記憶體分頁資訊(paging information)，造成假的虛擬位址符合。在一貫施例中，特別是在使用堆疊位址區段暫存器（sta淡 address segment register)，例如 χ86 架構的 SS 暫存器，的，處理機100,堆疊快取124在猜測載入時提供錯誤資料，疋因為堆疊區段暫存器受到更新，影響到有效位址計算， •而可能造成假的虛擬位址符合。 # 雖然圖8敘述一個實施例，其中堆疊快取124最頂端的兩個記錄單位會做為猜測載入的候選者而受檢查，猜測载入程序並不限於檢查特定數量的最頂端的記錄單位，本，明也包括其他實施例，其中猜測載入會檢查各個不同數量的最頂端記錄單位。在步驟802，圖1的指令轉譯器106會將一個載入指令解碼，指令排程器1〇8會將載入指令發給執行單元U4 的載入單元。接著載入單元會在圖3的loadjnstr訊號346 > 產生真值。接下來，在步驟804，位址產生器306會計算載入指令的來源虛擬位址334。接下來，在步驟806，圖3的兩個比較器312會比較步驟804產生的虛擬位址336與圖2的虛擬sc_tag[l:0]訊號224 ’以產生圖3的VA—match[l:0]訊號362。接下來，在決策步驟808，控制邏輯電路302會檢查 SeJV[ESI[l:〇]訊號 222 與 VA_match[l:0]訊號 362 以決定 43 I294m,〇c/m 堆疊快取124的最頂端的兩個記錄單位的任何一個是否有載人指令的來源虛擬位址334是否符合堆疊快取乂24的？頂端的兩個記錄單位的位址標籤綱的虛擬部 =山也就是„兒’虛擬位址334是否擊中堆疊快取以的最頂㈣削@記錄單位。如果是，餘會進人步驟812，否則進入圖8Β的步驟824。也I、r ί ^驟812 ’為了回應1〇ad—instr訊號346的真值，控一，^路302會在spec—sc—1〇ad—細、如訊號391產生值’使多工器412選取在決策步驟_決定的，有 :口載之指令的來源虛擬位址334的有效位址標籤204 、’堆豐快取124的兩個快取線（也就是％一她[岡訊號 m 的其中之一，以提供給圖4的訊號422。此外，多工 = 404會從快取、線422選取實體位址说的位* [5:2]指 ^的雙字組，以供給圖4的訊號424。而且，控制邏輯電 3〇2會在圖3的L1—mux—此丨訊號產生一個數值，使 y的多工器術選取輸入424以經由匯流排m提供給，行單元114的載人單元’讀供給載人齡，猶後寫回會將輸入424載入到暫存器組112其中，載入指二Πα的暫存器。由圖SA可知，資料是猜測性的提供 ^入單兀。說猜測是因為尚未確認將在後面的步驟 81^在實體位址336產生的載入指令的來源實體位址，是否會，於從堆疊快取124的最頂端的兩個記錄單位的其中 ^供給載入單元的載入資料的位址。由於步驟別8偵測到虛擬位址334擊中堆疊快取m的最頂端的兩個記錄 44 1294569 14394twf.doc/m 單位，控制邏輯電路302會在SCJlit訊號389產生真值，以提供給執行單元114的載入單元。接下來，在步驟814，轉譯暫存區308會產生載入指令的來源實體位址336。接下來’在步驟816 ’圖3的兩個比較器314會比較步驟814產生的實體位址336與實體sc_tag[i:〇]訊號 224 ’在步驟812，載入資料就是由此猜測性地輸出，以產ΒΝ = ϋ indicator. - As another example, if the push command is x86 ^. 9, 7, push the poor material is the content value of the EBP = device of the x86 register group 112. As another example, if the push command is the order (4) temporary storage. In the _16 end 在 in decision step 618, because the target address 336 of the push data misses 37 I294H-/m stack=take 124, the stack cache 124 must be configured with a new logged unit at the top to Store the cache line where the push target address 336 is located. To do this, the Stack Cache 124 will pan down one record unit and the bottom record unit will move out of the Stack Cache 124. Therefore, the control logic circuit 3〇2 checks sc_MESI[15] 222 to determine whether the bottommost record unit of the stack cache 124 is valid. If it is active, the process proceeds to step 622, otherwise step 624 is entered. At step 622, the control logic circuit 〇2 will schedule the write back of the bottommost record unit of the stack cache 124 by generating a true value in the writeback_mux-select signal 328, causing the multiplexer 326 to select sc. The -data[15] signal 226' is the cache line of the bottommost record unit of the stack cache 124 to supply the write back line buffer 324, and then generates a true value at the writeback_request signal 338 to request the sink of Figure 1. The interface unit 118 writes the cache line described above back to the second level cache or system memory. Next, in step 624, the control logic circuit 3〇2 sets up a push_sc > signal 232 to cause the stack cache 124 to translate downward by one recording unit to respectively transmit 8 (; 1 \ 1 1 1 (1 & 3 The signal 216, 3""1"" & 2 signal 214, and the SC-write-MESI signal 212 store the push data and its address tag and MESI status. Next, at step 626, the control logic circuit 3 2 will configure a load staging area so that the stack cache 124 is ready to accommodate the cache line where the push target's target address 336 is located and store the cache line into the stack cache 124. In an embodiment, the steps 626 also includes accessing the non-stacked cache 122 38 1294 欢 f.doc / m process at step 628 end ^ recording the combination of the push information in the early position. 媳 = please / according to Figure 7, Figure 7 is in accordance with the present invention, ® 1 - 2, the program flow chart of the stacking index addition instruction. For example: ^ Use: Part of the well-behaved program = The pop-up instruction should follow = Say, 'Every pusher has the right = sub-program The required parameters (.^^ Recalling the function of the body C Ϊ ( (fUnetkm) is to use the system to record two, Γ transfer. 4 Will execute - a series of PUSH instructions to push the parameters, abundance, and weave orders to push the human-item parameters. For example, before calling the function that receives five four-byte parameters, the call is responsible: five times Let 'push five parameters into the stack. Then call; ^ ^LLf ^ ', push the return address to the stack, and control: ,,, and port. The last instruction of the subroutine is RET, which The returning address will be returned from the heap. The program to be called must restore the = space of the parameter. One method is to execute the POP instruction five times in succession, so that the stack indicator returns to the value before the parameter was pushed. However, due to The calling function does not need these parameters. Most of the compilers (C〇mpiler) directly use the 39 1294掀 ADD deduction to increase the size of the space occupied by the parameters back to the stacking indicator. Generate an ADD instruction instead of five instructions, so the program execution will be faster and the code will be smaller. In the above example, the calling program will add 20 to the stack indicator. This is the most common, push Inconsistent with pop-up instructions Thus, in one embodiment, the fast pop-up apparatus of the present invention will find an instruction to add a value to the stack indicator and adjust the value of fp-offset 322. This flow begins at step 702 of Figure 7. The instruction translator 1〇6 of FIG. 1 decodes an add instruction that targets the stack to be temporarily stored as 152, and the instruction scheduler 1〇8 sends the add instruction to the integer unit of the execution unit 114. Then the integer unit The true value will be generated in the add_sp-instr signal 3M of FIG. Next, in step 704, the control logic circuit 3〇2 generates a true value at the addition signal 382, causing the arithmetic unit 3〇4 to add the add-print signal to the fp-offset signal 396, and provide the result at the output 372. The control logic 302 causes the multiplexer 316 to select the result via the control signal 368 to load into the fp_0ffset register 322 of FIG. Next, at decision step 706, the control logic circuit 〇2 checks the overflow signal 392 to determine if the add procedure of step 704 creates an offset of the offset state 322. That is, the control logic circuit 3 determines whether the above-described addition instruction causes the stack indicator register 152 to point to another cache line. In step 706, the overflow means that the addition causes the stacking buffer 152+ to point to the cache line of the topmost storage unit of the stack cache m. More specifically, if the addition causes an overflow, the stack indicator register I294^i, oc/1 152 usually points to the address of the cache line whose memory address is adjacent to and larger than the topmost storage unit of the stack cache 124. Cache line. Therefore, the stack cache 124 must be ejected so that the correct cache line is at the top of the record unit. In one embodiment, control logic 302 issues an add instruction that causes stack indicator register 152 to overflow more than one cache line. In this embodiment, in the next step 708, the number of line units N that the stack cache 124 must pop out is calculated by the following formula, and the size of the false line is 1 byte. · An oriset + add-sp One val) / 64 Therefore, if N is greater than 1, it indicates that an overflow occurs, and the flow is combined with step 708, otherwise the flow ends. 3 = step 708 ’ control logic circuit 3〇2 will say J true f in the qing-sc signal to pop up the topmost recording unit of the stack cache 124. However, if you decide that the top 10 of the record unit is valid, you will write the effective remaining line back to the system. The action is not the same as the record. As described in step 7〇6, the value of '' in '-yous| is calculated by the calculation of 'stacking cache' and the state of being valid is written back to the second step in the step stacking speed η according to this Invented from the stack of Figure 1 with 8C. Simply said, private. Figure 8 contains Figure 8 VIII, sin, order; Figure 8 Β 示堆叠 ^ ^ f f f f f f f f f 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 124 Loader for non-stacked cache 122. The process begins with a step. In the general program, the system memory stack has another main purpose, that is, the space required to configure the subprogram's area variables (l〇cal variables). The method of de-private configuration space is to subtract the amount of space required for the area variable from the stack indicator. The region variable is then loaded into the instruction fetch, where the target address of the load instruction is the relative address calculated from the stacking metric. Therefore, the poor information reported may be on the same cache line as the recently pushed data. In addition to this, the program may execute a load command to retrieve the parameters of the heap for the calling program. The pushed parameter is likely to span two cache lines, that is, one of the push commands will cause the stack indicator to point to the next cache line, as described in steps 618 through 628 of Figure 6. Thus, some parameters - will be in the cache line of the second record unit at the top of the heap cache 124, and ^ is not the topmost record bit, the money is recorded in the __ three record sheets - and so on. Thus, in one embodiment, the loader from the stack cache 124 takes advantage of this and checks the stack cache 124 to see if the loaded data is present at the top two record units. By directly checking the two recording units at the top of the 彳, the traditional cache memory row decode operation can be avoided, saving one clock cycle. In addition, in an embodiment, the guess load can save another clock cycle by using the virtual address 334 of the load instruction instead of the physical address 33^ to compare the address tags to see the load data. Whether there are two record units at the top. If the virtual address matches one of the top two record units, the load data is likely to be in the record unit just hit, although ^ is not determined because it may be virtual aliasing. In an embodiment of the microprocessor 42 1294569 14394 twf.doc/m machine 100, the stack cache 124 provides error data during guess loading because the operating system switches tasks and thus updates memory paging information (paging Information), resulting in a false virtual address match. In a consistent embodiment, particularly when using a stack address segment register, such as an SS register of SS architecture, processor 100, stack cache 124 is guessing load time. Error data is provided, because the stack segment register is updated, affecting the effective address calculation, and may cause false virtual addresses to match. # Although Figure 8 illustrates an embodiment in which the top two record units of the Stack Cache 124 are checked as candidates for guess loading, the guess loader is not limited to checking a certain number of topmost record units. Others are also included in this, where the guess load checks for a different number of topmost record units. At step 802, the instruction translator 106 of Fig. 1 decodes a load instruction, and the instruction scheduler 〇8 sends the load instruction to the load unit of the execution unit U4. Then the load unit will generate a true value in the loadjnstr signal 346 > of Figure 3. Next, at step 804, the address generator 306 calculates the source virtual address 334 of the load instruction. Next, at step 806, the two comparators 312 of FIG. 3 compare the virtual address 336 generated in step 804 with the virtual sc_tag[l:0] signal 224' of FIG. 2 to produce the VA-match[l: of FIG. 0] Signal 362. Next, at decision step 808, control logic 302 checks SeJV[ESI[l:〇] signal 222 and VA_match[l:0] signal 362 to determine 43 I294m, 最c/m stack cache 124 topmost Does any of the two record units have the source virtual address 334 of the manned command conforming to the stack cache 24? The top part of the two record units of the address label class virtual part = mountain is the "child" virtual address 334 whether to hit the top of the stack cache (four) cut @ record unit. If yes, the rest will enter step 812 Otherwise, it enters step 824 of Figure 8A. Also I, r ί ^ 812 ' In response to the true value of 1〇ad-instr signal 346, control one, ^ path 302 will be in spec-sc-1 〇ad-fine, such as The signal 391 generates a value 'to enable the multiplexer 412 to be selected in the decision step _, which includes: a valid address tag 204 of the source virtual address 334 of the command of the port, and two cache lines of the stack cache 124 ( That is, % one of her [one of the gang signals m, to provide the signal 422 of Figure 4. In addition, multiplex = 404 will select the bit of the physical address from the cache, line 422 * [5:2] The double word group of ^ is supplied to the signal 424 of Fig. 4. Moreover, the control logic 3〇2 will generate a value in L1_mux of Fig. 3, so that the multiplexer of y selects the input 424 to pass The bus bar m is provided to the manned unit of the row unit 114 to read the supply age, and the write back will load the input 424 into the register group 112. Loading the register of the binary Πα. As can be seen from the graph SA, the data is a speculative provisioning unit. The guess is because the loading instruction generated at the physical address 336 will not be confirmed in the following step 81^. The source entity address, whether or not, is supplied to the address of the load data of the load unit from the top two record units of the stack cache 124. Since the step 8 detects that the virtual address 334 hits The top two records 44 1294569 14394 twf.doc/m of the stack cache m, the control logic 302 will generate a true value at the SCJlit signal 389 to provide the load unit to the execution unit 114. Next, at step 814 The translation buffer 308 will generate the source entity address 336 of the load instruction. Next, at step 816, the two comparators 314 of FIG. 3 compare the entity address 336 generated by step 814 with the entity sc_tag[i:〇 ] Signal 224 'At step 812, the loaded data is outputted speculatively to produce

生對應的PA—match[l:0]訊號364。接下來，在決策步驟818，控制邏輯電路302會檢查對應於在步驟812猜測提供載入資料的堆疊快取124的記錄單位的PA一match[l:0]訊號364，以決定載入指令的來源實體位址336是否等於上述的記錄單位的實體位址標籤 2〇4，也就是說，實體位址336是否擊中記錄單位。如果載入指令的來源位址336擊中堆疊快取124的猜測記錄單位1流ΐ就此結束，也就是說，猜測载入程序提供了正確的彈出資料。否則，流程進入步驟822。在步驟822，控制邏輯電路3〇2會在例外事件訊號39< 產生真值’使微處理機刚執行例外事件處理程式，以處 if,人程序提供錯誤㈣的航。㈣事件處理㈣ :使載入指令接_正確㈣。在—實施财，例外事 =:ί4Ϊ從非堆疊快取122或系統記憶體或L2快取載入正確貝料。流程在步驟822結束。㈣ΓίΓ知，後面也會參照圖11詳加說明，圖8Α的猜測載入程序使載人資顺供給載人指令，可以轉統^ 45 c/m 12945激_ 取記憶體快上幾個時脈週期。在圖8B的步驟824，轉譯暫存區3〇8會產生載入指令的來源實體位址336。接下來，在步驟826，圖3的比較器314會比較步驟 =4產生的實體位址336以及堆疊快取124的十六個記錄單位所各自對應的實體sc_tag[15叫訊號224，以產生 PA_match[15:0]訊號 364。接下來，在步驟828，控制邏輯電路302會檢查 s:JV[ESI[15:0]訊號 222 與 PA—match[15:0]訊號 364，以決定堆疊快取124是否有記錄單位的狀態為有效，以及載入才曰令的來源實體位址336是否等於有效的記錄單位的實體位址標籤204，也就是說，實體位址336是否擊中堆疊快取124。如果是，流程會進入步驟832，否則會進入圖8C 的步驟834。在步驟832，為回應l〇ad一instr訊號346的真值，並且回應載入指令的來源虛擬位址334錯過堆疊快取124最頂碥的兩個記錄單位，以及回應載入指令的來源實體位址 336擊中堆疊快取124，控制邏輯電路3〇2會在 normal—scjoad一mux一sel訊號393產生一個數值，使多工器408選取已經在步驟828決定，有等於載入指令的來源 λ體位址336的有效實體位址標鐵204的堆叠快取124的十六個快取線（也就是sc一data[15:0] 226)的其中之一，以輪出於圖4的訊號428。此外，多工器406會從快取線 428選取實體位址336的位元[5:2]指定的雙字組，以輸 46 1294569 14394twf.doc/m 出於圖4的訊號426。而且控制邏輯電路搬會在 L1 一mux—seUfU虎395產生一個數值，使多工器4〇2選取輸 ^ 426卩經由匯流排138提供給執行單元川的載入單 =以供給載入指令，稍後寫回單元ιΐ6會將訊號426的 :載人至載人指令所指定的，圖丨的暫存器組ιΐ2其中田個暫存器。由於步驟US偵測到實體位址336擊中堆 :陕取124 ’控制邏輯電路3〇2會在sc—⑽訊號389產生 _ ^以提供給執行單元114的载入單元。流程在步驟832 w 結束。，此可知，後面也會配合圖12詳加解釋，圖8B的發 ^疊快取124的正常載入程序使得載入資料提供給載入 - 曰7，可以比傳統的快取記憶體，例如非堆疊快取122， . 快上至少一個時脈週期。在圖8C的步驟834，非堆疊快取122會接收圖8B的步驟82=產生的實體位址336的索引部分(index_㈣，並且接著對索引進行列解碼，以在麵疊快取122選取一 • 列，或稱為一個欄集合(set of ways)。接下來，在步驟836，非堆疊快取122比較步驟824 產生的實體位址336的高位部分，或標籤部分，以及步驟 834選取的集合其中每一攔(wa力的實體位址標籤。接下來，在決策步驟838，非堆疊快取122會檢查步驟836的比較結果以及受選取的攔的有效位元，以決定載 =的實體位址336是否擊中非堆疊快取122。如果擊中，流程會進入步驟842，如果沒有，流程則會進入步驟844。 47 twf.doc/m 1294569 在，驟,，非堆疊快取122會從它被實體位址336 擊中的記料位提供載人資料。流程在步驟842結束。在步驟844，由於㈣咖&定載入的實體位址336 錯過非堆疊快取122，非堆疊快取122會在其巾配置-個記錄單位’以載入錯過的載入指令的實體位址336所在的快取線。，下來，在步驟846，非堆疊快取122會從系統記憶體或第一級快取，將錯過的快取線載入步驟844配置的記錄單位。、接下來’在步驟848，非堆疊快取122會從步驟846 載入的快取線提供載入資料。流程在步驟848結束。在一實施例中，圖8C的步驟834到848是以傳統快取記憶體的方式執行。也就是說，圖8C繪示傳統的非堆豐快取122，在載入實體位址336錯過堆疊快取124時的傳統載入程序。請參照圖9，圖9為根據本發明的，進入圖1的第一級快取126的存放程序流程圖。流程從步驟9〇2開始。在步驟902，圖1的指令轉譯器1〇6會將存放指令解碼’指令排程器108會將存放指令發給執行單元114的存放單元。緊接著，存放單元會在圖3的storejnstr訊號348 產生真值。接下來，在步驟904，位址產生器306會計算存放指令的目標虛擬位址334。接下來，在步驟906，轉譯暫存區308會產生存放指 48 f.doc/m 令的目標實體位址336。接下來，在步驟908，圖3的比較器314會比較步驟 906產生的實體位址336與堆疊快取124的十六個記錄單位各自對應的實體sc—tag[15:0]訊號224，以產生 PA—match[15:0]訊號 364 〇接下來，在決策步驟912，控制邏輯電路3〇2會檢查 scJV[ESI[l5:〇]訊號 222 與 PA—match[15:0]訊號 364 以決The corresponding PA-match[l:0] signal 364 is generated. Next, at decision step 818, the control logic circuit 302 checks the PA-match[l:0] signal 364 corresponding to the record unit of the stack cache 124 that provided the load data at step 812 to determine the load instruction. Whether the source entity address 336 is equal to the physical address label 2〇4 of the above-mentioned recording unit, that is, whether the physical address 336 hits the recording unit. If the source address 336 of the loaded instruction hits the guessing record unit 1 of the stack cache 124, the flow ends, that is, the guess loader provides the correct pop-up data. Otherwise, the flow proceeds to step 822. At step 822, control logic circuit 3 会 2 will generate a true value at the exception event signal 39 < cause the microprocessor to execute the exception event handler for the if, human program to provide the error (four) of the flight. (4) Event Processing (4): Make the load instruction _ correct (4). In the implementation of the money, the exception = = ί4 Ϊ from the non-stack cache 122 or system memory or L2 cache to load the correct material. Flow ends at step 822. (4) ΓίΓ知, will be described later with reference to Figure 11, the guessing loader of Figure 8Α enables the manned personnel to supply manned instructions, can be transferred to ^ 45 c / m 12945 _ _ take the memory faster on several clocks cycle. In step 824 of Figure 8B, translation of the scratchpad 3〇8 will result in the source entity address 336 of the load instruction. Next, at step 826, the comparator 314 of FIG. 3 compares the entity address 336 generated by step = 4 and the entity sc_tag [15 call signal 224 corresponding to the sixteen record units of the stack cache 124 to generate a PA_match. [15:0] Signal 364. Next, at step 828, control logic 302 checks s:JV[ESI[15:0] signal 222 and PA-match[15:0] signal 364 to determine if stack cache 124 has a record unit status of Valid, and whether the source entity address 336 of the load command is equal to the physical address tag 204 of the valid record unit, that is, whether the physical address 336 hits the stack cache 124. If so, the flow proceeds to step 832, otherwise it proceeds to step 834 of Figure 8C. In step 832, in response to the true value of l〇ad-instr signal 346, and in response to the source of the load instruction virtual address 334 misses the top two records of the stack cache 124, and the source entity that responds to the load instruction The address 336 hits the stack cache 124, and the control logic circuit 3〇2 generates a value in the normal-scjoad-mux-sel signal 393, causing the multiplexer 408 to select the source that has been determined in step 828, which is equal to the source of the load instruction. The effective physical address of the λ body address 336 is one of the sixteen cache lines of the stack cache 124 (ie, sc_data[15:0] 226) to round the signal of FIG. 428. In addition, multiplexer 406 will select the doubleword specified by bit [5:2] of physical address 336 from cache line 428 to input 46 1294569 14394 twf.doc/m for signal 426 of FIG. Moreover, the control logic circuit moves a value in the L1-mux-seUfU tiger 395, so that the multiplexer 4〇2 selects the load 426 and supplies the load order to the execution unit via the bus 138 to supply the load command. Write back to the unit ιΐ6 later will be the signal 426: the manned to the manned command specified, the map of the register group ι ΐ 2 of the field register. Since the step US detects that the physical address 336 hits the heap: the capture 124' control logic circuit 3〇2 generates _^ at the sc-(10) signal 389 to provide the load unit to the execution unit 114. The process ends at step 832w. This can be seen later, as will be explained in detail with FIG. 12, the normal loading procedure of the stacking cache 124 of FIG. 8B allows the loading data to be supplied to the loading-曰7, which can be compared to a conventional cache memory, for example. The non-stack cache 122, . is fast on at least one clock cycle. At step 834 of FIG. 8C, the non-stack cache 122 receives the index portion of the generated physical address 336 of step 82 of FIG. 8B (index_(4), and then performs column decoding on the index to select one in the overlay buffer 122. Columns, or a set of ways. Next, at step 836, the non-stack cache 122 compares the high-order portion of the physical address 336 generated by step 824, or the label portion, and the set selected in step 834. Each block (the physical address tag of the wa force. Next, at decision step 838, the non-stack cache 122 checks the comparison result of step 836 and the valid bits of the selected block to determine the physical address of the load = 336 whether to hit the non-stack cache 122. If hit, the process will proceed to step 842, if not, the process will proceed to step 844. 47 twf.doc/m 1294569 In, the non-stack cache 122 will be from it The bearer data hit by the physical address 336 provides manned data. The flow ends at step 842. At step 844, the non-stack cache is missed because the (four) coffee & loaded physical address 336 misses the non-stack cache 122 122 will be configured in its towel - a record Bit 'to load the cache line where the physical address 336 of the missed load instruction is located. Down, in step 846, the non-stack cache 122 will cache from the system memory or the first level, and the cache will be missed. The line loads the record unit configured in step 844. Next, at step 848, the non-stack cache 122 will provide the load data from the cache line loaded in step 846. The flow ends at step 848. In an embodiment, Steps 834 through 848 of Figure 8C are performed in the manner of conventional cache memory. That is, Figure 8C illustrates the conventional non-stack cache 122, the tradition of missing the stack cache 124 when loading the physical address 336 Referring to Figure 9, Figure 9 is a flow diagram of a deposit procedure for entering the first stage cache 126 of Figure 1 in accordance with the present invention. The flow begins with step 9: 2. At step 902, the instruction translation of Figure 1 The device 1〇6 will decode the store instruction. The instruction scheduler 108 will send the store instruction to the storage unit of the execution unit 114. Then, the storage unit will generate a true value in the storejnstr signal 348 of Fig. 3. Next, in the step 904, the address generator 306 calculates the storage instruction The target virtual address 334. Next, at step 906, the translation buffer 308 will generate the target entity address 336 of the store pointer f.doc/m. Next, at step 908, the comparator 314 of FIG. The physical sc_tag[15:0] signal 224 corresponding to the physical address 336 generated by the step 906 and the sixteen recording units of the stack cache 124 is compared to generate a PA-match[15:0] signal 364. Down, in decision step 912, the control logic circuit 3〇2 checks the scJV[ESI[l5:〇] signal 222 and the PA-match[15:0] signal 364 to determine

定堆疊快取124是否有狀態為有效的記錄單位，以及存放指令的目標實體位址336是否等於堆疊快取124當中，一個有效的記錄單位的實體位址標籤2〇4，也就是說，實體位址336是否擊中堆疊快取124。如果是，流程會進入步驟914，否則會進入步驟916。Whether the fixed stack cache 124 has a record unit whose status is valid, and whether the target entity address 336 of the store instruction is equal to the stack address cache 124, a valid record unit physical address label 2〇4, that is, the entity Whether address 336 hits stack cache 124. If so, the process proceeds to step 914, otherwise step 916 is entered.

外在步驟914，存放資料會放入決策步驟912當中位址符合的堆疊快取124的有效記錄單位，存放資料是透過 s^_:rite_data訊號216存放至實體位址336的位元[5:2] 指定的快取線206的雙字組位移值。如有必要，最頂端記錄單位的MESI狀態2G2會透過se—writeJvlESI訊號212 例如更新為已修改狀態。存放資料來自存放指令所曰疋的暫存器或#己憶體位置。例如，如果存放指令是x86 指令，指個—般用途暫存器為資料來源，存貝二就在暫存驗m當巾，受Μ〇)ν指令的來源運算 +&田^的暫存器。由於步驟912彳貞测到實體位址336擊中隹”’控制邏輯電路3〇2會在％，訊號—產、，提供給執行單％ 114的存放單元。餘在步驟914 49In an external step 914, the stored data is placed in the valid record unit of the stack cache 124 where the address matches in the decision step 912, and the stored data is stored in the bit address of the physical address 336 through the s^_:rite_data signal 216 [5: 2] The double word shift value of the specified cache line 206. If necessary, the MESI status 2G2 of the topmost recording unit is updated to the modified state by, for example, the se-writeJvl ESI signal 212. The stored data comes from the location of the scratchpad or #存忆。. For example, if the deposit instruction is an x86 instruction, the general purpose register is the data source, and the memory is stored in the temporary storage, and the source of the ν instruction is +& Device. Since step 912 detects that physical address 336 hits 隹" control logic circuit 3 〇 2 will be at %, signal-produced, supplied to the storage unit of execution unit % 114. The remaining step 914 49

Jm 結束。一在步驟f 16，步驟9〇6產生的實體位址幻6的標籤部分會與非堆®快取122當巾，實體位址336的索引部分選取的集合的每一攔的實體標籤做比較。接下來，在決策步驟918，控制邏輯電路302會檢查 non-sc—hit矾號366以決定存放指令的目標實體位址336 是否擊中非堆疊快取122。如果擊中，流程會進入步驟 922，否則會進入步驟924。在步驟922，存放資料會被存放入非堆疊快取122在㈣918受選取的集合當中’有效且比較結果相等的欄。流程在步驟922結束。在步驟924,由於決策步驟918判定存放實體位址336 錯過f堆疊快取122，非堆疊快取122會在其中配置一個記錄單位，以容納存放指令的實體位址336所在的快取線。、接下來，在步驟926，非堆疊快取122會從系統記憶體或第一級快取，將前面錯過的快取線載入在步驟Μ#配置的非堆疊快取122的記錄單位。接下來，在步驟928，非堆疊快取122會將存放資料存放入在步驟926載入的快取線。流程在步驟928結束。在一實施例中，圖9的步驟902到906以及步驟916 到928都是以傳統快取記憶體的方式進行。也就是說，步，902到906以及步驟916到928是存放位址336錯過^ s:快取124時，在傳統的非堆疊快取丨22進行的傳統存放程序。 50 1294歡㈤m 固川，圖10疋根據本發明的，圖5所繪示， ^堆疊快取124進行快速彈出程相時賴。圖1G包括四 ^丁(column)，標示為i到4，對應於微處理機ι〇〇的四個 1 脈也包括五铜(row)，各自代表微處理機 ι〇〇β的-個動作或結果。圖1G的每個行列交會處的方格，白就ί「彈出」’以指示彈出指令在微處理機100 的g線(pipeline)中所在的位置。在時脈週期1，根據圖1G的第—列，圖！的執行單元 1H的載入單元會在圖3的p〇p_instr訊號344產生真值，以請求彈出資料給彈出指令，就如圖5的步驟搬所示。在時脈週期2，根據第二列，堆疊快取124 i-offsl訊號396財的最頂端記錄單位的快取線^供貝料給彈出指令，就如圖5的步驟5〇4所示。更明確的說，多工器318;會從堆疊快取124的最頂端記錄單位提供在 sc_data[0]職226的十六個雙字組當巾，選取电〇訊號396指定的雙字組398，而且多工器搬會選取电―杨，入二8。此外，堆疊快取124會以^⑽訊號谢通知載入早TL有彈出指令發生擊中。也就是說，堆疊快取以會通知載人單元，彈出指令的資料正位於堆疊快取124。如同前面配合圖5的說明，s〇it訊號389的猜測的’因為還沒確定稍後會在時脈獅3產生的彈= 令來源位址’會等於將在時脈職3從堆疊快取124的^ 頂端記錄單位提供給載人單元㈣出資料的位址。在施例中，sc一hit職389通知載入單元有彈出指令發生^ 51 I2945369fd〇c/m 甲，會經過圖 π、MbSI[0] w 双证凡杷關，如此在，頂端的記錄單元為有效時，堆疊快取124才會通知載入單元有彈純令發生擊巾。也就是說，儘管控制邏輯電路302在告知彈出擊中之前不會確認位址符合，至少會確認堆疊快取m的最頂端記錄單位為有效。曰在時脈週期2，根據第三列圖3的虛擬位址334，就如彈出iUri ’根據第四列，轉譯暫存區柳會產生 == = =圓5的步_所示。 _堆疊快取m提==控制邏輯電路302會偵步驟训至524所了不正確的彈出資料，就如圖5的 13可知，快速彈出程料給彈出指令，比傳統记憶體快上幾個時脈週、比較圖10與後面會說明的圖可以使第—級資料快取126提供資 =不會區分彈出與•指令的快取朋0 ▼ · 賞體位址336沾a - rr m 取資料的雙字组，的位兀[5:2]是用來立脈週^提供，而不是在=脈=^。396，而資料是在日個行，標示為的時序圖。圖η包. 期。圖11也包括六個列，ς自;^理機刚的四個時脈沒作或結果。圖u的每個 ϋ理機1GG的—個鸯均處的方格，不是空白就男 52 1294械 4twf.doc/m 「載入」，以指示載入指令在微處理機1〇()的管線中所在的位置。在時脈週期1，根據圖11的第一列，圖1的執行單元 114的載入單元會在圖3的i〇ad-instr訊號346產生真值，以請求載入資料給載入指令，就如圖8的步驟8〇2所示。在時脈週期2，根據第二列，位址產生器306會計算圖3的虛擬位址334，就如圖8的步驟804所示。在時脈週期3，根據第三列，圖3的比較器312會進行虛擬標籤比較，以產生VA一match[1:〇]訊號362，就如圖8的步驟806所示。此外，控制邏輯電路3〇2會根據 VA—match[l:0]訊號 362 與圖 2 的 scJVIESI[l:0]訊號 222 產生spec—scjoad一mux—sel訊號391，如同圖8的步驟812 所示。此外，堆疊快取124會以sc—hit訊號389通知載入單元有載入指令發生擊中，就如圖8的步驟812所示。也就是說，堆疊快取124會通知載入單元，載入指令所需的資料正位於堆疊快取124。如同前面關於圖8的說明，擊中指示是猜測的，因為還沒確定在時脈週期3產生的載入指令的來源實體位址336，會等於將在時脈週期4從堆疊快取124 k供給載入早元的載入貧料的位址。在時脈週期3，根據第四列，轉譯暫存區308會產生載入指令的來源實體位址336，就如圖8的步驟814所示。在時脈週期4，根據第五列，載入資料會提供給載入單元’就如圖8的步驟812所示。更明確的說，圖$的多工器412會根據spec一sc一load—mux一sel訊號391選取 53 1294569 14394twf.doc/m sc—datanw]訊號226的兩個快取線其中之—，多工，根據實體位址336的位元[5:2]選取合 =4 多工器402會選取輸入424。又予、、且，而在時脈週期4，根據第六列，控測到堆疊快取124提供了工/ 2會偵驟議到822所示。錯與的載入資料。就如圖8的步比較圖11與後面會說明的圖13可知，猜 ::使第-級資料快取126提供資料給載入指令载的快取記憶體快上幾個時脈週期。比傳統請參照圖12’圖12是根據 ==124進行正常載入程序，二=程序的時序圖。圖12包括五個行，入私處理機 loo ## ^ ^Ν ί應於微表微處理機二五個列，各自代處的方格，不是空白就β ί，果。圖12:的每個行列交會處理機刚的管線中所二ίί」。’以指不載入指令在微 114 όΓ載!’根據81 12的第—列’圖1的執行單元以姓^圖3的1〇心贈訊號346產生真值’ 給载入指令，就如圖8的步驟搬所示。在寺脈週期2，根據第二列Jm is over. In step f 16, the label portion of the entity address phantom 6 generated in step 〇6 is compared with the non-heap® cache 122, the physical label of each block of the selected set of the index portion of the physical address 336. . Next, at decision step 918, control logic 302 checks non-sc-hit apostrophes 366 to determine if the target entity address 336 of the store instruction hits non-stack cache 122. If it is hit, the flow proceeds to step 922, otherwise it proceeds to step 924. At step 922, the deposited data is stored in a column of non-stacked caches 122 that are valid and equal in the selected set of (four) 918. The process ends at step 922. At step 924, since decision step 918 determines that the storage entity address 336 misses the f-stack cache 122, the non-stack cache 122 configures therein a record unit to accommodate the cache line where the entity address 336 of the instruction is located. Next, at step 926, the non-stack cache 122 will fetch from the system memory or the first level, and load the previously missed cache line into the record unit of the non-stack cache 122 configured in step Μ#. Next, at step 928, the non-stack cache 122 stores the deposited data into the cache line loaded at step 926. The process ends at step 928. In one embodiment, steps 902 through 906 and steps 916 through 928 of FIG. 9 are all performed in a conventional cache memory. That is, steps 902 through 906 and steps 916 through 928 are conventional deposit procedures performed on the conventional non-stack cache 22 when the address 336 misses the s: cache 124. 50 1294 Huan (5) m Gu Chuan, Fig. 10疋 According to the present invention, as shown in Fig. 5, the stacking cache 124 performs a fast pop-up phase. Figure 1G includes four columns, labeled i to 4, and the four pulses corresponding to the microprocessor ι〇〇 also include five rows, each representing the action of the microprocessor ιβ. Or the result. The square of each row and column intersection of Figure 1G is "popped" to indicate where the pop-up instruction is located in the g-line of microprocessor 100. In the clock cycle 1, according to the first column of Figure 1G, Figure! The load unit of the execution unit 1H will generate a true value in the p〇p_instr signal 344 of FIG. 3 to request pop-up information to the pop-up instruction, as shown in the step of FIG. In the clock cycle 2, according to the second column, the cache line of the topmost recording unit of the cached 124 i-offsl signal 396 is supplied to the pop-up command, as shown in step 5〇4 of FIG. More specifically, the multiplexer 318; will provide the sixteen double-words in the sc_data[0] 226 from the topmost recording unit of the stack cache 124, and select the double-word 398 specified by the electronic signal 396. And the multiplexer will choose the electric Yang, enter the second 8. In addition, the stack cache 124 will be notified by ^(10) signal loading. The early TL has a pop-up command hit. That is to say, the stack cache will notify the manned unit that the data of the pop-up instruction is located in the stack cache 124. As previously explained in conjunction with Figure 5, the guess of the s〇it signal 389 'because it has not been determined that the bomb generated by the lion 3 will be later = the source address will be equal to the cached 3 will be cached from the stack. The top record unit of 124 is provided to the address of the manned unit (4) out of the data. In the example, the sc-hit job 389 informs the loading unit that a pop-up instruction occurs. ^ 51 I2945369fd〇c/m A, will pass through the figure π, MbSI[0] w double certificate, so at the top of the recording unit When it is valid, the stack cache 124 will notify the loading unit that there is a bullet to produce a scarf. That is, although the control logic circuit 302 does not confirm the address match before notifying the pop-up hit, at least confirms that the topmost record unit of the stack cache m is valid.曰 In clock cycle 2, according to the virtual address 334 of the third column of Figure 3, just like the pop-up iUri ’ according to the fourth column, the translation of the temporary storage area will produce == = = step 5 of circle 5. _Stacking cache m == control logic circuit 302 will detect the step to the 524 incorrect pop-up data, as shown in Figure 5, 13 quickly pop-up instructions to the pop-up instructions, faster than the traditional memory The clock cycle, the comparison chart 10 and the figure that will be described later can make the first-level data cache 126 provide the cost = the quick pop-up and the instruction-free cache 0 ▼ · the tour address 336 a - rr m The double word of the data, the position [5:2] is used to provide the pulse of the week ^, not = pulse = ^. 396, and the data is in the line of the day, marked as a timing chart. Figure η package. Period. Figure 11 also includes six columns, from which the four clocks of the processor have not been made or result. Figure u is a box of each of the processors 1GG, not blank, male 52 1294 4twf.doc / m "load" to indicate the load command in the microprocessor 1 () The location in the pipeline. In the clock cycle 1, according to the first column of FIG. 11, the loading unit of the execution unit 114 of FIG. 1 generates a true value in the i〇ad-instr signal 346 of FIG. 3 to request loading of the data into the load instruction. This is shown in step 8〇2 of Fig. 8. In clock cycle 2, according to the second column, address generator 306 calculates virtual address 334 of FIG. 3, as shown in step 804 of FIG. In clock cycle 3, according to the third column, comparator 312 of FIG. 3 performs a virtual tag comparison to generate a VA-match[1:〇] signal 362, as shown in step 806 of FIG. In addition, the control logic circuit 3〇2 generates a spec-scjoad-mux-sel signal 391 according to the VA-match[l:0] signal 362 and the scJVIESI[l:0] signal 222 of FIG. 2, as in step 812 of FIG. Show. In addition, the stack cache 124 will notify the load unit with a sc-hit signal 389 that a load command hit has occurred, as shown in step 812 of FIG. That is, the stack cache 124 notifies the load unit that the data required to load the instruction is located on the stack cache 124. As previously described with respect to Figure 8, the hit indication is guessed because the source entity address 336 of the load instruction generated during clock cycle 3 has not been determined to be equal to 124 k from the stack cache at clock cycle 4. Supply the address of the poor material loaded in the early yuan. In clock cycle 3, according to the fourth column, the translation buffer 308 will generate the source entity address 336 of the load instruction, as shown in step 814 of FIG. In clock cycle 4, according to the fifth column, the load data is provided to the load unit as shown in step 812 of FIG. More specifically, the multiplexer 412 of FIG. $ selects two cache lines of 53 1294569 14394 twf.doc/m sc-datanw] signal 226 according to spec-sc-load-mux-sel signal 391. If the multiplexer 402 is selected according to the bit [5:2] of the physical address 336, the input 424 is selected. In addition, and in the clock cycle 4, according to the sixth column, it is detected that the stack cache 124 provides the work / 2 will be detected to 822. Wrong and loaded data. As can be seen from the comparison of Fig. 11 and Fig. 13 which will be described later, the guess: causes the first level data cache 126 to provide data to the cache memory of the load instruction for several clock cycles. Referring to Fig. 12', Fig. 12 is a timing chart of the normal load program according to ==124, and the second = program. Figure 12 consists of five rows. The private processor loo ## ^ ^Ν ί should be in the second five columns of the micro-processor, the squares of the respective generations, not blank, β ί, fruit. Figure 12: Each line of intersections is the same as in the pipeline of the processor. 'The instruction is not loaded in the micro 114 load! 'According to the first column of 81 12', the execution unit of Figure 1 generates the true value by the first letter of the first name ^ Figure 3, giving the true value' to the load instruction, just like The steps of Figure 8 are shown. In the temple cycle 2, according to the second column

虛擬位址334,如_8的步驟綱所示。H 截入3 ’根據第三列’轉譯暫存區308會產生二二^源實體位址336’就如圖8的步驟824所示。、週J 4，根據第四列，圖3的比較器314會進 54 1294¾¾ 4twf.doc/m 行實體標籤比較，以產生圖3的pA—match[15:〇]訊號 364 ’就如圖8的步驟836所示。此外，控制邏輯電路3〇2 會根據 PA—match[15:0]訊號 364 與圖 2 的 sc_MESI[15:0] 訊號222產生normal—sc」〇ad一muX-Sel訊號393，就如圖8 的步驟832所示。此外，堆疊快取124會以sc—hit訊號389 通知載入單元有載入指令發生擊中，就如圖8的步驟832 所示。The virtual address 334 is as shown in the step of _8. H truncation 3 ‘translating temporary storage area 308 according to the third column' produces a binary source entity address 336' as shown in step 824 of Fig. 8. According to the fourth column, the comparator 314 of FIG. 3 will enter 54 12943⁄43⁄4 4twf.doc/m row physical label comparison to generate the pA-match[15:〇] signal 364 ' of FIG. 3 as shown in FIG. Step 836 is shown. In addition, the control logic circuit 3〇2 generates a normal-sc"〇ad-muX-Sel signal 393 according to the PA_match[15:0] signal 364 and the sc_MESI[15:0] signal 222 of FIG. 2, as shown in FIG. Step 832 is shown. In addition, the stack cache 124 notifies the load unit with a sc-hit signal 389 that a load command hit has occurred, as shown in step 832 of FIG.

0 一在時脈週期5，根據第五列，載入資料會提供給載入單元，就如圖8的步驟832所示。更明確的說，圖4的多工器408會根據normai—sc—i〇ad—mux—訊號選取 sc—data[15:〇]訊號226的十六個快取線其中之一，多工器 406會根據實體位址336的位元[5:2]選取正確的雙字組，而多工器402會選取輸入426。比較圖12與稍後說明的圖13可知，正常的載入程序 =使第一級資料快取126提供資料給載入指令，比傳统的快取記憶體更快。0 In clock cycle 5, according to the fifth column, the load data is provided to the load unit as shown in step 832 of FIG. More specifically, the multiplexer 408 of FIG. 4 selects one of the sixteen cache lines of the sc-data[15:〇] signal 226 according to the normai_sc_i〇ad-mux-signal, the multiplexer. 406 will select the correct double word based on the bit [5:2] of the physical address 336, and the multiplexer 402 will select the input 426. Comparing Figure 12 with Figure 13 described later, the normal loader program = causes the first level data cache 126 to provide data to the load instruction, which is faster than conventional cache memory.

% >圖13，圖13是根據本發明的圖8所繪示從非堆疊快取丨22進行載入程序的時序圖。圖13包括= 標示為！至6，對應於微處理機刚的六個時脈週期 ^ 3也包括六個列’各自代表微處理機刚的一個動作、、Ό 。圖13的每個行列交會處的方格，不是空白就是厂 ^」。，以指示載入指令在微處理機刚的管線中所在的圖1的執行單元在時脈週期1，根據圖13的第一列， 55 1294569 14394twf.doc/m =1、載入單几會在圖3的1〇ad-instr訊號346產生真值，以请求載人資料給載人指令，就如圖8的步驟所示。在時脈週期2 ’根據第二列，位址產生器遍會計算圖3的虛擬位址334，就如圖8的步驟_所示。在時脈週期3 ’根據第三列，轉譯暫存區，會產生載入指令的來源實體位址336,就如圖8的步驟824所示。% > Figure 13, Figure 13 is a timing diagram of the loading procedure from the non-stacked cache 22, as depicted in Figure 8 of the present invention. Figure 13 includes = marked as! Up to 6, the six clock cycles corresponding to the microprocessor ^ 3 also includes six columns 'each representing one action of the microprocessor, Ό . The square of each row and column in Figure 13 is either blank or factory ^". To indicate that the load instruction is in the pipeline of the microprocessor, the execution unit of Figure 1 is in clock cycle 1, according to the first column of Figure 13, 55 1294569 14394twf.doc/m =1, loading a few sessions The true value of the 1〇ad-instr signal 346 of FIG. 3 is generated to request the manned data to the manned command, as shown in the steps of FIG. In the clock cycle 2' according to the second column, the address generator traverses the virtual address 334 of FIG. 3 as shown in step _ of FIG. In the clock cycle 3', according to the third column, the temporary storage area is translated, and the source entity address 336 of the load instruction is generated, as shown in step 824 of FIG.

在時脈週期4，根據第四列，非堆疊快取Η2會根據實體位址336的索引部分進行傳統的列解碼，從列解碼的結果所指定的集合的每一攔讀取資料。在時脈週期5，根據第五列，非堆疊快取122會用實體位址，的標籤部分與受選取的集合的每一欄的▲籤i 行實體標勤b較。根據標鐵比較、絲與每—棚的有效位 70，非堆疊快取122會產生一個攔選取訊號(way sdect signal)以選取符合且有效的攔。In clock cycle 4, according to the fourth column, the non-stack cache 2 performs conventional column decoding based on the index portion of the physical address 336, and reads data from each block of the set specified by the result of the column decoding. In clock cycle 5, according to the fifth column, the non-stack cache 122 will use the physical address, and the tag portion is compared with the ▲ sign entity class bid b of each column of the selected set. Based on the standard rail comparison, the wire and the effective position of each shed 70, the non-stack cache 122 generates a way sdect signal to select a valid and valid barrier.

在時脈週期6，根據第六列，非堆疊快取122會選出欄選取A號所指定的快取線，並且根據實體位址336的低位位元，在剛才選出的快取線當中，選出正確的雙字組。除了圖10到圖13演示的範例之外，本發明也包含其他實施例，其中前面提到的各種功能，例如位址比較與多工選取(multiplexing)，是併入不同的時脈週期，而且快速彈出、猜測載入、正常載入、以及從非堆疊快取丨22的載入程序，都不侷限於以上的實施例。從以上的說明可知，堆疊快取124與非堆疊快取122 分離的優點是，比起不會區分堆疊與非堆疊存取的傳統的 56 1294歡 f.doc/m 单-快取記憶體，可以有效增加第一級資料快取i26的容量，而且不會增加第-級資料快取126的存取時間。此外，由於非堆叠快取m不儲存堆叠資料，基於程式存取的非堆疊快取122比同樣大小的傳統快取記憶體更有效率。此外，堆疊快取124可以加速大部分的彈出指 =，這料卿4舰先“紐，料齡所要求的資 3可=在堆疊快取124的頂端，因為最頂端的資料很 ^就疋取近推入堆疊快取124的資料，也就是最新資堆豐快取124在決定彈出位址是否真的擊中堆疊快取124之前就會猜測提供彈出資料。再者，堆疊快取以可以加快大部分取鱗疊#料的載人齡，這也是由於先出的雜，要狀的資料很可能位在靠近堆疊快 ίΓΙ頂^一個或多個快取線之中。因此’堆疊快取124 址比較以確定載入資料是否存在之前，就先根據虛擬位址比較，從頂端的記錄單位其中之-猜測提這使得堆叠快取124在大部分情況下:= 更咏供载入貢料’因為不必等待虛擬位址被轉譯為實體 =j比較實體位址。最後，如果載入的虛擬位址沒擊測接供Μλ 2 ί 早位’使得載入資料不能猜彳"’堆疊快取124會在載人的實體位址擊墼中二=124時提供載入資料。如果載入的實體位址沒你曰f、，124 ’就由非堆疊快取122提供载入資料。易二:2ί取124讀取資料所需的時間會變動，越容動作所需的時脈週期就越少。時間變動的原 57 1294569 14394twf.doc/m 因之一是讀取堆疊快取124 是資1在堆疊快取124之内的位^。1另一個變動原因 1400 ^方持^14,圖14是根據本發明的管線式微處理機 100，口 θ f/ °微處理機_類似於圖1的微處理機 β /、疋政處理機觸的第-級資料快取14G2不包含堆 =取124。圖14的第一級資料快取】搬包含一 1402進仃快速彈出，細節後述。 —睛參照圖15，圖15是根據本發明的，圖14當申第一級貝料快取1402的方塊圖。其中有幾個元件類似於圖3 的對應it件，作用也類似，類似的元件都使用相同標號。更明確的說，資料快取14〇2包括··位址產生器3〇6，負責接收運算元332並產生虛擬位址334 ;轉譯暫存區3〇8，負責接收虛擬位址334並產生實體位址336 ;算術單元304，負責接收加法訊號382、遞減訊號384、以及遞增訊號386，並產生欠位訊號388與溢位訊號392;以及多工器316、多工器 318、fp—0ffset 暫存器 322、add一sp一val 訊號 394、堆疊指標暫存器152的位元[5:2]、輸出訊號372、以及 fp一offset訊號396，除了下面說明的差別之外，以上元件的作用都類似於圖3當中標號相同的元件。資料快取1402 也包括控制邏輯電路1502，其作用在某些方面類似於圖3 的控制邏輯電路302。控制邏輯電路1502接收類似於控制邏輯電路 302 的 pushjnstr 訊號 342、pop_instr 訊號 344、以及add_sp_instr訊號352。控制邏輯電路1502產生類似 58 1294569 14394twf.doc/m 於圖3的控制訊號368。控制邏輯電路15〇2產生例外事件訊號399以回應快速彈•序的錯誤，除了下面說明的差異之外’就如同圖3的對應訊號。資料快取1402也包含儲存陣列(伽零dement array) 1504 ’以械錄錄轉，从它們的健標籤與快取狀態’例如MESI狀態。在圖15的實施例中，儲存陣列丄504有N」固列’或稱為集合，以及四個行，或稱為攔。也就疋說’麟快取1402是-個四攔集合關聯式快取記伊體。不過’本發明並不偈限於具有特定數量的搁的快取^ 憶體。在-實施财，儲存_ i綱所儲存的快取線大小為64位元組。資料快取1402也包括列解碼器(r〇w如⑶加。15〇6。列解碼器1506接收指定儲存陣列！ 5〇4的N個列其中之一的列訊號(row signal) 1552。列解碼器15〇6會在^數個許取訊號[N-l:〇] 1542之中，列訊號1552所指定的一個產= 真值。緊接著，儲存陣列15〇4會輸出上述為真的讀取訊號 [队1:^| 1542所指定列的訊號1594。也就是說，受選取的^ 歹J的母一攔的快取線資料、位址標籤、以及mesi狀熊都會輸出於訊號1594。在圖15的實施例中，訊號1594 :面會輪出四個各包含十六個雙字組的快取線，以及各個快取線所對應的位址標籤1574與MESI狀態中的有效位元 1576 〇資料快取1402也包含一個耦接於儲存陣列15〇4，四個輪入端的多工器1528。多工器1528的四個輸入端各 59 1294總 4twf.doc/m 自接收儲存陣列1504輸出的四個快取線1594的其中之。夕工器1528根據控制輸入1596選取四個快取線其中 ^以輸出於訊號1592。受選取的快取線經過訊號1592 提供至多工器318，後者會根據fp_0ffset訊號396在匯流排138上提供一個雙字組。資料快取1402也包含由控制邏輯電路15〇2產生的 fastjop汛號1564。控制邏輯電路15〇2會在fast j〇p訊號 1564產生真值，以回應p〇p—匕批訊號討4的真值，從資料快取1402進行快速彈出作業。 f料快取1402也包含一個儲存單位堆疊，或記錄單位堆疊，稱為fP-row堆疊1516，耦接於控制邏輯電路 15=。fp—row堆疊1516由多數個儲存單位組成，每個儲位儲存一個數值，指向儲存陣列1504的一個列。在一，鉍例中，fp—row堆疊1516的每個單位儲存個位元，其中N是儲存陣列1504的列數。fp 一 row堆疊1516的多個儲存單位構成—個堆疊，包括最頂端的記錄單位 15M以存放最近推入的列數值(row vaiue)，後者是由控制邏輯電路1502透過new—row訊號1554提供。也就是說， new一row戒號1554會指出儲存陣列1504之内，存放包含 $最近的推入指令資料的快取線的列，後面會配合圖詳細說明。儲存最近的推人資料所在的列，可讓資料快取 1402進行快速彈出程序，細節後述。电―·^堆疊⑸6也控制邏輯電路15〇2接收push—r〇w訊號1562。當控制邏輯電路1502在push—row訊號1562產生真值，_堆 12945獻疊1516會向下平移一個記錄單位，也就是說，最底下的記錄單位會移出fp_row堆疊1516，其餘的每個記錄單位會接收上一個記錄單位的内容，而且new_r〇w訊號1554的内容會被寫入fp—row堆疊1516的最頂端記錄單位。fp_row 堆疊1516也從控制邏輯電路1502接收p〇p_r〇w訊號 1558。當控制邏輯電路15〇2在p〇p—row訊號1558產生真值，fp—row堆疊1516會向上平移一個記錄單位，也就是說’最頂端的記錄單位會移出fp—r〇w堆疊1516，而其餘的母個記錄單位會接收下一個記錄單位的内容。資料快取1402也包含一個有兩個輸入端的多工器 1512，耦接於fp—r〇w堆疊1516。多工器1512的一個輸入端接收fp—row堆疊1516的最頂端記錄單位1514的内容，，標示為fp一row訊號1556。多工器1512的另一個輸入端接收來自轉譯暫存區308的實體位址336的索引部分，或稱為列選取部份1548。在一實施例中，索引1548就是實體位址336的低位位址位元。如果fast』〇p訊號1564 為，值，多工器1512會選取fp一row訊號1556，以輸出於列訊號1552，提供至列解碼器15〇6 ;否則，多工器1512 會選取索引1548以輸出於列訊號1552。資料快取也包含另-個儲存單位，或記錄單位的，豐’稱為fp—way堆疊1534，耦接於控制邏輯電路〇2。fp一way堆疊1534由多個儲存單位組成，每個單位 $存-個數值，指向儲存_ 15G4的—個攔。在圖15 的實施例巾，fP—way堆疊1534的每解_儲存兩個位 1294撤 ^twf.doc/m 兀，以指出儲存陣列1504的四個欄的其中之一。fp_way 堆，1534的多個儲存單位構成一個堆疊，包括最頂端的記錄單位1532，以存放最近推入的攔數值(way vahie)，後者是控制邏輯電路1502以new一way訊號1582提供。也就是說’ new—way訊號1582會指出儲存陣列15〇4之中， new一row訊號1554指定的列當中，存放含有最近推入指令的資料的快取線的攔，後面會配合圖17詳細說明。儲存含 φ 有最近的推入資料的攔，可讓資料快取1402執行快速彈出程序，細節後述。fp—way堆疊1534也從控制邏輯電路丨5〇2 接收push—way訊號1588。當控制邏輯電路15〇2在 push一way汛號1588產生真值，fp—way堆疊ι534會向下：平移一個記錄單位，也就是說，最底下的記錄單位會移出 * 巧-way堆豐I534’其餘的每個記錄單位會接收上一個記錄單位的内容，而且new—way訊號1582的内容會被寫入 fp—way堆疊1534的最頂端記錄單位1532。fp—way堆疊 1534也從控制邏輯電路15〇2接收p〇p一way訊號1586。^ _ 控制邏輯電路1502在p〇p—way訊號1586產生真值，堆疊U34會向上平移一個記錄單位，也就是說，最頂端的記錄單位會移出fp—way堆疊1534，而其餘的每個記錄單位會接收下一個記錄單位的内容。在一實施例中，fp—row堆疊1516與fP-Way堆疊1534 是由單一堆疊組成，其中每個記錄單位各存放一個列數值與一個欄數值。資料快取1402也包含一個有兩個輸入端的多工界 62 c/m 1294棚_ 1526,耦接於fp一way堆疊1534。多工器1526的一個輸入端接收fp—way堆疊1534的最頂端記錄單位1532的内容，標示為fp一way訊號1584。多工器1526的另一個輸入端接收 normal—way—select 訊號 1578。如果 fast」)〇p 訊號 1564 為真值，多工器1526會選取fp一way訊號1584，以輸出於多工選取訊號(mux select signal) 1596，提供至多工器 1528，否則’多工态1526會選取normai—way—seiect訊號 1578以輸出於多工選取訊號1596。在一實把例中’Φ—way堆叠1534和fp_r〇w堆疊1516 的母個5己錄單位都包含一個有效位元（valid bit)，而且 fastj)〇p訊號1564會由最頂端記錄單位1514以及最頂端記錄單位1532的有效位元做邏輯或運算的結果把關。也就疋说，儘管控制邏輯電路1502不會在執行快速彈出之前，檢查彈出來源位址符合與否，至少會先檢查φ—Γ(ην堆疊 1516的最頂端記錄單位1514以及fp—way堆疊ι534的最頂端七錄單位1532都是有效。在本實施例中，fp way堆疊1534和fp 一 row堆疊1516的每一次彈出時，向上平移之後，最底下的記錄單位的有效位元都會設為偽值(false)。資料快取1402也包括搞接於控制邏輯電路1502的攔選取產生器（way select generator) 1524。欄選取產生器1524 從儲存陣列1504受選取的列當中，接收每一個位址標箴 1574以及有效位元1576。欄選取產生器1524也接收來自轉譯暫存區308的實體位址336的位址標籤部分1546。攔選取產生器1524會比較實體位址標籤1546與儲存陣列 63 f.doc/m 1504輸出的每一個標籤1574，其中實體位址標籤1546可能來自彈出、推入、載入、或存放指令。如果位址標籤1574 的其中之一等於實體位址標籤1546,而且對應的有效位元 1576表示這個位址標籤1574為有效，攔選取產生器1524 就會在φς：供給控制邏輯電路1502的cache一hit訊號1572產生真值。此外，欄選取產生器1524會將有效且相等的欄的數值，也就是擊中儲存陣列1504的欄，輸出於 normal一way-select訊號1578,並提供給控制邏輯電路15〇2 與多工器1526。資料快取1402也包括耦接於儲存陣列15〇4的檢查邏輯電路(check logic) 1508。檢查邏輯電路15〇8接收實體位址 336、fastjpop 訊號 1564、fp—row 訊號 1556、fp—way 訊號1584、位址標籤1574、有效位元1576、以及fp__0ffset 说號396。檢查邏輯電路1508會做檢查以決定，在快速彈出程序中猜測提供給彈出指令的資料是否正確。檢查邏輯電路1508會決定由fp-row訊號1556和fp—way訊號1584 分別提供的正確列數值與欄數值，是否於快速彈出程序中用來從儲存陣列1504選取正確的快取線，以提供正確的彈出資料。在一實施例中，檢查邏輯電路1508會在快速彈出程序中比較fp 一row訊號1556的值，以及fp—way訊號1584 所指定的攔的位址標籤1574。在一實施例中，檢查邏輯電路1508也會比較在快速彈出程序中使用的fp_r〇w訊號 1556的值’以及實體位址336的對應位元。在一實施例中，檢查邏輯電路1508也會比較在快速彈出程序中使用的 64 Ϊ294569 l4394twf.doc/m Φ〜offset訊號396的值，以及實體位址336的對應位元。檢查邏輯電路1508也會確認fp—way訊號1584所指定的攔的有效位元1576指出在快速彈出程序中取用的快取線為有效。如果上述的快取線並非有效，或是沒有取用正確的快取線，檢查邏輯電路15〇8會在fp—check訊號1544產生偽值，以提供給控制邏輯電路15〇2。否則，檢查邏輯電路1508會在fp—check訊號1544產生真值，以提供給控制 > 邏輯電路1502。請參照圖16，圖16為根據本發明的，從圖15的資料快取1402進行快速彈出程序的流程圖。流程從步驟16〇2 開始。在步驟1602，指令轉譯器106會將彈出指令解碼，指令排程器108會將彈出指令發給圖14的執行單元114的載入單元。然後載入單元會在p〇p—instr訊號344產生真值。接下來’在步驟1604，為回應p0pjnstr訊號344的真值，控制邏輯電路15〇2會在fastj)〇p訊號1564產生真 | 值。因此’多工器1512會選取fpjr〇w訊號1556以經由列afU虎1552輸出至列解碼器15〇6。接著列解碼器15〇6會在碩取sfl唬[N-l:〇] 1542當中，由fp—row訊號1556指定的一個之上產生真值。接著儲存陣列15〇4會在輸出訊號 1594輸出一列處於真值的讀取訊號…丄別。為了回應 festjop訊號1564的真值，多工器1526會選取电吖訊號^!^4，以經由多工選取訊號1596提供給多工器1528。接著多工器1528會選取來自fp_way訊號1584所指定的 65 /m I294m,d〇c 攔的快取線’以輸出於訊號1592。多工器318合〜。口 1528輸出的快取線1592選取正確的雙字J攸: 最後寫回:-^早70 114的载人單元，以提供給彈出指令， 112當中’彈出指令所指定的暫存器。舉例而言暫= 3RET指令，彈出資料會載人暫存器組112 L FAV日日標暫存器。另一個例子，如果彈出指令是X86 的LE=VE ‘令，彈出資料會載人暫存器組112當中的EBP ^存器。又一個例子，如果彈出指令是χ86的p〇p指令，彈^資料會載人暫存器組112當中，PQp指令所指定^暫存态。由圖16可知，資料是猜測性地提供給載入單元。說猜測是因為尚未確定將在步驟1616產生於實體位址336 的彈出指令來源位址，會等於從儲存陣列15〇4當中，由 fp—row訊號1556與fP-Way訊號1584指定的記錄單位，提供至載入單元的彈出資料的位址。接下來，在步驟1606，控制邏輯電路1502會在遞增訊號386產生真值，接著算術單元3〇4會遞增fp_〇fftet訊號396，然後將遞增後的數值輸出於訊號372，控制邏輯電路1502會透過控制訊號368，使多工器316選取這個數值，以載入到fp_〇ffset暫存器322。接下來’在決策步驟1608，控制邏輯電路1502會檢查溢位訊號392,以決定步驟1606的遞增程序是否造成 fp-〇ffset暫存器322溢位。也就是說，控制邏輯電路1502 66 I2945l^twf.doc/m =定^ 會造成堆4指標152指向下—個快取、^二流程會進人步驟1612,否則會進人步驟1614。ッ驟1612，控制邏輯電路15 1558產生真值，以强φ f 曰牡P〇P—r〇w虎压王具值Μ無出fp—r〇w堆疊151 位，控制邏輯電路15〇2 ★合力n ^己錄早也曰在P叩一way訊號1586產生真 Ϊ镇上P，ay堆疊1534的最頂端記錄單位。這是為 2匕們…钱記憶體快取—致，因為在儲存陣列15〇4 畜中，由fp—謂堆疊1516的最頂端記錄單位i5i4與 Φ—堆® 1534 6^¾頂端簡單位1532所指㈣記錄單位1存的快取線的上一個雙字組，正被彈出指令彈出系統咖體堆疊。在-實施例中，步驟1612是在後面說明的步驟1618之後執行。在另一實施例中，用於步驟！刪的 fp一row訊號1556和fp—way訊號1584的數值會保存下來’留待步驟1618使用。接下來，在步驟1614，位址產生器306會計算彈出指令的來源虛擬位址334。接下來，在步驟1616，轉譯暫存區308會產生彈出指令的來源實體位址336。接下來，在步驟1618，檢查邏輯電路15〇8會比較產生於步驟1616的實體位址336的對應部分，以及fp_way 訊號1584所選取的位址標籤1574,並且比較實體位址336 的對應部分與fp—row訊號1556，並且比較實體位址336 的對應部分與fp_〇ffset訊號396 ’並檢查fp_way訊號 1584選取的有效位元1576,以產生提供給控制邏輯電路 67 I294H〇c/m 1502 的 fp_check 訊號 1544。接下來，在決策步驟1622，控制邏輯電路15〇2會檢查fp一check訊號1544，以決定彈出指令的來源實體位址 336是否擊中儲存陣列1504當中，fp—r〇w堆疊i5i6盥 fp一way堆疊1534的最頂端記錄單位所指定的記錄單位了如果擊中，流程在此結束，也就是說，猜測快速彈出程序提供了正確的彈出資料。否則流程會進入步驟1624。在步驟1624，控制邏輯電路1502會在例外事件訊號 399產生真值，使微處理機14〇〇執行例外事件處理程式，以處理猜測快速彈出程序提供錯誤資料的狀況。例外事件處理程式會使彈出指令收到正確的資料。在一實施例中，例外事件處理程式會清空fp—row堆疊1516與电一〜矽堆疊 1534，並且將堆疊指標暫存器152的位元[5:2]一的正確$ 料載入到fp—offset暫存器322。流程在步驟1624結束。、、时由此可知，後面也會配合圖19詳細說明，圖16的快速彈出程序使得傳統的快取記憶體提供彈出資料給彈出指令，可以比沒有快速彈出裝置快上幾個時脈週期。請參照圖17,圖17是根據本發明的，圖15的資料快取1402進行推入程序的流程圖。流程從步驟17〇2開始。' ^在步驟1702，指令轉譯器106會將推入指令解碼:然，指令排程器108會將推入指令發給執行單元114的存放單元。接著存放單元會在pUSh—instr訊號342產生真值。接下來，在步驟1704，控制邏輯電路15〇2會在遞減訊號384產生真值，接著算術單元3〇4會遞減电―〇饱以訊 68 c/m Ι294^,〇號396，將遞減後的數值輸出於訊號372，控制邏輯電路 1502會透過控制訊號368使多工器316選取這個數值，以將它載入fp_offset暫存器322。接下來，在決策步驟1706，控制邏輯電路1502會檢查欠位訊號388，以決定步驟1704遞減fp—offset訊號396 是否造成fp一offset暫存器322欠位。也就是說，控制邏輯電路1502會決定，推入指令是否會造成堆疊指標暫存器 152指向前一個快取線。如果是，流程會進入決策步驟 1716，否則會進入決策步驟17〇8。在決策步驟1708,控制邏輯電路1502會檢查cacheJiit 訊號1572，以決定推入指令的目標實體位址336是否擊中儲存陣列1504。如果擊中，流程會進入步驟1712，如果沒有，則會進入步驟1714。在步驟1712，資料快取1402會將目前的推入指令視為擊中資料快取1402的正常推入指令。也就是說，資料快取1402會以資料快取領域所熟知的傳統方法處理這個推入指令。因為推入動作不會轉而指向上一個快取線，不需要更新fp—row堆疊1516與fp_way堆疊1534 ;於是，下一個彈出程序彳艮可能會指定fp_row堆疊1516的最頂端記錄單位1514與fp_way堆疊1534的最頂端記錄單位1532 所指定的快取線其中的資料。流程在步驟1712結束。在步驟1714，控制邏輯電路1502會在例外事件訊號 399產生真值，使微處理機1400執行例外事件處理程式，以更新fp—row堆疊1516與fp_way堆疊1534。在一實施 69 I2945ld。- 例中，例外事件處理程式會清空fp—row堆疊1516與 fp—way1534,並且將堆疊指標暫存器152的位元[5:2] 的正確資料載入到fp-0ffset暫存器322。接下來，流程會進入步驟1726。在決策步驟1716,控制邏輯電路15〇2會檢查cache_hit afl號1572以決定推入指令的目標實體位址336是否擊中儲存陣列1504。如果擊中，流程會進入步驟丨718，否則會進入步驟1726。在步驟1718，控制邏輯電路15〇2會決定擊中儲存陣列1504的列與欄。擊中的列是由索引1548指示。擊中的攔是由normal—way—select訊號1578指示。控制邏輯電路 1502會以new一way訊號1582，將擊中的欄提供給fp_way 堆4: 1534。此外，控制邏輯電路1502會以newjrow訊號 1554，將擊中的列提供給fp—row堆疊1516。接下來，在步驟1722，控制邏輯電路1502會在 push—row訊號1562產生真值’以將new_row 1554提供的數值推入fp_row堆疊1516。控制邏輯電路1502也會在 push_way訊號1588產生真值，以將new_way 1582提供的數值推入fp_way堆疊1534。接下來，在步驟1724，資料快取1402會將目前的推入指令視為擊中資料快取1402的正常推入指令。也就是說，在步驟1722更新fp_row堆疊1516和fp_way堆疊1534 之後，資料快取1402會以資料快取領域習知的傳統方法處理這個推入指令。流程在步驟Π24結束。 ►c/m 在步驟1726,控制邏輯電路1502會決定儲存陣列 1504當中，被索引1548指定的列當中，要被錯過的推入位址336所涉及的快取線取代的攔，這個快取線必須現在就載入資料快取1402。在一實施例中，控制邏輯電路bo] 會選取受選取的列當中’最久沒使用的攔。控制邏輯電路 1502會透過new-way訊號1582，將用來取代的攔提供給 fp一way堆疊1534。此外，控制邏輯電路15〇2會透過 new一row訊號1554，將索引1548所選取的列提供給fp_r〇w 堆疊1516。接下來，在步驟1728，控制邏輯電路15〇2會在 push一row訊號1562產生真值，以將new—r〇w訊號1554 長:供的值推入fp—row堆豐1516。控制邏輯電路1別2也會在push_Way訊號1588產生真值，以將new_way訊號158a2 提供的值推入fp_way堆疊1534。接下來，在步驟1732，資料快取1402會將目前的推入指令視為錯過資料快取1402的正常推入指令。也就是說j在^驟1728更新fp—row堆疊1516和母一way堆疊1534 之，，資料快取14G2會以倾快取領_知的傳統方法處理這個推入指令。流程在步驟1732結束。明參照圖18，圖18為根據本發明的，圖14的微處理機1400處理㈣指標加法齡的流糊。流程從步驟 1802 開始。田在步驟1802，指令轉譯器106會將目標為圖14的堆豐指標暫存器152的加法指令解碼，而指令排程器綱會 71 12945傲— 將這個加法指令發給執行單元114的整數單元。然後整數單元會在add—spjnstr訊號352產生真值。接下來，在步驟1804，控制邏輯電路1502會在加法訊號382產生真值，然後算術單元3〇4會將add—sp—vd訊號394加上fp—〇ffset訊號396，將總和輸出於訊號372，控制邏輯電路1502會透過控制訊號368，使多工器316選取這個總和’以載入到fp__offset暫存器322。In the clock cycle 6, according to the sixth column, the non-stack cache 122 selects the cache line specified by the A number in the column, and selects among the cache lines just selected according to the low bit of the physical address 336. The correct double word group. In addition to the examples illustrated in Figures 10 through 13, the present invention also encompasses other embodiments in which the various functions previously mentioned, such as address comparison and multiplexing, incorporate different clock cycles, and The fast pop-up, guess load, normal load, and loader from the non-stack cache 22 are not limited to the above embodiments. As can be seen from the above description, the advantage of the stack cache 124 being separated from the non-stack cache 122 is that compared to the conventional 56 1294 f.doc/m single-cache memory that does not distinguish between stacked and non-stacked access, The capacity of the first level data cache i26 can be effectively increased, and the access time of the first level data cache 126 is not increased. In addition, since the non-stack cache m does not store stacked data, the program-based access non-stack cache 122 is more efficient than the same size of conventional cache memory. In addition, the stack cache 124 can speed up most of the pop-up finger =, this is the first of the four ships, the required amount of money can be = at the top of the stack cache 124, because the top information is very ^ The data that is pushed into the stack cache 124, that is, the latest resource cache 124 will guess to provide pop-up data before deciding whether the pop-up address actually hits the stack cache 124. Further, the stack cache can be Accelerate the age of most of the scales, which is also due to the first-out, the information is likely to be located near the stacking speed ^ ^ one or more cache lines. So 'stack cache Before the comparison of the addresses to determine whether the loaded data exists, it is based on the virtual address comparison, from the top of the record unit - guess that this makes the stack cache 124 in most cases: = more for loading the tribute 'Because there is no need to wait for the virtual address to be translated into entity = j to compare the physical address. Finally, if the loaded virtual address is not tested, 早 λ 2 ί early bit 'so that loading data can't guess 彳' stacking fast Take 124 will be killed at the manned physical address 2 = 124 when the loading data is provided. If the loaded physical address is not you 曰 f,, 124 ' is provided by the non-stack cache 122 to load the data. Yi 2: 2 ί take 124 time to read the data will The change, the more the clock cycle required for the more action, the one of the time changes of the original 57 1294569 14394twf.doc / m because the read stack cache 124 is the bit 1 of the stack 1 within the stack cache 124. 1 another variation reason 1400 ^ square hold ^ 14, Figure 14 is a pipeline microprocessor 100 according to the present invention, port θ f / ° microprocessor _ similar to the microprocessor of Figure 1 β /, 疋政处理机触The first-level data cache 14G2 does not include the heap=take 124. The first-level data cache of Fig. 14 is loaded with a 1402 fast pop-up, the details will be described later. - The eye is referenced to Fig. 15, which is a diagram according to the present invention. Figure 14 is a block diagram of the first-level shell material cache 1402. Several of the components are similar to the corresponding ones of Figure 3, and the functions are similar. Similar components use the same label. More specifically, the data is fast. Taking 14〇2 includes address generator 3〇6, responsible for receiving operand 332 and generating virtual address 334; translation temporary storage The area 3〇8 is responsible for receiving the virtual address 334 and generating the physical address 336; the arithmetic unit 304 is responsible for receiving the addition signal 382, the decrementing signal 384, and the incrementing signal 386, and generating the under-bit signal 388 and the overflow signal 392; The multiplexer 316, the multiplexer 318, the fp_0ffset register 322, the add-sp-val signal 394, the bit [5:2] of the stack indicator register 152, the output signal 372, and the fp-offset signal 396, except for the differences described below, the above elements function similarly to the elements of the same reference numerals in FIG. The data cache 1402 also includes a control logic circuit 1502 that functions in some respects similar to the control logic circuit 302 of FIG. Control logic circuit 1502 receives pushjnstr signal 342, pop_instr signal 344, and add_sp_instr signal 352 similar to control logic circuit 302. Control logic circuit 1502 produces control signal 368 similar to that of Fig. 3 of 58 1294569 14394 twf.doc/m. Control logic circuit 15 产生 2 generates an exception event signal 399 in response to a fast bullet sequence error, except for the difference described below, which is like the corresponding signal of FIG. The data cache 1402 also includes a storage array (gamma zero dement array) 1504' for mechanical recording, from their health label and cache state 'for example MESI status. In the embodiment of Figure 15, storage array 504 has N"solids" or a collection, and four rows, or blocks. It is also said that 'Linku takes 1402 is a four-block set-associated cache. However, the present invention is not limited to a cache with a certain number of holds. In the implementation, the cache line size stored in the storage directory is 64 bytes. The data cache 1402 also includes a column decoder (r〇w such as (3) plus .15〇6. The column decoder 1506 receives the specified storage array! The row signal of one of the N columns of 5〇4 is 1552. The decoder 15〇6 will be in the ^1 number of signals [Nl:〇] 1542, the one specified by the signal 1552 = true value. Then, the storage array 15〇4 will output the above true reading. Signal [team 1: : 15 | 1542 specified column signal 1594. That is to say, the cache line data, address label, and mesi bear of the selected parent 一J will be output on signal 1594. In the embodiment of FIG. 15, signal 1594: the face will take four cache lines each containing sixteen double blocks, and the address tag 1574 corresponding to each cache line and the valid bit in the MESI state. The data cache 1402 also includes a multiplexer 1528 coupled to the storage array 15〇4, four wheel-in ends. The four inputs of the multiplexer 1528 are each 59 1294 total 4 twf.doc/m self-receiving storage array 1504 The output of the four cache lines 1594. The studio 1528 selects four cache lines according to the control input 1596, wherein the output is At signal 1592, the selected cache line is provided via signal 1592 to multiplexer 318, which provides a double word on bus bar 138 based on fp_0ffset signal 396. Data cache 1402 also includes control logic circuit 15 〇 2 The fastjop nickname 1564. The control logic circuit 15 〇 2 will generate a true value in the fast j 〇 p signal 1564 in response to the true value of the p 〇 p - 匕 batch signal 4, from the data cache 1402 for a quick pop-up operation. The material cache 1402 also includes a storage unit stack, or a recording unit stack, called an fP-row stack 1516, coupled to the control logic circuit 15=. The fp-row stack 1516 is composed of a plurality of storage units, each storage location A value points to a column of the storage array 1504. In one example, each unit of the fp-row stack 1516 stores one bit, where N is the number of columns of the storage array 1504. fp a plurality of row stacks 1516 The storage unit constitutes a stack, including the topmost recording unit 15M to store the most recently pushed column value (row vaiue), which is provided by the control logic circuit 1502 via the new-row signal 1554. That is, new-r The ow ring number 1554 will indicate the column of the cache line containing the most recent push command data in the storage array 1504, which will be described later in detail. The column of the most recent pusher data is stored to allow the data to be cached. The 1402 performs a quick pop-up procedure, which will be described later in detail. The electric---stack (5) 6 also controls the logic circuit 15〇2 to receive the push-r〇w signal 1562. When the control logic circuit 1502 generates a true value in the push-row signal 1562, the _ heap 12945 splicing 1516 will translate down one recording unit, that is, the bottom recording unit will move out of the fp_row stack 1516, and each of the remaining recording units The contents of the previous record unit will be received, and the contents of the new_r〇w signal 1554 will be written to the topmost record unit of the fp-row stack 1516. The fp_row stack 1516 also receives the p〇p_r〇w signal 1558 from the control logic circuit 1502. When the control logic circuit 15〇2 generates a true value in the p〇p_row signal 1558, the fp_row stack 1516 will shift up one recording unit, that is, the topmost recording unit will move out of the fp_r〇w stack 1516, The remaining parent records unit will receive the contents of the next record unit. The data cache 1402 also includes a multiplexer 1512 having two inputs coupled to the fp-r〇w stack 1516. An input of multiplexer 1512 receives the contents of the topmost recording unit 1514 of fp-row stack 1516, designated as fp-row signal 1556. The other input of multiplexer 1512 receives an index portion of physical address 336 from translation buffer 308, or column selection portion 1548. In one embodiment, index 1548 is the lower address bit of physical address 336. If fast 〇p signal 1564 is the value, multiplexer 1512 will select fp-row signal 1556 for output to column signal 1552, to column decoder 15〇6; otherwise, multiplexer 1512 will select index 1548 to Output to column signal 1552. The data cache also contains another storage unit, or a recording unit, which is called fp-way stack 1534, and is coupled to the control logic circuit 〇2. The fp-way stack 1534 consists of multiple storage units, each of which has a value of $1, pointing to a block that stores _15G4. In the embodiment of Figure 15, each solution of the fP-way stack 1534 stores two bits 1294 and withdraws ^twf.doc/m 以 to indicate one of the four columns of the storage array 1504. The fp_way heap, 1534's multiple storage units form a stack, including the topmost recording unit 1532, to store the most recently pushed block value (way vahie), which is provided by the control logic circuit 1502 with the new one way signal 1582. In other words, the 'new-way signal 1582' will indicate that among the columns specified by the new one row signal 1554, the cache line containing the data of the most recent push command is stored in the storage array 15〇4, which will be detailed later with FIG. Description. Store the block containing φ with the most recent push data, and let the data cache 1402 execute the quick pop-up procedure, which will be described later. The fp-way stack 1534 also receives the push-way signal 1588 from the control logic circuit 丨5〇2. When the control logic circuit 15〇2 generates a true value in the push-way nickname 1588, the fp-way stack ι534 will go down: pan one record unit, that is, the bottom record unit will be shifted out* Qiao-way Hefeng I534 Each of the remaining recording units will receive the contents of the previous recording unit, and the contents of the new-way signal 1582 will be written to the topmost recording unit 1532 of the fp-way stack 1534. The fp-way stack 1534 also receives the p〇p-way signal 1586 from the control logic circuit 15〇2. ^ _ control logic circuit 1502 generates a true value at p〇p-way signal 1586, and stack U34 translates one record unit upwards, that is, the topmost record unit is shifted out of fp-way stack 1534, and each of the remaining records The unit will receive the contents of the next record unit. In one embodiment, the fp-row stack 1516 and the fP-Way stack 1534 are comprised of a single stack in which each record unit stores a column value and a column value. The data cache 1402 also includes a multiplexed 62 c/m 1294 shed _ 1526 with two inputs coupled to the fp-way stack 1534. An input of multiplexer 1526 receives the contents of the topmost recording unit 1532 of fp-way stack 1534, designated as fp-way signal 1584. The other input of multiplexer 1526 receives a normal-way-select signal 1578. If the fast ") 〇p signal 1564 is true, the multiplexer 1526 will select the fp-way signal 1584 for output to the mux select signal 1596, which is provided to the multiplexer 1528, otherwise the 'multiple mode 1526' The normai-way-seiect signal 1578 is selected for output to the multiplex selection signal 1596. In a real example, the parent 5 recording units of the 'Φ-way stack 1534 and the fp_r〇w stack 1516 all contain a valid bit, and the fastj) 〇p signal 1564 is determined by the topmost recording unit 1514. And the most significant bit of the top record unit 1532 is the result of a logical OR operation. That is to say, although the control logic circuit 1502 does not check whether the pop-up source address is consistent or not before performing the fast pop-up, at least φ-Γ is first checked (the topmost recording unit 1514 of the ην stack 1516 and the fp-way stack ι534) The topmost seven-record unit 1532 is valid. In this embodiment, each time the fp way stack 1534 and the fp-row stack 1516 are popped up, after the upward shift, the effective bit of the bottommost record unit is set to false. Value (false) The data cache 1402 also includes a way select generator 1524 that is coupled to the control logic circuit 1502. The column selection generator 1524 receives each of the selected columns from the storage array 1504. The flag 1574 and the valid bit 1576. The column selection generator 1524 also receives the address tag portion 1546 from the entity address 336 of the translation buffer 308. The bar selector generator 1524 compares the entity address tag 1546 with the storage array 63. F.doc/m 1504 outputs each tag 1574, where the physical address tag 1546 may be from a pop-up, push-in, load, or store instruction. If the address tag 1574 is One is equal to the physical address tag 1546, and the corresponding valid bit 1576 indicates that the address tag 1574 is valid, and the block selection generator 1524 generates a true value at the cache-hit signal 1572 of the φς: supply control logic circuit 1502. In addition, the column selection generator 1524 outputs the value of the valid and equal column, that is, the column hitting the storage array 1504, to the normal one way-select signal 1578, and supplies it to the control logic circuit 15〇2 and the multiplexer. 1526. The data cache 1402 also includes a check logic 1508 coupled to the storage array 15〇4. The check logic 15〇8 receives the physical address 336, the fastjpop signal 1564, the fp_row signal 1556, fp— The way signal 1584, the address tag 1574, the valid bit 1576, and the fp__0ffset statement 396. The check logic circuit 1508 will check to determine if the data provided to the pop-up instruction is correct in the fast pop-up procedure. The check logic circuit 1508 will Determines whether the correct column values and column values provided by fp-row signal 1556 and fp-way signal 1584, respectively, are used in the fast pop-up procedure from storage array 150. 4 Select the correct cache line to provide the correct pop-up data. In one embodiment, the check logic 1508 compares the value of the fp-row signal 1556 in the fast pop-up procedure, and the block specified by the fp-way signal 1584. Address tag 1574. In one embodiment, the check logic circuit 1508 also compares the value of the fp_r〇w signal 1556 used in the fast pop-up procedure with the corresponding bit of the physical address 336. In one embodiment, the check logic 1508 also compares the value of the 64 Ϊ 294569 l4394 twf.doc/m Φ~offset signal 396 used in the fast pop-up procedure with the corresponding bit of the physical address 336. The check logic circuit 1508 also confirms that the valid bit 1576 of the block specified by the fp-way signal 1584 indicates that the cache line taken in the fast pop-up procedure is valid. If the cache line described above is not active or the correct cache line is not taken, the check logic circuit 15 8 generates a dummy value at the fp-check signal 1544 for supply to the control logic circuit 15〇2. Otherwise, check logic circuit 1508 will generate a true value at fp-check signal 1544 for supply to control > logic circuit 1502. Referring to Figure 16, Figure 16 is a flow diagram of a quick pop-up procedure from the data cache 1402 of Figure 15 in accordance with the present invention. The process begins with step 16〇2. At step 1602, the instruction translator 106 decodes the pop-up instructions, and the instruction scheduler 108 sends the pop-up instructions to the loading unit of the execution unit 114 of FIG. The load unit then generates a true value at p〇p-instr signal 344. Next, in step 1604, in response to the true value of the p0pjnstr signal 344, the control logic 15 〇 2 will generate a true | value at fastj) 〇 p signal 1564. Therefore, the multiplexer 1512 selects the fpjr〇w signal 1556 to output to the column decoder 15〇6 via the column afU tiger 1552. Then the column decoder 15〇6 will generate a true value above the one specified by the fp-row signal 1556 among the masters sfl唬[N-l:〇] 1542. The storage array 15〇4 then outputs a list of true read signals in the output signal 1594. In response to the true value of the festjop signal 1564, the multiplexer 1526 will select the power signal ^!^4 to provide the multiplexer 1528 via the multiplex selection signal 1596. The multiplexer 1528 then selects the cache line ' from the 65/m I294m,d〇c block specified by the fp_way signal 1584 to output the signal 1592. The multiplexer 318 is combined with ~. The cache line 1592 of the port 1528 selects the correct double word J攸: Finally writes back: -^ 70 114 the manned unit to provide the temporary register specified by the pop-up instruction in the pop-up instruction 112. For example, for the temporary = 3RET command, the pop-up data will be loaded into the register group 112 L FAV day-day register. As another example, if the pop-up command is X86's LE=VE ‘order, the pop-up data will be loaded into the EBP register in the scratchpad group 112. As another example, if the pop-up command is a p〇p command of χ86, the data will be loaded into the scratchpad group 112, and the PQp command specifies the temporary state. As can be seen from Figure 16, the data is provided to the loading unit speculatively. The guess is because the pop-up instruction source address that would be generated at the physical address 336 in step 1616 has not been determined, which would be equal to the recording unit specified by the fp-row signal 1556 and the fP-Way signal 1584 from the storage array 15〇4. Provide the address of the popup data to the load unit. Next, in step 1606, the control logic circuit 1502 generates a true value at the increment signal 386, and then the arithmetic unit 3〇4 increments the fp_〇fftet signal 396, and then outputs the incremented value to the signal 372, and the control logic circuit 1502 The multiplexer 316 is selected by the control signal 368 to load the value into the fp_〇ffset register 322. Next, at decision step 1608, control logic circuit 1502 checks for overflow signal 392 to determine if the incrementing of step 1606 caused the fp-〇ffset register 322 to overflow. That is to say, the control logic circuit 1502 66 I2945l ^ twf.doc / m = will cause the heap 4 indicator 152 to point to the next - cache, the second process will enter step 1612, otherwise it will enter step 1614. Step 1612, the control logic circuit 15 1558 generates a true value, with a strong φ f 曰〇 P〇P-r〇w tiger pressure king value Μ no fp-r〇w stack 151 bits, control logic circuit 15 〇 2 ★ Heli n ^ has recorded as early as the P叩 oneway signal 1586 to produce the real top town P, ay stack 1534 the top record unit. This is for 2 ......money memory cache, because in the storage array 15〇4 animals, the topmost recording unit i5i4 and Φ-heap® 1534 6^3⁄4 top 1532 of the stack 1516 by fp- The previous double word group of the cache line stored in the recording unit 1 is being ejected by the pop-up command to pop up the system. In the embodiment, step 1612 is performed after step 1618 described later. In another embodiment, for the steps! The values of the deleted fp-row signal 1556 and the fp-way signal 1584 are saved and left for use in step 1618. Next, at step 1614, the address generator 306 calculates the source virtual address 334 of the pop-up instruction. Next, at step 1616, the translation buffer 308 will generate the source entity address 336 of the pop-up instruction. Next, in step 1618, the check logic circuit 15 8 compares the corresponding portion of the physical address 336 generated in step 1616 with the address tag 1574 selected by the fp_way signal 1584, and compares the corresponding portion of the physical address 336 with The fp-row signal 1556, and compares the corresponding portion of the physical address 336 with the fp_〇ffset signal 396' and checks the valid bit 1576 selected by the fp_way signal 1584 to generate a control logic circuit 67 I294H〇c/m 1502. Fp_check signal 1544. Next, at decision step 1622, the control logic circuit 15〇2 checks the fp-check signal 1544 to determine if the source entity address 336 of the pop-up instruction hits the storage array 1504, fp_r〇w stack i5i6盥fp one Way stacks the recording unit specified by the topmost record unit of 1534. If it hits, the process ends here, that is, the guess quick popup program provides the correct popup data. Otherwise the process will proceed to step 1624. At step 1624, control logic circuit 1502 generates a true value at exception event signal 399, causing microprocessor 14 to execute an exception event handler to process the condition of the guessing fast pop-up program providing the erroneous data. The exception handler will cause the popup to receive the correct data. In one embodiment, the exception handler will empty the fp-row stack 1516 and the power-to-stack stack 1534, and load the correct $1 of the stack indicator register 152 into the fp. —offset register 322. The process ends at step 1624. As can be seen from the following, the detailed pop-up procedure of Figure 16 allows the conventional cache memory to provide pop-up data to the pop-up command, which can be several clock cycles faster than without the fast pop-up device. Referring to Figure 17, Figure 17 is a flow diagram of the data cache 1402 of Figure 15 for a push procedure in accordance with the present invention. The process begins with step 17〇2. ' ^ At step 1702, the instruction translator 106 decodes the push instruction: however, the instruction scheduler 108 will send the push instruction to the storage unit of the execution unit 114. The storage unit then generates a true value at pUSh_instr signal 342. Next, in step 1704, the control logic circuit 15 〇 2 will generate a true value at the decrement signal 384, and then the arithmetic unit 3 〇 4 will decrement the power - 〇以 68 68 c / m Ι 294 ^, 〇 396, will be decremented The value is output to signal 372. Control logic circuit 1502 causes multiplexer 316 to select this value via control signal 368 to load it into fp_offset register 322. Next, at decision step 1706, control logic circuit 1502 checks for under-signal 388 to determine if step 1704 decrements fp-offset signal 396 to cause the fp-offset register 322 to be under-asserted. That is, control logic circuit 1502 determines if the push command causes stack metric register 152 to point to the previous cache line. If so, the process proceeds to decision step 1716, otherwise it proceeds to decision step 17〇8. At decision step 1708, control logic circuit 1502 checks cacheJiit signal 1572 to determine if the target entity address 336 of the push command hits storage array 1504. If hit, the flow proceeds to step 1712, and if not, proceeds to step 1714. At step 1712, the data cache 1402 treats the current push command as a normal push command to hit the data cache 1402. That is to say, Data Cache 1402 will process this push command in the traditional way known in the field of data caching. Since the push action does not point to the previous cache line, there is no need to update the fp-row stack 1516 and the fp_way stack 1534; thus, the next pop-up program may specify the topmost record unit 1514 and fp_way of the fp_row stack 1516. Stacks the data of the cache line specified by the topmost record unit 1532 of the 1534. The process ends at step 1712. At step 1714, control logic circuit 1502 generates a true value at exception event signal 399, causing microprocessor 1400 to execute an exception event handler to update fp_row stack 1516 and fp_way stack 1534. In one implementation 69 I2945ld. - In the example, the exception handler will empty the fp_row stack 1516 and fp-way 1534 and load the correct data for the bits [5:2] of the stack indicator register 152 into the fp-0ffset register 322. Next, the flow proceeds to step 1726. At decision step 1716, control logic 15 〇 2 checks cache_hit afl number 1572 to determine if the target entity address 336 of the push instruction hits memory array 1504. If it is hit, the flow proceeds to step 718, otherwise it proceeds to step 1726. At step 1718, control logic 15 〇 2 will decide to hit the columns and columns of storage array 1504. The column hit is indicated by index 1548. The hit is indicated by the normal-way-select signal 1578. Control logic circuit 1502 will provide the hit column to fp_way heap 4: 1534 with new one way signal 1582. In addition, control logic circuit 1502 provides the hit column to fp-row stack 1516 with newjrow signal 1554. Next, at step 1722, control logic 1502 will generate a true value at push_row signal 1562 to push the value provided by new_row 1554 into fp_row stack 1516. Control logic circuit 1502 also generates a true value at push_way signal 1588 to push the value provided by new_way 1582 into fp_way stack 1534. Next, at step 1724, the data cache 1402 treats the current push command as a normal push command to hit the data cache 1402. That is, after updating the fp_row stack 1516 and the fp_way stack 1534 at step 1722, the data cache 1402 processes the push command in a conventional manner known in the art of data caching. The process ends at step Π24. ►c/m At step 1726, the control logic circuit 1502 determines the cache line that is replaced by the cache line 336 involved in the index 548 specified by the index 1548 in the storage array 1504. The data cache 1402 must be loaded now. In one embodiment, the control logic circuit b] selects the oldest unused block among the selected columns. The control logic circuit 1502 provides the replaced block to the fp-way stack 1534 via the new-way signal 1582. In addition, the control logic circuit 15〇2 provides the column selected by the index 1548 to the fp_r〇w stack 1516 through the new one row signal 1554. Next, in step 1728, the control logic circuit 15〇2 generates a true value in the push-row signal 1562 to push the new_r〇w signal 1554 long: the value supplied to the fp-row stack 1516. The control logic circuit 1 also generates a true value at the push_Way signal 1588 to push the value provided by the new_way signal 158a2 into the fp_way stack 1534. Next, at step 1732, the data cache 1402 treats the current push command as a normal push command to miss the data cache 1402. That is to say, j updates the fp_row stack 1516 and the parent one way stack 1534 in step 1728, and the data cache 14G2 will process the push command in the traditional way of taking the _ know. The process ends at step 1732. Referring to Figure 18, there is shown in Figure 18 that the microprocessor 1400 of Figure 14 processes (iv) an index plus age paste. The process begins at step 1802. In step 1802, the instruction translator 106 decodes the addition instruction destined for the heap indicator register 152 of FIG. 14, and the instruction scheduler outline 71 12945 is proud - the integer that sends this addition instruction to the execution unit 114 unit. The integer unit then generates a true value at add_spjnstr signal 352. Next, in step 1804, the control logic circuit 1502 generates a true value at the addition signal 382, and then the arithmetic unit 3〇4 adds the add_sp_vd signal 394 to the fp_〇ffset signal 396, and outputs the sum to the signal 372. The control logic circuit 1502 causes the multiplexer 316 to select the sum 'to load into the fp__offset register 322 via the control signal 368.

接下來，在決策步驟1806，控制邏輯電路15〇2會檢查溢位訊號392,以決定步驟1804的加法程序是否造成 fp一offset暫存器322溢位。也就是說，控制邏輯電路i5〇2 會決定加法指令是否會使堆疊指標暫存器152指向另一條，取線。在步驟1’，溢位狀況是指力σ法程序使得堆疊指枯暫存為152不再指向資料快取14〇2當中，电_r〇w堆疊 1山516的最頂端記錄單位1514與fp—way堆疊⑽的最$ 端記錄單位1532所指向的記錄單位存放的快取線。更明確 =說，如果加法造成溢位，堆疊指標暫存器152通常會指向記憶體位址鄰接於且大於資料快取14()2當中，丽，疊1516的最頂端記錄單位1514與胃堆疊i53—4的 f頂端。己錄單位1532所指向的記錄單位存放的快取線的 =止的的快取線。因此，fp—贿堆疊1516 堆疊出動作，使fp 一贿堆叠1516的最頂端記錄 I 4/、年一Way堆豐1534的最頂端記錄單位1532指幻吏存H施制邏輯電路1502會發口口 52 /庞位起過一條快取線的加法指 72 I29451^4twf.doc /m 令。在此實施例中，在接下來的步驟1808，fp一row堆疊 1516與fp一way堆疊1534彈出的記錄單位數量n是以下^ 方式計算，假設快取線的大小為64位元組·· N = (fp一offset + add—sp—val) / 64 因此，假如N大於1，表示有溢位發生，此時流程會進入步驟1808，否則流程結束。Next, at decision step 1806, control logic circuit 15 会 2 will check overflow signal 392 to determine if the add procedure of step 1804 caused the fp-offset register 322 to overflow. That is to say, the control logic circuit i5〇2 determines whether the addition instruction causes the stack indicator register 152 to point to the other line and take the line. In step 1 ', the overflow condition refers to the force σ method so that the stacking finger is temporarily stored as 152 and no longer points to the data cache 14 〇 2, the electric _r 〇 w stack 1 mountain 516 the topmost recording unit 1514 and fp — The cache line in which the highest unit of the stack (10) records the unit of record pointed to by 1532. More specifically = say, if the addition causes an overflow, the stack indicator register 152 will usually point to the memory address adjacent to and larger than the data cache 14 () 2, the topmost record unit 1514 of the stack 1516 and the stomach stack i53 - 4 f top. The cache line of the cache line stored in the recording unit pointed to by the recorded unit 1532 is the cache line of the stop line. Therefore, the fp-bribery stack 1516 stacks out the action, so that the fp bribe stacks the topmost record of the 1516 I 4 /, the first top record unit of the year 1ay Hays 1534 refers to the magic circuit H logic circuit 1502 will mouth Port 52 / Pang position has been added to a cache line 72 I29451 ^ 4twf.doc / m order. In this embodiment, in the next step 1808, the number of recording units popped up by the fp-row stack 1516 and the fp-way stack 1534 is calculated by the following method, assuming that the size of the cache line is 64 bytes. = (fp_offset + add_sp_val) / 64 Therefore, if N is greater than 1, it indicates that an overflow occurs, and the flow proceeds to step 1808, otherwise the flow ends.

在步驟1808，控制邏輯電路1502會在卿―贿訊號 1558產生真值，以彈出fp—row堆疊1516的最頂端記錄單^ 位，控制邏輯電路1502也會在p0p—way訊號1586產生真值，以彈出fp 一way堆疊1534的最頂端記錄單位。如同^ 驟1806的說明，在一實施例中，N的數值是由計算而來，而且fp一row堆疊1516與fP-Way堆疊1534各彈出N個記錄單位。流程在步驟1808結束。 °At step 1808, the control logic circuit 1502 generates a true value in the bristles 1558 to pop up the topmost recording unit of the fp-row stack 1516, and the control logic circuit 1502 also generates a true value at the p0p-way signal 1586. To pop up fp a way to stack the topmost record unit of 1534. As explained in step 1806, in one embodiment, the value of N is calculated, and the fp-row stack 1516 and the fP-Way stack 1534 each pop up N record units. The process ends at step 1808. °

明參照圖19，圖19為根據本發明的，從圖ι5的資料快取1402進行圖16的快速彈出程序的時序圖。圖19包括四個行’標示為1至4，對應於微處理機14〇〇的四個時脈週期。圖19也包括六個列，各自代表微處理機i铜的一 ”作，結果。® 19的每個行列交會處的方格，不是空白就是「彈出」，以指示彈出指令在微處理機14〇〇所在的位置。 ^ Τ 在時脈週期1，根據圖19的第一列，執行單元ii4# ::== P〇P-1讀訊號344產生真值，以請求彈出賣枓給弹出私令，就如圖16的步驟16〇2所示。在時脈週期2，根據第二列，列解碼器1506會將 73 I29451^4twf.doc/m fp_row訊號1556提供的列數值解碼，以在讀取訊號[N-l:〇] 1542的其中之一產生真值，如同圖16的步驟1604所示。接著儲存陣列1504會輸出真值的讀取訊號[N-1:0] 1542所選取的列的四欄當中，每個記錄單位的快取線、位址標籤、與狀悲’如同圖16的步驟1604所示。在時脈週期2，根據第三列，位址產生器306會計算虛擬位址334，如同圖16的步驟1614所示。在時脈週期3，根據第四列，多工器1528會選出 ® fp—way訊號1584指定的快取線1592,而且多工器318會選出剛才選出的快取線1592當中，fp_〇ffset訊號396指定的正確雙字組，如同圖16的步驟1604所示。在一實施 : 例中，選出的是快取線1592當中，實體位址336的低位位 . 元[5:2]所指定的雙字組。在時脈週期3，根據第五列，轉譯暫存區308會產生彈出指令的來源實體位址336，如同圖16的步驟1616所不〇肇在時脈週期4 ’根據第六列’控制邏輯電路合檢查fp—check訊號1544以決定前面進行的猜測彈出程序是否不正確，如同圖16的步驟1618至1624所示。在一實施例中，從圖15的資料快取1402執行載入指令的時序，類似於從圖1的非堆疊快取122執行載入指^ 的時序；因此，圖13也可以用來說明從資料快取以们曰^ 行載入指令的的時序。比較圖19與圖13可知，圖16的快速彈出程序使得資料快取1402提供資料給彈出指令，可以 74 c/m Ι294^^〇比不包含圖15的快速彈出裝置，而且不區分彈出與載入指令的傳統快取記憶體快上幾個時脈週期。曰在一實施例中，會使用虛擬位址334的位元[5:2]來選取雙子組，而不是用fp—〇ffset訊號396。雖然本發明與其目的、技術特徵、與優點已詳細說明如亡，本發明亦包含其他實施例。例如上述的堆疊快取或堆豐記憶體可以用各種方式實施，以達成具有後進先出功能的記憶體。其中-個實施例是功能為環狀先進先出記憶Referring to Figure 19, there is shown a timing diagram of the quick pop-up procedure of Figure 16 from the data cache 1402 of Figure 1 in accordance with the present invention. Figure 19 includes four rows ' labeled 1 through 4 corresponding to four clock cycles of the microprocessor 14A. Figure 19 also includes six columns, each representing a "micro" of the microprocessor i. As a result, the square of each row of intersections of the ® 19 is either blank or "popped" to indicate pop-up instructions at the microprocessor 14. The location of the 〇〇. ^ Τ In the clock cycle 1, according to the first column of Figure 19, the execution unit ii4#::== P〇P-1 read signal 344 generates a true value to request the pop-up to sell the private order, as shown in FIG. Steps 16〇2 are shown. In clock cycle 2, according to the second column, column decoder 1506 decodes the column values provided by 73 I29451^4twf.doc/m fp_row signal 1556 to generate one of the read signals [Nl:〇] 1542. The true value is as shown in step 1604 of FIG. Then, the storage array 1504 outputs a true value read signal [N-1:0] 1542 of the selected columns of the four columns, each record unit cache line, address label, and sorrow 'like the same as Figure 16 Step 1604 is shown. In clock cycle 2, according to the third column, address generator 306 calculates virtual address 334 as shown in step 1614 of FIG. In clock cycle 3, according to the fourth column, the multiplexer 1528 selects the cache line 1592 specified by the ® fp-way signal 1584, and the multiplexer 318 selects the cache line 1592 just selected, fp_〇ffset The correct double word specified by signal 396 is as shown in step 1604 of FIG. In an implementation: in the example, the lower word of the physical address 336, the double word group specified by the element [5:2], is selected in the cache line 1592. In clock cycle 3, according to the fifth column, the translation buffer 308 will generate the source entity address 336 of the pop-up instruction, as in step 1616 of Figure 16, which is not in the clock cycle 4 'based on the sixth column' control logic The circuit checks the fp-check signal 1544 to determine if the previously made guess pop-up procedure is incorrect, as shown in steps 1618 through 1624 of FIG. In one embodiment, the timing of the load instruction is executed from the data cache 1402 of FIG. 15, similar to the timing of executing the load pointer from the non-stack cache 122 of FIG. 1; therefore, FIG. 13 can also be used to illustrate The data cache takes the timing of the load instruction. Comparing FIG. 19 with FIG. 13, it can be seen that the quick pop-up program of FIG. 16 causes the data cache 1402 to provide data to the pop-up command, which can be 74 c/m Ι 294 ^ ^ 〇 does not include the quick pop-up device of FIG. 15 , and does not distinguish between pop-up and load The traditional cache memory that enters the instruction is on several clock cycles.一 In one embodiment, the bits [5:2] of the virtual address 334 are used to select the dual subgroup instead of the fp_〇ffset signal 396. Although the present invention has been described in detail with reference to its purpose, technical features, and advantages, the present invention includes other embodiments. For example, the stacked cache or stack memory described above can be implemented in a variety of ways to achieve a memory with a last in first out function. One of the embodiments is a function of circular FIFO memory.

^ircular FIFO memory)的暫存器组，有堆疊頂端與底S 指標、，以決定下次要推入或彈出哪個記錄單位，以及堆疊何時清空。再者，雖然前面的實施例皆以χ86架構的指令為主，堆疊是往記憶體位址逐漸減少的方向成長，本發明也可以應用於，堆疊指令會使堆疊往記憶體位址逐漸增加的方向成長的微處理機。此外，雖然以上的實施例只揭露一種快取線的大小，其他大小不同的快取線也可以用於本發明。一另外，雖然本發明與其目的、技術特徵、與優點已詳細谠明如上，本發明亦包含其他實施例。除了以硬體實施之外，本發明也能以電腦可接受（例如可讀取）的媒體所儲存的數碼來實施（例如用電腦程式碼與資料等方式）。上述的電腦數碼可實現本發明的功能或製造，或兩者皆有。例如實現方式可以是一般程式語言（例如C 、C++、 JAVA ’諸如此類）；GDSII資料庫；包括VerilogHDL、 VHDL、Altera HDL (AHDL)在内的硬體描述語言（hard ware 75 f.doc/m description languages，即HDL)，諸如此類；或相關技術領域的其他程式與/或電路設計工具。上述的電腦數碼可存放於電腦可接受（例如可讀取）的任何已知媒體，包括半導體記憶體、磁碟(magnetic disk)、光碟（optical disk，例如CD-ROM與DVD-ROM，諸如此類），也能以電腦資料訊號的形式，内含於電腦可接受（例如可讀取）的傳輸媒體，例如載波(carrier wave)或任何其他媒體，包括數位、癱光學、與類比式媒體。因此上述的電腦數碼可透過通訊網路傳遞，包括網際網路(Internet)與内部網路(intranet)。本發明也能以智慧財產（intellectual property，即IP)核心，例如微處理機核心，的電腦數碼（例如為其中一部分）實 : 知，或以系統級設計，例如單晶片系統（System on Chip， • 即s〇c)實施，並且於積體電路的製程中轉換為硬體。此 — 外，本發明也能以硬體與電腦數碼的組合方式實施。最後，熟習本發明相關技術領域者應能以此處揭露的概念與實施例為基礎，輕易設計或修改其他結構，以實現 • 與本發明相同之目的，而不背離附加於後的申請專利範圍所界定的本發明的精神與範圍。【圖式簡單說明】圖1為根據本發明的管線式微處理機的方塊圖。圖2為根據本發明，繪示圖丨的堆疊快取的方塊圖。圖3為根據本發明，繪示圖丨的堆疊快取的額外元件的方塊圖。圖4為根據本發明，繪示圖丨的第一級資料快取的多 76 1294織 4twf.doc/m 工選取邏輯電路的方塊圖。圖5為根據本發明，綠示田彈出程序的流程圖。 127、豐快取進行快速的流程圖圖6為根據本發明，進行推人程序至圖丨的堆叠快取加法圖本發明，繪示圖丨微處理機執行堆叠指標程序的圖：:據本發明的，緣示從圖1堆叠快取進行載入資料發明的’進行存放程序至圖丨的第-級快速:二:=明’從圖1的權取進行圖5的入程===本伽，從堆4快取進行心的猜測載圖12為根據本發明田入，即非猜測载人程序的時序圖豐快取進行圖8的正常載程序=2_本發明’從非堆疊快取進行圖8的載人 ===的:線的方塊圖。方塊圖。杨圖14的第-級資料快取的圖16為根據本發明出程序的流程圖。貧料快取進行快速彈 77 1294拍贫—加圖17為根據本發明，進行推入程序至圖ls的資料快取的流程圖。圖18為根據本發明，緣示圖的微處理機執行堆疊指標加法指令的流程圖。圖19為根據本發明，從圖15的資料快取進行圖16 的快速彈出程序的時序圖。【主要元件符號說明】 0〜3 儲存陣列的攔編號 1〜6 時脈週期 100 管線式微處理機 102 指令快取 104 指令抓取器 106 指令轉譯器 108 指令排程器 112 暫存器組 114 執行單元 116 寫回單元 118 匯流排介面單元 122 非堆疊快取 124 堆疊快取 126 第一級資料快取 128 微程式碼記憶體 132 微處理機匯流排 134、136、138、142 ··資料訊號 78 /m 1294¾^ 152 ··堆疊指標暫存器 202 ··快取狀態 204 :位址標籤 206 :快取線資料 212〜216、222〜226、232、234 :資料訊號 302 :控制邏輯電路 304 :算術單元 306 :位址產生器 308 ··轉譯暫存區 312、314 :比較器 316、318 :多工器 322 : fp—offset 暫存器 324 :寫回線緩衝區 326 :多工器 328 :資料訊號 332 :運算元 334 ··虛擬位址 336 :實體位址 338〜372 :資料訊號 382 :加法訊號 384 ··遞減訊號 386 :遞增訊號 388 ··欠位訊號 389〜391 ··資料訊號 79 1294歡 f.doc/m 392 :溢位訊號 393〜398 :資料訊號 399 :例外事件訊號 402〜412 :多工器 422〜432 ··資料訊號 502 :解碼並發出彈出指令 504 :猜測提供彈出資料 506 :遞增 fp offset 508 :溢位？ 512 :堆疊快取彈出 514 :計算虛擬位址 516 :自轉譯暫存區查詢 518 :比較實體位址與堆疊快取最頂端的記錄單位的實體位址標籤 522 :擊中？ 524 :產生例外事件以修正錯誤 602 :解碼並發出推入指令 604 :遞減 fp—offset 606 :計算虛擬位址 608 :自轉譯暫存區查詢 612:比較實體位址與堆疊快取最頂端的記錄單位的實體位址標籤 614 :擊中？ 616 :存放資料於堆疊快取最頂端的記錄單位 1294歡 f.doc/m 618 :堆疊快取最底下的記錄單位為有效？ 622 :排程寫回堆疊快取最底下的記錄單位 624 :將新的資料、位址標籤與狀態推入堆疊快取頂 626 :配置裝填暫存區 628 :將接收到的快取線併入堆疊快取的記錄單位 702 :解碼並發出堆疊指標加法指令 704 : fp—〇ffset = fp—0ffset + vaiUe^ircular FIFO memory) has a stack of top and bottom S metrics to determine which record unit to push or pop next time, and when the stack is emptied. Furthermore, although the foregoing embodiments are mainly based on the instructions of the χ86 architecture, the stack is grown toward the direction in which the memory address is gradually reduced, and the present invention can also be applied to, the stacking command causes the stack to grow in the direction in which the memory address is gradually increased. Microprocessor. Moreover, while the above embodiments only disclose the size of a cache line, other cache lines of different sizes may be used in the present invention. In addition, the present invention has been described in detail with reference to the appended claims. In addition to being implemented in hardware, the present invention can also be implemented in a computer-readable (e.g., readable) medium (e.g., by computer code and data). The above computer digital can implement the function or manufacture of the present invention, or both. For example, the implementation can be a general programming language (such as C, C++, JAVA ', etc.); GDSII database; hardware description language including VerilogHDL, VHDL, Altera HDL (AHDL) (hardware 75 f.doc/m description Languages, ie HDL), and the like; or other programming and/or circuit design tools in the related art. The above computer digital data can be stored in any known medium that is acceptable (eg, readable) by a computer, including semiconductor memory, magnetic disk, optical disk (such as CD-ROM and DVD-ROM, and the like). It can also be embodied in a computer data signal, such as a carrier wave or any other medium, including digital, optical, and analog media. Therefore, the above computer digital can be transmitted through the communication network, including the Internet and the intranet. The invention can also be implemented as an intellectual property (i.e., IP) core, such as a microprocessor core, such as a computer digital (for example, part of it), or at a system level, such as a system on chip (System on Chip, • That is, s〇c) is implemented and converted to hardware in the process of the integrated circuit. In addition, the present invention can also be implemented in a combination of hardware and computer digital. Finally, those skilled in the art of the present invention should be able to easily design or modify other structures based on the concepts and embodiments disclosed herein to achieve the same objectives as the present invention without departing from the scope of the appended claims. The spirit and scope of the invention as defined. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a block diagram of a pipeline microprocessor in accordance with the present invention. 2 is a block diagram of a stack cache of the diagrams in accordance with the present invention. 3 is a block diagram showing additional components of a stacked cache of the drawing in accordance with the present invention. 4 is a block diagram showing a plurality of 76 1294 woven 4 twf.doc/m selection logic circuits of the first level data cache of the figure 根据 according to the present invention. Figure 5 is a flow diagram of a green field pop-up procedure in accordance with the present invention. 127. Fast Flowchart for Fast Flowchart FIG. 6 is a diagram of a stacking cache adder diagram for performing a push procedure to a map according to the present invention, and showing a diagram of a stacking index program executed by the graph microprocessor: According to the invention, the reason for loading the data from the stacking cache of Fig. 1 is to carry out the storage procedure to the level of the map-level fast: two: = Ming' from the weight of Fig. 1 to proceed to the entry of Fig. 5 === Benga, guessing from the heap 4 cache. Figure 12 is a timing diagram of the non-guessing manned program according to the present invention. The normal load procedure of Figure 8 is performed. The cache is a block diagram of the line of the manned === of Figure 8. Block diagram. The first level data cache of Yang Figure 14 is a flow chart of the program according to the present invention. The poor material cache is fast-moving. 77 1294 is poor-plus. Figure 17 is a flow chart of the data flow from the push-in procedure to Figure ls in accordance with the present invention. Figure 18 is a flow diagram of a microprocessor executing a stacked index addition instruction in accordance with the present invention. Figure 19 is a timing diagram showing the quick pop-up procedure of Figure 16 from the data cache of Figure 15 in accordance with the present invention. [Main component symbol description] 0~3 Storage array block number 1~6 Clock cycle 100 Pipeline microprocessor 102 Instruction cache 104 Instruction fetcher 106 Instruction translator 108 Instruction scheduler 112 Register group 114 Execution Unit 116 write back unit 118 bus interface unit 122 non-stack cache 124 stack cache 126 first level data cache 128 micro code memory 132 microprocessor bus 134, 136, 138, 142 · · data signal 78 /m 12943⁄4^ 152 ··Stacking index register 202 ··Cache state 204: Address tag 206: Cache line data 212~216, 222~226, 232, 234: Data signal 302: Control logic circuit 304: Arithmetic unit 306: address generator 308 · translation temporary storage area 312, 314: comparator 316, 318: multiplexer 322: fp_offset register 324: write back line buffer 326: multiplexer 328: data Signal 332: Operational unit 334 · Virtual address 336: Physical address 338~372: Data signal 382: Addition signal 384 ··Decrement signal 386: Increment signal 388 ··Under signal 389~391 ··Data signal 79 1294 Huan f.doc/m 3 92: overflow signal 393~398: data signal 399: exception event signal 402~412: multiplexer 422~432 ··data signal 502: decode and issue pop-up instruction 504: guess provides pop-up data 506: increment fp offset 508: Overflow? 512: Stack Cache Popup 514: Calculate Virtual Address 516: Self-Translating Scratchpad Query 518: Compare Physical Address with Physical Address Label of the Top Recording Unit of Stacked Cache 522: Hit? 524: Exception event is generated to correct error 602: Decode and issue push command 604: Decrement fp_offset 606: Calculate virtual address 608: Self-translating buffer area query 612: Compare physical address with stack topmost record Unit Physical Address Label 614: Hit? 616: Store the data at the top of the stack cache. 1294 Huan f.doc/m 618: The lowest record unit of the stack cache is valid? 622: Schedule write back to the bottom of the stack cache unit 624: Push new data, address label and status into the stack cache top 626: Configure the fill buffer area 628: Incorporate the received cache line Stacking cached record unit 702: Decode and issue a stack indicator addition instruction 704: fp - 〇 ffset = fp - 0ffset + vaiUe

706 :溢位？ 708 :資料快取彈出，並排程寫回彈出的有效快取線 802 ·解碼並發出載入指令 ' 804 :計算虛擬位址快取最頂端的兩個虛擬位 806 :比較虛擬位址與堆疊址標籤 808 :擊中？706: Overflow? 708: The data cache pops up, and the program writes back the valid cache line 802 that is popped up. · Decodes and issues the load instruction. 804: Calculates the top two virtual bits 806 of the virtual address cache: Compare the virtual address with the stack address. Tag 808: Hit?

料812 ··從有效且符合的堆疊快取記錄單位猜測提供資 814:自轉譯暫存區查詢快取記錄單 816 ··比較實體位址與有效且符合的堆疊位的實體位址標籤 ^ 818 :擊中？ 822:產生例外事件以修正錯誤 824 :自轉譯暫存區查詢 826 :比較實體位址與所有堆疊快取記錄單位的實體 1294》組w— 位址標籤 828 :擊中？ 832 :從有效讀合的堆疊快取記錄單位提斜 834:在非堆疊快取進行列解碼以選取集人/、、枓 836 :比較實體位址與受選取的集合當中口體位址標籤攔的實 838 :擊中？ 842 :從有效且符合的麵疊錄供 844:配置記錄單位早位徒供貝枓 846 :將錯過的快取線载入配置的記錄單位 848 :從非堆疊快取提供資料 902 :解碼並發出存放指令 904 :計算虛擬位址 906 :自轉譯暫存區查詢 _:比較實體位址與所有堆疊快取位址標籤只體 912 :擊中？ 914 ·存放資料至有效且符合的堆疊快取記錄單位 916:比較實體位址與非堆疊快取當中受選取的集合的每一欄的實體位址標籤一口 918 ··擊中？ 922 ·存放資料至有效且符合的非堆疊快取記錄單位 924 :於非堆疊快取配置記錄單位 926 :將錯過的快取線載入配置於非堆疊快取的記錄 82 1294棚— 單位 928 :存放資料於非堆疊快取 1400 :管線式微處理機 1402 :第一級資料快取 1502:控制邏輯電路 1504 :儲存陣列 1506 :列解碼器 1508 :檢查邏輯電路 1512 :多工器 1514 : fp_row堆疊頂端 1516 : fp row 堆疊 1524 :欄選取產生器 1526、1528 :多工器 1532 : fp_way堆疊頂端 1534 : fp_way 堆疊 1542 :讀取訊號[N_1:0] 1544 :資料訊號 1546 ··標籤 1548 :索引 1552 ··列訊號 1554〜1558、1562〜1564、1572 :資料訊號 1574 :位址標籤 1576 :有效位元 1578、1582〜1586、1588、1592〜1596 :資料訊號 83 1294總 ^twf.doc/m 1602 :解碼並發出彈出指令 1604 :猜測提供彈出資料 1606 :遞增 fp—offset 1608 :溢位？ 1612 : fp_row堆疊與fp way堆疊彈出 1614 :計算虛擬位址 1616 :自轉譯暫存區查詢 1618 ··比較實體位址與實體位址標籤 1622 :擊中？ 1624 :產生例外事件以修正錯誤 1702 :解碼並發出推入指令 1704 :遞減 fp—offset 1706 :欠位？ 1708 :擊中？ 1712 :視為擊中的一般推入指令處理 1714 ··產生例外事件以更新fp_row堆疊與fp_way堆疊 1716 :擊中？ 1718 :決定擊中的列與攔 1722 :將擊中的列推入fp_row堆疊，將擊中的欄推入fp一way堆疊 1724 :視為擊中的一般推入指令處理 1726 :在受選取的列當中決定取代的欄 1728 :將取代的列推入fp_row堆疊，將取代的欄推 84 I294^twf.doc/m 入fp—way堆疊 1732 ··視為錯過的一般推入指令處理 1802 :解碼並發出堆疊指標加法指令 1804 : fp—offset 二 fp—offset + value 1806 :溢位？ 1808 : fp row堆疊與fp way堆疊彈出 85Material 812 · · from the valid and consistent stack cache record unit guess provider 814: self-translating temporary storage area query cache record 816 · · compare physical address with valid and consistent stack bit physical address label ^ 818 :hit? 822: Exception event is generated to correct the error 824: Self-translating buffer area query 826: Entity entity address and entity of all stacked cache record units 1294 Group w- address label 828: hit? 832: Lifting the recording unit from the valid read-and-kick line 834: performing column decoding on the non-stack cache to select the set person/, 枓 836: comparing the physical address with the selected address set of the mouth address tag Real 838: Hit? 842: From the effective and consistent face stacking 844: configuration record unit early for the Bellow 846: loading the missed cache line into the configured record unit 848: providing data from the non-stack cache 902: decoding and issuing Store instruction 904: Calculate virtual address 906: Self-translating temporary area query _: Compare physical address with all stacked cached address tags only body 912: hit? 914 · Store data to a valid and consistent stack cache record unit 916: Compare the entity address with the physical address label of each column of the selected set among the non-stack caches. 918 ··Hit? 922 · Store data to a valid and consistent non-stack cache unit 924: In non-stack cache configuration record unit 926: Load missed cache line into a non-stack cache record 82 1294 shed - Unit 928: Storing data on non-stack cache 1400: pipelined microprocessor 1402: first level data cache 1502: control logic circuit 1504: storage array 1506: column decoder 1508: check logic circuit 1512: multiplexer 1514: fp_row stack top 1516 : fp row stack 1524 : column selection generator 1526, 1528 : multiplexer 1532 : fp_way stack top 1534 : fp_way stack 1542 : read signal [N_1:0] 1544 : data signal 1546 · · tag 1548 : index 1552 · · Column signal 1554~1558, 1562~1564, 1572: Data signal 1574: Address label 1576: Valid bits 1578, 1582~1586, 1588, 1592~1596: Data signal 83 1294 Total ^twf.doc/m 1602 : Decode and issue a pop-up instruction 1604: Guess provides pop-up data 1606: increment fp-offset 1608: overflow? 1612 : fp_row stacking and fp way stack popping 1614: Calculating virtual address 1616: Self-translating buffer area query 1618 · Comparing entity address with physical address label 1622: Hit? 1624: Exception event is generated to correct the error 1702: Decode and issue the push instruction 1704: Decrement fp_offset 1706: Under? 1708: Hit? 1712: General push instruction processing as a hit 1714 · Generate exception events to update fp_row stack with fp_way stack 1716: Hit? 1718: Decide on the hit column and block 1722: Push the hit column into the fp_row stack, push the hit bar into the fp-way stack 1724: see the general push command hit 1726: in the selected Column 1728 in the column is determined to be substituted: push the replaced column into the fp_row stack, push the replaced column 84 I294^twf.doc/m into the fp-way stack 1732 · see the missed general push instruction processing 1802: decode And issue the stack indicator addition instruction 1804: fp-offset two fp-offset + value 1806: overflow? 1808 : fp row stacking and fp way stacking popup 85

Claims

12943⁄4^ twf.doc/m X. Patent application scope: 1. A fast pop-up device for random access memory, including · a first-in first-out memory, storing a plurality of column values y after the first memory The topmost recording unit includes a storage-reciting value; a multiplexer, including: a column value first-input terminal, receiving the latest n-person end receiving access machine access cache from the topmost recording unit A selected part of a memory address of a memory of a memory; the two ends of the memory are selected by a value to select the input terminal, the type of the instruction is specified, and the type of the input finger (four) is popped up. Command, the multi-guard is selected to provide a rain end to provide at the output. , μ input 2 · as described in the scope of the patent application, i = speed = set 'if the selected wheel input end of the specified type is 2 曰 7, then the multiplexer selects the second input 3. Random as described in item 1 of the patent application: Take: Out. a quick pop-up device, wherein if the selection ===body=out command' then the more than one selects the second round of the person's end "provided in the == as in the patent application scope item r-speed ejecting device, which is stored in The column values each include a parent of the target address of the push command. 86 1294 欢 f.doc/m A push-in system that has not been ejected from the stack memory is coupled to the left and A~ And 4 piles should be sU 12. If Shen = 1! · Feng Jilong - microprocessor. The body's quick pop-up month wears the random access cache memory described in item 9 of the track, and hits the random "take = two: u:=: the object in the memory ^:::^ ... Exposure and storage of the information: = in the package of a paste, its patent scope of the item mentioned in the item of the quick pop-up device, including: Determine the first second after the first tB memory 'storage most of her value = In-first-out memory contains storage - 1 top of the latest block value = record [ 15. As described in the patent application scope item 14 of the rapid bomb (four) set, including: (4) take δ 隐 hidden one second The device includes: the first first input end of the memory, receiving the latest block value from the second last in first out recording unit; and a second input receiving a column selecting a value of the output end, providing a value to select a block of the memory; and a type of the instruction to select the input terminal to specify access to the random access cache 88 1294$, if the order is Two more work crying selection ^ input ^ day type is pop-up finger 16 such as Shen-I II - the input end to provide The output end. I6. For example, the quick pop-up device of the body described in the patent application of the 帛专利帛帛帛帛帛帛帛戍戍戍戍戍戍戍戍戍戍戍戍戍选取选取选取选取选取选取选取选取选取选取选取选取选取选取选取选取The type of the input pointer is a load instruction, and the first is provided at the input end. The location of the random access cache bar according to item 16 of the monthly input range of the input body, wherein the block The selected value contains a block, and the point is here. 'Dai is equal to a tag part of the source address of the load instruction is. For example, please refer to the random access cache memory pop-up device described in item (4), wherein If the selected input end finger of the second multiplexer is not a pop-up command, the second multiplexer selects the second input port to provide the output terminal. 駚^9·If the patent scope is 15th The random access cache memory - one, one, pop-up device, wherein the second last-in first-out memory stores a number of the block values respectively designated in the random access cache memory, save push A block of the target data of the instruction. The cache memory value is divided into: wherein the output number of the second second multiplexer is listed with the block, and the random access cache memory is selected as one of the cache lines to supply the command, wherein the The cache line contains the source data of the capture. 0 2!·For example, the random access cache memory described in the second paragraph of the patent scope of the application

12945 — 体快速 , , , , , , 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 945 The specified cache line. If you have eaten your body, you will take the column 22. If you apply for a patent, Fan Yidi, the body's quick pop-up device, including:, \^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Cache memory as a towel, the output value specified by the work = the value of the column - the cache line in the cache line (four) 22 pin to access the cache memory access device, where the displacement value will Respond to - pop up refers to people and increments. 7 24. For example, in the fast pop-up device of the random access cache according to claim 23, wherein the increment of the Wei displacement value causes the displacement value to overflow, the random access cache memory is from the The last column value of the last in first out miscellaneous. </ RTI> </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; </ RTI> <RTIgt; 26. The fast pop-up device for random access memory according to the scope of the patent application, further comprising: " a comparison logic circuit coupled to the topmost recording unit to receive the latest column value 'and Comparing the latest column value with a portion of the memory address 'where the 6H address contains a source data address of the pop-up instruction. I294^§twf.doc/m 27· The quick pop-up device of the portable body described in claim 26 of the patent application further includes: a ton machine access cache memory exception event wheel & end, (4) 比较 the new column The value does not match the 2 way recorded in view (4) in the most recent condition. T彳日#2························································································ A computer can use the media, and the bean type in the bean is used by the 耘 code calculation device. /, in the middle of the dip-type product a match - the signal is provided 'and the computer data signal contains 3 〇 · - a random pop-up memory quick pop-up method, the column steps: imitation of a push-in command data In the random access cache memory, a column of values specified by a column; after the storage, the field pushes the column value to a topmost recording unit of a last-in first-out memory; and after the pushing, A request is received to read the random access cache, wherein the request specifies a request category. 31. The fast pop-up method of the random access cache memory twf.doc/m I2945^4 body as described in claim 30, further includes one of the following steps: if the request type is a pop-up instruction, Reading the random access cache memory according to the column value stored in the topmost recording unit of the last in first out memory; and if the request type is a load instruction, a memory address specified according to the request Read the random access cache memory. 32. The fast pop-up method of the random access cache according to claim 30, wherein the + file storing the push instruction includes:, v^ storing the data of the push instruction to One of the columns of the random access cache memory ^, wherein the barrier is specified by a block value. 33. The fast pop-up method of the random access body as described in claim 32 of the patent application further includes the following steps: After entering the memory of the daily memory, the first top of the first incoming and outgoing memory is blocked. Recording unit 34. A fast pop-up method for a random access fast body as described in claim 33, wherein the first and the same are the same as the latter. The quick pop-up method of the 33rd 顼申请申请 , , , , , , , , , 申请 : 申请申请申请申请申请申请申请申请申请申请申请快速快速快速快速快速快速快速快速快速快速快速快速快速快速快速快速快速快速快速快速快速Randomly stored 36. The random pop-up memory of the random access cache 92 12945, as described in claim 31, provides a quick pop-up method for the μ·-/πι body, and further includes the following steps: Deciding to read the silk picking with the age (four) ❺ 槪Value, ^ Section lacks general (4) = used ^ The data specified by the pop-up command - block. A body, although stored 37. As claimed in the patent scope of the Honghong body, the method of rapid pop-up, which determines The steps further include: ', take the memory to compare the source of the pop-up instruction - source It is expected that her machine accesses the cache to provide the record - the record is selected according to the block value. The value of the armor iron is 38. The quick pop-up method of the body described in the third paragraph of the patent application includes the following steps. : Transfer the cache memory to the storage step and decrement a displacement value, & access a cache line of the cache memory. The material level: refers to: the middle cache line is located in the random memory, set: , the specified column. The value of the column is 39. For example, the pop-up method of the syllabus of the patent application scope includes the following steps: _ access cache memory storage scale (four) order, job _ value The method for reading the fast pop-up method of the body described in item 31 further includes the following steps: _ sparing, accessing the value for the reading step, m specifying the external machine accessing the cache memory, storing Today (4) 疋机 41 41. If you apply for a list of data specified by patent rH. 随机 40 items of random access cache memory 93 I294W, ^ f.doc / im a quick catch eject device, the computer can read The program mom includes: a majority of the _ the first top record unit; and 匕Store - the latest column value - a second code, providing a multiplexer, including · the latest column value; - the first - the input terminal receives the memory from the topmost record unit - the instruction: gas: body: Two of the memories are provided with a value of 2 to select the random access cache input terminal to specify; if the input terminal is selected to be provided at the round end. To select the first 95