TWI317065B

TWI317065B - Method of accessing cache memory for parallel processing processors

Info

Publication number: TWI317065B
Application number: TW95129660A
Authority: TW
Inventors: Shiwu Lo
Original assignee: Shiwu Lo
Priority date: 2006-08-11
Filing date: 2006-08-11
Publication date: 2009-11-11
Also published as: TW200809502A

Description

1317065 九、發明說明：【發明所屬之技術領域】本發明是有關於一種快取記憶體的設計方法，且特別是有關於一種適用於平行處理處理器之快取記憶體資料存取的方法。 * * 【先前技術】在過去幾年來，由於處理器（processor)製程技術的持 # 續進步，明顯的增加了處理器的運算速度，相對地，記憶體的存取速度卻未明顯的提升，造成處理器與記憶體在存取上速度上的差距越來越大。一般解決處理器與記憶體存取速度的差異問題，舉例來說，現代的處理器（modern processor)，是利用階層式記憶體設計（hierarchical memory design)的方法來克服存取速度的差異問題。階層式記憶體設計（hierarchical memory design)將記 φ 憶體的資料區分為時間局部性（temporal locality)與空間局部性（spatial locality)的方式，提高快取（cache)存取資料的速度，由進一步提高處理器的效能。由於快取屬於由硬體動態分配的的記憶體，且程式執行時間高度相關於快取的命中率（hit rate)。在應用 SMT(simultaneous multithreading ;同步多線程）/CMP(chip-multiprocessor ;單晶片多處理器）技術於處理器時，當此處理器執行的一程式的執行時間，會受到其 6 1317065 . 他一起執行的程式所影響。雖然利用快取分割（cache partition)的方法，可以消除並行處理程式在同一個實體處理器之間的交互影響，但是這個方法不允許共用快取，造成快取的使用率仍然無法有效的提升。雖然上述的問題可以透過動態調整每個迷你處理器 (mini-processor)所專有的次快取（sub-caehe)記憶體大小而加以克服，但是這些動態調整的方法往往藉由修改替代演算法（replacement algorithm)來動態調整次快取記憶體的大 I 小。然而，當處理器的系統中只有少量的快取遺失（cache miss)出現時’改變一個快取分割的大小需要耗費相當長的時間’而產生所謂的延遲（latency)，這個延遲對於有服務〇口質（quality of service)以及時序限制（timing constraint)的程式而言’會降低服務品質或是出現最後期限遺失 (deadline miss)的情況發生，造成系統效能無法有效提升。因此’對系統設計者而言，局部性（locality)造成的影響有三個方面。 _ 首先’最糟計算時間（WCET; worst case execution time) 因為快取命中率的關係，而變得難以估計。在程序設計中’ WCET會嚴重影響到預測整個系統運行的時間，而準確的預測快取的命中率將成為WCET的一大挑戰課題。此外，由於WCET的估計是許多嵌入式系統與即時系統的基礎，WCET的難以估計會造成軟體設計上的困難。其次，指令層（instruction level)平行處理程式可能導致輾轉現象（trashing)，也就是當處理器内部的第一層快取 1317065 記憶體（level one cache memory)發生大量的快取遺失 (cache miss)時，亦即當快取命中率低時，因而造成處理器每秒鐘所能完成的指令數突然的降低。此時，當平行處理的不同程式之工作組（working set)對應到快取上的同一群快取列（cache line)時，不同程式將會互相覆蓋掉對方的工作組，因而發生輾轉現象（trashing)，造成系統效能不彰。第三，設計作業系統（operating system)排程器 (scheduler)的難度提高。一般的作業系統都假設處理器（包括CMP/SMT處理器中的簡單處理器與邏輯處理器）都能公平的使用所有的硬體資源且不會相互影響，假如不對快取限制，這會造成同一個程式與記憶體加強（memory intensive)的程式並行處理或與中央處理單元（CPU)加強的程式並行處理，將會有不一樣的執行效果而產生差異性，這種差異性將會增加作業系統的設計難度。【發明内容】本發明的一目的就是在提供一種適用於平行處理處理器之快取記憶體資料存取的方法，用以有效提高快取記憶體的使用率。本發明的另一目的是在提供一種適用於平行處理處理器之快取記憶體資料存取的方法，更精準的估算處理器之最糟計算時間WCET。本發明一種適用於平行處理處理器之快取記憶體資料存取的方法，包含： 13170651317065 IX. Description of the Invention: [Technical Field] The present invention relates to a method of designing a cache memory, and more particularly to a method for accessing a cache memory data for a parallel processing processor. * * [Prior Art] In the past few years, due to the continuous advancement of processor technology, the operating speed of the processor has been significantly increased. In contrast, the access speed of the memory has not been significantly improved. The gap between the speed of access between the processor and the memory is increasing. Generally, the difference between the processor and the memory access speed is solved. For example, a modern processor uses a hierarchical memory design method to overcome the difference in access speed. Hierarchical memory design divides the data of φ memory into temporal locality and spatial locality, and improves the speed of accessing data by cache. Further improve the performance of the processor. Since the cache is a memory that is dynamically allocated by the hardware, the program execution time is highly correlated with the cache hit rate. When applying SMT (simultaneous multithreading)/CMP (chip-multiprocessor) technology to the processor, the execution time of a program executed by this processor will be affected by its 6 1317065. The program executed is affected. Although the method of cache partitioning can eliminate the interaction between parallel processors in the same entity processor, this method does not allow shared caches, and the utilization rate of caches cannot be effectively improved. Although the above problems can be overcome by dynamically adjusting the sub-caehe memory size that is unique to each mini-processor, these dynamic adjustment methods often modify the alternative algorithm. (replacement algorithm) to dynamically adjust the size of the secondary cache memory. However, when there is only a small amount of cache miss in the processor's system, it takes a long time to change the size of a cache split, and a so-called latency is generated. This delay is for service. In terms of quality of service and timing constraints, 'there will be a reduction in service quality or a deadline miss, resulting in an inability to effectively improve system performance. Therefore, there are three aspects to the system designer's impact on locality. _ First, the worst case execution time (WCET; worst case execution time) becomes difficult to estimate because of the fast hit rate. In programming, WCET will seriously affect the prediction of the entire system operation time, and the accurate prediction of the cache hit rate will become a major challenge for WCET. In addition, because WCET estimates are the basis of many embedded systems and real-time systems, the difficulty of estimating WCET can cause software design difficulties. Secondly, the instruction level parallel processing program may cause trashing, that is, a large amount of cache miss occurs when the first layer of the processor's internal layer 1706650 memory (level one cache memory) occurs. At that time, that is, when the cache hit rate is low, the number of instructions that the processor can complete per second is suddenly reduced. At this time, when the working sets of different programs processed in parallel correspond to the same group cache line on the cache, different programs will overwrite each other's work groups, thus causing a tumbling phenomenon ( Trashing), causing system performance is not good. Third, it is more difficult to design an operating system scheduler. The general operating system assumes that the processor (including the simple processor and the logical processor in the CMP/SMT processor) can use all the hardware resources fairly and without affecting each other. If the cache limit is not imposed, this will cause the same Parallel processing of a program with memory intensive programs or parallel processing with a central processing unit (CPU) will have different execution effects and differences, and this difference will increase the operating system. Design difficulty. SUMMARY OF THE INVENTION One object of the present invention is to provide a method for accessing a cache memory data suitable for a parallel processing processor to effectively increase the usage rate of the cache memory. Another object of the present invention is to provide a method for accessing a cache memory data suitable for a parallel processing processor to more accurately estimate the worst computation time WCET of the processor. The invention provides a method for accessing cache memory data of a parallel processing processor, comprising: 1317065

<曰7爽埋7L件係對應於其中一中一次快取； π，該處理器使用元’如一第一層快次快取，其中每一使用其中一帛一指令處理元件，第' 人快取中的一預定資料；存取其所對應的其中，第當ί:該第一次快取存取到該預定資料時，存取異於㈣他次絲，直狀其卜異於該第—次陕取的第一次快取存取到該預定資料；當於前-步驟中，未於該第二次快取存取到該預定資料時存取該下層記憶體單元，直到存取到該預定資料；當發生快取遺失（caehemiss)，可依據_預定順序自下層兒憶體單元載人該預定資料於該些次快取的其中一個中，該預定順序為：第一，第一次快取中的已經被宣告為將來不太可能再被使用的其中一停滞快取線（dead cache line); 第一 ’異於該第一次快取的其他次快取中的其中一停滞停滞線；以及第二’依照一預定規則將新載入的預定資料放入到第一次快取中的一位置。而該預定規則係可使用先進先出 (FIF0)方法、隨機（RANDOM)方法或是取代最近最少使用 (Least Recently Used，LRU)方法等等。本發明具有下列優點： 1.可透過限定各指令處理元件只能使用與其相對應 1317065 的次快取可估算最糟計算時間（WCET)。 2. 本發明不需動態分割第一層快取之快取容量，避免發生額外的延遲（latency) ’並且可更有效率的分享第一層快取，減少指令處理元件的回應時間。 3. 本發明可將各指令處理元件的最近最多使用（Most Recently Used，MRU)資料’存放於對應的次快取中，以提高快取命中的機率。 4. 本發明可避免發生快取辍轉現象（cache thrashing) 0 【實施方式】請參照第1圖’其繪示依照本發明一實施例的系統架構示意圖。本發明之實施例適用於平行處理處理器之快取記憶體資料存取的方法，包含提供一處理器100以及一下層記憶體單元200，其中，該處理器1〇〇係可使用一 SMT/CMP(Simultaneous multithreading/ chip-multiprocessor)處理器，並具有複數指令處理元件101 以及一上層記憶體單元，如一第一層快取（L1 cache) 102，其中，指令處理元件101可為一迷你處理器（mini processor)，該些迷你處理器形成此實體處理器1 〇〇的核心 (core)裝置。指令處理元件101也可為一簡單處理器（simple processor)或邏輯處理器（logical processor ; 一種虛擬的處理器）或是一指令執行程式等。下層記憶體單元200係相對於上層記憶體單元，如上 1317065 層記憶體單s為第-層快取時，下層記憶體單元2〇〇可為第二層快取(L2Caehe)、第三層快取(L3eaehe)等，或是主記憶體（main memory)等。第一層快取102、經由分割（Ρ_οη)形成複數次快取 (sub-cache) i 03 ’其中’ €每—指令處理元件i g ι係對應於其中-次快取103 ’亦即-第i個指令處理元件1〇1具有一對應的第i個次快取103，而具有優先存取的權力。請進一步參照第2圖所示’係繪示依照本發明較佳實施例之流程圖，為使方便說明，本發明較佳實施例以使用兩個指令處理元件101作說明，但本發明可擴及至使用多個指令處理元件10b當使用第1個指令處理元件1〇1，存取其所對應的第i個次快取103，即步驟300時，需判斷第1個次快取103是否可以存取到一預定資料，即步驟 3〇卜當於該第i個次快取103可以存取到該預定資料時，該第1個次快取103回傳一存取該預定資料的結果，即步驟 302。若未於該第i個次快取103存取到該預定資料時，使用該第i個指令處理元件101存取其中一第j個次快取 l〇3a，即步驟303，需判斷第j個次快取1〇3a是否可以存取到該預定資料，即步驟304，當於該第j個次快取1〇3a 存取到該預定資料時，該第j個指令處理元件1〇3a回傳該結果，其中，i不等於j，即步驟305，亦即，該第i個指令處理元件101存取異於該第i個次快取1〇3的其他次快取’直到於該第j個次快取1 〇3a存取到該預定資料為止。 11 1317065 夫於t於該第]個次快取1G3a#取到該預定資料時，即 =所有次快取存取到該預„料時，使用該^個指; 101存取該下層記憶體單元2〇〇，直 ^資料日個指令處理元件⑻回傳該結果，时鄉 307。 .固私7處理元件101於第j個次快取1 〇3a或是 I層β己憶體2GG中存取到該預^資料時，可進—步使用一父換快取資料（Swap)步驟，如步驟3〇6，將該預定資料載入^層快取1()2巾，可以提昇系統整體的首次快取命中 (1 rate)機率’增加快取第一層快取1〇2的存取效率。田發生決取遺失時，如同前述，將存取其他次快取或下層5己憶體單το 200直到存取到該預定資料。此時，由下層記憶體單元200新載入快取記憶體中的預定資料將依照預定順序放置於其中一次快取的對應快取組合㈣che set)中的-個快取線（eaeheUne)卜該預定順序係為：第一、第1次快取103中的已經被宣告為將來不太可月b再被使用的快取線’其中，該些已經被宣告為將來不太可能再被使用的快取線即為停滯快取線（dead cache line)。關於停滯快取線的判定與宣告，已經有許多公開文獻探討，因而，對於此技術領域中具有通常之知識者，可以了解其運作原理，且並非本發明之技術特徵，因此不再詳加贅述。第二、其他次快取中（不包含第i次快取〗〇3)中已經被宣告為將來不太可能再被使用的停滯快取線。 12 1317065 第三、新載入的資料將放入到第i次快取103中，依照一預定規則所放置的一位置中。該預定規則係可使用現有的資料方法規則，例如：依照先進先出（FIFO)方法、隨機（RANDOM)方法或是取代最近最少使用（Least Recently Used，LRU)方法等等。如此一來，可透過限定各指令處理元件101只能使用與其相對應的次快取103估算最糟計算時間（WCET)，在實際應用時，系統中會同時存在即時（real time)及非即時 (non-real-time)的應用程式，因此，可以產生多的寬裕時間 (slack time)用於提供服務品質（QoS)的提昇或是處理器進入省電模式。由上述本發明較佳實施例可知，應用本發明具有下列優點。 1. 可透過限定各指令處理元件101只能使用與其相對應的次快取103估算最糟計算時間（WCET)。 2. 本發明不需動態分割第一層快取102之快取容量，避免發生額外的延遲（latency)，並且可更有效率的分享第一層快取102，減少指令處理元件101的回應時間。 3. 本發明可將各指令處理元件101的最近最多使用 (Most Recently Used, MRU)資料，存放於對應的次快取103 中，以提高快取命中的機率。 4. 本發明可避免發生快取輾轉現象（cache thrashing)。本發明之較佳實施例說明及範例中，上層記憶體單元雖以第一層快取為例，但本發明之應用範疇並不侷限於第 13 1317065 —層快取。其它具儲存特性的元件如第二層、第三層快取 δ己憶體及轉換表緩衝區（transiati〇I1 l〇〇k-aside buffer，TLB) 等亦可利用此技術達到本發明所提供以及產生之功效β 雖然本發明已以一較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍内，當可作各種之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。【圖式簡單說明】為讓本發明之上述和其他目的、特徵、優點與實施例能更明顯易懂，所附圖式之詳細說明如下：第1圖係繪示依照本發明一實施例之一系統架構示意圖。第2圖係繪示依照本發明較佳實施例之流程圖。【主要元件符號說明】 100:處理器 300:步驟 101:指令處理元件 301:步驟 102:第—層快取 302:步驟 103:次快取 303:步驟 103a:次快取 304:步驟 200:下層記憶體單元 305:步驟 306:步驟 307:步驟<曰7 爽 7L parts correspond to one of the first cache; π, the processor uses the element 'such as a first layer fast cache, each of which uses one of the instruction processing elements, the first person a predetermined data in the cache; access to the corresponding one of them, the first ί: the first cache access to the predetermined data, the access is different from (4) the second wire, the difference is different from the The first cache access of the first-time access to the predetermined data; in the previous-step, the lower-level memory unit is not accessed when the second cache accesses the predetermined data, until the save Obtaining the predetermined data; when a cache loss occurs, the predetermined data may be carried in one of the caches from the lower layer of the memory unit according to the predetermined sequence: the first order is: The first cache has been declared as one of the dead cache lines that are unlikely to be used again in the future; the first one is different from the other caches of the first cache. a stagnant stagnation line; and a second 'newly scheduled data to be loaded according to a predetermined rule Put it in a position in the first cache. The predetermined rule may use a first in first out (FIF0) method, a random (RANDOM) method, or a Least Recently Used (LRU) method or the like. The present invention has the following advantages: 1. The worst calculation time (WCET) can be estimated by limiting the number of instruction processing elements to only use the secondary cache corresponding to 1317065. 2. The present invention does not need to dynamically split the cache capacity of the first layer cache, avoiding additional latency' and more efficient sharing of the first layer cache, reducing the response time of the instruction processing component. 3. The present invention can store the Most Recently Used (MRU) data of each instruction processing component in the corresponding secondary cache to increase the probability of a fast hit. 4. The present invention can avoid the occurrence of cache thrashing. [Embodiment] Please refer to FIG. 1 for a schematic diagram of a system architecture according to an embodiment of the present invention. The embodiment of the present invention is applicable to a method for accessing a cache memory data of a parallel processing processor, and includes a processor 100 and a lower layer memory unit 200, wherein the processor 1 can use an SMT/ The CMP (Simultaneous multithreading/chip-multiprocessor) processor has a complex instruction processing component 101 and an upper memory unit, such as a first layer cache (L1 cache) 102, wherein the instruction processing component 101 can be a mini processor. (mini processor), the mini processors form the core device of the physical processor 1 . The instruction processing component 101 can also be a simple processor or a logical processor (a virtual processor) or an instruction execution program. The lower memory unit 200 is relative to the upper memory unit. When the first layer of the memory is as long as the first layer is cached, the lower memory unit 2 can be the second layer cache (L2Caehe) and the third layer is fast. Take (L3eaehe), etc., or main memory (main memory). The first layer cache 102 forms a plurality of sub-caches i 03 ' by partitioning (Ρ_οη), where '€ per-instruction processing element ig ι corresponds to one-time cache 103', ie - i The instruction processing elements 101 have a corresponding ith secondary cache 103 with priority access. 2 is a flow chart according to a preferred embodiment of the present invention. For convenience of description, the preferred embodiment of the present invention uses two instruction processing elements 101 for illustration, but the present invention is expandable. And using the plurality of instruction processing elements 10b, when using the first instruction processing element 1〇1, accessing the corresponding i-th cache 103, that is, step 300, it is necessary to determine whether the first cache 103 can be Accessing a predetermined data, that is, step 3, when the ith cache 103 can access the predetermined data, the first cache 103 returns a result of accessing the predetermined data. That is, step 302. If the ith command processing component 101 does not access the predetermined data, the i-th command processing component 101 accesses one of the j-th caches l〇3a, that is, step 303, the j-th needs to be judged. Whether the next cache 1 〇 3a can access the predetermined data, that is, step 304, when the j-th cache 1 〇 3a accesses the predetermined data, the j-th instruction processing component 1 〇 3a Returning the result, where i is not equal to j, that is, step 305, that is, the i-th instruction processing element 101 accesses another sub-cache that is different from the i-th cache 1 直到 3 until the The jth cache 1 〇 3a accesses the predetermined data. 11 1317065 The husband uses the ^ finger when the first data is taken by the 1G3a#, that is, when all the cache accesses the pre-material, the access is used; 101 accessing the lower memory Unit 2〇〇, the data processing instruction component (8) returns the result, and the time is 307. The solid and private processing element 101 is in the jth cache 1 〇 3a or the I layer β recall 2GG. When accessing the pre-me data, the step of using a parent swap data (Swap) can be further used, such as step 3〇6, the predetermined data is loaded into the layer 1 cache, and the system can be upgraded. The overall first-time cache hit rate increases the access efficiency of the first-tier cache of 1〇2. When the field is lost, as described above, it will access other caches or lower layers. The body sheet το 200 is accessed until the predetermined data is accessed. At this time, the predetermined data newly loaded by the lower layer memory unit 200 in the cache memory will be placed in a corresponding order in one of the cached corresponding cache combinations (four) che set) The order of the cache line (eaeheUne) is: The first, the first cache 103 has been declared as not in the future The cache line that is used again in the month b. Among them, the cache line that has been declared as unlikely to be used again in the future is the dead cache line. The determination and announcement of the stagnation cache line There have been many open literature discussions, and thus, those who have ordinary knowledge in this technical field can understand the operation principle thereof, and are not technical features of the present invention, and therefore will not be described in detail. Second, other sub-caches (not including the i-th cache 〇3) has been declared as a stagnant cache line that is unlikely to be used again in the future. 12 1317065 Third, the newly loaded data will be placed in the i-th cache 103 In a location placed according to a predetermined rule. The predetermined rule may use existing data method rules, for example, according to a first in first out (FIFO) method, a random (RANDOM) method, or a least recently used (Least Recently) Used, LRU) method, etc. In this way, the worst processing time (WCET) can be estimated by limiting each instruction processing component 101 using only the corresponding secondary cache 103, in practical applications, There are both real-time and non-real-time applications in the system, so you can generate a lot of slack time to provide quality of service (QoS) improvements or processors. Entering the power saving mode. It is apparent from the above-described preferred embodiments of the present invention that the application of the present invention has the following advantages: 1. The worst processing time can be estimated by limiting each command processing element 101 only by using the corresponding secondary cache 103 (WCET) 2. The present invention does not need to dynamically split the cache capacity of the first layer cache 102, avoids extra latency, and can share the first layer cache 102 more efficiently, reducing the instruction processing component 101. Response time. 3. The present invention can store the Most Recently Used (MRU) data of each instruction processing component 101 in the corresponding secondary cache 103 to increase the probability of a fast hit. 4. The present invention avoids the occurrence of cache thrashing. In the description and examples of the preferred embodiment of the present invention, the upper layer memory unit is exemplified by the first layer cache, but the application scope of the present invention is not limited to the 13 1317065 layer cache. Other components with storage characteristics such as the second layer, the third layer cache δ mnemonic and the conversion table buffer (transiati〇I1 l〇〇k-aside buffer, TLB), etc. can also be used to achieve the present invention. And the resulting effect of the present invention. Although the present invention has been described above in terms of a preferred embodiment, it is not intended to limit the invention, and various modifications may be made without departing from the spirit and scope of the invention. And the scope of the present invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features, advantages and embodiments of the present invention will become more <RTIgt; A schematic diagram of the system architecture. Figure 2 is a flow chart showing a preferred embodiment of the present invention. [Main component symbol description] 100: Processor 300: Step 101: Instruction processing component 301: Step 102: First layer cache 302: Step 103: Secondary cache 303: Step 103a: Secondary cache 304: Step 200: Lower layer Memory unit 305: Step 306: Step 307: Step

Claims

1317065 X. Patent application scope: l - A method for accessing the cache data of the parallel processing processor, comprising: a, providing-processor and - lower memory unit, the processor is used! a plurality of instruction processing elements and an upper memory unit, wherein the upper memory bank 7L includes a plurality of caches, wherein each instruction processing component corresponds to one of the caches; and b. uses the -first instruction processing component Accessing a predetermined data in one of the first caches corresponding thereto; C. although the first cache access is not accessed to the predetermined data, the access is different from the first cache Other times, the cache is different from the first _; the second cache accesses the predetermined data; d. When in step e, the second cache access is not When the predetermined data is accessed, the lower memory unit is accessed until the predetermined data is accessed; and e. when a cache relocation occurs, (4) the data is accessed from the predetermined data of the lower memory unit. The person is placed in one of the secondary caches, wherein the predetermined sequence is: first, one of the first caches in a dead cache line; second, different In one of the other caches of the first cache, in one of the stagnant cache lines And a third 'of the - secondary cache, in accordance with - a location disposed a predetermined rule. 15 1317065 2. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the predetermined rule is a first in first out (FIFO) method. 3. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the predetermined rule is a random (RANDOM) method. 4. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the predetermined rule is to use a Least Recently Used (LRU) method. 5. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein each of the instruction processing elements uses a mini processor. 6. The method of accessing a cache memory data for a parallel processing processor as described in claim 1, wherein the upper memory unit uses a first layer cache, each of the caches The partition of the first layer cache is used. 7. The method of accessing a cache memory data for a parallel processing processor as described in claim 1, wherein the processor uses a SMT/CMP (Simultaneous multithreading/chip-multiprocessor) Processor 0 8. The method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the processor uses a simple processor. 9. The method of accessing a cache memory data for a parallel processing processor as described in claim 1 of the patent application, wherein the processor uses a logical processor. 10. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the processor executes the program using an instruction. 11. The method of accessing a cache memory data for a parallel processing processor according to claim 1, wherein the upper memory unit is a first layer cache and a second layer fast Take a third layer cache or a translation look-aside buffer (TLB). 17