TW200809502A

TW200809502A - Method of accessing cache memory for parallel processing processors

Info

Publication number: TW200809502A
Application number: TW95129660A
Authority: TW
Inventors: Shi-Wu Lo
Original assignee: Shi-Wu Lo
Priority date: 2006-08-11
Filing date: 2006-08-11
Publication date: 2008-02-16
Also published as: TWI317065B

Abstract

A method of accessing cache memory for parallel processing processors includes providing a processor and a lower level memory unit. The processor utilizes multiple instruction processing members and multiple sub-cache memories that are respectively corresponding to the instruction processing members. Next step is using a first instruction processing member to access a first sub-cache memory. The first instruction processing member will access each one of the sub-cache memories other than the first sub-cache memory, when the first instruction processing member does not access a piece of desired data in the first instruction processing member. The first instruction processing member will access the lower level memory unit until the desired data have been accessed, when the first instruction processing member does not access the desired data in the sub-memories. The instruction processing member returns a result of accessing the desired data.

Description

200809502 九、發明說明· 【發明所屬之技術領域】本發明是有關於一種快取記憶體的設計方法，且特別是有關於一種適用於平行處理處理器之快取記憶體資料存取的方法。【先前技術】BACKGROUND OF THE INVENTION 1. Field of the Invention This invention relates to a method of designing a cache memory, and more particularly to a method for accessing a cache memory data for a parallel processing processor. [Prior Art]

在過去幾年來，由於處理器（processor)製程技術的持續進步，明顯的增加了處理器的運算速度，相對地，記憶體的存取速度卻未明顯的提升，造成處理器與記憶體在存取上速度上的差距越來越大。一般解決處理器與記憶體存取速度的差異問題，舉例來說，現代的處理器（m〇dern processor)，是利用階層式記憶體設計（hierarchical memory design)的方法來克服存取速度的差異問題。階層式記憶體設計（hierarchical memory design)將記憶體的資料區分為時間局部性（temporal locality)與空間局部性（spatial locality)的方式，提高快取（cache)存取資料的速度，由進一步提高處理器的效能。由於快取屬於由硬體動態分配的的記憶體，且程式執行時間高度相關於快取的命中率（hit rate)。在應用 SMT(simultaneous multithreading ;同步多線程）/CMP(chip-multiprocessor ;單晶片多處理器）技術於處理器時，當此處理器執行的一程式的執行時間，會受到其 200809502 他一起執行的程式所影響。雖然利用快取分割（cache partition)的方法，可以消除並行處理程式在同一個實體處理器之間的交I影響，但是這個方法不允許共用快取，造成快取的使用率仍然無法有效的提升。雖然上逃的問題可以透過動態調整每個迷你處理器 (mini-processor)所專有的次快取（sub_eache)記憶體大小而加以克服，但是這些動態調整的方法往往藉由修改替代演算法（replacement algorithm)來動態調整次快取記憶體的大小。然而，當處理器的系統中只有少量的快取遺失（cache miss)出現時，改變一個快取分割的大小需要耗費相當長的時間，而產生所謂的延遲（latency)，這個延遲對於有服務品質（quality of service)以及時序限制（timing constraint)的程式而言，會降低服務品質或是出現最後期限遺失 (deadline miss)的情況發生，造成系統效能無法有效提升。因此，對系統設計者而言，局部性（locality)造成的影響有三個方面。首先，最糟計算時間（WCET; worst case execution time) 因為快取命中率的關係，而變得難以估計。在程序設計中，WCET會嚴重影響到預測整個系統運行的時間，而準確的預測快取的命中率將成為WCET的一大挑戰課題。此外，由於WCET的估計是許多嵌入式系統與即時系統的基礎，WCET的難以估計會造成軟體設計上的困難。其次，指令層（instruction level)平行處理程式可能導致輾轉現象（trashing)，也就是當處理器内部的第一層快取 200809502 記憶體（level one cache memory)發生大量的快取遺失 (cache miss)時，亦即當快取命中率低時，因而造成處理器每秒鐘所能完成的指令數突然的降低。此時，當平行處理的不同程式之工作組（working set)對應到快取上的同一群快取列（cache line)時，不同程式將會互相覆蓋掉對方的工作組，因而發生輾轉現象（trashing)，造成系統效能不彰。第三，設計作業系統（operating system)排程器 (scheduler)的難度提高。一般的作業系統都假設處理器（包括CMP/SMT處理器中的簡單處理器與邏輯處理器）都能公平的使用所有的硬體資源且不會相互影響，假如不對快取限制，這會造成同一個程式與記憶體加強（memory intensive)的程式並行處理或與中央處理單元（CPU)加強的程式並行處理，將會有不一樣的執行效果而產生差異性，這種差異性將會增加作業系統的設計難度。【發明内容】本發明的一目的就是在提供一種適用於平行處理處理器之快取記憶體資料存取的方法，用以有效提高快取記憶體的使用率。本發明的另一目的是在提供一種適用於平行處理處理器之快取記憶體資料存取的方法，更精準的估算處理器之最糟計算時間WCET。本發明一種適用於平行處理處理器之快取記憶體資料存取的方法^包含· 200809502 提供一處理器以及一下層記憶體單元，該處理器使用複數指令處理元件以及一上層記憶體單元，如一第一層快取（Ll cache)，該第一層快取包含有複數次快取，其中每一指令處理元件係對應於其中一次快取；使用其中一第一指令處理元件，存取其所對應的其中一第一次快取中的一預定資料；當未於該第一次快取存取到該預定資料時，存取異於該第一次快取的其他次快取，直到於其中一異於該第一次快取的第二次快取存取到該預定資料；當於前一步驟中，未於該第二次快取存取到該預定資料時，存取該下層記憶體單元，直到存取到該預定資料；當發生快取遺失（cache miss)，可依據一預定順序自下層記憶體單元載入該預定資料於該些次快取的其中一個中，該預定順序為：第一，第一次快取中的已經被宣告為將來不太可能再被使用的其中一停滯快取線（dead cache line); 第二，異於該第一次快取的其他次快取中的其中一停滯停滯線；以及第三，依照一預定規則將新載入的預定資料放入到第一次快取中的一位置。而該預定規則係可使用先進先出 (FIFO)方法、隨機（RANDOM)方法或是取代最近最少使用 (Least Recently Used，LRU)方法等等。本發明具有下列優點： 1.可透過限定各指令處理元件只能使用與其相對應 200809502 的次快取可估算最糟計算時間（WCET)。 2. 本發明不需動態分割第一層快取之快取容量，避免發生額外的延遲（latency)，並且可更有效率的分享第一層快取，減少指令處理元件的回應時間。 3. 本發明可將各指令處理元件的最近最多使用（Most Recently Used，MRU)資料，存放於對應的次快取中，以提高快取命中的機率。 4. 本發明可避免發生快取輾轉現象（cache thrashing) 〇【實施方式】請參照第1圖，其繪示依照本發明一實施例的系統架構示意圖。本發明之實施例適用於平行處理處理器之快取記憶體資料存取的方法，包含提供一處理器100以及一下層記憶體單元200，其中，該處理器100係可使用一 SMT/CMP(Simultaneous multithreading/ chip-multiprocessor)處理器，並具有複數指令處理元件101 以及一上層記憶體單元，如一第一層快取（LI cache) 102，其中，指令處理元件101可為一迷你處理器（mini processor)，該些迷你處理器形成此實體處理器100的核心 (core)裝置。指令處理元件101也可為一簡單處理器（simple processor)或邏輯處理器（logical processor ; —種虛擬的處理器）或是一指令執行程式等。下層記憶體單元200係相對於上層記憶體單元，如上 200809502 層記憶體單元為第一層快取時，下層記憶體單元200可為第二層快取（L2 cache)、第三層快取（L3 cache)等，或是主記憶體（main memory)等。第一層快取102經由分割（partition)形成複數次快取 (sub-cache)l〇3，其中，使每一指令處理元件101係對應於其中一次快取1 〇3，亦即一第i個指令處理元件1 〇 i具有一對應的第i個次快取103，而具有優先存取的權力。請進一步參照第2圖所示，係繪示依照本發明較佳實施例之流程圖，為使方便說明，本發明較佳實施例以使用兩個指令處理元件101作說明，但本發明可擴及至使用多個指令處理元件101。當使用第i個指令處理元件1〇1，存取其所對應的第i個次快取103，即步驟3〇〇時，需判斷第1個次快取103是否可以存取到一預定資料，即步驟 30卜當於該第i個次快取1〇3可以存取到該預定資料時，该第1個次快取103回傳一存取該預定資料的結果，即步驟 302。若未於該第i個次快取1〇3存取到該預定資料時，使用該第i個指令處理元件1〇1存取其中一第』個次快取 l〇3a，即步驟303，需判斷第j個次快取1〇3a是否可以存取到該預定資料，即步驟3〇4，當於該第」個次快取刚a 存取到該預定f料時，該f j她令處理元#胸回傳該 ⑽果/、中’ 1不等於j，即步驟3〇5，亦即，該第i個指令處理元件101存取異於該第i個次快取103的其他次快取，直到於該第j個次快取脑存取到該預定資料為止 11 200809502 右未於該第j個次快取103a存取到該預定資料時，即未於所有次快取存取到該預定資料時，使用該第i個指令處理元件101存取該下層記憶體單元·，直到存取到該預定貧料時該第i個指令處理元们01回傳該結果，即步驟 307 〇當第1個指令處理元件101於第j個次快取l〇3a或是下層記憶體200中存取到該預$資料時，彳進一步使用一㈣快取資料伽叩）步驟，如步驟306，將該預定資料載 • 入§第-層快取102中，可以提昇系統整體的首次快取命中 (1st hit rate)機率，增加快取第一層快取1〇2的存取效率。當發生快取遺失時，如同前述，將存取其他次快取或下層記憶體單元200直到存取到該預定資料。此時，由下層記憶體單元200新載入快取記憶體中的預定資料將依照 -預定順序放置於其中_次快取的對應快取組合（cache ， set)中的-個快取線（eaeheHne)中。該預定順序係為：第一、第卜欠快取103中的已經被宣告為將來不太可 _ ㉟再被使用的快取線，其中，該些已經被宣告為將來不太可能再被使用的快取線即為停滯快取線（dead cache line)。關於停滯快取線的判定與宣告，已經有許多公開文獻探討，因而，對於此技術領域中具有通常之知識者，可以了解其運作原理，且並非本發明之技術特徵，因此不再詳加贅述。第二、其他次快取中（不包含第卜欠快取叫中已經被宣告為將來不太可能再被使用的停滯快取線。 12 200809502 第三、新載入的資料將放入到第i次快取丨中，依照一預定規則所放置的一位置中。該預定規則係可使用現有的資料方法規則，例如：依照先進先出（FIFO)方法、隨機（RAnd〇M)方法或是取代最近最少使用（Least Recently Used，LRU)方法等等。如此一來，可透過限定各指令處理元件1〇1只能使用與其相對應的次快取103估算最糟計算時間（WCET)，在實際應用時，系統中會同時存在即時（real tilne)及非即時 (non-real-time)的應用程式，因此，可以產生多的寬裕時間 (slack time)用於提供服務品質（qoS)的提昇或是處理器進入省電模式。由上述本發明較佳實施例可知，應用本發明具有下列優點。 1·可透過限定各指令處理元件101只能使用與其相對應的次快取103估算最糟計算時間（WCET)。 2·本發明不需動態分割第一層快取1 〇2之快取容量，避免發生額外的延遲（latency)，並且可更有效率的分享第一層快取102，減少指令處理元件1 〇 1的回應時間。 3·本發明可將各指令處理元件1〇1的最近最多使用 (Most Recently Used，MRU)資料，存放於對應的次快取1 〇3 中，以提高快取命中的機率。 4·本發明可避免發生快取輾轉現象(cache thrashing)。本發明之較佳實施例說明及範例中，上層記憶體單元雖以第一層快取為例，但本發明之應用範疇並不侷限於第 13 200809502 一層快取。其它具儲存特性的元件如第二層、第三層快取記憶體及轉換表緩衝區（translation look-aside buffer，TLB) 等亦可利用此技術達到本發明所提供以及產生之功效。雖然本释明已以一較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍内，當可作各種之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。【圖式簡單說明】為讓本發明之上述和其他目的、特徵、優點與實施例能更明顯易懂，所附圖式之詳細說明如下：第1圖係繪示依照本發明一實施例之一系統架構示意圖。第2圖係繪示依照本發明較佳實施例之流程圖。In the past few years, due to the continuous advancement of processor technology, the operating speed of the processor has been significantly increased. In contrast, the access speed of the memory has not been significantly improved, resulting in the storage of the processor and the memory. The gap in speed is getting bigger and bigger. Generally, the difference between the processor and the memory access speed is solved. For example, a modern processor (m〇dern processor) uses a hierarchical memory design method to overcome the difference in access speed. problem. Hierarchical memory design distinguishes the data of memory into temporal locality and spatial locality, and improves the speed of cache access to data. Processor performance. Since the cache is a memory that is dynamically allocated by the hardware, the program execution time is highly correlated with the cache hit rate. When applying SMT (simultaneous multithreading)/CMP (chip-multiprocessor) technology to a processor, when a program executed by this processor is executed, it will be executed by its 200809502. The program is affected. Although the cache partition method can eliminate the impact of parallel processing on the same physical processor, this method does not allow shared cache, which makes the cache usage still not effectively improved. . Although the problem of fleeing can be overcome by dynamically adjusting the sub-eache memory size that is unique to each mini-processor, these dynamic adjustment methods often modify the alternative algorithm ( Replacement algorithm) to dynamically adjust the size of the secondary cache. However, when there is only a small amount of cache miss in the processor's system, it takes a long time to change the size of a cache split, resulting in a so-called latency, which is for service quality. (quality of service) and timing constraints (programming) will reduce the quality of service or the occurrence of deadlines, resulting in system performance can not be effectively improved. Therefore, there are three aspects to the system designer's impact on locality. First, the worst case execution time (WCET; worst case execution time) becomes difficult to estimate because of the fast hit rate. In programming, WCET will seriously affect the prediction of the entire system operation time, and the accurate prediction of the cache hit rate will become a major challenge for WCET. In addition, because WCET estimates are the basis of many embedded systems and real-time systems, the difficulty of estimating WCET can cause software design difficulties. Secondly, the instruction level parallel processing program may cause trashing, that is, a large amount of cache miss occurs when the first layer cache of the internal layer of the processor 200809502 memory (level one cache memory) occurs. At that time, that is, when the cache hit rate is low, the number of instructions that the processor can complete per second is suddenly reduced. At this time, when the working sets of different programs processed in parallel correspond to the same group cache line on the cache, different programs will overwrite each other's work groups, thus causing a tumbling phenomenon ( Trashing), causing system performance is not good. Third, it is more difficult to design an operating system scheduler. The general operating system assumes that the processor (including the simple processor and the logical processor in the CMP/SMT processor) can use all the hardware resources fairly and without affecting each other. If the cache limit is not imposed, this will cause the same Parallel processing of a program with memory intensive programs or parallel processing with a central processing unit (CPU) will have different execution effects and differences, and this difference will increase the operating system. Design difficulty. SUMMARY OF THE INVENTION One object of the present invention is to provide a method for accessing a cache memory data suitable for a parallel processing processor to effectively increase the usage rate of the cache memory. Another object of the present invention is to provide a method for accessing a cache memory data suitable for a parallel processing processor to more accurately estimate the worst computation time WCET of the processor. A method for accessing a cache memory data for a parallel processing processor of the present invention includes: 200809502 provides a processor and a lower layer memory unit, the processor using a plurality of instruction processing elements and an upper memory unit, such as a first layer cache (Ll cache), the first layer cache includes a plurality of caches, wherein each instruction processing component corresponds to one of the caches; using one of the first instruction processing elements to access the cache Corresponding one of the first caches of the first cache; when the first cache is not accessed to the predetermined profile, the access is different from the other caches of the first cache, until One of the second cache accesses of the first cache accesses the predetermined data; in the previous step, when the second cache accesses the predetermined data, accessing the lower layer a memory unit until accessing the predetermined data; when a cache miss occurs, the predetermined data may be loaded from the lower memory unit in one of the caches in a predetermined order, the predetermined order : First, the first cache has been announced as one of the dead cache lines that are unlikely to be used again in the future; second, it is different from the other cache of the first cache. One of the stagnation stagnation lines is taken; and thirdly, the newly loaded predetermined data is placed in a position in the first cache according to a predetermined rule. The predetermined rule can use a first in first out (FIFO) method, a random (RANDOM) method, or a least recently used (Least Recently Used (LRU) method, and the like. The present invention has the following advantages: 1. The worst calculation time (WCET) can be estimated by limiting the number of instruction processing elements to only use the secondary cache corresponding to 200809502. 2. The present invention does not need to dynamically split the cache capacity of the first layer cache, avoids extra latency, and can share the first layer cache more efficiently, reducing the response time of the instruction processing component. 3. The present invention can store the Most Recently Used (MRU) data of each instruction processing component in the corresponding secondary cache to increase the probability of the cache hit. 4. The present invention can avoid the occurrence of cache thrashing. [Embodiment] Referring to Figure 1, a schematic diagram of a system architecture in accordance with an embodiment of the present invention is shown. The embodiment of the present invention is applicable to a method for accessing a cache memory data of a parallel processing processor, and includes a processor 100 and a lower layer memory unit 200, wherein the processor 100 can use an SMT/CMP ( The simultaneous multithreading/chip-multiprocessor processor has a complex instruction processing component 101 and an upper memory unit, such as a first layer cache (LI cache) 102, wherein the instruction processing component 101 can be a mini processor (mini Processors, the mini-processors form the core device of the physical processor 100. The instruction processing component 101 can also be a simple processor or a logical processor (a virtual processor) or an instruction execution program. The lower memory unit 200 is compared to the upper memory unit. When the 200809502 layer memory unit is the first layer cache, the lower memory unit 200 may be the second layer cache (L2 cache) and the third layer cache ( L3 cache), etc., or main memory (main memory). The first layer cache 102 forms a plurality of sub-caches through a partition, wherein each of the instruction processing elements 101 corresponds to one of the caches of 1 〇 3, that is, an ith. The instruction processing elements 1 〇i have a corresponding ith secondary cache 103 with priority access. Referring to FIG. 2, there is shown a flow chart in accordance with a preferred embodiment of the present invention. For ease of description, the preferred embodiment of the present invention uses two instruction processing elements 101 for illustration, but the present invention is expandable. And using a plurality of instruction processing elements 101. When the i-th instruction processing element 1〇1 is used and the corresponding i-th cache 103 is accessed, that is, step 3〇〇, it is determined whether the first cache 103 can access a predetermined data. Step 30: When the predetermined data is accessed by the ith cache 1, the first cache 103 returns a result of accessing the predetermined data, that is, step 302. If the predetermined data is not accessed by the ith cache 1, the ith instruction processing component 1 存取 1 accesses one of the caches l 〇 3a, step 303, It is necessary to determine whether the j-th cache 1 〇 3a can access the predetermined data, that is, step 3 〇 4, when the first cache access a is accessed to the predetermined f material, the fj she orders The processing element #1 chest returns the (10) fruit /, the middle '1 is not equal to j, that is, step 3〇5, that is, the ith instruction processing element 101 accesses the other times different from the ith second cache 103 Cache, until the jth cache accesses the predetermined data. 11 200809502 Right does not access the predetermined data when the jth cache 103a accesses the predetermined data, that is, not all cache accesses When the predetermined data is obtained, the i-th instruction processing element 101 is used to access the lower memory unit, and the i-th instruction processing element 01 returns the result when accessing the predetermined poor material, that is, step 307 When the first instruction processing component 101 accesses the pre-$ data in the j-th cache l3a or the lower memory 200, further use one (four) cache Step 306, the predetermined data is loaded into the §-layer cache 102, which can improve the system's overall first hit rate probability, and increase the cache of the first layer cache. 1〇2 access efficiency. When a cache miss occurs, as in the foregoing, the other cache or lower memory unit 200 will be accessed until the predetermined material is accessed. At this time, the predetermined data newly loaded into the cache memory by the lower memory unit 200 will be placed in the cache line of the corresponding cache combination (cache, set) in the predetermined order. EaeheHne). The predetermined order is as follows: The cache lines in the first and second caches 103 that have been declared as not being used again in the future, wherein the orders have been declared as unlikely to be used again in the future. The cache line is the dead cache line. There have been many open literature discussions on the determination and announcement of the stagnation cache line. Therefore, those who have common knowledge in this technical field can understand the operation principle and are not the technical features of the present invention, and therefore will not be described in detail. . Second, other sub-caches (excluding the stagnation cache line that has been declared to be unlikely to be used again in the future). 12 200809502 Third, the newly loaded data will be placed in the first The i-time cache is placed in a location according to a predetermined rule. The predetermined rule can use existing data method rules, for example, according to a first in first out (FIFO) method, a random (RAnd〇M) method, or Replacing the Least Recently Used (LRU) method, etc. In this way, the worst calculation time (WCET) can be estimated by limiting the respective instruction processing elements 1〇1 using only the corresponding secondary cache 103. In practice, real tilne and non-real-time applications exist in the system, so you can generate a lot of slack time to improve service quality (qoS). Or the processor enters the power saving mode. As can be seen from the above preferred embodiments of the present invention, the application of the present invention has the following advantages: 1. The instruction processing element 101 can be used to define only the secondary cache 103 corresponding thereto. Calculate the worst calculation time (WCET). 2. The present invention does not need to dynamically divide the cache capacity of the first layer cache by 1 〇 2, avoiding extra latency, and sharing the first layer faster. Taking 102, reducing the response time of the instruction processing component 1 〇 1. 3. The present invention can store the Most Recently Used (MRU) data of each instruction processing component 1 〇 1 in the corresponding secondary cache 1 〇 3 In order to improve the probability of fast hits. 4. The present invention can avoid cache thrashing. In the description and example of the preferred embodiment of the present invention, the upper memory unit is cached by the first layer. For example, the scope of application of the present invention is not limited to the 13th 200809502 layer cache. Other components with storage characteristics such as the second layer, the third layer cache and the translation look-aside buffer (TLB) The present invention may be utilized to achieve the effects of the present invention. The present invention has been described above in terms of a preferred embodiment, and is not intended to limit the present invention. In the spirit and scope of the present invention, the scope of protection of the present invention is subject to the definition of the scope of the appended claims. [Simplified Description of the Drawings] For the above and other purposes of the present invention The detailed description of the drawings is as follows: FIG. 1 is a schematic diagram showing a system architecture according to an embodiment of the present invention. FIG. 2 is a schematic diagram showing a system architecture according to an embodiment of the present invention. A flow chart of a preferred embodiment.

【主要元件符號說明】 100:處理器 300:步驟 101:指令處理元件 301:步驟 102:第一層快取 302:步驟 103:次快取 3 03:步驟 103 a:次快取 304:步驟 200:下層記憶體單元 305:步驟 306:步驟 307:步驟[Main component symbol description] 100: Processor 300: Step 101: Instruction processing component 301: Step 102: First layer cache 302: Step 103: Secondary cache 3 03: Step 103 a: Secondary cache 304: Step 200 : Lower memory unit 305: Step 306: Step 307: Step

Claims

200809502 X. Patent application scope: A method suitable for accessing memory data of parallel processing processors, including ·· k for processing fast and '-lower level 恃恃一一一一人人 # # # # The processor uses a data processing device and an upper memory unit body unit to include a plurality of caches, wherein each finger upper layer 4 is in-(four) taken; Corresponding to the middle one b command processing component, accessing its corresponding data of the f-secondary cache t; (4) ==:= access dragon dare (four), access other aliens to decide other fast Taking until the second cache of the second cache accesses the predetermined data; , d... When in step c, when the material is not being spun, accessing the lower layer gramm m is accessed to The pre-determined capital and μ early 7°, Lang access _ reservation data; layer memory two 钭:- predetermined order will be accessed from the next one:, a pre-^, the first cache in the cache Line); 八八中—Stagnation cache line (dead beans in rtr' is different from the first-time cache One of the secondary caches is in the stagnation cache line; and the second person who takes the shackles, the first time to cooperate with one position. In the decision, 15 according to a predetermined rule. The method of accessing the cache data of the parallel processing processor according to the first item of the first aspect, wherein the predetermined rule uses a first in first out (FIFO) method. 3 · As described in claim 1 The method for parallel processing of cache memory data access, wherein the predetermined rule uses a random (RANDOM) method. 4. The parallel processing processor is as described in claim 1 A method of accessing a memory data access method, wherein the predetermined rule is to use a Least Recently Used (Lru) method. 5. The fast processing parallel processor is described in the scope of the patent application. The method for accessing the memory data, wherein each of the instruction processing elements uses a mini processor. 6· The fast-moving memory suitable for parallel processing according to the above-mentioned patent application The data access method, in which the upper memory unit makes the 帛帛-layer cache, and each of the caches uses the partition of the first layer cache. The parallel processing process prints the method of accessing the memory of the memory, wherein the processor uses 16 SMT/CMP (Simultaneous multithreading/chip-multiprocessor) processor. The method of accessing the memory data of the parallel processing processor according to the first item, wherein the processor uses a simple processor. 9. A method of accessing a cache memory data for a parallel processing processor as claimed in claim 1 wherein the processor uses a logical processor. 10. A method of accessing a cache memory data for a parallel processing processor as described in claim 1 wherein the processor executes the program using an instruction. 11. The method of accessing a cache memory data for a parallel processing processor according to claim 1, wherein the upper memory unit is a first layer cache and a second layer fast Take a third layer cache or a translation look-aside buffer (TLB). 17