TW200949690A - Processor and early execution method of data load thereof - Google Patents

Processor and early execution method of data load thereof Download PDF

Info

Publication number
TW200949690A
TW200949690A TW97119412A TW97119412A TW200949690A TW 200949690 A TW200949690 A TW 200949690A TW 97119412 A TW97119412 A TW 97119412A TW 97119412 A TW97119412 A TW 97119412A TW 200949690 A TW200949690 A TW 200949690A
Authority
TW
Taiwan
Prior art keywords
instruction
early
processor
data
queue
Prior art date
Application number
TW97119412A
Other languages
Chinese (zh)
Other versions
TWI364703B (en
Inventor
Shun-Chieh Chang
Yuan-Hwa Li
Yuan-Jung Kuo
Chin-Ling Huang
Chung-Ping Chung
Original Assignee
Faraday Tech Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Faraday Tech Corp filed Critical Faraday Tech Corp
Priority to TW97119412A priority Critical patent/TWI364703B/en
Publication of TW200949690A publication Critical patent/TW200949690A/en
Application granted granted Critical
Publication of TWI364703B publication Critical patent/TWI364703B/en

Links

Landscapes

  • Advance Control (AREA)

Abstract

A processor and an early execution method of data load thereof are provided. The early execution method fetch and pre-estimation an instruction for obtaining the estimation result in the instruction fetch stage. According to the estimation result, determine to be early loaded the pre-fetch data with corresponding the instruction. The target data is fetched according to the instruction in the instruction execution stage if the pre-fetch data have not been written into correctly. If the pre-fetch data have been written into correctly, has regarded the pre-fetch data as the target data.

Description

200949690 r^.ui//ixuuJ 27005twf.doc/n 九、發明說明: 【發明所屬之技術領域】 本發明是有關於一種處理器,且特別是有關於一種管 線式(pipeline)處理器。 【先前技術】 圖1是說明傳統管線式處理器。圖1中傳統處理器僅 繪出管線(pipeline) 100。管線100具有指令提取級 ❹ (instruction fetch stage) 110、指令佇列(instruction queue) 120、指令解碼級(instruction decode stage) 130、指令執行 級(instruction execution stage) 140、以及資料回寫級(data write-back stage) 150。在傳統的處理器設計中,指令提取 級110及指令解碼級130之間會用指令佇列120將兩個級 110與130隔開,藉此降低issue Rate及FetchRate不穩定 而造成處理器效能的損失。因此大部分指令在被提取(fetch) 進處理器後,並不會馬上進入指令解碼級13〇 ,它會先在 指令佇列12〇中等待一段時間。指令提取級11〇從指令快 取記憶體(或是從主記憶體中)提取指令,並將指令送入指 々4丁歹】120中。才曰令仔列12Q以先進先出(flrst in first 〇ut, FIFO)原則存放指令提取級丨1〇所提取的指令,並依序將指 令提供給指令解碼級130。 一,而言’處理器在執行指令前需要利用指令解碼級 130將「指令碼」進行解碼。完成解碼的指令會被傳送到 才曰令執行級140。指令執行級14〇包含有算數邏輯單元 (arithmetic andl〇glc unit,ALU),可以依據指令解碼級 5200949690 r^.ui//ixuuJ 27005twf.doc/n IX. Description of the Invention: TECHNICAL FIELD The present invention relates to a processor, and more particularly to a pipeline processor. [Prior Art] FIG. 1 is a diagram illustrating a conventional pipelined processor. The conventional processor in Figure 1 only depicts the pipeline 100. The pipeline 100 has an instruction fetch stage 110, an instruction queue 120, an instruction decode stage 130, an instruction execution stage 140, and a data write-back level (data). Write-back stage) 150. In the conventional processor design, between the instruction fetch stage 110 and the instruction decode stage 130, the two stages 110 and 130 are separated by the instruction queue 120, thereby reducing the issue rate and the FetchRate instability to cause processor performance. loss. Therefore, most instructions will not enter the instruction decode stage 13 after being fetched into the processor. It will wait for a while in the instruction queue 12〇. The instruction fetch stage 11 extracts the instruction from the instruction cache (or from the main memory) and sends the instruction to the instruction file 120. The order 12Q is stored in the first-in first-out (FIFO) principle to extract the instruction extracted by the instruction level, and the instruction is sequentially supplied to the instruction decoding stage 130. First, the processor needs to decode the "instruction code" by the instruction decode stage 130 before executing the instruction. The instruction to complete the decoding is transmitted to the execution level 140. The instruction execution level 14〇 includes an arithmetic logic unit (ALU), which can be decoded according to the instruction level 5

200949690 以 wmw3 27005twf.d〇c/n 130的解碼結果而執行指令操作。若指令執行級所執 行的f令操作會產生運算結果,則資料回寫級150負責將 此運算結果寫回資料快取記憶體(或是主記憶體)。 傳統的處理器設計中,資料載入_使用的延遲會隨著管 ,深度的增加而增加。载入_使用的延遲將嚴重影響到處理 态的效能。舉個例子來說,觀察以下的指令串: LOAD Rm> [mem addr] ADD Rd, Rn, Rm 會依序從指令記憶體中提取上指 曰令,並且存入指令佇列120中。經由指令解 作後’指令執行級⑽會先執行l: 資料曰ί執饤級140令的載入,儲存單元(未繪示) ==後將此資料存放在暫存器如中—。這= 續取動作將會在指令執行級⑽' 需要η個時脈才能完成上述L_指令,1〇 ===)必須等待n個時脈 官線的深度(級數) 【發明内容】 本發明提出一種處理5|之 取級中提取並判斷—指令\ 法。此方法在指令提 獲仔判斷結果。依據判斷結 -3 27005twf.doc/n 200949690 果、’決定是碰早载人指令對狀預較料。若預載資料 已被正確地載入,則以預載資料作為該目標資料。 在本發明之一實施例中,若預载資料未被正確地載 入,則在指令執行級依據該指令去提取目標資料。 ❹ ❹ 本發明提出-種處理器,包括指令提取級、指令解碼 2、指令執行級、以及提早載人㈣卜指令提取級用以提 一指令’其中指令提取級包含預解碼單元,則更在指令 β取級巾預先觸難令,轉制斷結果。指令解碼級 ίίϊ指令提取m解碼指令,轉得解碼結果。指 二執=級_至指令解碼級’㈣依據解碼結果執行該指 曰早載人彳宁顺接至預解碼單元,用以依據前述判斷 是否提早载人該指令對應之預載資料。其中, =4未被正確地載入’則指令執行級依據指令去提 ^科。若該預載資料已被正確地载人該提早載入仔 1,則以該預載資料作為目標資料。 $發明之—實施财,若該靖結果表示該指令屬 狀型’且暫存器狀態表中與該指令對應之暫存器 載狀態’則將該指令對應之預载資料載入至提早 r列fit明之—實施例巾,在指令解敬檢查提早載入 32=是錢妥且合法。若提早载入·中的資料 該指令所指定的目的暫存器位址改為 通钕早载入佇列中該預載資料的位址。 200949690 /1 ivuj 27005twf.doc/n 本發明因利用指令被提取(fetch)進入指令j宁列中的等 待時間,提早載入該指令對應之預載資料,因此可以解決 深官線處理器設計中,载入_使用延遲過長的問題。 為讓本發明之上述特徵和優點能更明顯易懂,下文特 舉較佳實施例,並配合所附圖式,作詳細說明如下。 【實施方式】 圖2是依照本發明實施例說明一種處理器之預載方法 ❹ 流程圖。當指令提取級提取指令時,指令提取級會預先判 斷該指令,以獲得判斷結果(步驟S21〇)。依據判斷結果, 處理器可以決定是否提早載入該指令對應之預載資料(步 驟S220)。若預載資料未被正確地載入,則指令執行級依 據該指令去提取目標資料(步驟S23〇)。若預載資料已被正 確地載入,則處理器便以預載資料作為目標資料(步驟 S240)。 本發明所屬領域具有通常知識者可以視其需求,以任 ❹何方式修改上述實施例。例如,圖3A是依照本發明另一 實施例說明處理器之預載方法流程圖。與前一實施例相 較’本實施例在步驟S210與S220之間更進行判斷步驟(步 驟S310)。請參照圖3A。於步驟S21〇中,指令提取級會 從指令記憶體(或指令快取)提取指令,並且預先判斷(或是 預先解碼)該指令。因此,在該指令進入指令佇列 (instruction queue)之前,步驟S210可以而提早分辨出該指 令需不需要從資料快取(或資料記憶體)提取資料。 27005twf.doc/n 200949690 JL Z. W / l± X/V/3 依據步驟S210之判斷結果’步驟S310決定是否將所 述指令存進提早載入彳宁列(early-load queue,ELQ)中。若是 所述指令不屬於目標類型(例如不需要從資料快取提取資 料),則只將所述指令存進指令佇列(不需將所述指令存進 提早載入仵列)。因此’所述指令會經由指令解雨級 (instruction decode stage)、指令執行級(instruction execmi〇n stage)而被執xf于(步驟S320)。當然,若所述指令雖然不屬於 目標類型,卻仍然需要從資料快取提取資料者,在步驟 S320中指令執行級便會依據所述指令而從資料快取提 資料。 步驟S310亦可能依據判斷結果,決定將該指令放進 提早載入佇列與指令佇列中。若步驟S31〇將該指令放進 提早載入佇列,則步驟S220檢查暫存器狀態表該^令所 指定的暫存器的狀態是否為備妥狀態,然後將該指令對應 之預載資料從資料快取載入至提早載入佇列中。因此,可 以在指令執行級之前(在該指令還在指令佇列等待被執^ 提早從提早載人仔列中執行該指令以载入對應的 預載貝料,然後將預載資料放進提早載入佇列中。另外, 儲存在指令仔列中的所述指令經過等待執行的時間後,會 本實施狀處㈣在指令解碼 Γ二得解碼結果。依據解碼結果,處理器檢杳暫 H縣’關_料是倾正確地狀提早载入 =二若預《料未被正確地載人,則指令執行 該I去資料快取提取目標資料(步驟S230)。 t 200949690 r 厶, i jl vv3 27005twf.doc/n 已被正確地載入,則處理器便以預载資料作為目標資料(步 驟S24G),指令執行級不需要花費額㈣賴去資料 快取提取目標資料。 本發明所屬領域具有通常知識者可以依其需求,而於 上述實施例中以任何手段設置無效機制(invalidati〇n mechanism) ’以便預防上述提早載人操作存取到錯誤的資 料。例如,在指令解碼級若有第二指令(泛指任何指令)被200949690 Executes the instruction operation with the decoding result of wmw3 27005twf.d〇c/n 130. If the f-order operation performed by the instruction execution stage produces an operation result, the data write-back stage 150 is responsible for writing the result of the operation back to the data cache (or the main memory). In the traditional processor design, the delay of data loading _ will increase with the increase of the depth of the tube. The delay of loading_usage will seriously affect the performance of the processing state. For example, observe the following command string: LOAD Rm> [mem addr] ADD Rd, Rn, Rm will sequentially extract the upper command from the instruction memory and store it in the instruction queue 120. After the instruction is executed, the instruction execution level (10) will execute first: data 曰 饤 饤 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 140 This = the resuming action will be completed at the instruction execution level (10)' requires η clocks to complete the above L_ instruction, 1〇===) must wait for the depth of the n-line clock lines (levels) [Invention] The invention proposes a process of extracting and judging - instruction \ method in the process of processing 5|. This method extracts the judgment result in the instruction. According to the judgment -3 27005twf.doc/n 200949690, the decision is to compare the early manned instructions. If the preloaded data has been loaded correctly, the preloaded data is used as the target data. In an embodiment of the invention, if the preloaded data is not correctly loaded, the target data is extracted at the instruction execution level in accordance with the instruction. ❹ ❹ The present invention proposes a processor, including an instruction fetch stage, an instruction decode 2, an instruction execution stage, and an early manned (four) b instruction fetch stage for providing an instruction 'where the instruction fetch stage includes a pre-decoding unit, and more The instruction β takes the pre-emptive order and turns the result. The instruction decode stage ίίϊ instruction extracts the m decoding instruction and converts the decoding result. Refers to the second execution level=to the instruction decoding level'. (4) The finger is executed according to the decoding result. The early carrier is connected to the pre-decoding unit to determine whether to preload the preloaded data corresponding to the instruction according to the foregoing. Where =4 is not loaded correctly' then the instruction execution level is based on the instruction. If the pre-loaded data has been correctly loaded, it should be loaded into the child 1 early, and the pre-loaded data will be used as the target data. $Invention--implementation, if the result indicates that the instruction is of the type 'and the register state of the register corresponding to the instruction in the register status table', the preloaded data corresponding to the instruction is loaded into the early r Column fit Ming - the implementation of the towel, in the command of the deliberation check early loading 32 = is money and legal. If the data is loaded early, the destination register address specified by the instruction is changed to the address of the preloaded data in the queue. 200949690 /1 ivuj 27005twf.doc/n The present invention uses the instruction to fetch the waiting time into the sequence of the instruction j, and preloads the preloaded data corresponding to the instruction, so that the deep official line processor design can be solved. , loading _ use delay is too long. The above described features and advantages of the present invention will become more apparent from the following description. [Embodiment] FIG. 2 is a flow chart showing a preloading method of a processor according to an embodiment of the invention. When the instruction fetch stage fetch instruction, the instruction fetch stage prejudges the instruction to obtain the judgment result (step S21). Based on the result of the judgment, the processor can decide whether to preload the preloaded data corresponding to the instruction (step S220). If the preloaded data is not loaded correctly, the instruction execution stage extracts the target data according to the instruction (step S23). If the preloaded data has been correctly loaded, the processor uses the preloaded data as the target data (step S240). Those skilled in the art to which the invention pertains may modify the above-described embodiments in any manner, depending on the needs thereof. For example, Figure 3A is a flow chart illustrating a method of preloading a processor in accordance with another embodiment of the present invention. Compared with the previous embodiment, the present embodiment further performs a judging step between steps S210 and S220 (step S310). Please refer to FIG. 3A. In step S21, the instruction fetch stage fetches the instruction from the instruction memory (or instruction fetch) and prejudges (or pre-decodes) the instruction. Therefore, before the instruction enters the instruction queue, step S210 can discriminate early that the instruction does not need to extract data from the data cache (or data memory). 27005twf.doc/n 200949690 JL Z. W / l± X/V/3 According to the judgment result of step S210, 'Step S310 determines whether to store the instruction into the early-load queue (ELQ). . If the instruction does not belong to the target type (e.g., does not need to extract data from the data cache), then only the instructions are stored in the instruction queue (the instructions are not stored in the early loading queue). Therefore, the instruction is executed by the instruction decode stage and the instruction execmi〇n stage (step S320). Of course, if the instruction does not belong to the target type, but still needs to extract data from the data cache, in step S320, the instruction execution level will extract data from the data cache according to the instruction. Step S310 may also decide to put the instruction into the early loading queue and the command queue according to the judgment result. If the instruction is placed in the early loading queue in step S31, step S220 checks whether the state of the register specified by the register status table is ready, and then the preloaded data corresponding to the instruction. Load from the data cache to the early loading queue. Therefore, it can be executed before the instruction execution level (the instruction is still waiting to be executed in the instruction queue). The instruction is executed from the early loader column to load the corresponding preloaded material, and then the preloaded data is put into the early stage. In addition, after the instruction stored in the instruction queue has passed the waiting time, the implementation will be decoded (4) in the decoding of the instruction. According to the decoding result, the processor checks the temporary H. The county 'off' is expected to be loaded correctly before loading = two if the pre-requisite is not correctly loaded, the instruction is executed to go to the data cache to extract the target data (step S230). t 200949690 r 厶, i jl Vv3 27005twf.doc/n has been correctly loaded, then the processor uses the preloaded data as the target data (step S24G), and the instruction execution level does not need to spend the amount (4) depending on the data cache to extract the target data. Those who have the usual knowledge can set the invalidation mechanism (invalidati〇n mechanism) by any means in the above embodiments according to their needs in order to prevent the above-mentioned early manned operation from accessing the wrong data. For example, in the instruction solution Level if the second instruction (instruction refers to any) is

❹ 解,’則在暫存器狀態表巾對應於第二指令所指^目的暫 存器之狀態設為忙碌’以免其他指令存取㈣暫存器。接 下來搜尋提早載人制之所有記錄。若提早載人仵列中有 二記錄指向第二指令所指定目的暫存器,則將該記錄設為 二效(不合法)。因此’可以避免發生資料相依 Dependency)錯誤。 蔣次2如」在指令執行級若有f二指令(泛指任何指令) 人記憶體某位址處,職尋提早載人彳宁列。若提 ^載^財有—記錄與第二指令所指定記龍位址相 同’則將该崎設為無效(不合法)。目此 記憶體相依(MemoryDependency)錯誤。 避免發生 步驟’在設置了無效_的前提T,上述 c可以包括下述操作。在指令解媽級,檢查提 的資料已備妥且合法,則將該指令所指定 址改為提早載入符列中該預載資料的位址。 益 200949690 本發明所屬領域具有通常知識者可以搭配任何管線式 處理器之設計來實現上述實施例。例如,圖3B是依照本 發明實施例說明一種管線式處理器。圖3B所示的處理器 僅繪出管線(pipeline) 300。管線300具有指令提取級 (instruction fetch stage) 310、指令佇列(instruction queue) 320、指令解碼級(instruction decode stage) 330、指令執行 級(instruction execution stage) 340、以及資料回寫級(data ❹ wrlte-back stage) 350。指令提取級31〇及指令解碼級330 之間配置指令佇列320,藉此降低issue Rate及Fetch Rate (fetch)進處理器後, 間,才會進入指今哀 不穩定而造成處理器效能的損失。指令提取級31〇從指令 快取記憶體(或是從主記憶體中)提取指令。指令在被提取 間,才會進入指令解碼級33〇。指令佇列32〇以先進先出 (first in first out,FIF0)原則存放指令提取級31〇所提取的 指令,並依序將指令提供給指令解碼級330。 在執行指令前需要利用姑么紐__❹ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Then search for all the records of the early manned system. If there are two records in the early manned queue pointing to the destination register specified by the second instruction, the record is set to be invalid (not legal). Therefore, you can avoid data dependency Dependency errors. Jiang Ci 2, such as "At the instruction execution level, if there are f two instructions (referred to as any instruction), a certain address of the human memory, the job seeks to preload the person. If the certificate is the same as the one specified in the second instruction, then the singularity is invalid (not legal). This is a memory dependency error. Avoid the occurrence of the step 'With the premise T of setting the invalid_, the above c may include the following operations. At the instruction level, if the inspection data is ready and legal, the address specified in the instruction is changed to the address of the preloaded data in the early loading column. Benefits 200949690 Those skilled in the art having the knowledge of the present invention can implement the above-described embodiments in conjunction with the design of any pipelined processor. For example, Figure 3B illustrates a pipelined processor in accordance with an embodiment of the present invention. The processor shown in Figure 3B depicts only the pipeline 300. Pipeline 300 has an instruction fetch stage 310, an instruction queue 320, an instruction decode stage 330, an instruction execution stage 340, and a data write-back level (data ❹). Wrlte-back stage) 350. The instruction fetch stage 31 and the instruction decode stage 330 are configured with an instruction queue 320, thereby reducing the issue rate and the Fetch Rate (fetch) into the processor, and then entering the processor is unstable and causing processor performance. loss. The instruction fetch stage 31 fetches instructions from the instruction cache (or from the main memory). The instruction will enter the instruction decode stage 33〇 after being fetched. The instruction queue 32 stores the instructions fetched by the instruction fetch stage 31 in a first in first out (FIF0) principle, and sequentially supplies the instructions to the instruction decode stage 330. Need to use the aunt __ before executing the instruction

它會先在指令佇列320中等待一段時 200949690 χ -i-ww/ ijiuv3 27005twf.doc/n 以依據指令解碼級330的解碼結果而執行運算於八的才二 作。若指令執行級340所執行的指令操作會產生運管= 果’則資料回寫級350負責將此運算結果寫;資料快 憶體(或是主記憶體)。It will first wait for a period of time in the command queue 320 200949690 χ -i-ww/ ijiuv3 27005twf.doc/n to perform the operation in eight according to the decoding result of the instruction decoding stage 330. If the instruction operation executed by the instruction execution stage 340 generates a transport = result ', then the data write back stage 350 is responsible for writing the result of the operation; the data memory (or the main memory).

於本實施例中’指令提取級310包含提取單元311與 預解碼(pre-decoding)單元312。提取單元3n從指令快^ 記憶體(或是從主記憶體中)提取指令。預解碼單元312、預 先判斷提取單元311所提取的指令,以獲得判斷結果。 官線300更具有提早載入佇列(eariy七ad职㈣。 對於指令流而言,提早載入佇列36〇可以是平行於指令佇 列320的一個小型表格。提早載入佇列36〇 至^ f元312。預解碼單元312依據其判斷結*,決定= 指令寫入提早載入佇列360。在另一實施例中,可以由提 ^載入件列360依據該判斷結果,而決定是否紀錄該指 々。在本實施例中,若該判斷結果表示提取單元3n所提 取的指令屬於目標類型(例如LDR、LDRB等將資料載入暫 $的指令麵)’則預解碼單元312㈣時將指令寫入指 :•列320與提早載入仔列36〇。反之’若該判斷結果表 取單元311所提取的指令並不屬於目標類型,則預解 ^早το M2只會將指令寫入指令仔列no,而不寫入 戰入佇列360。 依據賴碼單元祀的_結果,處刻可以決定是 :^將該,令對應之預载資料抓取至提早載人仔列細 右預載資料未被正確地抓取至提早載入符列36〇,則 12 200949690 r-s,w/ nw3 27005twf.d〇c/n 指令執打級340依據該指令去提取資料(在此稱為目標資 料)。右該職㈣已被正確地資料抓取至提早載入 360’=處理純存放在提早載人㈣的職資料作為 目標貧料。以LDR指令為例,該指令尚在指令仔列32〇 $待的期間巾’處理H可以提早將LDR指令所缺位址的 貧料(在此稱為預載資料)抓取至提早載入符歹中。因 此,當LDR指令進入指令執行級34〇時’便可以使用提早 ❹載入仔歹,J 360中的預载資料,而不用去資料快取記憶體(或 是主記憶體)抓取目標資料。 上述提早载入資料的操作可以任何方式實現之。例 如,圖3B所示的實施例便是使用提早載入單元370來完 成提早載入資料的操作。提早載入佇列360保留提取單元 311所提供的指令,並要求提早載入單元37〇去提取目標 資料。提早載入佇列360可以參照表1所示之資料結構實 現之。於表1中’狀態欄位State[l:0]用來記錄提早載入仔 列360中每一筆記錄(entry)/指令的狀態,例如〇〇表示無 效(Invalld)、01表示忙碌(Busy)、10表示已備妥(Ready)、 11表示使用中(Using)。程式計數欄位PC[l:〇]用來記錄該 記錄/指令的程式計數器(pr〇gram c〇unter)内容,也就是該 指令的位址。暫存器資訊欄位Base_ID[3:0]與〇ffset[ii:〇] 用來紀錄該指令欲儲存資料的目的暫存器位址(基底值與 偏移值)。攔位Adr_mode[l:0]用來紀錄該指令的定址模 式,例如前索引(pre-index)、後索引(post-index)、自動索引 (auto-index)等模式。記憶體位址欄位Adr[31:0]用來紀錄該 13 200949690 i 二 v/v / 農 27005twf.doc/n 指令欲載入資料的記憶體位址 ❹ 指令提取級310巾的預解碼單元阳 的fff且解碼出該指令的基底暫存II索引 mdex)、偏移值_et)、與定址模式。若該指令具有 —…之位址形式,則此齡會被放人提早載入仔列 SIS提早載入件列⑽中設一 .表1 I提早載入仔列360之資料結構 Statpri m orr-ji-m n— I ~:-—~ —In the present embodiment, the instruction fetch stage 310 includes an extracting unit 311 and a pre-decoding unit 312. The extracting unit 3n extracts an instruction from the instruction memory (or from the main memory). The pre-decoding unit 312 preliminarily determines the instruction extracted by the extracting unit 311 to obtain a judgment result. The official line 300 has an early loading queue (eariy seven ad jobs (four). For the instruction stream, the early loading queue 36 can be a small table parallel to the command queue 320. Loading the queue 36提 early To the f element 312. The pre-decoding unit 312 determines, according to its judgment node*, that the instruction is written into the preloading queue 360. In another embodiment, the result of the determination may be determined by the loading column 360. Determining whether to record the fingerprint. In this embodiment, if the result of the determination indicates that the instruction extracted by the extracting unit 3n belongs to the target type (for example, LDR, LDRB, etc., the data is loaded into the command surface of the temporary $), then the pre-decoding unit 312 (4) When the instruction is written to: • column 320 and the early loading queue 36. Conversely, if the instruction extracted by the judgment result table taking unit 311 does not belong to the target type, the pre-solution ^ early το M2 will only command Write the command column no, not write the battle queue 360. According to the _ result of the 码 code unit ,, the moment can decide: ^, so that the corresponding preloaded data is captured to the early carrier list The fine right preloaded data was not correctly captured to the early loading list 36〇 Then 12 200949690 rs,w/ nw3 27005twf.d〇c/n The instruction level 340 extracts the data according to the instruction (herein referred to as the target data). The right job (4) has been correctly captured to the early loading 360'=Processing the information stored in the early stage (4) as the target poor material. Take the LDR instruction as an example, the instruction is still in the order of 32〇$waiting period. 'Processing H can prematurely invalidate the LDR instruction. The poor material of the site (herein referred to as preloaded data) is captured into the early loading symbol. Therefore, when the LDR instruction enters the instruction execution level 34, it can be used to load the baby, J 360 The data is preloaded without going to the data cache memory (or the main memory) to capture the target data. The above operation of loading the data early can be implemented in any manner. For example, the embodiment shown in FIG. 3B is used early. The loading unit 370 performs the operation of loading the data early. The early loading queue 360 retains the instruction provided by the extracting unit 311, and requests the early loading unit 37 to extract the target data. The early loading of the queue 360 can refer to the table. The data structure shown in 1 is implemented. The 'status field State[l:0] in 1 is used to record the status of each entry/instruction in the preloaded queue 360, for example, 〇〇 indicates invalid (Invalld), 01 indicates busy (Busy), 10 Indicates that Ready and 11 indicate Usage. The program count field PC[l:〇] is used to record the contents of the program counter (pr〇gram c〇unter) of the record/instruction, that is, the instruction. The address of the scratchpad information fields Base_ID[3:0] and 〇ffset[ii:〇] are used to record the destination register address (base value and offset value) of the instruction to store data. The interceptor Adr_mode[l:0] is used to record the addressing mode of the instruction, such as pre-index, post-index, auto-index, and so on. The memory address field Adr[31:0] is used to record the 13 200949690 i 2 v/v / Nong 27005twf.doc/n command memory address to be loaded into the data 指令 command extraction level 310 pre-decoding unit of the towel Fff and decode the base temporary index II index of the instruction, the offset value _et), and the addressing mode. If the instruction has the address form of -..., then the age will be pre-loaded into the SIS early loading list (10). Table 1 I is loaded into the data structure of the child column 360 Statpri m orr- Ji-m n— I ~:-—~ —

提早载入單元370耦接至提早載入佇列36〇。當提早 ❹ 载入單元370間置(idle)時,提早載入符列360將會選擇最 早被寄存於其内部的指令,並將此一指令送交給提皁載入 單元370執行之。因此,在該指令(例如LDR指令)進入指 令執行級340之前(還在指令佇列320中),由提早載入單 元370提早執行該指令’並且將該指令對應之預載資料放 進該提早載入佇列360的預載資料攔位Loaded__data中。 圖3B將提早載入單元370繪為處理器内部的一個專 用電路’其詳細實施範例容後詳述。然而,此範例僅以直 觀方式描述提早載入單元370之實施方式,不應以此限制 14 200949690 J. / Jl i. w3 27005twf.doc/n 其實現態樣。例如,本發明所屬領域之技術人員可以利用 傳統,令執行級34〇中的載入/儲存單元(未繪示〕實現提早 載入单7〇 370之功能,也就是將提早載入單元wo與指令 執打級通中的載入/館存單元共用其硬體。在此實施例 中二提早載入單元37G包含暫存器讀取單元371、位址產 生早几372以及資料提取單元373。暫存器讀取單元π ❹ 2檢查提早載人仔列360中有無存放需要提早載入資料 的心令’然後從處理H内部的暫存轉列(未緣示)中讀取 其基底暫存器資才斗’並將該指令傳遞給位址產生單元 =2位址產生單凡372負責依據該指令與其基底暫存器資 r:而產生用來提取資料的位址。資料提取單元373便依據 ^產生單元372所產生驗址’而提早去資料快取記憶 體(或是主記憶體)載入資料,並且將預載資料寫回 入佇列360。 θ私令解碼級33〇可以檢查提早載入佇列360中的資料 是否備妥且合法。當該指令已經從指令行列320送至指令 馬級330 ’ ^曰令解碼級330便去檢查提早載入仔列360 的、、、己錄狀態。若提早載入彳宁列360中的資料已備妥且合法 S’ ’騎旨令所指定的目的地暫存ϋ位址改為提早 載入1列360中該預載資料的位址。因此,該指令不再需 ^資料快取中提取資料’或者可以說指令執行級340不 為要再一次地執行該指令了。所以,接下來相依於同一目 的地暫存器的指令就可以從提早載入佇列360獲得所需資 15 200949690 nwJ 27005twf. doc/π ϊί現i紐錄钱人糾遍的操作可叫他任何方 ”於本實施例中,*配置耗接至指令解碼級33〇的 器狀態表380,用來紀錄處理器内部所有 ▲存 e 其中,若指令提取級310的判斷結果表示該=二二 類型(例如LDR指令或LDRB指令),且依暫^狀= ⑽之紀錄表示該指令所指定的暫存器之狀態為備j 態,則將該指令所欲提取的預載資料事先载入至提 佇列360中。暫存器狀態表380可以參照表2所示之資料 t構實現之。於表2中’暫存器欄位紀錄處理器内部各個 暫存器的位址。狀態攔位State[l:0]用來記錄各個暫存器的 狀態資訊,例如00表示已備妥(Ready)、〇1表示前饋 (Forwarding)、1〇 表示更名(Renaming)、u 表示忙石^ (Busy)。提早載入佇列位址欄位ELQjd[2:〇]用來記錄所屬 暫存器被更名至提早載入佇列360中的位址。 暫存器狀態表380之資料結構The early loading unit 370 is coupled to the early loading queue 36〇. When the load unit 370 is advanced, the early load queue 360 will select the instruction that was originally registered in it and send the instruction to the soap loading unit 370 for execution. Therefore, before the instruction (eg, LDR instruction) enters the instruction execution stage 340 (also in the instruction queue 320), the instruction is executed early by the early loading unit 370 and the preloaded data corresponding to the instruction is placed into the early stage. Load the preloaded data block of the queue 360 in Loaded__data. Figure 3B depicts the early loading unit 370 as a dedicated circuit within the processor', a detailed embodiment of which will be described in detail later. However, this example only describes the implementation of the early loading unit 370 in an intuitive manner and should not be limited to this. 14 200949690 J. / Jl i. w3 27005twf.doc/n Its implementation. For example, a person skilled in the art to which the present invention pertains can use the conventional loading/storing unit (not shown) in the execution stage 34 to implement the function of loading the single 7〇370 early, that is, to load the unit early and The load/store unit in the command execution level shares its hardware. In this embodiment, the second early load unit 37G includes a register read unit 371, an address generation early 372, and a data extracting unit 373. The scratchpad reading unit π ❹ 2 checks whether there is a need to store the data in the early loading queue 360, and then reads the base temporary storage from the temporary storage queue (not shown) in the processing H. The device is used to pass the instruction to the address generation unit = 2 address generation unit 372 is responsible for generating an address for extracting data according to the instruction and its base temporary storage device: the data extraction unit 373 According to the address generation generated by the generating unit 372, the data cache memory (or the main memory) is loaded early, and the preloaded data is written back into the queue 360. The θ private decoding stage 33 can be checked. Is it ready to load the data in queue 360 early? And legally. When the instruction has been sent from the command line 320 to the instruction level 330', the decoding stage 330 checks the status of the pre-loaded queue 360. If it is loaded earlier, it is loaded into the ranking 360. The information is ready and the address of the destination temporary address specified by the legal S' 'ride order is changed to the address of the preloaded data in the first column of 360. Therefore, the instruction no longer needs to be fast. Extracting the data 'or it can be said that the instruction execution stage 340 does not have to execute the instruction again. Therefore, the instruction that depends on the same destination register can then obtain the required resources from the early loading queue 360. 15 200949690 nwJ 27005twf. doc/π ϊ 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现 现All the internal memory of the processor, if the instruction fetch stage 310 determines that the = 22 type (such as the LDR instruction or the LDRB instruction), and the record according to the temporary = (10) indicates the register specified by the instruction. The state is the prepared j state, then the instruction is to be extracted. The data is preloaded into the list 360. The register status table 380 can be implemented by referring to the data t shown in Table 2. In Table 2, the 'storage field record records the internal registers of the processor. The status block State[l:0] is used to record the status information of each register. For example, 00 means Ready, 〇1 means Forwarding, 1〇 means Renaming, u means Busy. (Busy). The field address field ELQjd[2:〇] is used to record the address whose own register is renamed to the early loading queue 360. Data structure of the register status table 380

指令解碼級330負責解碼該指令,並依據該解碼結果 檢查暫存器狀態表380’以判斷該指令所需的預載資料是 否被正確地載入提早載入佇列360中。最後,指令解碼級 330依據前述檢查與處理結果,將解碼後的該指令傳送給 指令執行級340。 200949690 rz.w/ ιχν/ui 27005twf.doc/n 表3是說明以處理器說明使用上述預载方法 程式段為例,各指令於管線中的處理時序表。表=仃,二 明以處理器沒有使用預載方法而執行與表3相==說 各指令於管線中的處理時序表。表巾正表示指工又^ 表示指令解碼,ΕΧΕ麵執補令,MEM杨料1D WB表示資料回寫。另外也表示發生「預載方法」 表3使,里色方法一指令於管線中士^^字芒The instruction decode stage 330 is responsible for decoding the instruction and checking the scratchpad status table 380' based on the decoded result to determine if the preloaded data required by the instruction was correctly loaded into the early loading queue 360. Finally, the instruction decode stage 330 transmits the decoded instruction to the instruction execution stage 340 in accordance with the foregoing checks and processing results. 200949690 rz.w/ ιχν/ui 27005twf.doc/n Table 3 is a table showing the processing timing of each instruction in the pipeline using the above-mentioned preload method block as an example. Table = 仃, 二明 The processor does not use the preload method and performs phase 3 with the table == say the processing schedule of each instruction in the pipeline. The towel is indicating that the work refers to the instruction decoding, and the face-to-face binding order, MEM Yang material 1D WB indicates the data is written back. In addition, it also indicates that the "preload method" occurs. Table 3 makes the inner color method one instruction in the pipeline sergeant ^^ word

指令 --—w巨吟Ί 口、J处枉CHJ*斤衣。 --- cycle 1 2 3 4 5 6 7 Q g CMPrl,#10 IF ID ΕΧΕ MEM WB BEQ loop IF ID ΕΧΕ MEM WB LOAD γ2, ΓγΟ #01 IF ID EXE MEM WB ADD r3, r3, r2 IF ID stall stall EXE mem WB ADD rl,rl.#l IF stall stall ID EXE MEM WB ❹ 參 由表4可以看出,由於要等待指令「L〇ADr2, [r〇 #〇]」 從貧料快取將資料提取至暫存器r2,所以接下來的指令 「ADDr3,r3,r2」與「ADDrl,rl,#1」會被延遲數周期(如 表4中標示stall處)’直到指令「l〇ad r2, [r〇 #〇]」完成 資料提取之操作(表4中標示MEM處)。如表3所示,使 用了上述實施例之預載方法,指令「L〇ADr2, [r〇#〇]」在 17 200949690 27005twf.doc/n 指令解碼階段ID便已經透過提早載入單元37〇從資料快 取將預載資料提取至提早載入佇列360中,使得此指令資 料提取操作MEM中不需要再一次去資料快取提取資料。Command ---w giant mouth, J 枉 CHJ* pound clothes. --- cycle 1 2 3 4 5 6 7 Q g CMPrl, #10 IF ID MEM MEM WB BEQ loop IF ID MEM MEM WB LOAD γ2, ΓγΟ #01 IF ID EXE MEM WB ADD r3, r3, r2 IF ID stall stall EXE mem WB ADD rl, rl.#l IF stall stall ID EXE MEM WB ❹ As can be seen from Table 4, the data is extracted from the poor material cache by waiting for the command "L〇ADr2, [r〇#〇]" To the scratchpad r2, the next instruction "ADDr3, r3, r2" and "ADDrl, rl, #1" will be delayed for several cycles (as indicated by the stall in Table 4) until the instruction "l〇ad r2, [r〇#〇]” Complete the operation of data extraction (marked in MEM in Table 4). As shown in Table 3, the preloading method of the above embodiment is used, and the command "L〇ADr2, [r〇#〇]" is already transmitted through the early loading unit 37 at the instruction decoding stage of 17 200949690 27005twf.doc/n. The data is extracted from the data cache to the early loading queue 360, so that the data extraction operation MEM does not need to go to the data cache to extract the data again.

因此,接下來的指令「ADDr3,r3,r2」可以不用等待,而 在完成指令解碼操作ID後緊接著進行指令執行操作 ΕΧΕ。很明顯地’上述實施例利用指令被提取(色比⑴進入 扣令佇列中的等待時間,提早載入該指令對應之預載資 ,’可解決;|>線處理器設計巾,載人_使用延遲過長的問 ,。备官線的深度(級數)越深,則上述預載方法對於改善 載入-使用延遲」的效果將會越明顯。 w Μ黏令所需的預載資料是否被正確地載入去 入,列36G中’本實施例的處理器可以進行無效名 :正確。若指令解碼級330角 ^第:/tl(泛指任何指令),則暫存器狀態表380中對肩 H 目的财器的狀_位設為忙碌。 表疋之目的暫存器為R2 ’則將暫存器狀 中暫存杰R2的狀態攔位Statetl.Ojn為「11 以免其他指令存取暫存 中有記錄(不肖於第二γ人㈣有趨。右提早載入仔列3< 指定目的暫存器(例如暫“ 令)指向該第二指知 列(請參絲咐該記錄/指)令^=將提早載〜 為「〇〇」(表示無效狀態)。因^ ^:狀糊位s她陶 (Data Dependency)錯誤。 "以避免發生資料相^ 200949690 jl^w/hw3 27005twf.doc/n 另外,若在指令執行級340有第二指令(泛指任何指令) 要將資料寫入資料快取或記憶體某位址處,則處理器搜尋 提早載入仔列360。若搜尋結果顯示提早載入彳宁列360中 有記錄/指令與第二指令要寫入的記憶體位址相同,則處理 器將提早載入仵列360中該記錄/指令的狀態攔位State[1:〇j 没為「00」(表不無效狀態)。因此,可以避免發生記恨體 相依(Memory Dependency)錯誤。 ❹ 綜上所述,本實施例中採用的機制分為兩個部份:提 早載入手段(Early Load Policy)及無效手段(InvaUdati〇nTherefore, the next instruction "ADDr3, r3, r2" can be executed without waiting, but after the instruction decode operation ID is completed, the instruction execution operation is performed. Obviously, the above embodiment uses the instruction to be extracted (the color ratio (1) enters the waiting time in the deduction order, and the preloading corresponding to the instruction is loaded earlier, 'can be solved;|> line processor design towel, People _ use the delay is too long, the deeper the depth (the number of levels) of the official line, the effect of the above preloading method on improving the load-use delay will be more obvious. Whether the data is correctly loaded and entered, in the column 36G, the processor of this embodiment can perform invalid name: correct. If the instruction decodes the level 330 corner: /tl (refers to any instruction), the register In the status table 380, the status_bit of the payload of the shoulder H is set to be busy. The destination register of the header is R2', and the state block. Statel.Ojn of the temporary storage R2 is "11". There are records in the other instruction access temporary storage (not in the second gamma (4) trend. The right early loading queue 3< specifies the destination register (for example, the temporary "order" points to the second pointed column (please refer to the silk)咐 The record/refers to the order ^^ will be preloaded ~ "〇〇" (invalid state). Because ^ ^: paste s her Tao (Data Depe Ndency) error. "To avoid data phase ^200949690 jl^w/hw3 27005twf.doc/n In addition, if there is a second instruction (referred to as any instruction) at instruction execution level 340, the data should be written to the data cache or At a certain address of the memory, the processor searches for the preloaded queue 360. If the search result shows that the record/instruction in the Suining column 360 is the same as the memory address to be written by the second instruction, the processor The state of the record/instruction state[1:〇j is not "00" (the table is not invalid) will be loaded early in queue 360. Therefore, a memory Dependency error can be avoided. As described above, the mechanism adopted in this embodiment is divided into two parts: Early Load Policy and invalid means (InvaUdati〇n

Policy)。提早載入手段是將資料從快取記憶體提早搬到提 早載入佇列360中。以下簡要說明提早载入手段的動作: 1、在指令被放進指令仔列320前,先預先解碼 (Pre-decode)該指令,若符合提早載入條件(例如: 該指令是LDR、LDRB等,而其定址模式為Policy). The early loading method is to move the data from the cache memory early to the early loading queue 360. The following is a brief description of the actions of the early loading means: 1. Before the instruction is placed in the instruction queue 320, the instruction is pre-decoded (Pre-decode), if the early loading condition is met (for example: the instruction is LDR, LDRB, etc.) And its addressing mode is

Immediate (pre(post)-indexed) offset),並且其基底 藝 暫存器(Base暫存器狀態表38〇中狀態 為備妥(Ready),則將指令放入提早載入佇列3二 卜然後經由提早載人單元37G到快取或是記憶 體中提前將資料載入到提早載入仔列36〇中。 2田該扣令到達指令解碼級33〇,檢查提早載入佇列 360中的資料是否完成且合法。若是,則將該指令 ^ ^ H (Destination Register)^^ (Rename) 到提早载入狩列360中對應的紀錄(Entry)或位址。 19 200949690 X A.V/V / Λ X wi 27005twf.doc/n 讓載入(Load)指令在指令提取級310提前到快取或記 憶體抓取貧料可能會發生的錯誤有兩種情形,一種是資料 相依(Data Dependency)錯誤,一種是記憶體相依(Mem〇ryImmediate (pre(post)-indexed) offset), and its base art register (Read state in the Base register status table 38〇, put the instruction into the early loading queue 3 2 Then, the data is loaded into the early loading queue 36〇 via the early loading unit 37G to the cache or the memory. 2 The deduction order reaches the instruction decoding level 33〇, and the pre-loading queue 360 is checked. Whether the information is complete and legal. If yes, then the command ^ ^ H (Destination Register) ^ ^ (Rename) to the corresponding entry (Entry) or address in the hunting column 360. 19 200949690 X AV/V / Λ X wi 27005twf.doc/n There are two situations in which the load instruction can advance to the cache or memory to capture the poor material. There is two cases, one is Data Dependency error. One is memory dependent (Mem〇ry

Dependency)錯誤。前者發生在因有其他指令正在運算基底 暫存器的值,使得進行「提早載人」的指令可能會取到基 底暫存器的舊值而去做記憶體存取,此時我們會 位址抓取到錯誤的資料。後者發生在進行「提早載入」'的 〇 指令與另一道儲存(store)指令會存取到相同的記憶體位 址所以進行「知:早載入」的指令所抓到的資料可能是未 被更新過的。無效手段(Invalidation Policy)則是用來檢查載 入的資料是否正確。在無效手段中我們會檢查這兩種情形 的發生。若產生這些情況’我們會提早將提早載入佇列36〇 中的對應紀錄/指令設定為無效/不合法(Invalid)。當指令執 行級340真正執行到該指令時,會重新從快取或是記情體 中抓取正確資料。以下簡要朗無效手段_作:& 0 Case 1 :檢查基底暫存器是否合法: 當任一道指令通過指令解碼級330時,將其目的暫 存器在暫存器狀態表38〇中的狀態欄位設為忙碌 (Busy),並搜尋提早载入仵列36〇中是否有指令用其當 作基底暫存器。若有,則將此提早載入仔列36〇中對應 紀錄(Entry)的狀態攔位設為無效/不合法。 檢查°己憶體位址(Memory Address)是否合法: 當-道儲存(St〇re)指令在指令執行級34〇產生記憶 立址’則搜尋提早载人件列·中是否存在相同的記 20 200949690 ▲ 27005twf.doc/n 憶體位址,若有,則將此提早载入仲列36〇巾對應紀錄 的狀態攔位設為無效/不合法。 綜上所述,本實施例設計了提早載入(EariyL〇ad)的機 制’利用♦曰令在指令仵列中等待的時間,提早將資料從快 取或記憶體搬到處理器内的一個提早載入佇列中,並且提 出一個有效的方法,檢查所抓取的資料是否正確。如此, 右是管線300成功地將資料預先載入到提早載入佇列内, ❿ 那載入-使用所造成的延遲將可以有效地被減少。反之,若 提早載入資料失敗時,也不影響處理器原本的效能。 雖然本發明已以較佳實施例揭露如上,然其並非用以 限定本發明,任何所屬技術領域中具有通常知識者,在不 脫離本發明之精神和範圍内,當可作些許之更動與潤飾, 因此本發明之保護範圍當視後附之申請專利範圍所界定 為準。 【圖式簡單說明】 圖1是說明傳統管線式處理器。 ® 圖2是依照本發明實施例說明一種處理器之預載方法 流程圖。 圖3A是依照本發明另一實施例說明處理器之預载方 法流程圖。 圖3B是依照本發明實施例說明一種管線式處理器。 【主要元件符號說明】 100、300 :管線 110、310 :指令提取級 21 27005twf.doc/n 200949690 120、320 :指令佇列 130、330 :指令解碼級 140、340 :指令執行級 150、350 :資料回寫級 311 :提取單元 312 :預解碼單元 360 :提早載入佇列 370 :提早載入單元 371 :暫存器讀取單元 372 :位址產生單元 373 :資料提取單元 380 :暫存器狀態表 S210〜S240、S310〜S320 :處理器預載方法之步驟Dependency) error. The former occurs when the value of the base register is being calculated because of other instructions, so that the instruction of "early manned" may take the old value of the base register and perform memory access. At this time, we will address Grab the wrong information. The latter occurs when the "premature loading" command and another store command access the same memory address, so the data captured by the "know: early loading" command may not be captured. Updated. The Invalidation Policy is used to check if the information is correct. In the case of invalid means we will check for the occurrence of these two situations. If these conditions arise, we will set the corresponding record/instruction loaded into the queue 36〇 early to be invalid/invalid. When the instruction execution level 340 actually executes the instruction, the correct data is retrieved from the cache or the ticker. The following brief invalidation means _: & 0 Case 1: Check whether the base register is legal: when any instruction passes the instruction decode stage 330, the state of its destination register in the register status table 38〇 The field is set to Busy and it is searched for the early loading queue 36 to see if there are instructions to use it as the base register. If so, the status block that is loaded into the corresponding entry (Entry) in the 36 column is set to invalid/illegal. Check if the memory address is legal: When the channel store (St〇re) command generates the memory address at the instruction execution level 34, then the search for the early loader column has the same record. 20 200949690 ▲ 27005twf.doc/n Recall the body address, if any, then load the status bar of the corresponding record in the middle column 36 to be invalid/illegal. In summary, this embodiment designs a mechanism for early loading (EariyL〇ad) to use the ♦ command to wait in the command queue to move data from the cache or memory to one of the processors. Load it early in the queue and come up with an effective way to check if the data you grab is correct. Thus, right, the pipeline 300 successfully preloads the data into the preloaded queue, and the delay caused by the load-use can be effectively reduced. Conversely, if the data loading fails early, it does not affect the original performance of the processor. Although the present invention has been disclosed in the above preferred embodiments, it is not intended to limit the invention, and any one of ordinary skill in the art can make some modifications and refinements without departing from the spirit and scope of the invention. Therefore, the scope of the invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a diagram showing a conventional pipelined processor. FIG. 2 is a flow chart illustrating a preloading method of a processor in accordance with an embodiment of the present invention. 3A is a flow chart illustrating a preloading method of a processor in accordance with another embodiment of the present invention. FIG. 3B illustrates a pipelined processor in accordance with an embodiment of the present invention. [Main component symbol description] 100, 300: pipeline 110, 310: instruction fetch stage 21 27005twf.doc/n 200949690 120, 320: command queue 130, 330: instruction decode stage 140, 340: instruction execution stage 150, 350: Data write back stage 311: extracting unit 312: pre-decoding unit 360: early loading queue 370: early loading unit 371: register reading unit 372: address generating unit 373: data extracting unit 380: register State table S210~S240, S310~S320: steps of processor preloading method

22twenty two

Claims (1)

27005twfd〇c/n ❹ ⑩ 200949690 十、申請專利範面·· =種處理器之職方法包括·· 結果在-指令提取級’提取並騎—指令,讀得一判斷 預载:果’是否提早載入該指令對應之一 該指已被正確地載人’則以該預載資料作為 包括..如巾π專利㈣帛1項所述處理器之預載方法,更 仔歹^據該判斷結果’決定衫將該指令放進—提早载入 之該==之前,執行該指令以載入該指令對應 將該預載資料放進該提早載入佇列中。 中今3接如早t請專利範圍第2項所述處理器之預载方法,丈 载入佇列包括-狀態欄位、-程式計數攔位、二 存态資訊攔位、一記憶體位址攔、及 位。 及預载資料欄 包括,如申請專利範圍第3項所述處理器之預栽方法,更 以及在—指令解碼級’解碼該指令,以獲得—解竭結果; 23 27005twf.d〇c/n 200949690 ,據該解碼結果,檢查—暫存器 載資料是倾正錢載人該提早載人侧+。Μ斷該預 中該5暫4項所述纽k職方法,其 位址攔I表包括—狀態攔位、以及—提早载入符列 包括6:.如中請專利範圍第4項所述處理器之預载方法更 ❹ ❹ 器狀7解碼級有—第二指令被解碼,在談暫存 為忙石彔 表中對應於該第二指令所指定的目㈣存器狀態設 ^哥該提早獻糾之所有記錄;以及 若該提早載人仵列中有—記錄指 的目的暫存11,_該記賴為無效。令所指定 包括7:.如中請專利範圍第4項所述處理器之預栽方法,更 在該指令執行級,若有—第二指令將 -位^處’則搜尋該提早載入符歹,n以及·,·、-記憶 若該提早載人㈣jf有—記錄與該記 則將該記錄科無效。 址相同’ 8·如申請專利範圍第1項所述處 中決定是否提早載入該之預栽方法,其 括:*早載人私令對應之—預戴資料的步驟包 檢查—暫存器狀態表;以及 24 27005twf.doc/n 200949690 条該判斷結果表示該指令屬於一目標類型,―― 器狀態表巾與該指令對應之暫存器狀態為 〇該暫存 該指令對應之-_資料載人至—提早载’則將 9. 如申請專利範圍第1項所述處理器之預裁方丄 中以該預载資料作為該目標資料的步驟包括: ,其 在一指令解碼級,檢查一提早載入佇列中 備妥且合法;以及 貝料疋否 預载 若該提早載入佇列中的資料已備妥且合法,則將診少 令所指定的目的暫存器位址改為該提早载入佇列令該以指 資料的位址。 更 包括 10. 如申請專利範圍第1項所述處理器之預載方法, 若該預載資料未被正確地載入,則在一指令執行級依 據該指令去提取目標資料。 11.一種處理器,包括: Φ 一指令提取級,用以提取一指令,其中該指令提取級 包含一預解碼單元’以便在該指令提取級中預先判斷該指 令,以獲得一判斷結果;. 一指令解碼級’輕接至該指令提取級,用以解瑪該指 令’以獲得一解碼結果; 一指令執行級,耦接至該指令解碼級,用以依據該解 碼结果執行該指令;以及 一提早載入佇列,耦接至該預解碼單元,用以依據該 判斷結果,決定是否提早載入該指令對應之一預載資料; 25 X * W-I 27005twf.doc/n 其中=預載資料未被正確地載入,則該指令執行級依據 該指令去提取一目樟窨祖.,、,η 7轨仃級依據 載入兮提早,,及右该預載資料已被正確地 Λ 以該預載資料作為該目標資料。 •如1專纖圍第u項所叙處 早載入佇列包括一妝能捫& ^ - 其中該徒 祜狀匕、攔位、一程式計數攔位、一暫存5| 資訊搁位、一記憶於付I-诚A * -η σ 攔以及—預載資料攔位。 早裁入咖第U顿狀纽^,其中該提 ❹ Μ人㈣依據該靖結果,蚊是否紀錄該指令。 Μ ^申咕專利範圍第U項所述之處理器,更包括: 扣人I提早載入單元,耦接至該提早載入佇列,用以在該 才"進人指令執行級之前,提早執行該指令以將該指令對 應之該預載資料放進該提早载入佇列中。 如申請專利範圍第14項所述之處理器,更包括: 二暫存器狀態表,雜接至該指令解碼級,用以紀錄該 處理器中多個暫存器之狀態; ❹ 其中該指令解碼級解碼該指令,以及依據該解碼結果 双查該暫存器狀態表,以判斷該預載資料是否被正確地載 入該提早载入佇列中。 。:16.如申請專利範圍第15項所述之處理器,其中該暫 子益狀態表包括一狀態攔位'以及一提早載入佇列位址欄 位。 匕_ 17.如申請專利範圍第15項所述之處理器其中若該 心令解碼級解碼一第二指令’則該暫存器狀態表中對應於 該第二指令所指定的目的暫存器狀態設為忙碌;該處理器 26 200949690 r^υυ, 11 υυ j 27005twf.doc/n 搜尋該提早载入佇列之 中有-記錄指向該第二早载入件列 理器將該記錄設為無效/所k的目的暫存器,則該處 該指樣,其中若在 〇 魯 -記錄與該入狩列;若該提早载入仔列中有 效。心 相同,則該處理器將該記錄設為無 早載!專利範圍第14項所述之處理11,其中該提 硬體。早U指令執行級中的—載人/儲存單^共用其 20. 如申請專利範圍第u項所述之處理器,更包括: 考存11狀態表’祕至該指令解碼級,肋紀錄該 處理态中多個暫存器之狀態; 其中若該判斷結果表示該指令屬於一目標類型,且該 暫存器狀態表巾麟指令職、之暫存!I狀態為備妥狀態, 則將該指令對應之一預載資料載入至該提早载入佇列ΐ。 21. 如申請專利範圍第η項所述之處理器,其中該指 々解喝、、及檢查該提早載入仔列中的貧料是否備妥且合法; 提早載入佇列中的資料已備妥且合法’則將該指令所 寺曰疋的目的暫存器位址改為該提早載入仰列中該預載資料 的位址。 ' 2727005twfd〇c/n ❹ 10 200949690 X. Applying for a patented version ·· = The processor's job method includes ·· The result is at the - instruction fetch level 'extract and ride—instruction, read a judgment preload: fruit 'whether early Loading one of the instructions corresponding to the instruction that the finger has been correctly loaded' is based on the preloaded method of the processor as described in the article 如 Patent (4) 帛1, and more The result 'decision shirt puts the instruction into the pre-loading of the ==, before executing the instruction to load the instruction corresponding to the preloaded data into the early loading queue. In this case, the current pre-loading method of the processor mentioned in item 2 of the patent scope is as follows: the status column includes - status field, - program count block, two state information block, one memory address. Block, and position. And the preloaded data column includes, for example, the pre-planting method of the processor described in claim 3, and the decoding of the instruction at the instruction decoding level to obtain a decommissioning result; 23 27005twf.d〇c/n 200949690, according to the decoding result, check - the temporary data contained in the temporary storage is the early manned side +. The method of interrupting the 5th item of the 4th item is determined, and the address block I table includes a state block, and the early load column includes 6: as described in item 4 of the patent scope. The preloading method of the processor is further ❹ 器 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 All records of early warnings; and if the early manned queue contains - the purpose of the record refers to the temporary storage 11, the record is invalid. The designation includes 7: the pre-planting method of the processor described in item 4 of the patent scope, and at the execution level of the instruction, if there is - the second instruction will be - bit ^, then search for the early loader歹, n and ·, ·, - If the memory should be carried early (4) jf has - record and the record is invalid. The same address ' 8 · If the application of the scope of the first paragraph of the patent application determines whether to pre-load the pre-planting method, including: * early carrier private order - pre-wear data step package check - register State table; and 24 27005twf.doc/n 200949690 The result of the judgment indicates that the instruction belongs to a target type, and the status of the register corresponding to the instruction is 〇 the temporary corresponding to the instruction - _ information The manned to premature load will be 9. The step of using the preloaded data as the target data in the pre-cutting of the processor described in claim 1 of the patent application includes: It is prepared and legal in the early loading queue; and if the information is pre-loaded, if the information in the queue is ready and legal, the destination register address specified in the consultation order will be changed. For the early loading of the order, the address of the data is indicated. In addition, 10. If the preloading method of the processor described in claim 1 is not correctly loaded, the target data is extracted according to the instruction at an instruction execution level. 11. A processor, comprising: Φ an instruction fetch stage for extracting an instruction, wherein the instruction fetch stage includes a pre-decoding unit to pre-determine the instruction in the instruction fetch stage to obtain a decision result; An instruction decoding stage is spliced to the instruction fetch stage for decoding the instruction to obtain a decoding result; an instruction execution stage coupled to the instruction decoding stage for executing the instruction according to the decoding result; Loading the queue early, coupled to the pre-decoding unit, according to the determination result, determining whether to preload one of the preloaded data corresponding to the instruction; 25 X * WI 27005twf.doc/n where = preloaded data If the command is not loaded correctly, the instruction execution level extracts a target ancestor according to the instruction. The η 7 track level is loaded earlier, and the right preloaded data is correctly punctured. Preloaded data is used as the target data. • As mentioned in the section 1 of the special fiber section, the early loading queue includes a makeup 扪 & ^ - which is the 祜 匕 拦, blocking position, a program counting block, a temporary storage 5 | information One memory is paid by I-Cheng A * -η σ block and - preloaded data block. Early cut into the coffee U-shaped button ^, which mentions the Μ Μ (4) according to the results of the Jing, whether the mosquito recorded the order. Μ ^ The processor of claim U of the patent scope further includes: a buckle I early loading unit coupled to the early loading queue for use before The instruction is executed early to put the preloaded data corresponding to the instruction into the early loading queue. The processor of claim 14, further comprising: a second register status table, mixed to the instruction decoding stage, for recording the status of the plurality of registers in the processor; ❹ wherein the instruction The decoding stage decodes the instruction, and double checks the register status table according to the decoding result to determine whether the preloaded data is correctly loaded into the early loading queue. . The processor of claim 15, wherein the temporary benefit status table includes a status bar and an early loading of the address field.匕_ 17. The processor of claim 15, wherein if the heart decodes the decoding stage to decode a second instruction, the temporary register in the register state table corresponds to the destination register specified by the second instruction The status is set to busy; the processor 26 200949690 r^υυ, 11 υυ j 27005twf.doc/n searches for the early loading queue - there is a record pointing to the second early loader to set the record to Invalid/k's purpose register, then the reference should be made here, if it is in the Lulu-record and the entry column; if it is loaded early, it is valid. If the heart is the same, the processor sets the record to no early load! The process 11 described in claim 14 of the patent, wherein the hardware is the same. In the early U-command execution level, the manned/storage list is shared by the processor. For example, the processor described in the scope of the patent application, the method includes: the test state table 'secret to the instruction decoding level, the rib record The state of the plurality of scratchpads in the processing state; wherein if the judgment result indicates that the instruction belongs to a target type, and the temporary register state table towel commander, the temporary storage! If the I state is in the ready state, one of the preloaded data corresponding to the instruction is loaded into the early loading queue. 21. The processor as claimed in claim n, wherein the fingerprint is depleted, and the inspection of the poor material in the early loading queue is ready and legal; Ready and legal', change the destination register address of the command to the address of the preloaded data in the queue. ' 27
TW97119412A 2008-05-26 2008-05-26 Processor and early execution method of data load thereof TWI364703B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW97119412A TWI364703B (en) 2008-05-26 2008-05-26 Processor and early execution method of data load thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW97119412A TWI364703B (en) 2008-05-26 2008-05-26 Processor and early execution method of data load thereof

Publications (2)

Publication Number Publication Date
TW200949690A true TW200949690A (en) 2009-12-01
TWI364703B TWI364703B (en) 2012-05-21

Family

ID=44871057

Family Applications (1)

Application Number Title Priority Date Filing Date
TW97119412A TWI364703B (en) 2008-05-26 2008-05-26 Processor and early execution method of data load thereof

Country Status (1)

Country Link
TW (1) TWI364703B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9569216B2 (en) 2013-03-15 2017-02-14 Soft Machines, Inc. Method for populating a source view data structure by using register template snapshots
US9575762B2 (en) 2013-03-15 2017-02-21 Soft Machines Inc Method for populating register view data structure by using register template snapshots
US9632825B2 (en) 2013-03-15 2017-04-25 Intel Corporation Method and apparatus for efficient scheduling for asymmetrical execution units
US9766893B2 (en) 2011-03-25 2017-09-19 Intel Corporation Executing instruction sequence code blocks by using virtual cores instantiated by partitionable engines
US9811377B2 (en) 2013-03-15 2017-11-07 Intel Corporation Method for executing multithreaded instructions grouped into blocks
US9811342B2 (en) 2013-03-15 2017-11-07 Intel Corporation Method for performing dual dispatch of blocks and half blocks
US9823930B2 (en) 2013-03-15 2017-11-21 Intel Corporation Method for emulating a guest centralized flag architecture by using a native distributed flag architecture
US9842005B2 (en) 2011-03-25 2017-12-12 Intel Corporation Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
US9858080B2 (en) 2013-03-15 2018-01-02 Intel Corporation Method for implementing a reduced size register view data structure in a microprocessor
US9886416B2 (en) 2006-04-12 2018-02-06 Intel Corporation Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
US9886279B2 (en) 2013-03-15 2018-02-06 Intel Corporation Method for populating and instruction view data structure by using register template snapshots
US9891924B2 (en) 2013-03-15 2018-02-13 Intel Corporation Method for implementing a reduced size register view data structure in a microprocessor
US9898412B2 (en) 2013-03-15 2018-02-20 Intel Corporation Methods, systems and apparatus for predicting the way of a set associative cache
US9921845B2 (en) 2011-03-25 2018-03-20 Intel Corporation Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
US9934042B2 (en) 2013-03-15 2018-04-03 Intel Corporation Method for dependency broadcasting through a block organized source view data structure
US9940134B2 (en) 2011-05-20 2018-04-10 Intel Corporation Decentralized allocation of resources and interconnect structures to support the execution of instruction sequences by a plurality of engines
US9965281B2 (en) 2006-11-14 2018-05-08 Intel Corporation Cache storing data fetched by address calculating load instruction with label used as associated name for consuming instruction to refer
US10031784B2 (en) 2011-05-20 2018-07-24 Intel Corporation Interconnect system to support the execution of instruction sequences by a plurality of partitionable engines
US10140138B2 (en) 2013-03-15 2018-11-27 Intel Corporation Methods, systems and apparatus for supporting wide and efficient front-end operation with guest-architecture emulation
US10169045B2 (en) 2013-03-15 2019-01-01 Intel Corporation Method for dependency broadcasting through a source organized source view data structure
US10191746B2 (en) 2011-11-22 2019-01-29 Intel Corporation Accelerated code optimizer for a multiengine microprocessor
US10228949B2 (en) 2010-09-17 2019-03-12 Intel Corporation Single cycle multi-branch prediction including shadow cache for early far branch prediction
US10521239B2 (en) 2011-11-22 2019-12-31 Intel Corporation Microprocessor accelerated code optimizer

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11163720B2 (en) 2006-04-12 2021-11-02 Intel Corporation Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
US10289605B2 (en) 2006-04-12 2019-05-14 Intel Corporation Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
US9886416B2 (en) 2006-04-12 2018-02-06 Intel Corporation Apparatus and method for processing an instruction matrix specifying parallel and dependent operations
US9965281B2 (en) 2006-11-14 2018-05-08 Intel Corporation Cache storing data fetched by address calculating load instruction with label used as associated name for consuming instruction to refer
US10585670B2 (en) 2006-11-14 2020-03-10 Intel Corporation Cache storing data fetched by address calculating load instruction with label used as associated name for consuming instruction to refer
US10228949B2 (en) 2010-09-17 2019-03-12 Intel Corporation Single cycle multi-branch prediction including shadow cache for early far branch prediction
US9766893B2 (en) 2011-03-25 2017-09-19 Intel Corporation Executing instruction sequence code blocks by using virtual cores instantiated by partitionable engines
US9842005B2 (en) 2011-03-25 2017-12-12 Intel Corporation Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
US9934072B2 (en) 2011-03-25 2018-04-03 Intel Corporation Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
US10564975B2 (en) 2011-03-25 2020-02-18 Intel Corporation Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
US9921845B2 (en) 2011-03-25 2018-03-20 Intel Corporation Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
US11204769B2 (en) 2011-03-25 2021-12-21 Intel Corporation Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
US9990200B2 (en) 2011-03-25 2018-06-05 Intel Corporation Executing instruction sequence code blocks by using virtual cores instantiated by partitionable engines
US10372454B2 (en) 2011-05-20 2019-08-06 Intel Corporation Allocation of a segmented interconnect to support the execution of instruction sequences by a plurality of engines
US10031784B2 (en) 2011-05-20 2018-07-24 Intel Corporation Interconnect system to support the execution of instruction sequences by a plurality of partitionable engines
US9940134B2 (en) 2011-05-20 2018-04-10 Intel Corporation Decentralized allocation of resources and interconnect structures to support the execution of instruction sequences by a plurality of engines
US10191746B2 (en) 2011-11-22 2019-01-29 Intel Corporation Accelerated code optimizer for a multiengine microprocessor
US10521239B2 (en) 2011-11-22 2019-12-31 Intel Corporation Microprocessor accelerated code optimizer
US9886279B2 (en) 2013-03-15 2018-02-06 Intel Corporation Method for populating and instruction view data structure by using register template snapshots
US10275255B2 (en) 2013-03-15 2019-04-30 Intel Corporation Method for dependency broadcasting through a source organized source view data structure
US9934042B2 (en) 2013-03-15 2018-04-03 Intel Corporation Method for dependency broadcasting through a block organized source view data structure
US9904625B2 (en) 2013-03-15 2018-02-27 Intel Corporation Methods, systems and apparatus for predicting the way of a set associative cache
US10140138B2 (en) 2013-03-15 2018-11-27 Intel Corporation Methods, systems and apparatus for supporting wide and efficient front-end operation with guest-architecture emulation
US10146576B2 (en) 2013-03-15 2018-12-04 Intel Corporation Method for executing multithreaded instructions grouped into blocks
US10146548B2 (en) 2013-03-15 2018-12-04 Intel Corporation Method for populating a source view data structure by using register template snapshots
US10169045B2 (en) 2013-03-15 2019-01-01 Intel Corporation Method for dependency broadcasting through a source organized source view data structure
US9898412B2 (en) 2013-03-15 2018-02-20 Intel Corporation Methods, systems and apparatus for predicting the way of a set associative cache
US10198266B2 (en) 2013-03-15 2019-02-05 Intel Corporation Method for populating register view data structure by using register template snapshots
US9891924B2 (en) 2013-03-15 2018-02-13 Intel Corporation Method for implementing a reduced size register view data structure in a microprocessor
US10248570B2 (en) 2013-03-15 2019-04-02 Intel Corporation Methods, systems and apparatus for predicting the way of a set associative cache
US10255076B2 (en) 2013-03-15 2019-04-09 Intel Corporation Method for performing dual dispatch of blocks and half blocks
US9965285B2 (en) 2013-03-15 2018-05-08 Intel Corporation Method and apparatus for efficient scheduling for asymmetrical execution units
US9569216B2 (en) 2013-03-15 2017-02-14 Soft Machines, Inc. Method for populating a source view data structure by using register template snapshots
US9858080B2 (en) 2013-03-15 2018-01-02 Intel Corporation Method for implementing a reduced size register view data structure in a microprocessor
US10503514B2 (en) 2013-03-15 2019-12-10 Intel Corporation Method for implementing a reduced size register view data structure in a microprocessor
US9823930B2 (en) 2013-03-15 2017-11-21 Intel Corporation Method for emulating a guest centralized flag architecture by using a native distributed flag architecture
US10552163B2 (en) 2013-03-15 2020-02-04 Intel Corporation Method and apparatus for efficient scheduling for asymmetrical execution units
US9811342B2 (en) 2013-03-15 2017-11-07 Intel Corporation Method for performing dual dispatch of blocks and half blocks
US9811377B2 (en) 2013-03-15 2017-11-07 Intel Corporation Method for executing multithreaded instructions grouped into blocks
US10740126B2 (en) 2013-03-15 2020-08-11 Intel Corporation Methods, systems and apparatus for supporting wide and efficient front-end operation with guest-architecture emulation
US9632825B2 (en) 2013-03-15 2017-04-25 Intel Corporation Method and apparatus for efficient scheduling for asymmetrical execution units
US9575762B2 (en) 2013-03-15 2017-02-21 Soft Machines Inc Method for populating register view data structure by using register template snapshots
US11656875B2 (en) 2013-03-15 2023-05-23 Intel Corporation Method and system for instruction block to execution unit grouping

Also Published As

Publication number Publication date
TWI364703B (en) 2012-05-21

Similar Documents

Publication Publication Date Title
TW200949690A (en) Processor and early execution method of data load thereof
US8627044B2 (en) Issuing instructions with unresolved data dependencies
US6058472A (en) Apparatus for maintaining program correctness while allowing loads to be boosted past stores in an out-of-order machine
Blundell et al. Invisifence: performance-transparent memory ordering in conventional multiprocessors
US5987594A (en) Apparatus for executing coded dependent instructions having variable latencies
US7500087B2 (en) Synchronization of parallel processes using speculative execution of synchronization instructions
US7584332B2 (en) Computer systems with lightweight multi-threaded architectures
US10338928B2 (en) Utilizing a stack head register with a call return stack for each instruction fetch
US9086889B2 (en) Reducing pipeline restart penalty
US9501284B2 (en) Mechanism for allowing speculative execution of loads beyond a wait for event instruction
TW200937284A (en) System and method for performing locked operations
US6006326A (en) Apparatus for restraining over-eager load boosting in an out-of-order machine using a memory disambiguation buffer for determining dependencies
US20160098274A1 (en) Load-monitor mwait
US11755731B2 (en) Processor that prevents speculative execution across translation context change boundaries to mitigate side channel attacks
TW201203111A (en) Hardware assist thread for increasing code parallelism
TW200842703A (en) Branch predictor directed prefetch
JP2003514274A5 (en)
TW201030612A (en) Pipelined microprocessor with fast conditional branch instructions based on static exception state
US20060271769A1 (en) Selectively deferring instructions issued in program order utilizing a checkpoint and instruction deferral scheme
TW200807302A (en) Multi processor and multi thread safe message queue with hardware assistance
US8495311B2 (en) Updating shared variables atomically
US9588770B2 (en) Dynamic rename based register reconfiguration of a vector register file
EP1644823A1 (en) Load store unit with replay mechanism
WO2004099977A2 (en) System and method for operation replay within a data-speculative microprocessor
US6052777A (en) Method for delivering precise traps and interrupts in an out-of-order processor