TWI335550B - Stream processor with variable single instruction multiple data (simd) factor and common special function - Google Patents

Stream processor with variable single instruction multiple data (simd) factor and common special function Download PDF

Info

Publication number
TWI335550B
TWI335550B TW96104282A TW96104282A TWI335550B TW I335550 B TWI335550 B TW I335550B TW 96104282 A TW96104282 A TW 96104282A TW 96104282 A TW96104282 A TW 96104282A TW I335550 B TWI335550 B TW I335550B
Authority
TW
Taiwan
Prior art keywords
data
arithmetic logic
logic unit
format
short
Prior art date
Application number
TW96104282A
Other languages
Chinese (zh)
Other versions
TW200809690A (en
Inventor
Prokopenko Boris
Paltashev Timour
Gladding Derek
Original Assignee
Via Tech Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Via Tech Inc filed Critical Via Tech Inc
Priority claimed from US11/671,610 external-priority patent/US20070186082A1/en
Publication of TW200809690A publication Critical patent/TW200809690A/en
Application granted granted Critical
Publication of TWI335550B publication Critical patent/TWI335550B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8076Details on data register access
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/3016Decoding the operand specifier, e.g. specifier format
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Image Processing (AREA)
  • Advance Control (AREA)

Description

[1335550 *九、發明說明: 【發明所屬之技術領域】 本發明有關一種串流處理器(stream processor)’更特別係指一 種有可變單指令多資料(Single Instruction Multiple Data, SIMD) 係數的串流處理器,可以處理不同格式的資料。 【先前技術】 自從西元2000年起,固定函數的圖像處理單元 (Graphics Processing Units,GPUs )變得愈來愈可程式 化,以提供使用者直接而彈性地控制在圖像晶片中的基元 (primitive)、頂點(vertex)、質地(texture)、和像素 之串流(pixel stream)處理。許多現今的GPU中,至少有 一種著色器(shader,如基元、頂點之類)具有可程式化 的能力,但是這樣的GPU —般都只能處理很少數的資料類 φ 別(像是對於頂點之32位元的浮點以及32位元之整數)。 這些在圖像管線(graphics pipeline)中可程式化的著色器時 常以串列的方式安排,以傳遞資料至固定函數單元,或當 有其必要時,透過資料格式轉換以傳遞給彼此的著色器。 通常並列多重處理器架構原則(parallel multiprocessor architecture principles)也會包含在GPU設計中。並列架構 原則的應用時常利用數個相同型式的算術邏輯單元 (Arithmetic Logic Unit, ALU),去處理在非均勻程式緒 (non-uniform program thread )中的不同型式資料串流。[1335550 * IX, invention description: [Technical field of the invention] The present invention relates to a stream processor's more specifically to a variable single instruction multiple data (SIMD) coefficient A stream processor that can process data in different formats. [Prior Art] Since the beginning of 2000, fixed function image processing units (GPUs) have become more and more programmable to provide users with direct and flexible control of primitives in image wafers. (primitive), vertex, texture, and pixel stream processing. In many modern GPUs, at least one shader (such as primitives, vertices, etc.) has the ability to be programmed, but such a GPU can only handle a small number of data classes (like For 32-bit floating points of vertices and 32-bit integers). These colorizers, which can be programmed in the graphics pipeline, are often arranged in tandem to transfer data to fixed function units or, if necessary, to data converters for data transfer to each other. . Usually parallel multiprocessor architecture principles are also included in the GPU design. The application of the Parallel Architecture principle often uses several Arithmetic Logic Units (ALUs) of the same type to handle different types of data streams in a non-uniform program thread.

Client's Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A4]】10-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 6 1335550 在。午夕環土兄中如果非均勻程式緒是交錯(interleaved )的 活’ ALU會被要求在每個時脈週期中處理不同型式的資 料。 在這樣的多重處理器架構下,複雜數學函數(特別功 能)的實現是許多重要課題的其中之一。通常有兩種方式 可以實現它們:執行在—般的ALU的特別副程式,或是依 附在一般ALU的特別硬體單元(會根據一般ALu的要求而 產生結果)。上述功能的軟體實施會產生顯著的效能下降, 且可能無法被即時圖像應用所接受。在多重ALU結合於單 指令多資料(Single instructi〇11 Multiple Data,SIMD )架 構中’特別的硬體單元必須依附至每個ALU,而造成明顯 的硬體成本增加。上述的特别功能在著色器程式中不常被 用到’而且大多時間這些結合在每個ALU的特別硬體單元 皆處於閒置狀態。 這狀況可由數個ALU共享一特別函數單元(SpecialClient's Docket No.: S3U04-0012-TW TT, s Docket N〇: 0608-A4]] 10-TW/Final/LukeLee (Li Zongxuan) / 1, Feb, 2007 6 1335550. If the non-uniformity of the midnight brothers is interleaved, the ALU will be required to process different types of information in each clock cycle. Under such a multiprocessor architecture, the implementation of complex mathematical functions (special functions) is one of many important topics. There are usually two ways to do this: execute a special subroutine in the general ALU, or attach a special hardware unit to the general ALU (which will produce results based on the general ALU requirements). The software implementation of the above functions can result in significant performance degradation and may not be acceptable for instant image applications. In the case of multiple ALUs combined with Single Instructi 〇 11 Multiple Data (SIMD) architectures, special hardware units must be attached to each ALU, resulting in significant hardware cost increases. The special functions described above are not commonly used in shader programs. And most of the time these special hardware units combined in each ALU are idle. This condition can be shared by several ALUs with a special function unit (Special

Function Unit,SFU )而解決部分問題,但在SIMD架構中, 一執行緒將被停頓住直到所有串流從共享的SFU得到它們 的結果,而該SFU會依序處理它們的請求。著色程式中複 雜的數學函數可能會多花數個額外週期在計算上。SIMD 串流架構應5亥要有特別安排,來減少停頓等待週期,而且 要提供產生最少額外週期的平滑串流處理,假如非均勻程 式緒是交錯的話。 當ALU用在多重處理方式時常常承受高資料通量 (throughput),這些ALU應該要能分享處理較長格式之相The Function Unit (SFU) solves some of the problems, but in the SIMD architecture, a thread will be stalled until all streams get their results from the shared SFU, and the SFU will process their requests in sequence. Complex math functions in the shader may take several extra cycles to calculate. The SIMD streaming architecture should be specially arranged to reduce the pause waiting period, and to provide smooth stream processing that produces the least extra cycles, if the non-uniform pattern is interlaced. When ALUs are often subjected to high data throughput when used in multiple processing, these ALUs should be able to share the processing of longer formats.

Clients Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A4 丨 110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 ^ 同硬體,來處理更多短格式資料串流。一般而,言,現今GPU 的ALU僅用以處理一種浮點單元的格式’(例如32位元的 IEEE標準格式),而且時常在處理低精準像素和質地資料 時效能低落。此外,如果支援另一種資料格式時,該等ALU 時常是不考慮資料格式,就工作在相同數量的串流,而沒 有(或是只有一些)通量改進,也沒有SIMD係數(factor) 可變性。更進一步說,現今ALU通常不能隨意交錯指令流 (缺乏非均勻程式緒的支援)。此外,現今雙格式乘法累 鲁 力σ ( Multiply Accumulate, MACC )單元通常只處理整數資 料。 有固定資料格式和固定SIMD係數的向量運算器通常 具有少量的硬體負擔,以及,.當向量串流的元素數量比一 個向量單元的寬度少時,該向量運算器通常處理串流資料 相對地較慢。此外,現今圖像著色器架構,在相同指令裡, 對於處理不同格式資料,一般只有有限的指令集能力。Clients Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A4 丨110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 1335550 ^ Same hardware, to handle more short format data Streaming. In general, today's GPU's ALU is only used to handle a floating-point unit format (such as the 32-bit IEEE standard format), and is often inefficient when dealing with low-precision pixel and texture data. In addition, if another data format is supported, these ALUs often work on the same number of streams without considering the data format, and there is no (or only some) flux improvement, and there is no SIMD factor variability. . Furthermore, today's ALUs are usually not free to interleave instruction streams (lack of support for non-uniform programming). In addition, today's dual format multiply Accumulate (MACC) units typically only process integer data. Vector operators with fixed data formats and fixed SIMD coefficients usually have a small amount of hardware burden, and, when the number of elements in a vector stream is less than the width of a vector unit, the vector operator typically processes the stream data relatively Slower. In addition, today's image shader architectures, in the same instructions, generally have limited instruction set capabilities for processing different formats.

因此,一種迄今未被提出的需求存在於業界中,以解 ®決上述問題的不足與不適當Q 【發明内容】 在串流處理器的實施例中,設計用以處理數種不同格 式的資料。至少串流處理器的一實施例包含一第一純量 ALU處理短格式資料(根據接收指令集中的一短格式控制 訊號),以及處理長格式資料(根據接收指令集中的一長 格式控制訊號)。串流處理器的數個實施例也包含一第二 ALU,用以接收第一 ALU處理過的資料,以及根據指令集Therefore, a demand that has not been proposed so far exists in the industry to solve the above problems and inappropriate Q. [Invention] In the embodiment of the stream processor, it is designed to process data in several different formats. . At least one embodiment of the stream processor includes a first scalar ALU processing short format data (based on a short format control signal in the received command set), and processing long format data (based on a long format control signal in the received command set) . Several embodiments of the streaming processor also include a second ALU for receiving data processed by the first ALU, and according to the instruction set

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 8 1335550 、中的:控制訊號’處理輸入資料和第- ALU處理過的資 料還有其它貫施例包含一 SFU,用以奐供額外計算功能 給第一 ALU和第二alu。 此外,也包含在本說明書的是一種方法的實施例,以 處理任何數種不同格式的資料。至少該方法的一實施例包 含判別接收的資料是否為短格式資料,以及因應接收的資Client's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 8 1335550, Medium: Control Signal 'Processing Input Data and Section-ALU Processing Still other embodiments include an SFU for providing additional computational functions to the first ALU and the second alu. Also included within this specification is an embodiment of a method for processing data in any of a number of different formats. At least one embodiment of the method includes determining whether the received data is short format data and the funds received in response thereto

料為短格式資料,根據一指令集,而功能上切割第- ALU 以利處理。其它該方法的實施例包含傳送處理過的資料至 零—功能上切割的第二ALU。 、 也包含在本說明書的是模組化串流處理器的實施例, 用以處理不同格式的資料。至少一模組化串流處理器的實 施例包含第-ALU,用以接收第—輸人資料和控制資料, 該控制資料用以指示該接收資料的格式。第一 ALu更用以 處理短格式輸人資料和長格式輪人資料,根據該控制資 料。一些實施例包含第二ALU,用以接收來自第一 alu 春的控制資料,第二ALU更用以處理第二輸入資料,且第二 輸^資料與第-輸人f料有關^第二A L U更用以處理短格 式貧料和長格式資料,根據該控制資料。還有一些實施例 包含第三ALU,用以接收來自第二ALU的控制資料,第 二ALU更用以接收第三輸入資料,且第三輪入資料與第一 和第二輸入資料有關。第三ALU更用以處理短格式資料和 長格式資料,根據該控制資料。一些實施例包含第四 ALU,用以接收來自第三ALU的控制資料,第四alu更 用以接收第四輸入資料,且第四輸入資料與第一、第一、It is expected to be short-form data, and functionally cuts the -ALU for processing according to an instruction set. Other embodiments of the method include transmitting the processed data to a zero-functionally cut second ALU. Also included in this specification is an embodiment of a modular stream processor for processing data in different formats. An embodiment of at least one modular stream processor includes a first-ALU for receiving first-input data and control data, the control data indicating a format of the received data. The first ALu is used to process short format input data and long format wheel data based on the control data. Some embodiments include a second ALU for receiving control data from the first alu spring, a second ALU for processing the second input data, and a second input data for the first-input material. It is also used to process short format poor materials and long format data, according to the control data. Still other embodiments include a third ALU for receiving control data from the second ALU, a second ALU for receiving the third input data, and a third round entry for the first and second input data. The third ALU is further used to process short format data and long format data according to the control data. Some embodiments include a fourth ALU for receiving control data from the third ALU, a fourth alu for receiving the fourth input data, and a fourth input data for the first, first, and

Client's Docket No,: S3U04-0012-TW TT’s Docket No:060S-A41110-TW/Final/LukeLee(李宗軒)/! s Feb,2〇〇7 9 1335550 * 第三輸入資料有關。第四算術邏輯運算單元更用以處理短 格式資料和長格式貢料,根據該控制育相·。 當熟習本領域的人詳讀下述的圖示和細節描述時,本 發明所揭露的系統、方法、特性、和優點將變得顯而易見。 所有這些附加的系統、方法、特性、和優點將包含在下列 敘述中,而不脫離本揭露的範疇。 【實施方式】Client's Docket No,: S3U04-0012-TW TT’s Docket No: 060S-A41110-TW/Final/LukeLee (Li Zongxuan)/! s Feb, 2〇〇7 9 1335550 * The third input is related. The fourth arithmetic logic unit is further used to process the short format data and the long format tribute, and the phase is mediated according to the control. The systems, methods, features, and advantages of the present invention will become apparent from the Detailed Description. All of these additional systems, methods, features, and advantages will be included in the following description without departing from the scope of the disclosure. [Embodiment]

, 第1A圖為一流程圖,說明結合一向量ALU與一 SFU 的一作為例子的處理單元之串流資料處理步驟。更特別的 是,第1A圖顯示了伴隨一常規的架構100的一串流向量 處理單元。如圖所示,3維圖像資料向量的輸入串流傳送 至輸入緩衝常規記憶體102。輸入緩衝常規記憶體102在 此例子中傳遞向量資料至向量ALU( vector ALU) 104。如 接連的指令週期所示,每個向量包含四個成份:X、Y、z、 和W。如圖所示,向量從輸入緩衝常規記憶體102傳送至 vector ALU 104,該等向量安排成每個向量彼此相連。vector ALU 104和SFU 106可以執行想要的操作以針對現有向量 每個元素來產生輸出。一個SFU設計來處理許多操作型 式,如正弦函數、餘弦函數、平方根函數、分數、指數等 等。 第1B圖為一流程圖,相似於第1A圖的步驟,說明可 以運用在一作為一例子的純量處理單元的執行步驟。第1B 圖說明一向量資料處理,使用有四個純量ALU ( scalar ALU) 124的串流處理器。更特別的是,3維圖像資料向量Figure 1A is a flow chart illustrating the streaming data processing steps of a processing unit incorporating an example of a vector ALU and an SFU. More specifically, Figure 1A shows a stream vector processing unit with a conventional architecture 100. As shown, the input stream of the 3-dimensional image data vector is passed to the input buffer conventional memory 102. The input buffer conventional memory 102 passes the vector data to the vector ALU (vector ALU) 104 in this example. As shown in successive instruction cycles, each vector contains four components: X, Y, z, and W. As shown, the vectors are passed from the input buffer conventional memory 102 to the vector ALU 104, which are arranged such that each vector is connected to each other. The vector ALU 104 and SFU 106 can perform the desired operations to produce an output for each element of the existing vector. An SFU is designed to handle many operational types, such as sine functions, cosine functions, square root functions, fractions, indices, and so on. Fig. 1B is a flow chart similar to the step of Fig. 1A, illustrating the execution steps of a scalar processing unit as an example. Figure 1B illustrates a vector data processing using a stream processor with four scalar ALUs 124. More specifically, the 3-dimensional image data vector

Client's Docket No.: S3U04-0012-TW TT’s Docket Ν〇:0608-Α41 Π0-TW/Final/LukeLee(李宗軒)/1,Feb,2007 10 1335550 • 串流輸入至輸入資料缓衝4組正交存取記憶.體122。此例 子中的記憶體122用以提供資料讀取的i直存取型式和資 料寫入的水平存取型式。這種型式的記憶體’對於一個或 多個記憶體組而言,有特別的向量元素多工器和位址產生 器,如申請於2003年9月19日之美國專利申請 20040172517所揭示並討論’在此考量其整體並整合至此。 輸入資料缓衝4組正交存取記憶體122可傳送重組過 的(垂直的)向量資料至純量ALU ( scalar ALU ) • 124a-124d。更特別的是,輸入資料缓衝4組正交存取記憶 體122接連地傳送第一向量資料元素(W1,Z1,Y1 ’ XI) 至scalar ALU1 124a;接連地傳送第二向量資料元素(W2 ’ Z2,Y2,X2)至scalar ALU2 124b ;接連地傳送第三向量 資料元素至scalar ALU3 124c ;以及’接連地傳送第四向 量資料元素至 scalar ALU4 124d。scalar ALU 124a-124d 和 SFU 126因此可處理向量資料而且分別傳送處理過的資料 至緩衝器SI、S2、S3、和S4。輸出緩衝器S1-S4接著傳 ® 送資料至輸出正交轉換器130,其可轉換接收的資料至水 平向量格式。更進一步說,輸出正交轉換器130可設計來 將處理過的資料,從純量序列或垂直表示轉換至水平向量 表示。於是資料可以輸出成如圖上所示的Xout、Yout、 Zout ' Wout ° 值得注意的一點是,雖然具有常規架構100的向量處 理單元一次處理向量資料中的一個向量,而使用具有四個 純量ALU的串流處理器120的向量資料處理卻沒有這項Client's Docket No.: S3U04-0012-TW TT's Docket Ν〇:0608-Α41 Π0-TW/Final/LukeLee(李宗轩)/1,Feb,2007 10 1335550 • Stream input to input data buffer 4 groups of orthogonal storage Take memory. Body 122. The memory 122 in this example is used to provide a direct access pattern for data reading and a horizontal access pattern for data writing. This type of memory 'has a special vector element multiplexer and address generator for one or more memory banks, as disclosed and discussed in U.S. Patent Application No. 20040172517, filed on Sep. 19, 2003. 'Think about it as a whole and integrate it here. The input data buffer 4 sets of orthogonal access memory 122 can transmit the reconstructed (vertical) vector data to the scalar ALU (124a-124d). More specifically, the input data buffer 4 sets of orthogonal access memories 122 successively transmit the first vector data elements (W1, Z1, Y1 'XI) to scalar ALU1 124a; successively transmit the second vector data elements (W2) 'Z2, Y2, X2) to scalar ALU2 124b; successively transmit the third vector data element to scalar ALU3 124c; and 'continuely transmit the fourth vector data element to scalar ALU4 124d. The scalar ALUs 124a-124d and SFU 126 can therefore process the vector data and separately transmit the processed data to the buffers SI, S2, S3, and S4. The output buffers S1-S4 then pass the data to the output quadrature converter 130, which converts the received data to a horizontal vector format. Furthermore, output orthogonal converter 130 can be designed to convert processed data from a scalar sequence or vertical representation to a horizontal vector representation. The data can then be output as Xout, Yout, Zout 'Wout ° as shown on the figure. It is worth noting that although the vector processing unit with the conventional architecture 100 processes one vector in the vector data at a time, the use has four scalars. The vector data processing of the ALU stream processor 120 does not have this

Clients Docket No.: S3U04-0012-TW TT’s Docket No:〇608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 11 1335550 • 要求。如圖所示,向量資料的元素可以依照任何順序來處 理且接著被重新整理而輸出。此外,雖’然具有四個純量 ALU的串流處理器120和有常規架構的向量處理單元100 都接收向量資料來視為一資料組,然而這不是一個必要條 件。向量資料的元素可以以任何順序的純量來被接收,而 且可以SIMD方式來被處理。 如先前所提,SIMD串流處理器可設計來執行複雜的數 學運算(特別功能)如平方根、正弦、餘弦、和其它運算, ® 以提供當代GPU中的圖像資料處理。向量ALU可有一附 加的(或其它可存取的)SFU。每當一適當的指令到達ALU 時,SFU便開始工作。SFU可視為ALU的獨立通道。 第1C圖為一串流處理SIMD架構,伴隨複雜數學函數 的軟體實施。在SIMD純量ALU的條件下,特別函數的實 施可有一些選擇。每個ALU有特別的附加查值表以及稍微 修改的資料路徑,以執行特別常規裡的特別函數計算序列 (例如計算平方根的Newton-Raphson演算法)。在此例子 * 的特別函數計算的時間延遲,將等同於每個特別函數常規 的指令數目乘以SIMD純量ALU的指令執行週期。此個實 施上的問題是,延遲時間顯著取決於每個ALU的指令數 目° 第1D圖為一串流處理SIMD架構,伴隨複雜數學函數 的硬體實施,每個ALU使用各別的SFU。如第1D圖所示, 每個純量ALU享有一各別的SFU硬體。此應用的問題是 過多的硬體,而且這些過多的硬體很少被使用到。然而,Clients Docket No.: S3U04-0012-TW TT’s Docket No: 〇608-A41110-TW/Final/LukeLee/1,Feb, 2007 11 1335550 • Requirements. As shown, the elements of the vector data can be processed in any order and then rearranged for output. Furthermore, although the stream processor 120 having four scalar ALUs and the vector processing unit 100 having the conventional architecture receive the vector data as a data group, this is not a necessary condition. The elements of the vector material can be received in any order of scalar quantity and can be processed in SIMD mode. As mentioned earlier, SIMD stream processors can be designed to perform complex mathematical operations (special functions) such as square root, sine, cosine, and other operations to provide image data processing in contemporary GPUs. The vector ALU may have an additional (or other accessible) SFU. Whenever an appropriate command arrives at the ALU, the SFU begins to work. SFU can be considered as an independent channel of the ALU. Figure 1C shows a stream processing SIMD architecture with software implementation of complex mathematical functions. Under the condition of SIMD scalar ALU, there are some options for the implementation of special functions. Each ALU has a special additional look-up table and a slightly modified data path to perform a special function calculation sequence in a special routine (such as the Newton-Raphson algorithm for calculating the square root). The time delay calculated by the special function in this example * will be equivalent to the number of regular instructions per special function multiplied by the instruction execution cycle of the SIMD scalar ALU. The problem with this implementation is that the delay time is significantly dependent on the number of instructions per ALU. Figure 1D is a stream processing SIMD architecture, with the hardware implementation of complex mathematical functions, each ALU uses a separate SFU. As shown in Figure 1D, each scalar ALU has a separate SFU hardware. The problem with this application is too much hardware, and these too many hardware are rarely used. however,

Client's Docket No.: S3U04-0012-TW TT’s DocketN〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 1335550 特別函數計算的時間延遲將是最少的,且通常等於平 令執行週期。 · '二才曰 第1E圖為一串流處理s! M D架構,伴隨複雜數學函數 的硬體貫施,所有ALU使用一共同的SFU。如圖所示,其 藉由使用可處理多重ALU的請求之—共同SFU硬體,’ ^ 降低硬體成本。此實施的一問題在於,當SFU接連地广理 所有^LU的請求和計算所有串流的值時,所有alu 顯地停頓等待時間。一般而言,所有ALU將一直停頓 待,直到最後一個ALU從SFU接收到一個回傳值:止、。 這種操作的總延遲等同於SFU的處理週期乘以連接至 的純量ALU總數。 第1F圖為-串流處gSIMD架構,伴隨複雜數學 的硬體實施’交錯存取-共_ SFU。言亥SFU每 時間延遲,可藉由數個ALU交錯存取該SHJ而降低: 特別的是,第1F圖為使用共同SFU的串流處理加 構的-實施例。在此妓中’來自不同純量則求、, 利用特別的㈣暫存H在時間上被”,㈣在列 則中重新安排相同的SIMD指令執行。每個串 將等同於各別SFU的延遲,而相較於先前架構之剩^ 遲,將可藉由延遲暫存器所彌補。 、 當處理不同輸人串流型式時’影響s_純 理器效率的另一問題是SIMD係數。户必& " 點、三角形'和⑷像素資料,而二串流可能包含頂 叩彳遺存需要的輪眘 的累積可能導致顯著的延遲’也會在區域記憶體裡增加資Client's Docket No.: S3U04-0012-TW TT's DocketN〇:0608-A41110-TW/Final/LukeLee/1,Feb,2007 1335550 The time delay for special function calculations will be minimal and usually equal to the average execution. cycle. · '二才曰 1E is a stream processing s! M D architecture, with the hardware implementation of complex mathematical functions, all ALUs use a common SFU. As shown, it reduces hardware costs by using a common SFU hardware that can handle multiple ALU requests. One problem with this implementation is that when the SFU contiguously processes all of the ^LU requests and calculates the values of all the streams, all alu explicitly pauses the wait time. In general, all ALUs will wait until the last ALU receives a return value from the SFU: stop. The total delay for this operation is equivalent to the processing period of the SFU multiplied by the total number of scalar ALUs connected to. Figure 1F is a -streamed gSIMD architecture with hardware implementation of complex mathematics 'interleaved access-to-sFU'. The time delay of the SFU is reduced by interleaving the SHJ by several ALUs: In particular, the 1F is an embodiment of a stream processing using a common SFU. In this case, 'from different scalars, use special (four) temporary storage H in time, and (4) re-arrange the same SIMD instruction execution in the column. Each string will be equivalent to the delay of each SFU. Compared with the previous architecture, the delay can be compensated by the delay register. Another problem that affects the efficiency of the s_ pure processor when dealing with different input stream patterns is the SIMD coefficient. &" point, triangle' and (4) pixel data, while the two streams may contain the need for a careful accumulation of top-of-the-line remnants that may result in significant delays' will also increase in regional memory

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:060S-A41110-TW/Fmal/LukeLee(李宗軒)/】,Feb,2〇〇7 13 料生命時間(data life span )。 第1G圖為減少SIMD係數的一個㈣子’在一個共同共 享SIMD架構下的頂點和三角形處理。如圖所示,第 圖說明當四個ALU處理串流資料且SIMD係數為4時,— 個係數為4的相同SIMD架構之頂點和三角形串流處理。 要處理的頂點封包包含四個頂點之資料。要處理的三角# 封包包含12個頂點,且可能因為要累積完整封包的前置時 間,而在三角形頂點開始處理時就造成顯著延遲。這便是 為什麼在以四個ALU的相同架構下處理三角形時, 係數從4、2、減為1會變成現今圖像處理單元的重要課題 之原因。 第2A圖為一流程圖,說明一純量處理單元,相似於第Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 060S-A41110-TW/Fmal/LukeLee (Li Zongxuan)/], Feb, 2〇〇7 13 data life span. Figure 1G is a vertice and triangle process for reducing one (four) sub-s of SIMD coefficients under a common shared SIMD architecture. As shown in the figure, the figure illustrates the vertices and triangle stream processing of the same SIMD architecture with 4 coefficients when the four ALUs process the stream data and the SIMD coefficient is 4. The vertex packet to be processed contains data for four vertices. The triangle # packet to be processed contains 12 vertices, and may cause significant delays when the triangle vertices begin processing because of the pre-time to accumulate the complete packet. This is why when the triangle is processed in the same architecture with four ALUs, the reduction of the coefficients from 4, 2 to 1 becomes an important issue for today's image processing units. Figure 2A is a flow chart illustrating a scalar processing unit similar to the first

1圖的流程圖’伴隨SIMD係數為4。如圖所示,笛0 A 乐圖 用純量ALU處理向量串流資料,伴隨SIMD係數為4且= 料為長格式。相似於第1B圖的資料流,向量資料不為巧 於以一組資料流來流動的方式。當每個資料元素抵達各^ 的 ALU ( ALUO 204a、ALU1 204b、ALU2 204c、以及 ALU3 204d ) ’ ALU可以根據伴隨資料傳遞的延遲而同步傳達的 ALU指令來處理資料。此外,如圖所示,在ALUO 204a 接收資料早於ALU1 204b接收資料。同樣地,ALU2 204c 相較於ALU1 204b是延遲的。ALU3 204d相較於ALU2 204c是延遲的。當資料處理過後,該處理過的資料傳送至 緩衝器SI、S2、S3、以及S4,且有各別的同步延遲。 值得注意的是第2A圖的SIMD係數為4,因為四個The flowchart of Figure 1 is accompanied by a SIMD coefficient of four. As shown in the figure, the flute 0 A map processes the vector stream data with a scalar ALU with a SIMD coefficient of 4 and = a long format. Similar to the data stream of Figure 1B, vector data is not a clever way to flow through a set of data streams. When each data element arrives at each ALU (ALUO 204a, ALU1 204b, ALU2 204c, and ALU3 204d), the ALU can process the data according to the ALU instruction that is synchronously communicated with the delay of the data transfer. In addition, as shown, the ALUO 204a receives data earlier than the ALU1 204b. Similarly, ALU2 204c is delayed compared to ALU1 204b. ALU3 204d is delayed compared to ALU2 204c. When the data has been processed, the processed data is transferred to buffers SI, S2, S3, and S4 with respective synchronization delays. It is worth noting that the SIMD coefficient of Figure 2A is 4 because of four

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒X】,Feb, 2007 1335550 ALU Λ貝上執行相同的操作。此外,如第2a圖所示,每 個則是用以處理3 6位元的長格式資料。· 第2B圖為—流程圖’說明一純量處理單元,相似於第 \圖的",L私圖’伴隨SIMD係數為1,且將四個ALU的運 异結果&合至ALU3的運算結果。如圖所示,第2B圖說 明數個純置ALU處理向量串流資料,且SIMD係數為1而 資料為長七式。當第2A圖的結構顯示了向量資料以不同 於向里元素資料組的方式送至這些ALU時,第2B圖中的 結構顯不了向量資料以向量資料組的方式傳送至這些 ALU。更特別的是,第2B圖的資料XI送至ALU0。ALU0 可處理資料且傳送至少一部分的處理結果至ALU1,ALU0 也輸出資料傳至元素改組器((;〇111?〇1^1^5}111册幻226。人]11;1 被延遲的去接枚從ALU0的輸出資料以及資料γι。ALU1 傳送輸出資料至元素改組器226以及ALU2。ALU2接收資 料zi和來自ALTJ1的資料。ALU2傳送輸出資料至元素改 組益226和ALU3。ALU3接收資料W1和來自ALU2的資 料。ALU3傳送輸出資料至元素改組器226。元素改組器 226可傳送資料至下列輸出的一個或數個 :Xout、Yout、Client's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110-TW/Final/LukeLee (Fe Zongxuan X), Feb, 2007 1335550 ALU Mussel performs the same operation. In addition, as shown in Figure 2a Each is used to process 3 6-bit long format data. · Figure 2B is a flow chart 'illustrating a scalar processing unit, similar to the \'s ", L private graph' with the SIMD coefficient is 1, and the result of the four ALUs is combined with the result of ALU3. As shown in the figure, Figure 2B illustrates several pure ALU processing vector stream data, and the SIMD coefficient is 1 and the data is long. When the structure of Figure 2A shows that vector data is sent to these ALUs in a different way than the inbound element data set, the structure in Figure 2B shows that vector data is transmitted to these ALUs as vector data sets. In particular, the data XI of Figure 2B is sent to ALU0. ALU0 can process the data and transmit at least part of the processing result to ALU1, and ALU0 also outputs the data to the element reorganizer ((;〇111?〇1^1^5} 111 volumes 226. People] 11; 1 delayed to pick up the output data from ALU0 and the data γι ALU1 transmits the output data to element reorganizer 226 and ALU 2. ALU 2 receives data zi and data from ALTJ 1. ALU 2 transmits the output data to element reorganization 226 and ALU 3. ALU 3 receives data W1 and data from ALU 2. ALU 3 transmits output data to element Reorganizer 226. Element shuffler 226 can transmit data to one or more of the following outputs: Xout, Yout,

Zout、和Wout。以此例來看’假如是向量内積的操作時, 此模式最好能在較少的時脈週期内,處理一少量的串流, 譬如三角形和頂點封包。 值得注意的是第2B圖的SIMD係數為1,因為每個 ALU執行相同的指令但不同運算元數目。更特別的是,因 為每個ALU自前一 ALU接收資料,這些ALU依其所在位Zout, and Wout. In this example, if it is a vector inner product operation, this mode is better able to process a small number of streams, such as triangles and vertex packets, in fewer clock cycles. It is worth noting that the SIMD coefficient of Figure 2B is 1, because each ALU executes the same instruction but a different number of operands. More specifically, because each ALU receives data from the previous ALU, these ALUs are based on their location.

Client’s Docket No.: S3U04-0012-TW XT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 1335550 置執行不同運算。如上所述,用ALU實現内.積指令的實施 例有下述功能: ’ ALUO : D0=A0*B0+0,實現 X1*X2 ALU1 : D1=A1*B1+D0,實現 Y1*Y2+X1*X2 ALU2 : D2=A2*B2+D卜實現 Ζ1*Ζ2+Υ1*Υ2+Χ1*Χ2 ALU3 : D3=A3*B3+D2,實現 W1*W2+Z1*Z2+ Υ1*Υ2 +Χ1*Χ2 實際結果可能在ALU3的輸出中並且可移轉至任何向 量位置作為以後使用。此外,如第2 Α圖所示,於每個ALU, 第2B圖處理36位元的長格式資料。 第2C圖為一流程圖,說明一純量處理單元,相似於第 2A圖的流程圖,伴隨SIMD係數為8且資料為短格式。此 純量處理單元包含與第2A圖相同數量的ALU,然而,在 第2C圖每個ALU分開處理兩短格式資料串流(闢如說是 18位元的元素,而非36位元的元素)。如圖所示,第2C 圖包含向量串流資料處理,伴隨處理短格式資料且SIMD 係數為8的數個純量ALU。這表示每個ALU可基於不同 延遲時間的同一指令,處理八組輸入資料且產生八個結 果。更特別的是,向量資料可為18位元(短格式),而不 同於前述的3 6位元(長格式)。更特別的是,先前例子中 的向量元素W1現在分為兩個短格式元素W1.0和W1.1。 同樣地,X、Y、和Z,還有其它資料組2、3、和4也都以 短格式資料表示。此外,第2B圖也表示,輸入該等ALU 的資料不必然與一個向量元素資料組相關。更特別的是,Client’s Docket No.: S3U04-0012-TW XT’s Docket No: 0608-A41110-TW/Final/LukeLee/1, Feb, 2007 1335550 Perform different operations. As described above, the embodiment in which the inner product instruction is implemented by the ALU has the following functions: ' ALUO : D0=A0*B0+0, realizes X1*X2 ALU1 : D1=A1*B1+D0, realizes Y1*Y2+X1 *X2 ALU2 : D2=A2*B2+D Bu implementation Ζ1*Ζ2+Υ1*Υ2+Χ1*Χ2 ALU3 : D3=A3*B3+D2, realize W1*W2+Z1*Z2+ Υ1*Υ2 +Χ1*Χ2 Actual The result may be in the output of ALU3 and can be moved to any vector location for later use. In addition, as shown in Figure 2, in each ALU, Figure 2B processes 36-bit long format data. Figure 2C is a flow diagram illustrating a scalar processing unit similar to the flow chart of Figure 2A with a SIMD coefficient of 8 and data in a short format. This scalar processing unit contains the same number of ALUs as in Figure 2A. However, in Figure 2C each ALU separately processes two short format data streams (e.g., 18-bit elements, not 36-bit elements). ). As shown, the 2C graph contains vector stream data processing, with several scalar ALUs with short format data and a SIMD coefficient of 8. This means that each ALU can process eight sets of input data and produce eight results based on the same instruction for different delay times. More specifically, the vector data can be 18 bits (short format), unlike the aforementioned 3 6 bits (long format). More specifically, the vector element W1 in the previous example is now divided into two short format elements W1.0 and W1.1. Similarly, X, Y, and Z, as well as other data sets 2, 3, and 4, are also represented in short format data. In addition, Figure 2B also shows that the data entered into these ALUs is not necessarily related to a vector element data set. More specifically,

Client’s Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 16 1335550 • 如同輸入每個ALU的資料不必要相互相關,.該等ALU也 不受限於處理向量資料組。 ’ 此實施例中也包含有數個分離或分歧的ALU,可以更 有效率處理短格式資料。更特別的是,資料XI.〇輸入至 ALU0的左半部(ALU0.0 ),而ALUO的右半部(ALU0.1 ) 接收資料X1.1。輸入ALU0.0和ALU0.1的資料經過處理 後分別傳送至輸出緩衝器S1.0和S1.1。同樣地,資料X2.0 和X2.1分別送至ALU1的左半部(ALU1.0)與右半部 • (ALU1.1)。如圖所示,當相較於ALU0.0和ALU0.1處理資 料時,ALU1.0和ALU1.1處理資料時間較晚。當資料在處 理後,ALU1.0和ALU1.1分別送輸出資料至輸出缓衝器 52.0 和 S2.1。 以同樣的方式,ALU2.0和ALU2.1分別接收資料X3.0 和X3.1。在處理完接收資料後,ALU2.0和ALU2.1分別傳 送輸出資料至輸出缓衝器S3.0和S3.1。此外,ALU2.0和 ALU2.1的資料處理比先前的ALU的資料處理時間較晚。 * 如同先前的操作,ALU3.0和ALU3.1分別接收資料X4.0 和X4.1。ALU3.0和ALU3.1處理完接收的資料(來自 ALU2.0和ALU2.1)後,分別傳送輸出資料至輸出缓衝器 54.0 和 S4.1。 因為所有八個ALU (實體上可視為四個雙通道的 ALU,每個邏輯上分離成兩半)都執行相同指令,第2C 圖的SIMD係數為8。此外,在第2C圖的ALU可用以接 收和處理18位元(短格式)資料,以及3 6位元(長格式)Client's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110-TW/Final/LukeLee/1, Feb, 2007 16 1335550 • As with the input of each ALU, the data is not necessarily related to each other. The ALU is also not limited to the processing vector data set. This embodiment also includes several separate or divergent ALUs to process short-form data more efficiently. More specifically, the data XI.〇 is input to the left half of ALU0 (ALU0.0), while the right half of ALUO (ALU0.1) receives data X1.1. The data entered into ALU0.0 and ALU0.1 are processed and sent to output buffers S1.0 and S1.1, respectively. Similarly, data X2.0 and X2.1 are sent to the left half (ALU1.0) and the right half of ALU1 (ALU1.1). As shown, ALU 1.0 and ALU 1.1 processed data later than when processing data with ALU 0.0 and ALU 0.1. When the data is processed, ALU1.0 and ALU1.1 send the output data to output buffers 52.0 and S2.1, respectively. In the same way, ALU 2.0 and ALU 2.1 receive data X3.0 and X3.1, respectively. After processing the received data, ALU 2.0 and ALU 2.1 respectively transmit the output data to output buffers S3.0 and S3.1. In addition, the data processing of ALU2.0 and ALU2.1 is later than the data processing of the previous ALU. * As with previous operations, ALU 3.0 and ALU 3.1 receive data X4.0 and X4.1, respectively. After processing the received data (from ALU 2.0 and ALU 2.1), ALU 3.0 and ALU 3.1 respectively transmit the output data to output buffers 54.0 and S4.1. Since all eight ALUs (physically visible as four dual-channel ALUs, each logically split into two halves) execute the same instruction, the SIMD coefficient of Figure 2C is 8. In addition, the ALU in Figure 2C can be used to receive and process 18-bit (short-format) data, and 3 6-bit (long format)

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 資料。 . 第2D圖為一流程圖,說明一純量處’理單元,相似於第 2A圖的流程圖,伴隨SIMD係數為4且資料為長格式。第 2D圖包含短格式純量ALU的向量串流資料處理。如圖所 示,輸入至ALU的資料類似於第2C圖,可為一資料組的 架構或不是一資料組的架構。此外,如同前面的例子,資 料X1.0和X1.1輸入至ALU0.0和ALUO」。然而在此例子 中,相較於ALU0.0, ALU0.1是稍微延遲的,且使用ALU0.0 的運算結果。此外,ALU0.1不只從XI.1接收輸入資料, 也從ALU0.0的輸出接收資料。同樣地,ALU1.0接收資料 X2.0,處理該接收資料,且輸出處理過的資料至ALU1.1。 ALU1.1接收由ALU1.0的輸出資料以及資料X2.1。處理接 收到的資料後,ALU1.1輸出處理過的資料至輸出缓衝器 52.1。 ALU2.0接收資料X3.0,處理接收資料,並輸出結果 至ALU2.1。ALU2.1接收ALU2.0的輸出資料,也接收資 料X3.1。ALU2.1處理接收資料並輸出結果至輸出緩衝器 53.1。 ALU3.0接收輸入資料X4.0。ALU3.0處理該接收資 料並輸出處理過的資料至ALU3.1。ALU3.1從ALU3.0接 收輸出資料,也接收資料X4.1。ALU3.1處理接收的資料 並傳送處理過的資料至輸出緩衝器S4.1。 這樣ALU的實施例可用以實施下列功能: ALU0.0 : d0.0=a0.0*b0.0+0 ALU0.1 : d0.1=a0.1*b0.1+d0.0 ALU1.0 : dl.0=al.0*bl.0+0Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 1335550 Information. Fig. 2D is a flow chart illustrating a scalar unit, similar to the flowchart of Fig. 2A, with a SIMD coefficient of 4 and data in a long format. Figure 2D contains vector stream data processing for short format scalar ALUs. As shown, the data entered into the ALU is similar to the 2C diagram and can be a data frame architecture or an architecture that is not a data set. In addition, as in the previous example, the data X1.0 and X1.1 are input to ALU0.0 and ALUO". However, in this example, ALU0.1 is slightly delayed compared to ALU0.0 and uses the result of ALU0.0. In addition, ALU0.1 not only receives input data from XI.1, but also receives data from the output of ALU0.0. Similarly, ALU 1.0 receives the data X2.0, processes the received data, and outputs the processed data to ALU 1.1. ALU1.1 receives the output data from ALU 1.0 and data X2.1. After processing the received data, ALU 1.1 outputs the processed data to the output buffer 52.1. ALU2.0 receives the data X3.0, processes the received data, and outputs the result to ALU2.1. ALU2.1 receives the output data of ALU2.0 and also receives data X3.1. ALU 2.1 processes the received data and outputs the result to output buffer 53.1. ALU3.0 receives input data X4.0. ALU 3.0 processes the received data and outputs the processed data to ALU 3.1. ALU 3.1 receives the output data from ALU 3.0 and also receives data X4.1. ALU 3.1 processes the received data and transmits the processed data to output buffer S4.1. Thus an embodiment of the ALU can be used to implement the following functions: ALU0.0: d0.0=a0.0*b0.0+0 ALU0.1: d0.1=a0.1*b0.1+d0.0 ALU1.0 : dl.0=al.0*bl.0+0

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1, Feb,2007 18 1335550Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee/1, Feb,2007 18 1335550

- ALUl.l : dl.l=al.l*bl.l+dl.O ALU2.0 : d2.0=a2.0*b2.0+0 ALU2.1 : d2.1=a2.1*b2.1+d2.0 ALU3.0 : d3.0=a3.0*b3.0+0 ALU3.1 : d3.1=a3.1*b3.1+d3.0 如同有八個ALU處理資料但只有四個輸出成運算結 果,第2D圖的邏輯電路之SIMD係數為4。此外,當ALU0.0 傳送資料至ALU0.1,ALU0.1相較於ALU0.0有些微的處 # 理延遲。ALU0.1可等待ALU0.0處理資料X1.0後,才接 收並處理ALU0.0的輸出資料和資料X1.1。相似的延遲和 處理也在剩餘的ALU被執行。 第3圖為可處理雙格式的一偶對純量ALU的邏輯架 構,說明第1圖和第2A-2D圖的處理特徵。更特別的是, 第3圖包含一串流資料處理器的數個實施例,可以設定來 處理數種格式中之任一種的資料。至少有一實施例包含, 因應接收指令集之一短格式控制訊號,用以處理數個第一 * 組短格式浮點資料。該第一純量ALU,因應接收指令集中 之一長格式控制訊號,也用以處理第一組長格式浮點資 料。此外,一些實施例包含有第二ALU,用以因應接收指 令集之一短格式控制訊號,而處理數個第二組短格式浮點 資料;因應接收指令集之一長格式控制訊號,而處理第二 組長格式浮點資料;接收來自第一 ALU的處理後資料;以 及,根據指令集之一控制訊號,處理輸入資料和來自第一 ALU的處理後資料。一些實施例包含一 SFU,用以提拱額- ALUl.l : dl.l=al.l*bl.l+dl.O ALU2.0 : d2.0=a2.0*b2.0+0 ALU2.1 : d2.1=a2.1*b2 .1+d2.0 ALU3.0 : d3.0=a3.0*b3.0+0 ALU3.1 : d3.1=a3.1*b3.1+d3.0 As if there are eight ALU processing materials but Only four outputs are the result of the operation, and the SIMD coefficient of the logic circuit of the 2D diagram is 4. In addition, when ALU0.0 transmits data to ALU0.1, ALU0.1 has a slight delay compared to ALU0.0. ALU0.1 can wait for ALU0.0 to process data X1.0 before receiving and processing the output data and data X1.1 of ALU0.0. Similar delays and processing are also performed on the remaining ALUs. Figure 3 is a logical architecture of an even-pair scalar ALU that can handle dual formats, illustrating the processing characteristics of Figures 1 and 2A-2D. More particularly, Figure 3 includes several embodiments of a stream data processor that can be configured to process data in any of several formats. At least one embodiment includes processing a plurality of first * sets of short format floating point data in response to receiving a short format control signal of the instruction set. The first scalar ALU, in response to receiving a long format control signal in the command set, is also used to process the first set of long format floating point data. In addition, some embodiments include a second ALU for processing a plurality of second sets of short format floating point data in response to receiving one of the short format control signals of the instruction set; and processing the signal according to one of the long format control signals of the received instruction set a second set of long format floating point data; receiving processed data from the first ALU; and controlling the input data and processing the processed data from the first ALU according to one of the instruction sets. Some embodiments include an SFU for the amount of

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 19 外計算功能至第一 ALU和第二ALU。更進一步說,—此 實施例被設計來,因應接收短格式資料畤,使串流處理器 設計可以功能上分隔至少一對ALU,以利短格式和長格式 這樣的雙格式處理,也就是其SIMD係數是可變的。—此 實施例用的指令集包含至少一指令,以處理至少下列模式 其中之一:短格式運算元模式,長格式運算元模式,和昆 合格式遲算元模式。一些實施例的指令集設計以控制可變 的SIMD疊合模式,用以處理當第一 ALU的輸出資料當成 長格式模式的運算元傳送至第二ALU時,以及,當第— ALU的輸出資料當成短格式模式的運算元傳送至第-ALU時,兩個不同的情形。 更特別的疋,苐3圖的兩個ALU 310和ALU 32〇 了分別以SIMD係數為2和4的狀態下處理長資料格式和 短資料格式。該描述的架構顯示資料路徑,其包含區域乘 法器’以及加法器’並結合可處理短格式和長格式資料的 區域乘法累加(Multiply Accumulate,MACC )暫存哭。在 此例子中,來自SFU的資料被接收在ALUO和ALU1的累 加暫存器(accumulator registor)370。耦合至該累加器的是 快取記憶體輸入資料模組372,以及ALU輸入埠(port) p〇 3 7 6。ALU輸入淳P0376可以將72位元處理成四段。輕合 至快取記憶體輸入資料372的是ALU輸入埠P1 378。相似 於ALU輸入埠P0 376,ALU輸入埠P1 378也可以將72 位元資料處理成四段18位元。耦合至ALU輸入埠p 1 378 是ALU輸入埠P2 380,可用以將72位元資料處理成四段Client's Docket No.: S3U04-0012-TW TT’s Docket No: 0608-A41110-TW/Final/LukeLee/1, Feb, 2007 19 External calculation function to the first ALU and the second ALU. Furthermore, this embodiment is designed to allow the stream processor design to functionally separate at least one pair of ALUs in response to receiving short format data for the benefit of short format and long format dual format processing, ie The SIMD coefficient is variable. - The instruction set for this embodiment includes at least one instruction to process at least one of the following modes: short format operand mode, long format operand mode, and coherent format late mode. The instruction set of some embodiments is designed to control a variable SIMD overlay mode for processing when the output data of the first ALU is transferred to the second ALU in the extended format mode, and when the output data of the first ALU When the operands in the short format mode are transferred to the -ALU, there are two different cases. More specifically, the two ALUs 310 and ALU 32 of the Fig. 3 map the long data format and the short data format with the SIMD coefficients of 2 and 4, respectively. The described architecture displays a data path that includes a region multiplier 'and adder' combined with a region multiply accumulate (MACC) that can handle short format and long format data. In this example, data from the SFU is received at the accumulator registor 370 of the ALUO and ALU1. Coupled to the accumulator is a cache memory input data module 372 and an ALU input port (port) p〇 376. The ALU input 淳P0376 can process 72 bits into four segments. Lightly connected to the cache input data 372 is the ALU input 埠 P1 378. Similar to the ALU input 埠P0 376, the ALU input 埠P1 378 can also process 72-bit data into four segments of 18 bits. Coupled to the ALU input 埠p 1 378 is the ALU input 埠P2 380, which can be used to process 72-bit data into four segments

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 20 1335550 18位元。 ^合至ALU輸入淳p〇 376、AUJ輪入淳ρι 378、和 和輸出^人埠P2 38Q是ALUG 310,其包含輸人多工器382a 二夕工為384a。輸入多工器382a包含輸出埠ch、Client's Docket No.: S3U04-0012-TW TT’s Docket No: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 20 1335550 18-bit. ^ Combined to ALU input 淳p〇 376, AUJ round 淳 ρ 378, and and output ^ person 埠 P2 38Q is ALUG 310, which includes the input multiplexer 382a Ershigong 384a. The input multiplexer 382a includes an output 埠ch,

A0L 7、机'和肌^輸入多工器384a包含輸出埠 〇H、A〇L、B1H、和CL。輸出琿CH ί馬合至加法 ,&輸出埠Α1Η和B0L耦合至乘法器386a。乘法器 =也轉合至加法器396a。輸出埠A1L和B1L麵合至^ 法。〇 388a,乘法器388a耦合至13位元移位器371a,13 位元移位器371a耦合至加法器396a。 口攸輸入夕工器384a,輸出埠a〇H和Β0Η ♦馬合至乘法 扣392a。乘法器392a也耦合至加法器399a。輸出埠A〇L 和B1H耦合至乘法器39〇a,乘法器39〇&耦合至i3位元移 位器373a,13位元移位器373a耦合至加法器39%。輸出 埠CL耦合至加法器399a。加法器396a和399a經由13位 元移位裔和致能裝置398a而耦合在一起。乘法累加單元 (MACC) 394a和397a分別耦合至加法器396a和399a。 加法益396a和399a的輸出分別耦合至低輸出埠dl和高 輸出埠DH。 輸入埠 ALU port P0 376、ALU port P1 378 、和 ALU port P2 380也經由延遲暫存器383耦合至ALU1 320。延遲The A0L 7, the machine's and the muscle input multiplexer 384a contain outputs 埠H, A〇L, B1H, and CL. The output 珲CH ί is merged into the addition, and the & outputs 埠Α1Η and B0L are coupled to the multiplier 386a. Multiplier = also turns to adder 396a. The outputs 埠A1L and B1L are combined to the method. 〇 388a, multiplier 388a is coupled to 13-bit shifter 371a, which is coupled to adder 396a. The port is input to the 394a, and the outputs 埠a〇H and Β0Η ♦ Mahe to the multiply buckle 392a. Multiplier 392a is also coupled to adder 399a. The outputs 埠A〇L and B1H are coupled to a multiplier 39〇a, the multiplier 39〇& is coupled to an i3 bit shifter 373a, and the 13-bit shifter 373a is coupled to an adder 39%. The output 埠CL is coupled to an adder 399a. Adders 396a and 399a are coupled together via a 13-bit shifting and enabling device 398a. Multiply Accumulate Units (MACC) 394a and 397a are coupled to adders 396a and 399a, respectively. The outputs of Additions 396a and 399a are coupled to a low output 埠 dl and a high output 埠 DH, respectively. Inputs 埠 ALU port P0 376, ALU port P1 378, and ALU port P2 380 are also coupled to ALU1 320 via delay register 383. delay

暫存器383柄合至輸入多工器382b和384b。輸入多工器 382b的輸出埠CH耦合至加法器396b。輸出槔A1H和B0L 耦合至乘法器386b,其耦合至加法器396b。輸出埠A1LThe register 383 is coupled to the input multiplexers 382b and 384b. The output 埠CH of input multiplexer 382b is coupled to adder 396b. Outputs A1H and B0L are coupled to multiplier 386b, which is coupled to adder 396b. Output 埠A1L

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A4 II 10-TW/Final/LukeLee(李宗軒)/】,Feb, 2007 1335550 和B1L輕合至乘法器3 8 gb,其經由13位元移位器3 71 b, 而位元移位器371b耦合至加法器396b。 輸入多工器384b的輸出的包含A〇H和B〇H,其耦合 至乘法器392b。乘法器392b耦合至加法器399b。輸出埠 A0L和B1H耦合至乘法器390b,乘法器39〇b經由13位 元移位器373b,13位元移位器373b耦合至加法器399b。 輸出埠CL耦合至加法器399b。加法器396b和399b經由 13位元移位器和致能裝置398b而耦合在一起。乘法累加 單兀(MACC) 394b和397b分別耦合至加法器396b和 399b。加法器396b耦合至低輸出琿DL,而加法器39卯 耦合至高輸出埠hl。在此例子中具有旁通元件395和輸出 CL資料元件393,其耦合在ALU310和ALU320之間,利 用ALU1 320操作的一個時脈週期延遲。 值知注意的一點是,第3圖裡所繪之元件說明操作的 邏輯架構。更特別的是,第3圖所繪的架構說明有分岐資 料路徑和可變SIMD係數的ALU的設計原則。 、 第4圖說明有對偶純量ALU的串流處理單元,相似於 第3圖的架構。如圖所示,資料輸入至快取記憶單元, 其中包含LQ、L1、SO、SI、S2、S3、等等。快取記憶單 元472傳送儲存資料至記憶體輸出多工器474,其輕合至 輸入璋 port P0 476、port P1 478、和 port P2 48〇。pcJ p〇 476、port PI 478、和P〇rt P2 480也耦合至輸入多工器問 482a,其輕合至ALUO。ALU0在此例中計算a〇*b〇+c〇 等於DO,其結果輸出至d〇L。Client's Docket No.: S3U04-0012-TW TT's Docket No: 0608-A4 II 10-TW/Final/LukeLee (Li Zongxuan) /], Feb, 2007 1335550 and B1L lightly coupled to the multiplier 3 8 gb, which passes 13 bits A meta-shifter 3 71 b, and a bit shifter 371b is coupled to adder 396b. The output of input multiplexer 384b contains A 〇 H and B 〇 H, which are coupled to multiplier 392b. Multiplier 392b is coupled to adder 399b. Outputs A0L and B1H are coupled to multiplier 390b, which is coupled to adder 399b via a 13-bit shifter 373b, 13-bit shifter 373b. Output 埠CL is coupled to adder 399b. Adders 396b and 399b are coupled together via a 13-bit shifter and enabling device 398b. Multiply Accumulate Singles (MACC) 394b and 397b are coupled to adders 396b and 399b, respectively. Adder 396b is coupled to low output 珲 DL, and adder 39 耦合 is coupled to high output 埠 hl. In this example there is a bypass element 395 and an output CL data element 393 coupled between the ALU 310 and the ALU 320 for a clock cycle delay of operation of the ALU1 320. The point to note is that the components depicted in Figure 3 illustrate the logical architecture of the operation. More specifically, the architecture depicted in Figure 3 illustrates the design principles of ALUs with separate resource paths and variable SIMD coefficients. Figure 4 illustrates a stream processing unit with a dual scalar ALU, similar to the architecture of Figure 3. As shown, the data is input to the cache memory unit, which includes LQ, L1, SO, SI, S2, S3, and so on. The cache memory unit 472 transfers the stored data to the memory output multiplexer 474, which is coupled to the inputs port P0 476, port P1 478, and port P2 48〇. pcJ p〇 476, port PI 478, and P〇rt P2 480 are also coupled to input multiplexer 482a, which is lightly coupled to ALUO. ALU0 calculates a〇*b〇+c〇 equal to DO in this example, and the result is output to d〇L.

Client’s Docket No·: S3U04-0012-TW TT’s Docket N〇:0608-A4111 〇-TW/Final/LukeLee(李宗軒)/ι, Feb, 2007 22 1335550 port P0 476、port PI 478、和 port P2 4叫耦合至延遲 暫存器483,延遲暫存器483耦合至輸人多工器482b,再 連接至ALU1°ALU1在此例中計算A1*B1+C1+D0等於 D1,其結果輸出至D1L。ALU0的輸出埠D0L也耦合到 ALU1。熟習此技藝的人將了解,此特別例子在ALm裡計 算處理ALU0的輸出值。更特別的是,ALU0計算出D0的 值,其緊接傳送至延遲暫存器386。D0從延遲暫存器386 傳送至ALU1以計算D1。 耦合至ALU0和ALU1的輸出的是多工器484,其輕 合至ALU0和ALU1共用的SFU 470。SFU 470也轉合至 ALU0的輸入和(經由延遲暫存器483)ALU1的輸入。ALU() 和ALU1的輸出也耦合至快取記憶單元472和其他單元。 第4圖也包含了 SIMD微碼控制器488,用以決定和傳 遞所需的操作控制訊號至ALU0和ALU1。輕合至SIMD 微碼控制器488的是ALU控制和定址元件490。延遲暫存 器483 _合在ALU控制和定位元件490與ALU1之間。 值得注意的是,第3圖是針對短格式資料處理的實施 例’而第4圖是針對長格式資料處理的實施例。更特別的 是’儘管本發明揭露的實例施包含能處理短格式、長格式、 和混合格式、荨等的能力’其中許多例子也可包含重排^ 料之處理。 、 第5 A圖為一圖表’說明偶對ALU的算術功能,這樣 的ALU已描述於第3圖與第4圖。本表說明ALU0和ALU1 所有可能的操作。這些操作可執行短18位元、長36位元Client's Docket No·: S3U04-0012-TW TT's Docket N〇:0608-A4111 〇-TW/Final/LukeLee//, i, Feb, 2007 22 1335550 port P0 476, port PI 478, and port P2 4 To the delay register 483, the delay register 483 is coupled to the input multiplexer 482b, and then to the ALU1°ALU1. In this example, A1*B1+C1+D0 is calculated to be equal to D1, and the result is output to D1L. The output 埠D0L of ALU0 is also coupled to ALU1. Those skilled in the art will appreciate that this particular example calculates the output value of ALU0 in ALm. More specifically, ALU0 calculates the value of D0, which is passed to delay register 386. D0 is transferred from delay register 386 to ALU1 to calculate D1. Coupled to the outputs of ALU0 and ALU1 is a multiplexer 484 that is coupled to the SFU 470 common to ALU0 and ALU1. SFU 470 also switches to the input of ALU0 and (via delay register 483) the input of ALU1. The outputs of ALU() and ALU1 are also coupled to cache memory unit 472 and other units. Figure 4 also includes a SIMD microcode controller 488 for determining and transmitting the desired operational control signals to ALU0 and ALU1. Lightly coupled to the SIMD microcode controller 488 is the ALU control and addressing component 490. Delay register 483 is coupled between ALU control and positioning component 490 and ALU1. It is to be noted that Fig. 3 is an embodiment for short format data processing and Fig. 4 is an embodiment for long format data processing. More particularly, although the disclosed embodiments of the present invention include the ability to handle short formats, long formats, and mixed formats, 荨, etc., many of these examples may also include processing of rearrangement. Figure 5A is a diagram illustrating the arithmetic functions of the even-pair ALU. Such an ALU has been described in Figures 3 and 4. This table shows all possible operations for ALU0 and ALU1. These operations can be as short as 18 bits and as long as 36 bits.

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 23 1335550 ' 和混合18-36位元浮點資料。所有操作分別乓大群:常規、 混合、與交叉操作。每個群中有正常操作和18/36位元資 料的四倍/兩倍型式操作。四倍/兩倍型式使用傳遞在相同 ALU的區塊間或是ALU0和ALU1之間的資料。圖表的頂 端,有跟第3圖中ALU0和ALU1的輸入相同名字的數行, 以及與圖示相同的資料路徑控制訊號。 每個操作藉由兩列來描述:第一列表示來自ALU琿 Ρ0、Ρ卜和P2的輸入資料(特別元素如Ρ0·0、Ρ0·1等等), • 以及一些資料路徑控制訊號的狀態;第二列描述公式,而 公式的運算結果會傳送至輸出埠dl和dh。最後一行包含有 關在特別操作下之偶對ALU的SIMD係數資訊。這樣的偶 對ALU可複製數次以增加整體的SIMD係數。表格的右邊 包含操作的縮寫附註,ALU硬體的算術函數使用乘法符號 「S」和加法符號「s」,同時也包含在特別的操作時的乘 加暫存器(MAC)的作用。以下詳細的指令集敘述可以說明 串流處理器的整體功能。 ® 第5B圖包含一 GPU,其中的一個SIMD串流處理器 當成計算核心。此例中包含四個串流處理器且每個處理器 包含四對ALU和兩個SFU。串流處理器的實施例是用以處 理不同類型之資料(幾何和像素/紋素);透過使用指令集 裡不同指令,可以提供可變SIMD係數給所用之不同類型 的資料。 串流處理器指令可能有3至9位元組的長度,取決於 指令類型和位址模式。指令包含下列部分:(1)主體(一Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan) / 1, Feb, 2007 23 1335550 ' and mixed 18-36 bit floating point data. All operations are individually pong group: regular, mixed, and crossover operations. Each group has four times/twice the type operation of normal operation and 18/36 bit data. The quadruple/double pattern uses data that is passed between blocks of the same ALU or between ALU0 and ALU1. At the top of the chart, there are several lines with the same name as the inputs of ALU0 and ALU1 in Figure 3, and the same data path control signals as shown. Each operation is described by two columns: the first column represents the input data from ALU珲Ρ0, Ρ卜, and P2 (special elements such as Ρ0·0, Ρ0·1, etc.), • and the status of some data path control signals. The second column describes the formula, and the result of the formula is passed to the outputs 埠dl and dh. The last line contains information about the SIMD coefficients of the even pair of ALUs under special operation. Such even ALUs can be replicated several times to increase the overall SIMD coefficient. The right side of the table contains the abbreviation notes for the operation. The ALU hardware's arithmetic function uses the multiplication symbol "S" and the addition symbol "s", as well as the multiply-accumulator register (MAC) for special operations. The following detailed instruction set description illustrates the overall functionality of the stream processor. ® Figure 5B contains a GPU with one of the SIMD stream processors as the computing core. This example contains four stream processors and each processor contains four pairs of ALUs and two SFUs. Embodiments of the stream processor are used to process different types of data (geometry and pixels/texels); by using different instructions in the instruction set, variable SIMD coefficients can be provided for the different types of data used. Streaming processor instructions may have a length of 3 to 9 bytes, depending on the instruction type and address mode. The instruction consists of the following parts: (1) the subject (a

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 24 1335550 般指令和流程控制指令);(2)指令前置詞.,可傳遞一般 指令之結果至SFU、或是重覆一般指令之執行;以及(3) 修飾指令詞,可變動運算元大小、設置旗標和控制結果寫 回。指令編碼原則列出如下: 指食的樣佴元組4:: 指令的2nd」位元紙:::: 指令的3^位元組 位址位元組 一般指令格式 操作碼 運算;^姐 運算以姐 運算;^姐 指令前置詞(特別函數單元) 前置操作碼 無 無 無 指令前置詞(重覆指令控制) 重覆操作碼 現值 無 無 指令修飾前置詞 侧操作碼 運算元修飾詞 無 無 資料長度前置詞 資料長度操作碼1 資料長度操作碼2 無 無 控制流程指令 控制流程操作碼1 控制流程操作碼2 置換詞1 置換詞2Client's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 24 1335550 General Command and Flow Control Instructions); (2) Command Preposition. The result of the general instruction can be passed to the SFU, or the execution of the general instruction can be repeated; and (3) the instruction word can be modified, the operand size can be changed, the flag is set, and the control result is written back. The principle of instruction coding is listed as follows: The sample of the order of the food group 4:: 2nd bit of the instruction paper:::: The 3^ byte of the instruction bit address group general instruction format operation code operation; Sister operation; ^ sister instruction preposition (special function unit) pre-operation code no or no instruction preposition (repeated instruction control) repeated operation code present value no no instruction modification preposition side operation code operation element modification word no data length Preposition data length operation code 1 data length operation code 2 no control flow instruction control flow operation code 1 control flow operation code 2 replacement word 1 replacement word 2

第1表 基於此格式,串流處理器有下列依功能性分類的指令 集。串流處理器指令集舉例如下表所述: 功能 格式 數值指令 一般指令 Is1位元組 2nd位元組 3"*位元組 4th-9th位元組 MAC乘法累加 0000 OOsD 短位址A 短位址B/高 部分A B和D位址1 MUL乘 0000 010D 短位址A 短位址B/高 部分A B和D位址1 ADD 力α 0000 100D 短位址A 短位址C/高 部分A C和D位址1 SUB減 0000 101D 短位址A 短位址C/高 部分A C和D位址1 MAD乘加(無MACC) 0000 11sD 短位址A 短位址B/高 B, C和D位址1Table 1 Based on this format, the stream processor has the following instruction sets that are functionally categorized. An example of the stream processor instruction set is as follows: Function format value instruction general instruction Is1 byte 2nd byte 3"*byte 4th-9th byte MAC multiplication accumulation 0000 OOsD short address A short address B/High Part AB and D Address 1 MUL Multiply 0000 010D Short Address A Short Address B/High Part AB and D Address 1 ADD Force α 0000 100D Short Address A Short Address C/High Part AC and D Address 1 SUB minus 0000 101D Short Address A Short Address C/High Part AC and D Address 1 MAD Multiply Plus (No MACC) 0000 11sD Short Address A Short Address B/High B, C and D Addresses 1

Clients Docket No.: S3U04-0012-TW TT’s DocketN〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 25 1335550 MAC乘法累加長格式 MAC乘法累加短B位址 MAC乘法累加長B位址 ADD加法長格式 SUB減法長格式 MOV移動Clients Docket No.: S3U04-0012-TW TT's DocketN〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 25 1335550 MAC Multiply Accumulate Long Format MAC Multiply Accumulate Short B Address MAC Multiply Accumulate Length B address ADD addition long format SUB subtraction long format MOV move

ADAACC加法長格式 SBAACC減法短格式 MAAACC乘法累加 MUAACC乘法 MPAACC乘法加ACC MMAACC乘法減ACCADAACC addition long format SBAACC subtraction short format MAAACC multiplication accumulation MUAACC multiplication MPAACC multiplication plus ACC MMAACC multiplication minus ACC

部分A 01BB CCsD D 高部分A7 短位址A 0100 OCsD D 高部分A7 短位址A 0101 OCsD D 高部分A7 短位址A 0110 0C0D D 高部分A7 短位址A 0110 0C1D D 兩部分A7 短位址A 0110 lOxD D 高部分A7 短位址A 0110 110D D 高部分A7 短位址A 0110 HID D 高部分A7 短位址A 0111 OCsD D 高部分A7 短位址A 0111 lOsD D 高部分A7 短位址A 0111 110D D 高部分A7 短位址A 0111 111D D 高部分A7 短位址A B, C和D位址 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1 B, C和D位址1Part A 01BB CCsD D High Part A7 Short Address A 0100 OCsD D High Part A7 Short Address A 0101 OCsD D High Part A7 Short Address A 0110 0C0D D High Part A7 Short Address A 0110 0C1D D Two Parts A7 Short Address A 0110 lOxD D High Part A7 Short Address A 0110 110D D High Part A7 Short Address A 0110 HID D High Part A7 Short Address A 0111 OCsD D High Part A7 Short Address A 0111 lOsD D High Part A7 Short Address A 0111 110D D High Part A7 Short Address A 0111 111D D High Part A7 Short Address AB, C and D Address B, C and D Address 1 B, C and D Address 1 B, C and D Bits Address 1 B, C and D Address 1 B, C and D Address 1 B, C and D Address 1 B, C and D Address 1 B, C and D Address 1 B, C and D Address 1 B, C and D address 1 B, C and D address 1

外積 XRS外積 混合 BLN混合 DP2内積2 BLF疊合混合 DPF疊合内積 BL8短混合SMD 8 DPM内積混合資料 内積4Outer product XRS outer product mixing BLN mixing DP2 inner product 2 BLF superimposed hybrid DPF superimposed inner product BL8 short mixed SMD 8 DPM inner product mixed data inner product 4

0001 OSsD 短位址A0001 OSsD Short Address A

短位址B/高 部分A B, C和D位址Short address B/high Part A B, C and D addresses

0010 OSsD 短位址A 短位址B/高 部分A 0010 ISsD 短位址A 短位址B/高 部分A 0011 OSsD 短位址A 短位址B/高 部分A 0011 ISsD 短位址A 短位址B/高 部分A 1101 OSsD 短位址A 短位址B/高 部分A 1101 ISsD 短位址A 短位址B/高 部分A B, C和D位址1 B和D位址1 B, C和D位址1 B和D位址1 B, C和D位址1 B和D位址10010 OSsD Short Address A Short Address B/High Part A 0010 ISsD Short Address A Short Address B/High Part A 0011 OSsD Short Address A Short Address B/High Part A 0011 ISsD Short Address A Short Address B/High Part A 1101 OSsD Short Address A Short Address B/High Part A 1101 ISsD Short Address A Short Address B/High Part AB, C and D Address 1 B and D Address 1 B, C And D address 1 B and D address 1 B, C and D address 1 B and D address 1

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 26 1335550 DP4内積4 1100 OSsD 短位址A 短位址B/高 部分A DPI内積4伴隨IDCT重 組 1100 ISsD 短位址A 短位址B/高 部分A 指令前置詞 Γ位元組 2"1位元組 3rt位元組 SFU傳遞前置詞 REC傳遞至1/X 0001 1001 SQR傳遞至SQRT 0001 1011 RSQ傳遞至1/SQRT 0001 1011 LOG傳遞至LOG 0001 1101 EXP傳遞至EXP 0001 1110 SIN傳遞至SIN 0001 1111 B, C和D位址1 B, C和D位址1 4th - 9th位元Client's Docket No.: S3U04-0012-TW TT's Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 26 1335550 DP4 inner product 4 1100 OSsD short address A short address B/high Part A DPI inner product 4 with IDCT recombination 1100 ISsD short address A short address B/high part A instruction preposition Γ byte 2"1 byte 3rt byte SFU pass preposition REC passed to 1/X 0001 1001 SQR Pass to SQRT 0001 1011 RSQ pass to 1/SQRT 0001 1011 LOG pass to LOG 0001 1101 EXP pass to EXP 0001 1110 SIN pass to SIN 0001 1111 B, C and D address 1 B, C and D address 1 4th - 9th Bit

第2表 附註:、取決於當下運算元B、C、和目的地的運算元 長度 2-假如指令格式是短格式,「S」攔位只影響資 料交換(swap )而不影響寫入遮罩(write masking ) 3 -假如指令格式是短格式或混合格式’「S」爛位 只影響資料交換而不影響寫入遮罩 4- 假如内積和外積指令符號用於二次微分 5- 假如内積4指令符號用於二次微分和四次微 分;運算元C的位址的預設值為運算元A的位址 加1 功能 格式 重覆前置詞 Ist位元組 2nd位元組 3rd位元組 4th-9th位元組 REP重覆指令短無 MACC ΠΙΟΟπτ REP重覆指令短無 MACC lllOlrrrNote to Table 2: Depending on the current operation element B, C, and the operand length of the destination 2 - If the instruction format is short format, the "S" block only affects the data exchange (swap) without affecting the write mask. (write masking) 3 - If the instruction format is short format or mixed format 'S' rotten bit only affects data exchange without affecting write mask 4- if inner product and outer product instruction symbol are used for second derivative 5 - if inner product 4 The instruction symbol is used for the second derivative and the fourth differentiation; the default value of the address of the operation element C is the address of the operation element A plus 1 the function format repeats the preposition Ist byte 2nd byte 3rd byte 4th- 9th byte REP repeated instruction short no MACC ΠΙΟΟπτ REP repeated instruction short no MACC lllOlrrr

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 1335550

REP重覆指令上無 MACC lOOOrrrr repeat imm8 REP重覆指令下無 MACC 1001 rrrr repeat一imm8 REP重覆指令上 lOlOnrr MACC repeat_imm8 REP重覆指令下 1011 rrrr MACC repeat_imm8 指令修飾詞 指令前置詞 Γ位元組 2nd位元組 3rd位元組 SCS設定尺寸 1111 1100 set_scale_i mm8 SCT切換尺寸 1111 1101 set_scale_i mm8 0PS設定運算元欄位 mi mo set_ops_imm 8 OPT切換運算元欄位 1111 1111 set—ops一imm 8 CFS條件旗標設定 1111 Offf set_cf_imm8 WBS條件性的寫回設定 0000 0111 0010 set—wb_imm4 WBT條件性的寫回切換 0000 0111 0011 set_wb_imm4 資料長度前置詞 DLS資料長度設定 0000 0111 0100 11 LL DLT資料長度切換 0000 0111 0101 11 LL 流程控制 分支與呼叫 1st位元組 2nd位元組 3rt位元組 IF條件伴隨釋放 標藏觸發 0000 0111 0001 00 WW disp8 IF條件伴隨絕對 標籤觸發 0000 0111 0001 01 W displ6_low JC相對條件跳躍 0000 0111 0001 10 WW disp8 4th - 9th位元 4th - 9th位元 組 displ6_high Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/l,Feb,2007 28 1335550REP repeated instruction without MACC lOOOrrrr repeat imm8 REP without repeated MACC 1001 rrrr repeat one imm8 REP repeated instruction on lOlOnrr MACC repeat_imm8 REP repeated instruction 1011 rrrr MACC repeat_imm8 instruction modifier command preposition Γ byte 2nd bit Tuple 3rd byte SCS setting size 1111 1100 set_scale_i mm8 SCT switching size 1111 1101 set_scale_i mm8 0PS setting operand field mi mo set_ops_imm 8 OPT switching operand field 1111 1111 set-ops-imm 8 CFS condition flag setting 1111 Offf set_cf_imm8 WBS conditional writeback setting 0000 0111 0010 set_wb_imm4 WBT conditional writeback switching 0000 0111 0011 set_wb_imm4 data length preposition DLS data length setting 0000 0111 0100 11 LL DLT data length switching 0000 0111 0101 11 LL flow control branch With the call 1st byte 2nd byte 3rt byte IF condition with release label trigger 0000 0111 0001 00 WW disp8 IF condition with absolute label trigger 0000 0111 0001 01 W displ6_low JC relative condition jump 0000 0111 0001 10 WW disp8 4th- 9th bit 4th - 9th byte displ6_high Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/l,Feb,2007 28 1335550

JC絕對條件 0000 0111 0001 11 WW displ6_low JMP無條件相對跳躍 0000 0111 0000 10 00 disp8 JMP無條件絕對s摊 0000 0111 0000 11 00 displ6_low CALL無條件相對呼叫 0000 0111 0000 10 01 disp8 CALL無條件絕對呼叫 0000 0111 0000 11 01 displ6_low RET無條件返回 0000 0111 0000 10 10 ENDIF相對無條件跳躍 設定條件關閉 0000 0111 0000 10 11 disp8 ENDIF絕對無條件跳躍 設定條件關閉 0000 0111 0000 11 11 displ6_low 迴圈控制 Ist位元組 2nd位元組 3rt位元組 FOR設定迴圈指標計數 器 0000 0110 set_cnt_imm 8 LOOP相對短迴圈 0000 0111 0000 00 II disp8 LOOP相對長迴圈 0000 0111 0000 01 II displ6—low 查表 LKP常數頁查表 0000 0111 0000 11 10 short address AJC absolute condition 0000 0111 0001 11 WW displ6_low JMP unconditional relative jump 0000 0111 0000 10 00 disp8 JMP unconditional absolute s 0000 0111 0000 11 00 displ6_low CALL unconditional relative call 0000 0111 0000 10 01 disp8 CALL unconditional absolute call 0000 0111 0000 11 01 displ6_low RET unconditional return 0000 0111 0000 10 10 ENDIF relative unconditional jump setting condition off 0000 0111 0000 10 11 disp8 ENDIF absolute unconditional jump setting condition off 0000 0111 0000 11 11 displ6_low loop control Ist byte 2nd byte 3rt byte FOR Set loop index counter 0000 0110 set_cnt_imm 8 LOOP relatively short loop 0000 0111 0000 00 II disp8 LOOP relatively long loop 0000 0111 0000 01 II displ6—low Lookup table LKP constant page lookup table 0000 0111 0000 11 10 short address A

displ6_high displ6_hi^i displ6_high displ6_high 4th - 9th位元 組Displ6_high displ6_hi^i displ6_high displ6_high 4th - 9th byte

displ6—hi由 hign part A 第3表 附註:L取決於現存運算元B、C、和目的地的運算元 長度 • 2-假如指令格式是短格式,「S」欄位只影響資 料交換(swap )而不影響寫入遮罩(write masking ) 匕有MACC,重覆從運算元C初始化,當不重覆 (運算元C被忽略時)不初始化 2- 沒有MACC,假如位元「C」設定,運算元C 的位址=運算元B的位址+「cc」+1 3- 有MACC且重覆時以「0」初始化,「cc」欄Displ6—hi by hign part A Note to Table 3: L depends on the length of the operands of the existing operands B, C, and destination • 2- If the instruction format is short format, the “S” field only affects data exchange (swap Does not affect the write masking (write masking) MAC there is MACC, repeated from the operation unit C initialization, when not repeated (the operation element C is ignored) is not initialized 2 - no MACC, if the bit "C" is set Address of operand C = address of operand B + "cc" +1 3- MACC and initialized with "0" when repeated, "cc" column

Clients Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Fina!/LukeLee(李宗軒)/1,Feb, 2007 1335550 位永遠選擇運算元c的位址 運算元位址-以{運算元A,連算元B,運算元 C,目的地}的順序,其取決於現有的運算元和 長度。 攔位描述符號 難 描述 A 運算元A B 運算元B C 運算元C D 目的地 d 目的地至ACC寫入致能 S 交換疊合部分 S 混雜符號,DP4和外積 rrr(r) 重覆指標 ww 條件分支與寫回控制 Π 條件迴圈控制(分支與寫回條件相同)Clients Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Fina!/LukeLee(李宗轩)/1,Feb, 2007 1335550 Bits always select the address of the operand c. - in the order of {operating element A, connecting element B, operand C, destination}, depending on the existing operand and length. Block description symbol difficult to describe A operand AB operand BC operand CD destination d destination to ACC write enable S exchange overlap part S mixed symbol, DP4 and outer product rrr(r) repeat indicator ww conditional branch Write back control Π conditional loop control (same branch and writeback conditions)

第4表Table 4

D 目的地至ACC 0 啟動寫入至ACC 1 關閉寫入至ACCD Destination to ACC 0 Start write to ACC 1 Close write to ACC

第5表 S 交換疊辦分 0 不交換 1 交換疊合部分 第6表 WW 條件寫回控制 00 一直寫入 01 寫入^如果只有零標籤設定(=0) 10 寫入^如果只有符號標氣設定(<0) 1 1 寫入^如果只有零或符號標籤設定 ㈤) 第7表Table 5 S Exchange stacking points 0 No swapping 1 Exchange stacking part Table 6 WW Condition Write back control 00 Always write 01 Write ^ If only zero label setting (=0) 10 Write ^ If only symbol standard Setting (<0) 1 1 Writing ^ If only zero or symbol label setting (5)) Table 7

Client's Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550Client's Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 1335550

rrr 重覆計數 000 重覆次數看設定值 001 重覆至2 010 重覆至3 Oil 重覆至4 100 重覆至迴圈或分歧指標 101 重覆至6 110 重覆至7 111 重覆至8 WW orH ~~~~--— 一 支和迴圈控制 —---- 00 01 果只有零標籤設定(= -〇) 10 果只有符號標籤設定( <0) 11 執行,如果只有零或符號標^ (切) ° 第9表 f 8 ^ x〇 I xl I x2 I x3Rrr Repeat count 000 Repeat times See set value 001 Repeat to 2 010 Repeat to 3 Oil Repeat to 4 100 Repeat to loop or divergence indicator 101 Repeat to 6 110 Repeat to 7 111 Repeat to 8 WW orH ~~~~--- One and loop control ----- 00 01 If there is only zero label setting (= -〇) 10 If only the symbol label setting ( <0) 11 is executed, if there is only zero or Symbol mark ^ (cut) ° Table 9 f 8 ^ x〇I xl I x2 I x3

Ox lx 2x 3xOx lx 2x 3x

MACMAC

BLNBLN

BLF x4 x5BLF x4 x5

x7 I x8 ADD m / ·-X7 I x8 ADD m / ·-

R E C xA xB xC xD SUB MAD SQ R RS Q ;rsr, η,; vd L 〇 G ® Si ^ N DP2R E C xA xB xC xD SUB MAD SQ R RS Q ;rsr, η,; vd L 〇 G ® Si ^ N DP2

DPFDPF

4x 5x MAC short B address MAC long B address ADD SUB ADD SUB ADD SUB ADD 6x long long long long long long long form form form form form form form 7x 8x MAC ACC multiply-accumulate REP long format counting up no MACC 9x4x 5x MAC short B address MAC long B address ADD SUB ADD SUB ADD SUB ADD 6x long long long long long long form form form form form form 7x 8x MAC ACC multiply-accumulate REP long format counting up no MACC 9x

REP long format counting down with MACCREP long format counting down with MACC

AxAx

REP long format counting up no MACCREP long format counting up no MACC

BxBx

REP long format counting down with MACCREP long format counting down with MACC

Cx DP4Cx DP4

DPIDPI

Dx BL8Dx BL8

DPMDPM

ExEx

REP short format without MACCREP short format without MACC

REP short format with MACCREP short format with MACC

FxFx

CFSCFS

;··· sc sc 6η s T 第10表:指令編碼主矩陣(指令第一位元組);··· sc sc 6η s T Table 10: Instruction Encoding Master Matrix (instruction first byte)

07 x0 xl x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD 5 NV 網: J C R E J C L Οχ LOOPrel LOOP abs M ET N M A 〇: 0 A K07 x0 xl x2 x3 x8 x5 x6 x7 x8 x9 xA xB xC xD 5 NV Net: J C R E J C L Οχ LOOPrel LOOP abs M ET N M A 〇: 0 A K

Client’s Docket No·: S3U04-0012-TW TT’s Docket Ν〇:0608-Α4 Π 10-TW/Final/LukeLee(李宗軒)/!,Feb,2007 3l 1335550Client’s Docket No·: S3U04-0012-TW TT’s Docket Ν〇:0608-Α4 Π 10-TW/Final/LukeLee(李宗轩)/!,Feb,2007 3l 1335550

第11表:指令編碼主矩陣(指令第二位元組) 第6圖相似於第3圖和第4圖的ALU,說明有四個純 里ALU的串流處理器的邏輯架構和流程圖。如圖所示,輸 入資料送至四個ALU,分別標示成ALUO、ALU卜ALU2、Table 11: Instruction Encoding Master Matrix (Instruction Second Byte) Figure 6 is similar to the ALU of Figures 3 and 4, illustrating the logical architecture and flow diagram of a stream processor with four pure ALUs. As shown in the figure, the input data is sent to four ALUs, which are labeled as ALUO, ALU, and ALU2.

和ALU3。更特別的是,輸入資料6〇2a傳送至ALU〇的輸 入埠。此外,來自指令解碼器的控制和位址訊號6〇2e輸入 至ALU0’還有共同資料6〇2f也是輸入至alu〇。來自SFU 670的資料也輸入至ALU〇。在指令執行週期丨,資料在 ALU0裡被處理。 在執行週期1,輸入資料602b存於延遲暫存器683a, 然後再傳送至ALU1的輸入埠。來自指令解碼器6〇2e的控 制和位址訊號存於延遲暫存器683d,然後再傳送至ALU1 的輸入埠。同樣地,共同資料602f存於延遲暫存器683e 再輸入至ALU1。來自SFU670的資料沒有延遲地傳送至And ALU3. More specifically, the input data 6〇2a is transmitted to the input port of the ALU〇. In addition, the control and address signals from the command decoder 6〇2e are input to ALU0' and the common data 6〇2f is also input to alu〇. Information from SFU 670 is also entered into ALU〇. After the instruction execution cycle, the data is processed in ALU0. During execution cycle 1, input data 602b is stored in delay register 683a and then transferred to the input port of ALU1. The control and address signals from the instruction decoder 6〇2e are stored in the delay register 683d and then transferred to the input port of the ALU1. Similarly, the common data 602f is stored in the delay register 683e and then input to the ALU1. Information from SFU670 is transmitted without delay to

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1 Feb 2007 , 32 1335550 ALU1。在扣令執行週期2時,ALU1處理接彳欠資料。 在執订週期1時,輸入資料602c,存於延遲暫存器 683b在執行週期2時,資料存於延遲暫存器。然後 輸入資料602c傳送至ALU2。ALU2也經由延遲暫存器 和683g攸和令解碼器6〇2e接收控制和位址訊號。同樣地, 共同資料經由延遲暫存器683e和683h傳送至ALU2〇ALU3 ,收的輸入資料6〇2d,在指令執行週期1會經由延遲暫存 器683c,在指令執行週期2會經由延遲暫存器683q,在指 令執行週期3會經由延遲暫存器683f ^同樣地,ALu3接 收的指令解碼器的控制和位址訊號6〇2e ’在指令執行週期 1會經由延遲暫存器683d,在指令執行週期2會經由延遲 暫存器683g,在指令執行週期3會經由延遲暫存器。 ALU3接收的共同資料,在指令執行週期丨會經由延遲暫 存為683e’在指令執行週期2會經由延遲暫存器683h,在 指令執行週期3會經由延遲暫存器683j·。ALU3的輸出傳 送至有寬度Μ的四槽輸出緩衝器604和多工器672,其搞 合至SFU 670的一輸入埠。從ALU0、ALU 1、ALU2的輸 出傳送至多工器672。ALU2的輸出經由延遲暫存器683〇 傳送至輸出缓衝器4χΜ 604。ALU1的輸出經由延遲暫存 器6831和683ri傳送至輸出緩衝器4χΜ 604。ALU0的輸出 經由延遲暫存器683r、683k和683m傳送至輸出緩衝器4χΜ 604。值得注意的一點是在至少一實施例中,第6圖可設計 來具有邏輯電路,以從資料路徑移除至少一延遲。 第7 Α圖為一流程圖’說明向量ALU中的正規化之向Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan) / 1 Feb 2007 , 32 1335550 ALU1. At the time of the deduction execution period 2, ALU1 processes the incoming data. At the time of the binding cycle 1, the input data 602c is stored in the delay register 683b. During the execution cycle 2, the data is stored in the delay register. The input data 602c is then transferred to ALU2. ALU2 also receives control and address signals via the delay register and the 683g 攸 and decoder decoders 〇2e. Similarly, the common data is transmitted to the ALU2〇ALU3 via the delay registers 683e and 683h, and the input data 6〇2d is received. The instruction execution cycle 1 is via the delay register 683c, and the instruction execution cycle 2 is delayed by the temporary storage. 683q, in the instruction execution cycle 3, via the delay register 683f. Similarly, the control decoder and the address signal 6〇2e' received by the ALU3 are in the instruction execution cycle 1 via the delay register 683d. Execution cycle 2 will pass through delay register 683g, and will pass through the delay register during instruction execution cycle 3. The common data received by ALU3 will be buffered to 683e' via the delay during the instruction execution cycle, via delay register 683h during instruction execution cycle 2, and via delay register 683j. during instruction execution cycle 3. The output of ALU3 is passed to a four-slot output buffer 604 having a width 和 and a multiplexer 672 that is coupled to an input port of the SFU 670. The output from ALU0, ALU 1, and ALU2 is transmitted to multiplexer 672. The output of ALU2 is passed to output buffer 4 604 via delay register 683 。. The output of ALU1 is passed to output buffer 4 604 via delay registers 6831 and 683ri. The output of ALU0 is transferred to output buffer 4 604 via delay registers 683r, 683k, and 683m. It is worth noting that in at least one embodiment, Figure 6 can be designed with logic to remove at least one delay from the data path. Figure 7 is a flow chart illustrating the normalization direction in the vector ALU

Client^ Docket No.: S3U04-0012-TW TT’s DocketN〇:0608-A4111〇-TW/Final/LukeLee(李宗軒)/1,Febs 2007 1335550 量差處理。更特定的說,正規化之向量差的,算執行在傳 統向量ALU和串流SIMD純量ALU時,.可以當成一實施 例。第7A圖表示正規化之向量差計算的資料流程。舉例 而言’向量正規化之差(向量VI和V2)計算的向量架構 實施如下: //資料配置:Vl-〉r0.xyzw V2->rl.xyzw (x,y,z,w 是圖 形資料向量的元素) //向量ALU的程式 SUB r2, r〇, rl //所有元素的減法 DP3 r3.x,r2,r2 //三個元素的内積產生X元素 RSQ r3.x, r3.x //產生X元素的倒數平方根 MULr2,r2, r3.x//利用RSQ的結果調整所有元素的值 為了處理4組資料,此流程可重覆4次,花費16個指 令週期。利用第7B-C圖的純量ALU,在SIMD ·流處理 器也可達成相同任務: 例子:向量正規化之差(向量V1和V2 ) 比較傳統的作法與SIMD串流純量ALU架構。純量 ALU 的 SIMD 應用:Vl->r0.xyzw=r0[0], r0[l], r0[2],r0[3] V2->rl.xyZW=ri[〇],ri[i],ri[2],rl[3]。( x、y、z、和 w 為 圖像資料向量的元素,r[0-3]表示分開的純量) 向量ALU 串流SIMD純量ALU 註解 SUBr2,rf), rl Repl(j<3) SUB r2[j], lOULrlU] 所有元素的減法 DP3 γ3.χ5τ2, r2 Repl(j<3) MAC Null, r2 [j],r2[j] 所有元素的内積產生至χ元 素,以乘法累加來實現Client^ Docket No.: S3U04-0012-TW TT’s DocketN〇: 0608-A4111〇-TW/Final/LukeLee (Li Zongxuan)/1, Febs 2007 1335550 Differential processing. More specifically, the normalized vector difference can be considered as an embodiment when it is implemented in the traditional vector ALU and the streamed SIMD scalar ALU. Figure 7A shows the data flow of the normalized vector difference calculation. For example, the vector architecture calculated by the difference between vector normalization (vector VI and V2) is implemented as follows: //Data configuration: Vl->r0.xyzw V2->rl.xyzw (x,y,z,w is a graphic The element of the data vector) // The program of the vector ALU SUB r2, r〇, rl // The subtraction of all elements DP3 r3.x, r2, r2 // The inner product of the three elements produces the X element RSQ r3.x, r3.x // Generate the reciprocal square root of the X element MULr2, r2, r3.x// Use the result of RSQ to adjust the value of all elements to process 4 sets of data. This process can be repeated 4 times and takes 16 instruction cycles. Using the scalar ALU of Figure 7B-C, the same task can be achieved in the SIMD stream processor: Example: Differences in vector normalization (vectors V1 and V2) Comparison of traditional practices with SIMD streaming scalar ALU architecture. SIMD application for scalar ALU: Vl->r0.xyzw=r0[0], r0[l], r0[2],r0[3] V2->rl.xyZW=ri[〇],ri[i ], ri[2], rl[3]. (x, y, z, and w are elements of the image data vector, r[0-3] represents separate scalar quantities) Vector ALU Streaming SIMD scalar ALU Annotation SUBr2, rf), rl Repl(j<3) SUB r2[j], lOULrlU] Subtraction of all elements DP3 γ3.χ5τ2, r2 Repl(j<3) MAC Null, r2 [j],r2[j] The inner product of all elements is generated to the χ element, which is multiplied and accumulated

Clienfs Docket No.: S3U04-0012-TW TT’s Docket 1^〇:0608-八4"10-TW/Fina!/LukeLee(李宗軒)/1,Feb,2007Clienfs Docket No.: S3U04-0012-TW TT’s Docket 1^〇: 0608-八4"10-TW/Fina!/LukeLee(李宗轩)/1,Feb,2007

的功^例子的第8圖為-ALU模組,實現了第6圖ALU 。更特別的是’第8圖可視為第6圖則〇的一 :::”:例包含四部分:一資料路徑,具有雙 874.—紅累早 和必要的輸入與輸出多工器謂、 ,暫存為庫,包含延遲暫存器883a,寫回斬存哭 以及針對每個執行緒的數個累加器878 ;—軔^ 具有區域ALU暫存表880;以及一區域控制單元,呈 要的狀態機與位址產生器882。 八 ' ,如圖所示,輸入㈣膽在ALU的部分資料路徑中傳 送至多工器870。輸入資料lm、IN2、和m3分別傳送至Figure 8 of the example of the work ^ is the -ALU module, which implements the ALU of Figure 6. More specifically, 'Fig. 8 can be regarded as the sixth figure: ::::": The example consists of four parts: a data path with double 874. - red early and necessary input and output multiplexer, Temporarily stored as a library, including a delay register 883a, writing back to the memory crying and a number of accumulators 878 for each thread; - 轫 ^ having a regional ALU staging table 880; and a regional control unit, presenting State machine and address generator 882. Eight', as shown, the input (four) bile is transmitted to the multiplexer 870 in the partial data path of the ALU. The input data lm, IN2, and m3 are respectively transmitted to

Client's Docket No.: S3U04-0012-TW TT s Docket N〇:0608-A41110-TW/Final/Luk:eLee(李宗軒)/1,Feb, 2〇07 35 ⑴ 5550 延遲暫存器8S3c、延遲暫存器88μ'和暫存 =輸出。控制和位址訊號⑶傳送至延遲暫:二 :、、、後輸出,也傳送到A L U的區域控制部分㈣ =址產生請2。共同資料輪人⑽傳送至延遲暫= b攸延遲暫存器883b,共同資料傳送至輪出埠, 也輸入多工器870。Client's Docket No.: S3U04-0012-TW TT s Docket N〇:0608-A41110-TW/Final/Luk:eLee(李宗轩)/1,Feb, 2〇07 35 (1) 5550 Delayed Register 8S3c, Delayed Staging 88μ' and temporary storage = output. The control and address signals (3) are transmitted to the delay temporary: two: , , and after the output, also transmitted to the area control part of A L U (four) = address generation please 2. The common data wheel (10) is transmitted to the delay temporary = b攸 delay register 883b, and the common data is transmitted to the round trip, and is also input to the multiplexer 870.

多工器87〇也接收來自SRAM暫存表88〇的資料咖 、寫回暫存器876的資料、以及執行緒累加 8的資料。在圖示中’多工器87Q有三個輸出埠,每個 以傳送Μ位元的資料。多工器⑽的輪出埠輕合至 ,乘法累加單元(MACC)872,其詳述如下。雙格式 p们2的輸出槔耗合至多工器m的第二輸入蜂 :至寫回暫存H 876的輸人埠。如同先前所述的 存器876的輪出埠耦合至多工器87〇的輪入埠,也二 SRAM暫存表880的輸入埠WDATA、輪出 D至 出:阜FW。多工器874的輸出埠輕合至執行緒累力 =8’執行緒累加暫存器、878再輕合至多工器_的克 土。如上述,控制和位址訊號CAI經由延遲暫存、=入 輕合至控制狀態機和位址產生器882。控制狀態機^ 產生器882輸出資料至SRAM暫存表88〇的輪 止 RA卜 WA、以及 WE。 lRA0、 第9圖相似於第3、4圖的AUJ’說明有四 串流處理器模組,且以第8圖所述的純量處理器ALU的 立。此架構顯示使用與$ 8圖4目同的純量處理\=來建 窃模組的The multiplexer 87 receives the data from the SRAM temporary storage table 88, the data written back to the temporary register 876, and the data of the thread accumulation 8. In the figure, the multiplexer 87Q has three output ports, each of which transmits the data of the Μ bit. The wheel 埠 of the multiplexer (10) is lightly coupled to a multiply-accumulate unit (MACC) 872, which is detailed below. The output of the dual format p2 is summed to the second input bee of the multiplexer m: to the input of the temporary storage H 876. As previously described, the wheel 埠 of the memory 876 is coupled to the wheel 埠 of the multiplexer 87 埠, and the input 埠 WDATA of the SRAM temporary table 880 is rotated to D: 阜FW. The output of the multiplexer 874 is lightly coupled to the ECU load = 8' Thread Accumulator Register, 878 and then multiplexed to the multiplexer _. As described above, the control and address signal CAI is coupled to the control state machine and address generator 882 via delay staging, =in. The control state machine ^ generator 882 outputs the data to the rotation of the SRAM temporary table 88, RA, WA, and WE. lRA0, Fig. 9 is similar to the AUJ' of Figs. 3 and 4, which shows that there are four stream processor modules, and the scalar processor ALU described in Fig. 8 stands. This architecture shows the use of the same scalar processing as $8 Figure 4 to build the module.

Client's Docket No.: S3U04-0012-TW TT>s Docket No:0608-A41110-TW/Fmal/LukeLee(# Feb, 2007 1335550 SIMD串流多重處理器。此方式簡化設計和驗證的工作, 其可應用於建立尺寸可大可小的SIMD串流處理器的機器 模組。同樣的,相似於第8圖,控制與位址訊號(第8圖 的CAI)輸入至ALU0的CAI。共同資料(第8圖的CDI) 輸入至ALU0的CDI。如第9圖所示,ALUO從輸入緩衝 器4χΜ接收資料輸入直接至ΙΝ0。ALU0接著處理接收資 料,但處理後的資料再輸入至三個延遲暫存器(如第6圖 的延遲暫存器683r、683k、和683m)。在第9圖中,這些 延遲可透過連接ALU0的輸出埠至輸入埠來實現。進一步 來說,從ΙΝ0接收的資料,處理後輸出至00。00耦合至 IN3,IN3延遲該資料(第一延遲)後再輸出至03。03耦 合至IN2,IN2延遲該資料(第二延遲)後再輸出至02。 02耦合至INI,IN1延遲該資料(第三延遲)後再輸出至 〇1。01的輸出耦合至輸出緩衝器4χΜ。 至於ALU1,控制和位址訊號以及共同資料訊號分別 在ALU1的CAI和CDI接收。如第6、8圖所示,這些訊 號在ALU1接收前受到一個延遲。此延遲分別藉由經過 ALU0輸出至CA0和CD0,再傳送至ALU1的CAI和CDI 來表示。來自輸入緩衝器4χΜ的輸入資料在ALU1的ΙΝ1 接收。如第6圖所示,輸入資料在ALU1處理前受到一個 延遲(如第6圖的延遲暫存器683a)。此延遲在第9圖藉 由耦合輸出埠01至IN0達成。該資料處理後輸出至00。 輸出埠00耦合至IN3以創造輸出延遲,如第6圖所示。 兩個輸出延遲藉由IN3至03的延遲(延遲暫存器6831)Client's Docket No.: S3U04-0012-TW TT>s Docket No:0608-A41110-TW/Fmal/LukeLee(# Feb, 2007 1335550 SIMD Streaming Multiprocessor. This way simplifies the design and verification work, it can be applied To build a machine module of a SIMD stream processor that can be large or small. Similarly, similar to Figure 8, the control and address signals (CAI of Figure 8) are input to the CAI of ALU0. Common data (8th The CDI of the figure is input to the CDI of ALU0. As shown in Figure 9, ALUO receives the data input from input buffer 4χΜ directly to ΙΝ0. ALU0 then processes the received data, but the processed data is then input to the three delay registers. (Like the delay registers 683r, 683k, and 683m in Fig. 6.) In Fig. 9, these delays can be realized by connecting the output of ALU0 to the input port. Further, the data received from ΙΝ0 is processed. After the output is coupled to 00.00 to IN3, IN3 delays the data (first delay) and then outputs to 03. 03 is coupled to IN2, IN2 delays the data (second delay) and then outputs to 02. 02 coupled to INI, IN1 delays the data (third delay) and then outputs it to 〇1. The output of 01 is coupled to the output buffer 4. For ALU1, the control and address signals and the common data signal are received at the CAI and CDI of ALU1, respectively. As shown in Figures 6 and 8, these signals are subject to a delay before ALU1 is received. This delay is represented by the ACI0 output to CA0 and CD0, and then to the CAI and CDI of ALU1. The input data from the input buffer 4χΜ is received at ΙΝ1 of ALU1. As shown in Figure 6, the input data is processed in ALU1. It is subjected to a delay (such as delay register 683a in Fig. 6). This delay is achieved by coupling outputs 埠01 to IN0 in Fig. 9. The data is processed and output to 00. Output 埠00 is coupled to IN3 to create an output. Delay, as shown in Figure 6. Two output delays are delayed by IN3 to 03 (delay register 6831)

Client’s Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1, Feb, 2007 133555〇 % 、和腿至02的延遲(延遲暫存器683n)創造出來。 該等延遲後,輸出資料可傳送至輸出緩衝器4χΜ。、二k 至於ALU2,控制和位址訊號以及共同資料訊 ALU2接收前受到二個延遲,然後分別在ALU2的 CDI接收。來自輸入緩衝器4χΜ的輸入資料在ALU2的工 接收。如第6圖所示,為了達成輸入資料在ALU2處理二 受到二個延遲,接收訊號處理(在第6圖的延遲ς存= 683b)後輸出至〇2。此訊號在ΙΝ1接收,處理(延遲暫^ 器683p)後輸出至01。此訊號在IN〇接收,處理(=、子 暫存器683p)後輸出至〇〇。為了實現輸出延遲,輪 料傳送至IN3 ’處理(延遲暫存器683〇)後再傳送至輸 緩衝器4χΜ。 & 至於ALU3,控制和位址訊號以及共同資料訊號在無到 三個延遲後(ALU0、ALm、和ALU2),在ALU3的 和CDI接收。輸入資料傳送至IN3再受到三個輸入延遲。 鲁 第一輸入延遲經由ALU3的IN3和03間的延遲(延遲暫 存器683c)。該輸入資料由〇3傳送至IN2,然後在ALlJS3 受到第二延遲(延遲暫存器683q)。該輸入資料由〇2傳 送至IN1。該輸入資料延遲後(延遲暫存器683f)輸出至 〇 1。該輸入資料傳送至IN0 ’處理後輸出至〇〇,接著傳送 至輸出緩衝器4xM。 此外,如第8圖所示,輸出資料耦合至輸出埠fW以 傳送至SFU 980。輪出資料可傳送至多工器970。多工器 970耗合至SFU 980’其可進一步處理輸出資料,以輸入至Client's Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1, Feb, 2007 133555〇%, and leg to 02 delay (delay register 683n) created. After these delays, the output data can be transferred to the output buffer 4χΜ. 2k As for ALU2, control and address signals and common data ALU2 receives two delays before receiving, and then receives them in CDI of ALU2 respectively. Input data from the input buffer 4A is received at the ALU2. As shown in Fig. 6, in order to achieve the input data, two delays are received in the ALU2 processing 2, and the received signal processing (in the delay buffer of Fig. 6 = 683b) is output to 〇2. This signal is received at ΙΝ1, processed (delayed by 683p) and output to 01. This signal is received at IN〇, processed (=, sub-register 683p) and output to 〇〇. To achieve the output delay, the poll is transferred to the IN3' processing (delay register 683〇) and then to the input buffer 4χΜ. & As for ALU3, control and address signals and common data signals are received at ALU3 and CDI after three delays (ALU0, ALm, and ALU2). The input data is transferred to IN3 and is subject to three input delays. The first input delay is delayed by the delay between IN3 and 03 of ALU3 (delay register 683c). The input data is transferred from 〇3 to IN2 and then subjected to a second delay (delay register 683q) at ALlJS3. This input data is transferred from 〇2 to IN1. After the input data is delayed (delay register 683f), it is output to 〇 1. The input data is transferred to IN0' for processing and output to 〇〇, and then to the output buffer 4xM. Further, as shown in Fig. 8, the output data is coupled to the output 埠fW for transmission to the SFU 980. The rounded data can be transferred to the multiplexer 970. The multiplexer 970 is consuming to the SFU 980' which can further process the output data for input to

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒yi,Feb,2007 38 1335550 *- 每個ALU的輸入琿SF。 該串流ALU模組的一部分為乘法累加單元,其可用以 支援可變SIMD係數處理,其可運算雙浮點資料格式和具 有堆疊(減少)SIMD係數以及平行處理資料的能力。在 本揭露書中值得注意的是,縮寫「MAC」是指乘法累加暫 存器,而「MACC」和「Multiply Accumulate Unit」意指雙 格式乘法累加單元,如第8圖的元件872。 此外,如第9圖所示,ALU0、ALIH、ALU2、和ALU3 • 的實施例可接收來自SFU的運算資料,而運算資料用以指 示接收資料執行的運算。同樣地,在一些實施例中,ALU0 可傳送共同資料至ALU1、ALU1可傳送共同資料至 ALU2、ALU2可傳送共同資料至ALU3。 第10A-C圖說明如同第8圖的MACC單元MACC的 資料流程和格式。更特別的是,請參考第8圖,MACC單 元872可用以處理長格式資料(浮點、整數等等),短格 式資料(浮點、整數等等),和混合格式資料(浮點、整 • 數等等),且當處理短格式資料時其效能會增加。 第10A圖說明MACC單元的資料程流邏輯架構,以及 說明運算兩不同資料格式(也就是長浮點和短浮點格式)的 能力。下述步驟可根據浮點算術演算法,處理浮點資料: 1) 短和/或長指數(exponent )處理,當被乘數的 指數相加以及加法運算元的指數相減時。 2) 短和/或長運算元在區域乘法器的底數 (mantissas )乘法。Client's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan yi, Feb, 2007 38 1335550 *- Input of each ALU 珲 SF. The streaming ALU module Part of is a multiply-accumulate unit that can be used to support variable SIMD coefficient processing, which can compute dual floating point data formats and the ability to stack (reduce) SIMD coefficients and process data in parallel. It is worth noting in this disclosure that the abbreviations are "MAC" refers to the multiply-accumulate register, and "MACC" and "Multiply Accumulate Unit" refer to a dual format multiply-accumulate unit, such as element 872 of Figure 8. In addition, as shown in Figure 9, ALU0, ALIH, Embodiments of ALU2, and ALU3 can receive operational data from the SFU, and the operational data is used to indicate the operations performed on the received data. Similarly, in some embodiments, ALU0 can transmit common data to ALU1, ALU1 can transmit common data. The ALU2 and ALU2 can transmit the common data to the ALU 3. The 10A-C diagram illustrates the data flow and format of the MACC unit MACC as shown in Fig. 8. More specifically, please refer to Fig. 8, the MACC unit 872 can be used everywhere. Long format data (floating point, integer, etc.), short format data (floating point, integer, etc.), and mixed format data (floating point, integer number, etc.), and its performance will increase when processing short format data Figure 10A illustrates the data flow logic architecture of the MACC unit and the ability to interpret two different data formats (ie, long floating point and short floating point formats). The following steps can be used to handle floating point arithmetic based on floating point arithmetic algorithms. Data: 1) Short and/or long exponent processing, when the exponents of the multiplicands are added and the exponents of the addition elements are subtracted. 2) The short and/or long operands are multiplied by the mantissas of the region multiplier.

Client’s Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 3) ίΓί長Λ數的補數運算,根據運算符號和定 義加法或減法的運算元修飾語-modifier)。 4) 在加法/減法之前沾 引的紐和/或長底數校準 5) (ahgn職t) ’其根據指數不同而移位。 針對多重加法運算亓 減法。 凡之短和/或長底數的加法/ 6) 7) 具有剛置對齊(pre-aligned ) MACC暫存哭内容 的短和/或長底數的加法/減法。 ^的正規化’其在傳送至輸出暫存器之前, 可能需要相關指數更新的底數移位。 位元t第Γ圖所示,長浮點資料可為36位元,其中第⑽ 位兀為尚指數位元。 一 18位几表不低底數位元 12二 位凡表示指數值…第16.13位元表示低指數 Γ〇0位弟12位元表示底數符號%,其為高底數的-部分,第 u_u位兀m24-m,3也是。 ,1GC圖說明短浮點資料,針對通道!的短浮點資料。 此兩短格式資料可放於第 弟屬圖長袼式資料的位置。更特別 I ’ 0位元為指數位元e4。第34_31位元為高指數位元 ㈣。底數符號⑶為第3〇位元,第29_18位元為高底數 =仰至於树〇物㈣㈣,第17位福餘位元… 位το為低才曰數e3_e〇。底數符號%為第12位元,而第 ll-ο位元為低底數位元 第11圖相似於第8圖的MACC單元且實現第10A圖的資 Client’s Docket No.: S3U04-0012 丁你 爪她_侧副晴W/F喊ukeue(李宗軒凡秘,雇 1335550 料流程,詳細說明MACC單元的浮點資料路徑的内部邏輯架 構。更特別的是’ MACC單元872,如同第8圖所示,同時 處理短和長浮點資料。第11圖的浮點資料路徑包含下述主要 部分’‘而這些主要部分可用以處理一組長運算元(ABC)或兩 組短運算元(2xabc)。 1) 指數處理部分,在適當的通道處理長指數和短指數; 2) 底數處理部分,處理長指數和短底數。第11圖的浮點 資料路徑根據有額外累加單元的浮點乘加演算法而實 • 現。 MACC單元872可針對通道〇的短指數運算和大小調整單 元(Short Expoent Calculation and Scale unit for chennale 0, SECSO) 1120。SECSO 1120從通道1接收運算元A (之後稱 為「al」)的五個高指數位元。此外,SECSO 1120從通道0 接收運算元B (之後稱為「b0」)的五個低指數位元、運算元 al的五個底指數位元、運算元bl的五個底指數位元、以及第 三運算元cl (cl表示為ab+c的運算元)的五個指數位元。 • SECSO 1120也接收運算元C、B、和A的尺寸係數(scale factor) scal_c、scal_h、和 scal_l。SECSO 1120 的輸出包含 6 位元的 短指數,其傳送至補數與對齊移位單元(Complement and Alignment Shifter Unit,CASU) 1139,在加法之前將底數對齊。 SECSO 1120也輸出6位元短指數至最終加法器(CPA或CLA) 和正規化單元1147,以輸出指數的最終值以及從浮點資料路 徑提供最終輸出。 長指數計算和調整尺寸單元(Long Exponent CalculationClient's Docket No.: S3U04-0012-TW TT's Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 1335550 3) ίΓί Long-numbered complement operations, based on operators and definitions Addition or subtraction operator modifier -modifier). 4) New and/or long base calibrations that are prior to addition/subtraction. 5) (ahgn job t) ’ It shifts according to the index. For multiple addition operations 减 subtraction. Addition of short and/or long bases / 6) 7) Addition/subtraction of short and/or long bases with pre-aligned MACC temporary crying content. The normalization of ^ may require a base shift of the associated index update before being passed to the output register. As shown in the second figure of the bit t, the long floating point data can be 36 bits, and the (10)th bit is the still index bit. An 18-bit table is not low-bit-bit 12-bit where the index value is expressed... The 16.13-bit represents the low-index Γ〇0-bit 12-bit represents the base symbol %, which is the high-base-part, the u_u-bit M24-m, 3 is also. , 1GC chart shows short floating point data, for the channel! Short floating point data. The two short-form data can be placed in the position of the first-generation data. More specifically, I ’ 0 bits are exponent bits e4. The 34th 31st bit is the high index bit (4). The base symbol (3) is the third unit, the 29th-18th is the high base = the elevation is the tree artifact (four) (four), the 17th surplus bit... The bit το is the low number e3_e〇. The base symbol % is the 12th bit, and the ll-o bit is the low base bit. Figure 11 is similar to the MACC unit of Figure 8 and implements the 10A diagram of the Client's Docket No.: S3U04-0012 She _ side deputy clear W / F shout ukeue (Li Zongxuan franchise, hire 1335550 material flow, detailing the internal logic architecture of the floating point data path of the MACC unit. More specifically, 'MACC unit 872, as shown in Figure 8, Simultaneous processing of short and long floating point data. The floating point data path of Figure 11 contains the following main parts '' and these main parts can be used to process a set of long operands (ABC) or two sets of short operands (2xabc). The index processing part processes the long index and the short index in the appropriate channel; 2) the base processing part, which processes the long index and the short base. The floating point data path of Figure 11 is implemented based on a floating point multiply and add algorithm with additional accumulated elements. The MACC unit 872 can be directed to a Short Expoent Calculation and Scale unit for chennale 0 (SECSO) 1120. SECSO 1120 receives five high-index bits of operand A (hereafter referred to as "al") from channel 1. In addition, SECSO 1120 receives five low-index bits of operand B (hereinafter referred to as "b0"), five bottom-index bits of operand al, five bottom-index bits of operand bl, and The fifth operand of the third operand cl (cl denoted as the operand of ab+c). • SECSO 1120 also receives the scale factors scal_c, scal_h, and scal_l of operands C, B, and A. The output of the SECSO 1120 contains a short exponent of 6 bits, which is passed to the Complement and Alignment Shifter Unit (CASU) 1139, which aligns the base before the addition. The SECSO 1120 also outputs a 6-bit short index to the final adder (CPA or CLA) and normalization unit 1147 to output the final value of the index and provide the final output from the floating point data path. Long Exponent Calculation and Resizing Unit (Long Exponent Calculation

Client’s Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110_TW/Final/LukeLee(李宗軒)/1,Feb,2007 41 l33555〇 andScaleunit’LECs) mo接收結合運算元的和u的⑺位 元指數資料、結合運算元bG和bl的1G位元指數資料、以及 結合運算元ch和d的1〇位元。一併接收的也有運算元尺寸 係數scal_c和scaUl。LECS 1〇4〇的輸出包含三個:casu 1139的11位元輸出,以及至最終加法器和正規化單元丨ι47 的Π位元長指數輸出。 混合指數和短指數計算和調整尺寸單元通道丨(Mixed 鲁 Exponent and Short Exponent calculation and scale unit Channel 卜MESEC1) 1130接收運算元a0的五個低指數位元。此外, MESEC1 1130也接收運算元bl的高指數五個位元、運算元 的高指數、運算元b0的高指數、運算元ch—e、運算元b〇的 低指數、以及運算元bl的低指數來的五位元、以及汕和Μ 的高指數十個位元、還有cat(ch—e,d—e) eMESEC1 113〇 也接收 seal—c、scal_h、和 scal—1。MESEC1 113〇 輸出三組資 料(6位元或11位元,取決於特別運算元)至CASU ιΐ39, 鲁 也輸出6位元短指數至最終加法器和正規化單元1147。 至於通道〇的底數部分,乘法器1131接收運算元al低底 數(13位元)和運异元b 1咼底數(13位元)。乘法器113 3 接收運算元al高底數(13位元)和運算元b〇低底數(13位 元)。乘法器1131和1133兩者皆輸出26位元至CASU 1139。 此外,通道0的cl—m ( 13位元)在CASU 1139被接收,還有 符號位兀sign_h、signj、和Sing_c也是。同樣地,對於通道 卜乘法器1135接收運算元a0高底數(13位元)和運算元 而底數(13位το)。乘法器1137接收運算元a〇低底數Client's Docket No.: S3U04-0012-TW TT's Docket N〇:0608-A41110_TW/Final/LukeLee(李宗轩)/1,Feb,2007 41 l33555〇andScaleunit'LECs) mo receives the (7) bit of the combined operand and u The index data, the 1G bit index data in combination with the operands bG and bl, and the 1 bit in combination with the operands ch and d. Also received are the operand size coefficients scal_c and scaUl. The output of LECS 1〇4〇 consists of three: the 11-bit output of casu 1139, and the output of the final adder and the normalized unit 丨ι47. The Mixed Index and Short Exponent Calculation and Scale Unit Channel (MESEC1) 1130 receives the five low-index bits of the operand a0. In addition, MESEC1 1130 also receives the high-index five bits of the operand bl, the high index of the operand, the high index of the operand b0, the operand ch-e, the low index of the operand b〇, and the operand bl The low-index five-bit, and the high-index ten-bits of 汕 and 、, and cat(ch-e, d-e) eMESEC1 113〇 also receive seal-c, scal_h, and scal-1. MESEC1 113〇 outputs three sets of data (6 or 11 bits, depending on the special operand) to CASU ιΐ39, and Lu also outputs a 6-bit short index to the final adder and normalization unit 1147. As for the base portion of the channel ,, the multiplier 1131 receives the low base (13 bits) of the operand and the b 1 咼 base (13 bits). The multiplier 113 3 receives the operand al high base (13 bits) and the operand b low base (13 bits). Both multipliers 1131 and 1133 output 26 bits to the CASU 1139. In addition, the cl_m (13-bit) of channel 0 is received at CASU 1139, and the sign bits 兀sign_h, signj, and Sing_c are also. Similarly, the channel multiplier 1135 receives the high base (13 bits) of the operand a0 and the operand and the base (13 bits το). The multiplier 1137 receives the operand a 〇 low base

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒yi,秘,2〇〇7 1335550 負β -位兀)和運算元bl高底數(13位元)。符號位元signj^signj、 和sing—c (長格式),運算元修飾語abs_c和neg—c都在通道 1中由CASTJ1139接收。 針對短格式運算元,CASU 1139輸出六個26位元至多重 輸入加法器(MAD CSA單元)U41 ,其實現乘加運算 (Multiply-Add,MAD)步驟。MAD CSA 單元 1141 可用含 有四個37位元訊號(長格式運算元)輸入和兩個39位元訊號 輸入的數個進位儲存加法器(Carry_Save Adder,CSA)加以 • 實現。MAD CSA單元1141輸出兩個2+26位元輸出或一個 2+40位元(長格式)輸出至乘法累加進位儲存加法器(maC CSA )單元1145。MAC CSA單元1145可用以輸出兩個5+26 位元短格式輸出和一個5+40位元長格式輸出至最終加法器和 正規化單元1147。MAC CSA單元1145也輸出5+40位元(長 格式)和兩組5+26位元(短格式)至乘加暫存器1143,其耦 合至補數和對齊移位器CASU(complement and alignment shifter)1144〇CASU 1144輸出兩個5+26位元訊號和一個5+40 籲 位元長格式訊號回MAC CSA單元1145。最終加法器和正規 化早元1147輸出兩個短格式結果’各含有一個符號位元、五 個指數位元、以及十三個底數位元(S5el3m)。此外,於至 少一個實施例中,最終加法器和正規化單元1M7可以sl〇e26m 的型式輸出一長格式運算元。 下列描述第11圖中之雙格式乘加累加運算的兩個可能實 現方式:一個是,當我們使用分離電路以處理不同資料格式且 共用唯一輸出資料/結果緩衝器時,分離短格式資料和長格式Client's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan yi, secret, 2〇〇7 1335550 negative β-position 兀) and operand bl high base (13 digits) yuan). The symbol bits signj^signj, and sing-c (long format), the operand modifiers abs_c and neg-c are all received in channel 1 by CASTJ1139. For short format operands, CASU 1139 outputs six 26-bit to multiple input adders (MAD CSA units) U41, which implement the Multiply-Add (MAD) step. The MAD CSA unit 1141 can be implemented with several carry storage adders (Carry_Save Adder, CSA) with four 37-bit signal (long format operand) inputs and two 39-bit signal inputs. The MAD CSA unit 1141 outputs two 2+26 bit outputs or a 2+40 bit (long format) output to a multiply accumulate carry store adder (maC CSA) unit 1145. The MAC CSA unit 1145 can be used to output two 5+26 bit short format outputs and one 5+40 bit long format output to the final adder and normalization unit 1147. The MAC CSA unit 1145 also outputs 5+40 bits (long format) and two sets of 5+26 bits (short format) to the multiply-accumulate register 1143, which is coupled to the complement and alignment shifter CASU (complement and alignment) Shifter) 1144 〇 CASU 1144 outputs two 5+26 bit signals and a 5+40 hop bit long format signal back to the MAC CSA unit 1145. The final adder and normalize early 1147 output two short format results 'each containing one sign bit, five exponential bits, and thirteen base bits (S5el3m). Furthermore, in at least one embodiment, the final adder and normalization unit 1M7 can output a long format operand in the form of sl〇e26m. The following two possible implementations of the dual format multiply-accumulate operation in Figure 11 are described: one is to separate short format data and long when we use separate circuits to process different data formats and share a unique output data/result buffer. format

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 資料的資料路徑;另一個是,當我們使用相同電路伴隨額外言 輯電路以交錯處理短格式和長格式資料時,結合短格式資料璉 長格式資料的資料路徑。 ';: 第12圖相似於弟11圖的短指數計算通道,說明分離的 指數計算。該短指數通道接收三個短運算元的指數和 的指數以計算結果的指數部分,以及運算元底數對齊所需的浐 位位元數。短通道包含四個階層的指數加法器:乘法 夕 ^刀口决器 1212 和 1214,加法的加法器 12〇4、12〇6、和 12〇8 °。 匕>爪/ί:累力口 的加涑态1216、1218、1222、和1224 ’以及調整運算元 的加法器1241、1244、和1246 (2χ、4χ等等)。短通道也勹 含多工裔1210、1226、1232、1234、和1236,以幫上皮力 器和]yiAC指數暫存器112 8選擇正確的輸入。此外,翅通、首 包含優先權編碼器(priority encoder) 1220,其基於選擇的力 法器的輸出為正值或負值’產生控制訊號給數個多工器。 , . 咕 °° 趣底 數通道的運算結果產生一些訊號以用在短底數通道。運算、纟士表 後’短底數通道產生少量的信號給予短底數通道。這些訊號女 下所述:通道0的指數部分,和一群對齊底數的移位訊就。 位訊號包含短運算元c的位移、短運算元a和b的位移、、 Μ及 MACC暫存器值的位移。第13表描述進位傳遞加法器r r 、larryClient's Docket No.: S3U04-0012-TW TT's Docket No: 0608-A41110-TW/Final/LukeLee (Li Zongxuan) / 1, Feb, 2007 1335550 Data path of the data; the other is, when we use the same circuit with additional words When the circuit interleaves short and long format data, it combines the data path of the short format data with the long format data. ';: Figure 12 is similar to the short exponential calculation channel of the 11th figure, indicating the exponential calculation of the separation. The short exponential channel receives the index of the exponents of the three short operands to calculate the exponential portion of the result, and the number of truncated bits needed to align the operand base. The short channel consists of four levels of exponential adders: multiplication ^ ^ knife-edge actuators 1212 and 1214, addition adders 12〇4, 12〇6, and 12〇8 °.匕> Claw/ί: The twisted states 1216, 1218, 1222, and 1224' of the load port and the adders 1241, 1244, and 1246 (2χ, 4χ, etc.) that adjust the operands. The short channel also contains multiplexes 1210, 1226, 1232, 1234, and 1236 to help the upper input and the yiAC index register 112 8 select the correct input. In addition, the winged, first includes a priority encoder 1220 that generates a control signal to a plurality of multiplexers based on whether the output of the selected force processor is positive or negative. , 咕 ° ° The result of the calculation of the base channel produces some signals for use in the short base channel. The operation, the gentleman's watch, the 'short-bottom channel produces a small amount of signal to the short-bottom channel. These signals are described below: the exponential portion of channel 0, and a group of offset signals for the base. The bit signal contains the displacement of the short operand c, the displacement of the short operands a and b, and the displacement of the C and MACC register values. Table 13 describes the carry-transfer adder r r , larry

Propagate Adder ’ CPA) 1208符號輸出的輸出控制功能,其定 義每個輸出訊號的路線走法(見第12圖編碼表xl x2 & + 的輪 入): 條件0 Xl CPA 1208A 條件1 x2 CPA 1208B 條件2 x3 CPA 1208C 碼 N0T(cl> alh *b〇l) NOT(all*blh>alh*b01) alh*b01>all*blh N0T(cl> alh *b01) all*blh>alh*b01 NOT(alh*b01>all*blh)Propagate Adder ' CPA) The output control function of the 1208 symbol output, which defines the route of each output signal (see the 12-character code table xl x2 & + rounds): Condition 0 Xl CPA 1208A Condition 1 x2 CPA 1208B Condition 2 x3 CPA 1208C code N0T(cl> alh *b〇l) NOT(all*blh>alh*b01) alh*b01>all*blh N0T(cl> alh *b01) all*blh>alh*b01 NOT( Alh*b01>all*blh)

Client’s Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/ι,Feb,2007 44 1335550 NOT(cl> alh*b01) all*blh>alh*b01 alh*b0 卜 all*blh 0 · · -. '.V "--r V π \ ----.¾.·- — .*>«»-*s..- v—*·-·. Λ-»'·' ··.·..,. ---_V · ---"i,i i · :: i*. '·' cl > alh * b01 NOT(all*blh>alh*b01) alh*b01>all*blh 1 cl> alh*b01 all*blh>alh*b01) NOT(alh*b01>all*blh) 1 cl> alh*b01 all*blh>alh*b01 alh*b01>all*blh 1 --^ .^ · -' · ·'-··· -i·':.- , -·.... .··.. - •/«i. - ·· , -. -V···· .- -- ·'·. ' /' ..,..- - - . - .·.: > -.-, NOT(cl> alh*b01) NOT(all*blh>alh*b01) NOT(alh*b01>all*blh) 2 cl> alh * bOl NOT(all*blh>alh*b01) NOT(alh*b01>all*blh) 2 只有MAC磐 只有MAC體 只有MAC鶴 3 第13表Client's Docket No.: S3U04-0012-TW TT's Docket No:0608-A41110-TW/Final/LukeLee(李宗轩)/ι,Feb,2007 44 1335550 NOT(cl> alh*b01) all*blh>alh*b01 alh *b0 卜all*blh 0 · · -. '.V "--r V π \ ----.3⁄4.·- — .*>«»-*s..- v—*·-· Λ-»'·' ····..,. ---_V · ---"i,ii · :: i*. '·' cl > alh * b01 NOT(all*blh>alh *b01) alh*b01>all*blh 1 cl> alh*b01 all*blh>alh*b01) NOT(alh*b01>all*blh) 1 cl> alh*b01 all*blh>alh*b01 alh*b01&gt ;all*blh 1 --^ .^ · -' · ·'-··· -i·':.- , -·.... ..·.. - •/«i. - ·· , - . -V···· .- -- ·'·. ' /' ..,..- - - . - ...: > -.-, NOT(cl> alh*b01) NOT(all* Blh>alh*b01) NOT(alh*b01>all*blh) 2 cl> alh * bOl NOT(all*blh>alh*b01) NOT(alh*b01>all*blh) 2 Only MAC磐 only MAC body only MAC Crane 3 Table 13

更特別的是,如上所述,SECS0 1120接收運算元Cl_e、 運算元bl_e高指數部分(5位元)、運算元al_l低指數部分 (5位元)、運算元b0_e低指數部分(5位元)、以及運算元 al_e高指數部分(5位元)。這些輸入麵合至零值指數谓測器 (zero exponent detector) 1202。假如指數部分為零時,零值 指數偵測器1202可用以輸出訊號。此外,零值指數偵測器1202 輸出 cl_e 中的 5 位元至 CPA(carry propagate adder) 1204,其 為部分的CPA加法,同時也輸出至多工器1210的輸入埠「1」。 兩組5位元也傳送至另一 CPA 1212 ’另兩組5位元從零值指 數偵測器1202傳送至CPA 1214。CPA 1212傳送資料(6位 元)至加法的加法器(CP A for addition ) 1204、乘法累加的加 法器(CPA for MAC )1218、以及多工器1210的輸入埠「〇」。 乘法的加法器(CPA for multiplication) 1214傳送輸出至加法 的加法器(CPA for addition) 1206、1208、乘法累加的加法器 (CPA for MAC) 1222、以及多工器1210的輸入埠「2」。 加法的加法器(CPA for addition ) 1204傳送6位元資料 至多工态1232的輸入蟑「0」以及反相器1250,反相器1250More specifically, as described above, SECS0 1120 receives the operation element Cl_e, the operation element bl_e high index portion (5 bits), the operation element aal1 low index portion (5 bits), and the operation element b0_e low index portion (5 bits) ), and the high-index part (5 bits) of the operation unit al_e. These input faces are combined to a zero exponent detector 1202. If the index portion is zero, the zero value index detector 1202 can be used to output a signal. In addition, the zero value index detector 1202 outputs 5 bits in cl_e to CPA (carry propagate adder) 1204, which is a partial CPA addition, and is also output to the input multiplexer 1210 with an input 埠 "1". The two sets of 5 bits are also transferred to another CPA 1212' and the other two sets of 5 bits are transmitted from the zero value index detector 1202 to the CPA 1214. The CPA 1212 transmits data (6 bits) to the adder (CP A for addition) 1204, the multiply accumulate adder (CPA for MAC) 1218, and the input of the multiplexer 1210 "埠". A multiplier adder (CPA for multiplication) 1214 transmits an output to the adder (CPA for addition) 1206, 1208, a multiply accumulate adder (CPA for MAC) 1222, and an input multiplexer 1210 of the multiplexer 1210. Addition Adder (CPA for addition) 1204 transmits 6-bit data. Input "0" of the multi-mode 1232 and inverter 1250, inverter 1250

Client's Docket No.: S3U04-0012-TW TT’s Docket No:〇608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 1335550 將訊號反相和傳送該反相訊號至多工器1234的輸入埠「1」。 CPA f0r addition 1204也產生一負值訊號(<〇)至編碼器1220, 編碼器 1220 控制多工器 1232、1234、和 1236°CPA for addition 1206傳送6位元至多工器1232的輸入埠「2」和反相器1254, 反相器1254將訊號反相和傳送該反相訊號至多工器1236的輸 入埠「1」。CPA for addition 1206也產生一負值訊號(<〇) 至編碼器1220,CP A for addition 1208產生產生一負值訊號 (<〇)至編碼器1220,和6位元訊號至多工器1234的輸入埠 「2」’以及經由反相器1252至多工器1236的輸入埠「〇」。 多工器控制輸入1210輕合至「或」電路1230和編碼器1220。 此外’多工器1210輸出6位元至「及」電路1240和通道1。 CPA for MAC 1216傳送6位元資料至多工器1232的輸入 埠「3」。CPA for MAO 1218傳送ό位元資料至多工器1234 的輸入皡「3」。CPA for MAC 1222傳送6位元至多工器1236 的輸入谭「3」。CPA for MAC 1224傳送6位元至「及」電路 124〇。多工器m6從多工器1210接收6位元至輸入埠「1」, 以及從MAC指數暫存器1228接收6位元至輸入埠「〇」。多 工器1226的輸出傳送至MAC指數暫存器1228的輸入,以及 通道〇的輸出。Client's Docket No.: S3U04-0012-TW TT's Docket No: 〇608-A41110-TW/Final/LukeLee/1,Feb,2007 1335550 Inverting the signal and transmitting the inverted signal to the input of the multiplexer 1234埠 "1". The CPA f0r addition 1204 also generates a negative signal (<〇) to the encoder 1220, and the encoder 1220 controls the multiplexers 1232, 1234, and 1236 ° CPA for addition 1206 to transfer the input of the 6-bit to the multiplexer 1232. 2" and inverter 1254, the inverter 1254 inverts the signal and transmits the inverted signal to the input 埠 "1" of the multiplexer 1236. The CPA for addition 1206 also generates a negative signal (<〇) to the encoder 1220. The CP A for addition 1208 generates a negative signal (<〇) to the encoder 1220, and the 6-bit signal to the multiplexer 1234. The input "2"' and the input via the inverter 1252 to the multiplexer 1236 are "〇". The multiplexer control input 1210 is coupled to the OR circuit 1230 and the encoder 1220. In addition, the multiplexer 1210 outputs a 6-bit to the AND circuit 1240 and channel 1. CPA for MAC 1216 transfers the 6-bit data to the input of multiplexer 1232 埠 "3". The input of the CPA for MAO 1218 to the multiplexer 1234 is "3". CPA for MAC 1222 transmits the input tan "3" from 6-bit to multiplexer 1236. The CPA for MAC 1224 transmits a 6-bit to the AND circuit 124〇. The multiplexer m6 receives 6 bits from the multiplexer 1210 to input 埠 "1", and receives 6 bits from the MAC index register 1228 to input 埠 "〇". The output of multiplexer 1226 is passed to the input of MAC index register 1228, as well as the output of channel 〇.

多工态1232輸出6位元至調整運算元尺寸的加法器(〇ρA for operand scale) 1242,CPA for operand scale 1242 也接收 scale_c ’而scale_c表示尺寸運算元2x、4x,scale_l表示在c 1The multi-mode 1232 outputs 6 bits to the adder (调整ρA for operand scale) 1242, the CPA for operand scale 1242 also receives the scale_c ' and the scale_c represents the size operands 2x, 4x, and the scale_l represents the c 1

的加法前的乘法結果尺寸2x、4x等等。CPAf〇r^erandscale 1242輸出底數shift一cl的資料,其可用在對齊移位器上。cpAThe multiplication result before addition is 2x, 4x, and so on. CPAf〇r^erandscale 1242 outputs the base shift-cl data, which can be used on the alignment shifter. cpA

Client's Docket No.: S3U04-0012-TW TT sDocketNo:0608-A41110-TW/Final/LukeLee(李宗軒)/1, Feb, 2007 46 1335550 for operand scale 1244接收scaleJ (調整乘法器XJ結果的尺 寸)以及多工器1234之輸出,並輸出6位元至底數 CPA for operand scale 1246 接收 scale—h (調整乘法器 結 果的尺寸)以及多工器1236之輪出,並輸出6位元至底數 shift—hO,其可用在底數對齊移位器上。「及」閘124〇自cpA f〇rMAC 1224以及多工器12〗〇接收6位元輸出。「及」閘i24〇 輸出6位元至底數shift一maccO輸出,其可用於MAC對齊移 位器。 第13圖相似於第11圖的短指數計算,說明短指數計算。 短指數通道1幾乎對稱於第12圖的指數通道〇以及有相似的 功旎性,除了增加通道0短指數值至最後輸出指數的可能性。 此特性能支援可變SIMD係數在短運算元處理模式。更特別的 是’如圖所示,輸入值包含運算元a〇高指數(5位元)、運 异7G b0高指數(5位元)、運算元a0低指數(5位元)、運 算元bl高指數(5位元)、以及短運算元指數ch—e。雖然短 指數計异通道0的電路(第12圖)和第13圖的短指數計算通 道1 (結合混合指數通道)相似,一個顯著的不同在於第】3 圖出現有多工器1355。多工器1355從通道0(第12圖的輸出) 接收指數’也從MAC暫存器1328的輸出接收。多工器1355 輸出資料至第12圖的數個CP A for MAC。輸入資料可處理以 提供通道1的指數值’以及底數移位Mantissa shift_macl、Client's Docket No.: S3U04-0012-TW TT sDocketNo: 0608-A41110-TW/Final/LukeLee (Li Zongxuan) / 1, Feb, 2007 46 1335550 for operand scale 1244 Receive scaleJ (size of multiplier XJ result) and more The output of the tool 1234, and output 6-bit to base CPA for operand scale 1246 to receive scale_h (size of the multiplier result) and the multiplexer 1236, and output 6-bit to the base shift-hO, It can be used on the base alignment shifter. The "and" gate 124 is received from the cpA f〇rMAC 1224 and the multiplexer 12 to receive the 6-bit output. The "and" gate i24〇 outputs 6 bits to the base shift-maccO output, which can be used for the MAC alignment shifter. Figure 13 is similar to the short index calculation in Figure 11, which illustrates the short index calculation. The short exponential channel 1 is almost symmetrical to the exponential channel 第 of Fig. 12 and has similar power, except for the possibility of increasing the short index value of channel 0 to the final output index. This feature supports variable SIMD coefficients in short operand processing mode. More specifically, as shown in the figure, the input value includes the operation element a〇 high index (5 bits), the different 7G b0 high index (5 bits), the operation element a0 low index (5 bits), and the operation element. The bl high index (5 bits) and the short operand index ch-e. Although the circuit of the short index difference channel 0 (Fig. 12) and the short index calculation channel 1 of the Fig. 13 (in combination with the mixed index channel) are similar, one significant difference is that the multiplexer 1355 appears in the Fig. 3 figure. The multiplexer 1355 receives the index 'from channel 0 (the output of Fig. 12) and is also received from the output of the MAC register 1328. The multiplexer 1355 outputs data to a number of CP A for MACs in Fig. 12. The input data can be processed to provide the index value of channel 1 and the base shift Mantissa shift_macl,

Mantissa shift—hi、Mantissa shiftjl、和 Mantissa shift_ch 的 訊號。 第14圖顯示提供多種通道的一短底數路徑,詳細地說明Signals from Mantissa shift-hi, Mantissa shiftjl, and Mantissa shift_ch. Figure 14 shows a short base path providing multiple channels, detailing

Client's Docket No.: S3U04-0012-TW TT S Docket No:0608-M 1110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 第11圖的底數路徑。本架構的目的在提供短浮點運算元的底 數運算。本架構可用以實現d=a*b+c+MAC的相同運算,且包 含必要的硬體區塊。該短底數資料路徑包含兩個幾乎對稱的部 分:短底數通道0和通道1 (分別為第14圖的左半部和右半 部)。它們包含區域乘法器1431、1433、1435、1437,而它 們的輸出傳送至補數和對齊移位器單元(表示成+/-/>>)CASU 1439a、b、c、d、e、f、g、和h,其可才艮據選擇的指數值對齊 運算元底數。這些單元也根據運算符號(加法或減法)補數化 或負值化輸入底數值。這些單元結合實現進位儲存加法樹 (carry-save adder tree)的加法器 1441a、1441b、1445a、 1445b ’其把乘法的結果和運算元c_l〇w和c_high加起來,以 及相加maccjow和maccjiigh。MAC短底數暫存器1430a和 1430b包含累加短底數值。全加器和正規器1447a、1447b產 生兩通道的短底數和短指數最終值。 更特別的是’如上所述,乘法器X0L 1431接收13位元的 運鼻几b 1低底數’以及13位元的運鼻兀a 1低底數。乘法器 Χ0Η 1433接收13位元的運算元b0低底數和13位元的運算元 al高底數。CASU 1439a接收6位元的shift cl ' 13位元的運 算元al高底數、以及1位元的sign_c。CASU 1439b接收來自 乘法器1431的26位元輸出、6位元的預先對齊移位控制訊號 shift 10(其為第12圖短指數通道的輸出)、以及符號值Signj。 CASU 1439c接收來自乘法器ΧΟΗ 1433的26位元底數乘積、 來自短指數通道的6位元預先對齊移位控制訊號shift h0、以 及符號值sign_h。Client's Docket No.: S3U04-0012-TW TT S Docket No: 0608-M 1110-TW/Final/LukeLee (Li Zongxuan) / 1, Feb, 2007 1335550 The bottom path of Figure 11. The purpose of this architecture is to provide the base operations of short floating point operators. This architecture can be used to implement the same operation of d=a*b+c+MAC and contains the necessary hardware blocks. The short base data path contains two nearly symmetrical parts: a short base channel 0 and a channel 1 (the left and right halves of Figure 14 respectively). They contain region multipliers 1431, 1433, 1435, 1437, and their outputs are passed to the complement and alignment shifter units (denoted as +/-/>>) CASU 1439a, b, c, d, e, f, g, and h, which are used to align the operand bases according to the selected index value. These units also complement or negatively enter the base value based on the arithmetic symbol (addition or subtraction). These units are combined with an adder 1441a, 1441b, 1445a, 1445b' which implements a carry-save adder tree which adds up the result of the multiplication and the operands c_l〇w and c_high, and adds maccjow and maccjiigh. The MAC short base registers 1430a and 1430b contain accumulated short bottom values. The full adder and the normalizers 1447a, 1447b produce a two-channel short base and short index final value. More specifically, as described above, the multiplier X0L 1431 receives the 13-bit nose number b 1 low base number and the 13-bit nose nose a 1 low base number. The multiplier Χ0Η 1433 receives the 13-bit operand b0 low base and 13-bit operand al algebra. The CASU 1439a receives a 6-bit shift cl ' 13-bit operand al algebra, and a 1-bit sign_c. The CASU 1439b receives the 26-bit output from the multiplier 1431, the 6-bit pre-aligned shift control signal shift 10 (which is the output of the short exponential channel of Fig. 12), and the symbol value Signj. CASU 1439c receives the 26-bit base product from multiplier ΧΟΗ 1433, the 6-bit pre-aligned shift control signal shift h0 from the short exponent channel, and the symbol value sign_h.

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 48 1335550 CASIJ 1439a、1439b、1439c 的輸出輸入至 MAD CSA 樹 1441a (有相關的表顯示CSA的階層及額外位元p MAD CSA 樹1441a輸出2+26位元的資料至MAC CSA樹M45a以及多 工斋1432。額外位元用以在對齊和正規化前避免底數在mac 迴圈的溢位。全加器和正規化單元1447a接收來自MAC CSA 樹1445a的5+26位元底數資料’以及來自短指數通道〇的指 數資料。額外增加的5位元以避免可能在MAC迴圈中的底數 溢位。全加器和正規化單元1447a將底數從CSA格式轉換至 一般位元編碼格式、將結果正規化、以及將結果輸出。該結果 包含1位元符號’ 5位元指數,和13位元底數(S5el3m), 而該結果傳送至輸出dl。 也如上所述’乘法器X1H 1435接收運算元a〇高底數和 運算元b0高底數。乘法器xil 1437接收運算元a〇低底數和 運算元bl高底數。CASU 1439d接收來自乘法器χΐΗ 1435的 輸出(26位元底數乘積)以及用來底數對齊運算的6位元shift 11 (指數通道的輸出)’和1位元符號值signj!。CASU 1439e 接收來自乘法器1437的26位元、6位元的shift hi、以及1 位元的sign—hCASU 1439f接收13位元的ch_m、6位元的shift ch、以及1位元的sign_c。MAD CSA樹1441b用以接收來自 CASU 1439d的預先對齊26位元底數、來自CASU 1439e的 26位元底數、以及CASU 1439f的26位元底數。Client's Docket No.: S3U04-0012-TW TT's Docket N〇:0608-A41110-TW/Final/LukeLee/1,Feb,2007 48 1335550 The output of CASIJ 1439a, 1439b, 1439c is input to the MAD CSA tree 1441a ( There are related tables showing the CSA hierarchy and extra bits p MAD CSA tree 1441a output 2+26 bit data to MAC CSA tree M45a and multi-work fast 1432. Extra bits are used to avoid the base before alignment and normalization The overflow of the mac loop. The full adder and normalization unit 1447a receives the 5+26 bit base data from the MAC CSA tree 1445a and the index data from the short exponent channel. An additional 5 bits are added to avoid possible The base overflow in the MAC loop. The full adder and normalization unit 1447a converts the base from the CSA format to the general bit encoding format, normalizes the result, and outputs the result. The result contains a 1-bit symbol '5 bits The meta-index, and the 13-bit base (S5el3m), and the result is passed to the output dl. As described above, the multiplier X1H 1435 receives the operand a〇 high base and the operand b0 high base. The multiplier xil 1437 receives the operand A〇 low base and operand The bl high base. The CASU 1439d receives the output from the multiplier χΐΗ 1435 (the 26-bit base product) and the 6-bit shift 11 (the output of the exponential channel) used for the base alignment operation and the 1-bit symbol value signj!. CASU 1439e receives a 26-bit, 6-bit shift hi from multiplier 1437, and a 1-bit sign-hCASU 1439f receives a 13-bit ch_m, a 6-bit shift ch, and a 1-bit sign_c. MAD CSA Tree 1441b is used to receive a pre-aligned 26-bit base from CASU 1439d, a 26-bit base from CASU 1439e, and a 26-bit base of CASU 1439f.

此外,MAC_h暫存器1430b從MAC CSA樹1145b接收 5+26位元資料。多工器1432從MAC_h 1430b接收5+26位元 資料,以及從通道0的MAD CSA樹1441a接收資料。CASUIn addition, MAC_h register 1430b receives 5+26 bits of material from MAC CSA tree 1145b. The multiplexer 1432 receives 5+26 bit data from the MAC_h 1430b and receives data from the MAD CSA tree 1441a of channel 0. CASU

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1, Feb,2007 49 1335550 .1439h從多工器1432接收5+26位元,以及從指數通道接收底 數移位訊號 matissa shift__maccl 訊號。MAC CSA 樹 1445b 從 CASH 1439h接收5+26位元資料’也從MAD CSA樹1441b 接收2+26位元資料。全加器和正規器1447b從指數通道1接 收指數資料’也從MAC CSA樹1445b接收5+26位元資料。 全加器和正規器1447b傳送s5el3m的結果資料至輸出仙。 第15圖說明長指數計算ALU0,相似於第u圖的指數計 异。第15圖的例子包含有適當多工器的四個加法器階層,相 鲁 似於第11、12圖的短指數通道。差異點在於此通道處理一組 10位元長指數,而第11、12圖處理一組5位元短指數。長指 數處理通道用以產生所有運算元移位訊號以用在底數處理通 道的底數對齊,以及產生進一步的正規化指數結果。 第14表示長指數通道的路徑安排功能 條件0 CPA 1503符號輸出 條件1 CPA 1509符號輸出 針對C移位量和A*B 移位量,輸出多工器之控 制信號 N〇T(C>A*B) NOT((A*BorC)>MAC) 0 NOT(C>A*B) (A*BorC)>MAC 1 C>A*B NOT((A*BorC)>MAC) 2 C>A*B (A*BorC)>MAC 3 第14表 乘法的加法器(CPA for MUL) 1505接收1〇位元被乘數 A和B當作運算元a0高指數和運算元a 1高指數的結合,以及 運算元b0高指數和運算元b 1高指數的結合。乘加的加法p (CPA for MAD ) 1503接收運算元C的10位元指數當作ch e 和cl_e的結合,以及11位元CPA for MUL 1505的指數結果。 多工器1511從CPA for MUL 1505接收資料,也接收運算元cClient's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 49 1335550 .1439h receives 5+26 bits from multiplexer 1432, and The base shift signal matissa shift__maccl signal is received from the exponential channel. The MAC CSA tree 1445b receives 5+26 bit data from the CASH 1439h and also receives 2+26 bit data from the MAD CSA tree 1441b. The full adder and the regularizer 1447b receive the index data from the exponent channel 1 and also receive the 5+26 bit data from the MAC CSA tree 1445b. The full adder and the normalizer 1447b transmit the result data of s5el3m to the output fairy. Figure 15 illustrates the long index calculation ALU0, similar to the index variation in Figure u. The example of Figure 15 contains four adder levels with appropriate multiplexers, similar to the short exponential channels of Figures 11 and 12. The difference point is to process a set of 10-bit length indices in this channel, while Figures 11 and 12 process a set of 5-bit short indices. The long index processing channel is used to generate all operand shift signals for alignment with the base of the base processing channel and to produce further normalized index results. The 14th indicates the path arrangement function condition of the long exponential channel. 0 CPA 1503 Symbol output condition 1 CPA 1509 symbol output For the C shift amount and the A*B shift amount, the control signal of the output multiplexer N〇T (C>A* B) NOT((A*BorC)>MAC) 0 NOT(C>A*B) (A*BorC)>MAC 1 C>A*B NOT((A*BorC)>MAC) 2 C> A*B (A*BorC)>MAC 3 Adder of the 14th table multiplication (CPA for MUL) 1505 Receive 1 bit. The multiplicands A and B are treated as operands a0 high index and operand a 1 high index The combination of the high index of the operand b0 and the high index of the operand b1. The multiply-addition p (CPA for MAD) 1503 receives the 10-bit exponent of the operand C as a combination of ch e and cl_e, and an exponent result of the 11-bit CPA for MUL 1505. The multiplexer 1511 receives data from the CPA for MUL 1505 and also receives the operand c

Client's Docket No.: S3U04-0012-TW TT’sDocketN〇:0608-A41110-TW/Fina!/LukeLee(李宗軒)/】,Feb,2007 50 (〇>a f〇rMAC) 1501從心和cI_e的結合接收運算元c的 指數,也接收MAC指數暫存器⑸5的輸出。㈣如隱 1507從MAC指數暫存器! 5! 5接收資料也從cpa加腿 1505接收資料。多工器1513從暫存器i5i5以及多uu 接收。資料。多工器1511的資料也傳送至輸出指數至ALm。Client's Docket No.: S3U04-0012-TW TT'sDocketN〇:0608-A41110-TW/Fina!/LukeLee(李宗轩)/],Feb,2007 50 (〇>af〇rMAC) 1501 combination of heart and cI_e The index of the operand c is received, and the output of the MAC index register (5) 5 is also received. (D) Ruo 1507 from the MAC index register! 5! 5 Receive data also receives data from cpa plus leg 1505. The multiplexer 1513 receives from the register i5i5 and the multi uu. data. The data of the multiplexer 1511 is also transmitted to the output index to ALm.

器1513的輸出傳送至暫存$ 1515和指數輸出埠。cat 單70 1517傳送資料至CPA f0r MAD 15〇3、多工器、多 x# 1513>CPAf〇rMACl509^^a 1523The output of the processor 1513 is transferred to the temporary storage $1515 and the index output 埠. Cat single 70 1517 transfer data to CPA f0r MAD 15〇3, multiplexer, multi x# 1513>CPAf〇rMACl509^^a 1523

的時脈輸人。CAT單元1517合併兩位元襴成—個(h和】結 合成-個兩倍寬度攔位,此例從加法器⑽的負值結果旗標 和從加法器1503來的相同旗標)。多工器】523在輸入蜂「〇」 接收。fl號0」’在輸入埠「!」接收來自CPA f沉财D 15〇3 的反相移位量,以及在輸入埠「2」和「3」接收cpA f〇rMAc 1507的輸出。調整運算认寸的加法器(cpAfGrseaie) i527 從多工器1523接收…立元輸出,也接收係#Jtscale—h,然後 輸出A*B的結果移位量。多工器1521從輸入埠「3」和「2」 接收CPAforMACl501輸出,從輸入埠「】」接收訊號「〇」, 在輸入埠「0」接收CPAfor MAD 1503的輸出。多工器1521 輸出 11 位元至 CPA for scale 1529,其也接收 c。CpA f〇r scale 1529輸出運算元C移位量。 第16圖說明長指數計算ALU1,相似於第u圖的長指數 汁异。雖然第15圖的ALU0長指數計算相似於第〗6圖ALU1 的長私數计异,值得注意的相異之處在第16圖的多工器16〇2The clock is lost. The CAT unit 1517 combines the two elements into one (h and 】 knot synthesis - one double width block, this example from the negative result flag of the adder (10) and the same flag from the adder 1503). The multiplexer 523 receives the input bee "〇". Fl number 0"' receives the inverted shift amount from CPA f sinking D 15〇3 at input ! "!", and receives the output of cpA f〇rMAc 1507 at input 埠 "2" and "3". The adder for adjusting the operation (cpAfGrseaie) i527 receives the tensor output from the multiplexer 1523, and also receives the system #Jtscale_h, and then outputs the resulting shift amount of A*B. The multiplexer 1521 receives the CPAforMAC1501 output from the inputs 埠 "3" and "2", receives the signal "〇" from the input 埠 "], and receives the output of the CPA for MAD 1503 at the input "0". The multiplexer 1521 outputs 11 bits to the CPA for scale 1529, which also receives c. CpA f〇r scale 1529 outputs the operand C shift amount. Figure 16 illustrates the long index calculation ALU1, similar to the long index juice of Figure u. Although the ALU0 long index calculation in Fig. 15 is similar to the long and private number of ALU1 in Fig. 6, it is worth noting that the multiplexer in Fig. 16 is 16〇2.

Client's Docket No.: S3U04-0012-TW TT’S Docket N0:〇6〇8-A4111〇-TW/Final/LUkeLee(李宗軒)71,Feb,2〇〇7 !335550 " 接收ALU0的指數輸入,伴隨輪入ch_e* cl_e的結合。此外, ALU1的長指數計算產生指數、Mac移位量、A*B移位量、 以及c移位量的輸出。值得注意的是第16圖的功能表相同於 第15圖的功能表。 第17圖說明長底數資料路徑ALU〇,詳述第u圖的資料 路杈。本架構的目的在提供長浮點運算元的底數運算。此架構 可用以實現D=A*B+C+MAC的底數值運算以及包含必要的硬 φ 體區塊。長底數資料路徑有兩個幾乎對稱的架構:第17圖的 ALUO長底數資料路徑和第18圖的alui長底數資料路徑。 ALUO的長底數資料路徑包含有預先移位器pa、、1753 的四個區域乘法器1731、1733、1735、1737;補數和對齊移 位器 CASU 1739a、b、c、d、e、f、g,以(+/_/〉〉)標示,其 根據選擇的指數值對齊運算元底數。這些單元也根據運算符號 (加法或減法)而補數化或負值化輸入底數值。這些單元結合 加去态 1741a、1741b、和 1745 以實現 CSA(carry_save adder) • 料’其將乘法結果和運算元c相加,以及與MAC暫存器的 相加。MAC底數暫存器i759内含有accumuiated長底二值 全加器和正規器衝產生短底數和短指數兩通道的最。 更特別的是,與上述相似的,乘法器1731接收 _ 高底數和運算元b〇低底數。乘法器1733接收運算_异元al 數和運算元bl低底數。乘法器1735接收運算元二低底 運算元aO低底數。乘法器1739接收運算元b〇古向底數和 元aO高底數。 冋底數和運算 乘法器1731傳送26位元資料至CASH 17Client's Docket No.: S3U04-0012-TW TT'S Docket N0:〇6〇8-A4111〇-TW/Final/LUkeLee(李宗轩)71,Feb,2〇〇7 !335550 " Receive ALU0 index input, accompanying round Into the combination of ch_e* cl_e. In addition, the long index calculation of ALU1 produces an output of an index, a Mac shift amount, an A*B shift amount, and a c shift amount. It is worth noting that the function table of Fig. 16 is identical to the function table of Fig. 15. Figure 17 illustrates the long base data path ALU〇, detailing the data path of Figure u. The purpose of this architecture is to provide the base operations of long floating point operators. This architecture can be used to implement D=A*B+C+MAC bottom-value operations and to include the necessary hard φ body blocks. The long base data path has two nearly symmetrical architectures: the ALUO long base data path in Figure 17 and the alui long bottom data path in Figure 18. The long base data path of ALUO includes four regional multipliers 1731, 1733, 1735, 1737 with pre-shifters pa, 1753; complement and alignment shifters CASU 1739a, b, c, d, e, f, g, denoted by (+/_/〉>), which aligns the operand base according to the selected index value. These units also complement or negatively enter the base value based on the operand (addition or subtraction). These units combine the added states 1741a, 1741b, and 1745 to implement CSA (carry_save adder), which adds the multiplication result to the operand c and adds it to the MAC register. The MAC base register i759 contains the accumuiated long-bottom binary full-charger and the normalizer to produce the shortest base and short-index two channels. More specifically, similar to the above, the multiplier 1731 receives the _ high base and the operand b 〇 low base. The multiplier 1733 receives the operation_equivalent al number and the operand bl low base. The multiplier 1735 receives the operand two low bottom operand aO low base. The multiplier 1739 receives the operand b 〇 ancient base and the element aO high base.冋 Base and operation Multiplier 1731 transmits 26 bits of data to CASH 17

39a ’ CASH39a ’ CASH

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 52 1335550Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb,2007 52 1335550

1739a也接收ygn—h和底數Shift_h。CASU 1739b從乘法器 1735經由π位元移位器1743接收39位元資料。CASu n39c 經由13位元移位器1749接收39位元輸入資料cl_m。此架構 的特性在於其包含一具有兩階段的MAD加法器,其具有兩部 分:1/2MAD加法器和MAD加法器。此特性來自使用區域乘 法器處理長底數。1/2MAD CSA樹1741a從CASU1739a、 1739b、和 1739c接收資料。MAD CSA樹 1741b 從 1/2MAD CSA 樹1741a經由13位元移位器1769接收1+4〇位元資料、從 CASU 1739d接收37位元資料、從CASU 1739e接收經由區域 乘法器1735再經由13位元移位器1753的39位元資料。此外, MAD CSA樹1741b從從CASU 1739f,接收自乘法器1737來 的37位元資料。 MADCSA樹1741b傳送ALU0底數資料至ALU1底數輸 出,以及至 MAC CSA 樹 1745。MCA CSA 樹 1745 經由 CASU 1739g接收底數移位訊號mantissa shift_macc資料。MAC CSA 樹Π45傳送5+40位元資料至全加器和正規器1747,其也可 計算指數部分以在正規化時作進一步調整。底數的額外1位元 可用以避免底數在MAC迴圈溢位。全加器和正規器1747傳 送sl0e26m的長格式運算元資料至輸出埠cat (dh,dl),其 結合dh和dl兩半部至D。 第18圖說明長底數資料路徑,相似於第17圖的資料路 徑。更特別的是’除了少數例外’ ALU 1的長底數資料路徑對 稱於ALU0的長底數資料路徑。值得注意的是,多工器1805 接收來自於ALU0通道的底數。此外’多工器1705接收Ch m,1739a also receives ygn-h and base Shift_h. The CASU 1739b receives the 39-bit data from the multiplier 1735 via the π-bit shifter 1743. CASu n39c receives the 39-bit input data cl_m via the 13-bit shifter 1749. The feature of this architecture is that it consists of a two-stage MAD adder with two parts: a 1/2 MAD adder and a MAD adder. This feature comes from using a region multiplier to process long bases. The 1/2MAD CSA tree 1741a receives data from CASU 1739a, 1739b, and 1739c. The MAD CSA tree 1741b receives 1+4 bits from the 1/2 MAD CSA tree 1741a via the 13-bit shifter 1769, 37-bit data from the CASU 1739d, and 13-bit via the regional multiplier 1735. 39-bit data of the meta-shifter 1753. In addition, the MAD CSA tree 1741b receives 37-bit data from the multiplier 1737 from the CASU 1739f. The MADCSA tree 1741b transmits the ALU0 base data to the ALU1 base output and to the MAC CSA tree 1745. The MCA CSA tree 1745 receives the base shift signal mantissa shift_macc data via the CASU 1739g. The MAC CSA tree 45 transmits 5+40 bits of data to the full adder and the regularizer 1747, which can also calculate the exponent portion for further adjustments during normalization. An extra 1 bit of the base can be used to avoid the base overflow in the MAC loop. The full adder and the normalizer 1747 pass the long format operand data of sl0e26m to the output 埠cat (dh, dl), which combines the two halves of dh and dl to D. Figure 18 illustrates the long base data path, similar to the data path in Figure 17. More specifically, except for a few exceptions, the long base data path of ALU 1 is symmetrical to the long base data path of ALU0. It is worth noting that the multiplexer 1805 receives the base from the ALU0 channel. In addition, the multiplexer 1705 receives Ch m,

Client's Docket No.: S3U04-0012-TW TT’s DocketNo:〇608-A41110-TW/Final/LukeLee(李宗軒)/】,Feb,2007 1335550 其為ALU1運算元c的底數之部分。Client's Docket No.: S3U04-0012-TW TT’s DocketNo: 〇608-A41110-TW/Final/LukeLee (Li Zongxuan)/], Feb, 2007 1335550 This is the part of the base of the ALU1 operand c.

數的考量會在當有—指令中使用兩不同運算元 二 =而言,被乘數其中之一可能為短格式而所有 為長格式(見第5表)。本架構非常相似於短格式 曰m’只是它同時也可以處理長格式指數。本架構的實施 有由編碼器控制數個適當多工器的四個相同階層的指數 加法器/減法器。MAC指數暫存器的尺寸也是i i位元的長指 ,值。更特別的是’ CPA 19〇3接收結合運算元b〇底指數心 异凡bl高指數的資料。CPA 19〇3也接收運算元aQ低指數。 CI>A 19G5接收結合運算元bQ高指數#σΜ高指數的資料,以 及短格式的a0高指數。CPA 1907接收結合ch—e和cl—e的資 料,以及CPA 1903的輸出資料。CPA }9〇9接收來自CPA 19〇5 的輸出貧料,以及輸入資料ch—6和cl—e。cpA ΐ9ιι接收來自 CPA 1903的輸出資料’以及來自CPA 19〇5的輸出資料。編 碼器1920提供時脈訊號給cpA 19〇7、19〇9、和1911,以及 k供控制成號給多工器1913,和經由「或」電路1925至多工 态 1923、1935、1937、與 1939。 多工器1913從CPA 1903接收資料至輸入埠「〇」,接收 ch_e* cl—e至輸入埠Γι」,以及從cpA 19〇5接收資料至輸 入埠2」。CPA 1915接收輸入資料ch_e和cl_e,以及從暫 存器1943接收資料。cpa 1917從CPA 1913和暫存器1943 接收資料。CPA 1919從暫存器1943和CPA 1905接收資料。 CPA 1921從暫存器1943和多工器1913接收資料。多工器1923The number considerations will use two different operands in the presence of the instruction. Two = one of the multiplicands may be in short format and all in long format (see Table 5). This architecture is very similar to the short format 曰m’ except that it can also handle long format indices. Implementation of this architecture has four equal-level exponential adders/subtractors for several appropriate multiplexers controlled by an encoder. The size of the MAC index register is also the long index of the i i bit. More specifically, 'CPA 19〇3 receives data that combines the operating element b's bottom exponent and the extraordinary bl high index. CPA 19〇3 also receives the operand aQ low index. CI>A 19G5 receives the data of the combined operation element bQ high index #σΜ high index, and the a0 high index of the short format. The CPA 1907 receives information combining ch-e and cl-e, and the output of the CPA 1903. CPA }9〇9 receives the output lean from CPA 19〇5 and the input data ch-6 and cl-e. cpA ΐ9 ι receives the output data from CPA 1903' and the output data from CPA 19〇5. The encoder 1920 provides clock signals to cpA 19〇7, 19〇9, and 1911, and k for controlling the number to the multiplexer 1913, and via the OR circuit 1925 to the multi-states 1923, 1935, 1937, and 1939. . The multiplexer 1913 receives data from the CPA 1903 to input "埠", receives ch_e* cl_e to input 」", and receives data from cpA 19〇5 to input 埠2". The CPA 1915 receives the input data ch_e and cl_e and receives the data from the register 1943. The cpa 1917 receives data from the CPA 1913 and the scratchpad 1943. The CPA 1919 receives data from the scratchpad 1943 and the CPA 1905. The CPA 1921 receives data from the scratchpad 1943 and the multiplexer 1913. Multiplexer 1923

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-丁W/Final/LukeLee(李宗軒)/1,Feb,2007 54 1335550 從多工器1913和暫存器1943接收資料,以及輸出結果指數。 MAC指數暫存器1943從多工器1923接收資料。多工器丨935 從CPA 1915接收資料至輸入埠「3」、從cPA 19〇9接收至輸 入埠2」、接收訊號「〇」至輸入埠「丨」、以及從cpA19〇7 接收至輸入埠「0」。相同地’多工器1937接收訊號「〇」至 輸入埠「〇」、經由反相器1929接收CPA 1907輸出至輸入埠 h、從CPA 1911接收至輸入埠「2」、以及從CPA 1917 接收至輸入埠「3」。多工器1939經由反相器1931接收cpA 1911的輪出至輸入埠「〇」、經由反相器1933接收cpA 19卯 輸出至輸入埠厂1」、接收訊號「〇」至輸入埠「2」、接收CPA 1919輸出至輪入埠「3」cCpA 1949從多工器1935接收資料, 也接收係數scale_c,再輸出至運算元底數移位訊號爪抓仏以 s· C。CPA 1947從多工器1937接收資料,也接收係數 scale—1,再輸出半乘積底數移位訊號爪紐仏以此沿^。cpA 1945從多工器丨939接收資料,也接收係數scale—h,再輸出半 乘積底數移位訊號mantissa shift Η 〇 第20圖說明混合指數計算ALU1,相似於第η圖的混合 指數計算。第20的電路圖對稱於第19圖的電路,除了一些不 同點。一個顯著的不同在於第2〇圖的電路包含多工器2〇〇1, 其可用以接收結合ch_e和ch_l的資料以及來自ALU〇指數通 道的結果指數。 第21圖說明混合的底數資料路徑ALlJ〇,詳述第u圖的 資料路徑。混合底數資料路徑相似於第17圖的長底數資料路 徑。更特別的是,相似於第!7圖,乘法器213丨接收短格式輪Client's Docket No.: S3U04-0012-TW TT's Docket N〇: 0608-A41110- D/W/Final/LukeLee/1, Feb, 2007 54 1335550 Receive data from multiplexer 1913 and scratchpad 1943, and Output result index. The MAC index register 1943 receives data from the multiplexer 1923. The multiplexer 935 receives data from the CPA 1915 to input "3", receive from cPA 19〇9 to input 埠2", receive signal "〇" to input 埠"丨", and receive input from cpA19〇7 to input 埠"0". Similarly, the multiplexer 1937 receives the signal "〇" to the input 埠 "〇", receives the CPA 1907 output to the input 埠h via the inverter 1929, receives the input 埠 "2" from the CPA 1911, and receives the input from the CPA 1917. Enter 埠 "3". The multiplexer 1939 receives the round-out of the cpA 1911 to the input 埠 "〇" via the inverter 1931, the cpA 19 接收 output to the input factory 1" via the inverter 1933, and the reception signal "〇" to the input 埠 "2". Receive CPA 1919 output to the wheel 埠 "3" cCpA 1949 Receive data from the multiplexer 1935, also receive the coefficient scale_c, and then output to the operation element base shift signal claw capture s · C. The CPA 1947 receives the data from the multiplexer 1937 and also receives the coefficient scale-1, and then outputs the half-product matrix shift signal to the edge of the signal. cpA 1945 receives data from multiplexer 丨939 and also receives the coefficient scale-h, and then outputs the half-product matrix shift signal mantissa shift Η 〇 Figure 20 illustrates the mixed index calculation ALU1, similar to the mixed index calculation of the ηth graph. The circuit diagram of the 20th is symmetrical to the circuit of Fig. 19 except for some differences. One significant difference is that the circuit of Figure 2 contains a multiplexer 2〇〇1 that can be used to receive data combining ch_e and ch_l as well as a result index from the ALU〇 index channel. Figure 21 illustrates the mixed base data path ALlJ〇, detailing the data path of Figure u. The mixed base data path is similar to the long base data path of Figure 17. More special, similar to the first! Figure 7, multiplier 213丨 receives short format wheel

Client’s Docket No·: S3U04-0012-TW TT’s Docket N〇:0608-A411 l〇-TW/Final/LukeLee(李宗軒)/1,Feb,2007 55 1335550 入資料al高底數以及b0低底數。乘法器2133接收al低底數 和Μ低底數。乘法器2135接收Μ高底數和a〇低底數。乘法 器2137接收b0高底數和a〇高底數。CASU 2139a從乘法器 2131接收資料’也接收符號位元底數移位訊號mantissa shift A*B high。CASU 2139b從乘法器2133接收由13位元移位器 2105移位過的資料。也接收底數移位訊號mantissa shift c和 sign_c。CASU 2139c接收由13位元移位器2109移位過的資 料’以及接收 mantissa shift c 和 sign—c。CASU 2139d 也接收 sign_c 和底數移位訊號 mantissa shift C 以及 ch_m。CASU 2139e從乘法器2135接收由13位元移位器2107移位過的資 料,以及接收底數移位訊號mantissa shift A*B low和sign_l。 CASU 2139f從乘法器2137接收資料,也接收底數移位訊號 mantissa shift A*B high 和 sign_h。1/2 MAD CSA 樹 2141a 從 CASU 2139a、2139b、2139c 接收資料。MAD CSA 樹 2141b 從 1/2MAD CSA 樹 2141a 以及 CASU2139d、2139e、2139f 接 收資料。MAD CSA樹傳送底數資料至ALU1和MAC CSA樹 2145〇MAD CSA 樹 2145 也從 CASU 2139g 接收從暫存器 2143 來的資料。全加器和正規器2147接收輸入指數以及MAC CSA 樹2145。全加器和正規器2147出結果底數至結合的dh和d]。 第22圖說明混合底數資料路徑ALU1,對稱於第21圖的 資料路徑。第22圖的電路相似於第21的電路,除了一些相異 點。值得注意的是,第22圖的ALU 1混合底數資料路徑包含 多工器2202 ’其接收ch_m和來自第20圖ALU0的底數資料。 第21圖的電路輸出結果底數至dh和dl。Client’s Docket No·: S3U04-0012-TW TT’s Docket N〇: 0608-A411 l〇-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 55 1335550 Enter the data high base and b0 low base. The multiplier 2133 receives the al low base and the low base. The multiplier 2135 receives the high base and the a low base. The multiplier 2137 receives the b0 high base and the a high base. The CASU 2139a receives the data from the multiplier 2131 and also receives the sign bit shift signal mantissa shift A*B high. The CASU 2139b receives the data shifted by the 13-bit shifter 2105 from the multiplier 2133. The base shift signals mantissa shift c and sign_c are also received. The CASU 2139c receives the data shifted by the 13-bit shifter 2109 and receives the mantissa shift c and sign-c. CASU 2139d also receives sign_c and base shift signals mantissa shift C and ch_m. The CASU 2139e receives the data shifted by the 13-bit shifter 2107 from the multiplier 2135, and receives the base shift signals mantissa shift A*B low and sign_l. The CASU 2139f receives data from the multiplier 2137 and also receives the base shift signals mantissa shift A*B high and sign_h. 1/2 MAD CSA Tree 2141a Receives data from CASU 2139a, 2139b, 2139c. The MAD CSA tree 2141b receives data from the 1/2MAD CSA tree 2141a and the CASU 2139d, 2139e, 2139f. The MAD CSA tree transmits the base data to the ALU1 and MAC CSA trees. The 2145〇MAD CSA tree 2145 also receives the data from the scratchpad 2143 from the CASU 2139g. The full adder and regularizer 2147 receives the input index and the MAC CSA tree 2145. The full adder and the regularizer 2147 produce a base to the combined dh and d]. Figure 22 illustrates the mixed base data path ALU1, which is symmetric to the data path of Figure 21. The circuit of Figure 22 is similar to the circuit of the 21st, except for some differences. It is worth noting that the ALU 1 mixed base data path of Fig. 22 includes the multiplexer 2202' which receives the ch_m and the base data from the ALU0 of Fig. 20. The circuit of Fig. 21 outputs the base number to dh and dl.

Client’s Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 為了在相同的硬體架構上處理雙格式浮點資料,我們可以 使用分離的指數計算通道,因為它們相對較小的尺寸。此外, 我們可以合併短底數和長底數處理路徑在單一硬體架構,因為 我們難以複製同時具有短底數和長底數資料路徑的硬體而沒 有顯著硬體耗費。我們通常可以合併大多數用於短底數和長底 數5貝料路從的硬體區塊,以及增加一些額外的邏輯電路以提供 正確的短模式、長模式、和混合模式運算執行。 此設計的潛在修改可能包含(但不限制一定要): 1) 選擇長指數資料路徑修改的.基本架構。 2) 在運异元和結果路徑上增加額外的多工器,以選擇每個 處理模式下的正確資料。 3) 使用由資料格式控制的特別柵襴邏輯電路,分離所有補 數和對齊移位單元至兩部分。 4) 分離MACC暫存器至兩部分。 5) 藉由特別攔攔邏輯電路,分離mAC CSA和有正規器的 最終加法器至兩部分。 此外’下面的數個圖描述潛在的修改以實現雙模式ALU。 第23圖說明合併的底數資料路徑,相似於第11圖的資料 路徑°更特別的是’乘法器2333接收運算元al高底數和運算 元b0低底數。乘法器2331接收運算元al低底數和運算元bl 低底數。乘法器2337接收運算元bl高底數和運算元aO低底 數。乘法器2335接收b0高底數和a0高底數。CASU 2339a 從乘法器2333接收輸出,以及接收shift H0和sign_h0。CASU 2339b從多工器2308接收資料,多工器2308從乘法器2331Client's Docket No.: S3U04-0012-TW TT's Docket No:0608-A41110-TW/Final/LukeLee/1,Feb, 2007 1335550 In order to process dual format floating point data on the same hardware architecture, we can Channels are calculated using separate indices because of their relatively small size. In addition, we can combine short base and long base processing paths on a single hardware architecture because it is difficult to duplicate hardware with both a short base and a long base data path without significant hardware cost. We can usually combine most of the hardware blocks for short bases and long bases, and add some extra logic to provide the correct short mode, long mode, and mixed mode operation. Potential modifications to this design may include (but are not limited to): 1) Select the basic architecture modified by the long exponential data path. 2) Add additional multiplexers to the transport and result paths to select the correct data for each processing mode. 3) Use the special gate logic controlled by the data format to separate all complements and alignment shift cells into two parts. 4) Separate the MACC register to two parts. 5) Separate the mAC CSA and the final adder with the normalizer to the two parts by special blocking logic. In addition, the following figures depict potential modifications to implement a dual mode ALU. Figure 23 illustrates the combined base data path, similar to the data path of Figure 11, and more particularly the 'multiplier 2333 receives the operand al high base and the operand b0 low base. The multiplier 2331 receives the low base of the operand al and the low base of the operand bl. The multiplier 2337 receives the high base of the operand bl and the low base of the operand aO. The multiplier 2335 receives the b0 high base and the a0 high base. CASU 2339a receives the output from multiplier 2333 and receives shift H0 and sign_h0. CASU 2339b receives data from multiplexer 2308, multiplexer 2308 from multiplier 2331

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 57 1335550 和13位元移位器2306接收資料。CASU 2339c從多工器2310 接收資料,多工器2310從cl_m和13位元移位器2302接收資 料。CASU 2339c 也接收 sign_cl 和 shift CL。CASU 2339d 接 收 ch_m、shift CH 和 sign_ch。CASU 2339e 從多工器 2312 接 收資料’也接收shift L1和signjl。多工器2312從乘法器2337 和13位元移位器2304接收資料。CASU 2339f乘法器2335 接收資料,也接收shiftHl和Sign_h卜CASU 2339g藉由栅欄 電路分離成高端和低端兩部分。CASU 2339g高端接收shift ACCH訊號以及來自暫存器2342a的資料。CASU 2339g低端 接收shift ACCL訊號以及來自暫存器2342b的資料。暫存器 2342接收MAC ’以及來自]V[AC CSA樹0 2345的資料和來自 MAC CSA樹1 2345的時脈訊號。 1/2MAD CSA 樹 2341a 從 CASU 2339a、2339b、和 2339c 接收資料’傳送處理過的資料至13位元移位器2320。多工器 2322接收移位過的資料以及未移位的資料,再輸出至多工器 2316。多工器2316也接收資料「〇」。MAD CSA樹2341b接 收來自多工器2316以及CASU 2339d、2339e、和2339f的資 料,再輸出處理過的資料至MAC CSA樹1 2345。MAC CSA 樹1 2345也接收來自CASU 2339g低端的資料。 針對短格式,MAC CSA樹0(2345)藉由栅欄電路與MAC CSA樹1分離。MAC CSA樹0(2345)從CASU 2339g的高端以 及多工器2318接收資料。多工器2318從1/2MAD CSA樹2341a 接收資料,以及ALU0傳送至ALU1的底數資料。MAC CSA 樹0 2345傳送資料至CPAO 2347a,其針對短格式,藉由柵欄Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 57 1335550 and 13-bit shifter 2306 receive data. The CASU 2339c receives data from the multiplexer 2310, and the multiplexer 2310 receives the data from the cl_m and 13-bit shifters 2302. CASU 2339c also receives sign_cl and shift CL. CASU 2339d receives ch_m, shift CH, and sign_ch. The CASU 2339e receives data from the multiplexer 2312 and receives shift L1 and signjl. The multiplexer 2312 receives data from the multiplier 2337 and the 13-bit shifter 2304. The CASU 2339f multiplier 2335 receives the data and also receives shiftHl and Sign_h. The CASU 2339g is separated into high-end and low-end parts by the fence circuit. The CASU 2339g receives the shift ACCH signal and the data from the register 2342a. The CASU 2339g low-end receives the shift ACCL signal and the data from the register 2342b. The register 2342 receives the MAC ' and the data from the [V] CSA tree 0 2345 and the clock signal from the MAC CSA tree 1 2345. The 1/2MAD CSA tree 2341a receives the data from the CASU 2339a, 2339b, and 2339c and transmits the processed data to the 13-bit shifter 2320. The multiplexer 2322 receives the shifted data and the unshifted data, and outputs it to the multiplexer 2316. The multiplexer 2316 also receives the data "〇". The MAD CSA tree 2341b receives the data from the multiplexer 2316 and the CASUs 2339d, 2339e, and 2339f, and outputs the processed data to the MAC CSA tree 1 2345. The MAC CSA Tree 1 2345 also receives data from the low end of the CASU 2339g. For short format, MAC CSA Tree 0 (2345) is separated from MAC CSA Tree 1 by a fence circuit. The MAC CSA Tree 0 (2345) receives data from the high end of the CASU 2339g and the multiplexer 2318. The multiplexer 2318 receives data from the 1/2 MAD CSA tree 2341a and the base data that the ALU0 transmits to the ALU1. MAC CSA Tree 0 2345 transmits data to CPAO 2347a, which is for short format, with a fence

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 電路與 CPAl 2347b 分離。CPAl 2347b 從 MAC CSA 樹 1 2345 接收資料。CPA1 2347b輸出資料至領導零偵測器(Leading Zero Detector,LZD) L 2330 和 LZD1 2332,以及移位器 1 2334b。CPAO 2347a 輸出資料至 LZDL 2330、LZDO 2328 和移 位器0 2334a。LZDO 2328和LZDL 2330傳送資料至移位器0 2334a。LZDO 2328也傳送資料至多工器2325。LZDL 2330也 傳送資料至移位器1 2334b和多工器2325與2326°LZD1 2332 也傳送資料至移位器1 2334b和多工器2326。移位器0 2334a 和移位器1 2334b傳送資料至輸出閂2340。 CPA 2336a從指數多工器2324接收資料,多工器2324從 短指數通道0和1、長指數通道、以及混合指數通道接收資料。 CPA 2336a也從多工器2325和CPA 2336b接收資料。栅攔電 路 2338 分離 CPA 2336 a 和 CPA 2336b。CPA 2336a 和 CPA 2336b傳送資料至輸出閂2340。輸出閂2340輸出s5el3m資 料至dl,sl0e26m資料至(dh,dl),以及s5el3m資料至dh。 此外,許多控制訊號以第15表當例子,用以設定多工器 L0、CL、L1、和MUX1-5。第15表的輸出可因為在ALU處 理的不同資料格式而切換。 模式/多工器 L0 CL L1 Muxl Mux2 Mux3 Mux4 Mux 5 ExpMX 長模式 0 0 0 0 0 0 0 0 0 混合模式 0 0 0 1 0 0 0 0 1 短模式 1 1 1 1 1 1 1 1 2 第15表 第24圖說明ALU1的合併底數資料路徑,對稱於第23Client's Docket No.: S3U04-0012-TW TT’s Docket No: 0608-A41110-TW/Final/LukeLee/1, Feb, 2007 1335550 The circuit is separated from the CPAl 2347b. CPAl 2347b receives data from the MAC CSA Tree 1 2345. The CPA1 2347b outputs data to the Leading Zero Detector (LZD) L 2330 and LZD1 2332, and the shifter 1 2334b. The CPAO 2347a outputs data to LZDL 2330, LZDO 2328 and Shifter 0 2334a. LZDO 2328 and LZDL 2330 transmit data to shifter 0 2334a. The LZDO 2328 also transmits data to the multiplexer 2325. The LZDL 2330 also transmits data to the shifter 1 2334b and the multiplexers 2325 and 2326°LZD1 2332 also transmits data to the shifter 1 2334b and the multiplexer 2326. Shifter 0 2334a and shifter 1 2334b transmit data to output latch 2340. CPA 2336a receives data from exponential multiplexer 2324, and multiplexer 2324 receives data from short exponential channels 0 and 1, long exponential channels, and mixed exponential channels. CPA 2336a also receives data from multiplexer 2325 and CPA 2336b. Gate barrier 2338 separates CPA 2336 a and CPA 2336b. The CPA 2336a and CPA 2336b transmit data to the output latch 2340. Output latch 2340 outputs s5el3m data to dl, sl0e26m data to (dh, dl), and s5el3m data to dh. In addition, many control signals are used as an example in Table 15 to set the multiplexers L0, CL, L1, and MUX1-5. The output of Table 15 can be switched due to the different data formats processed in the ALU. Mode / Multiplexer L0 CL L1 Muxl Mux2 Mux3 Mux4 Mux 5 ExpMX Long Mode 0 0 0 0 0 0 0 0 0 Mixed Mode 0 0 0 1 0 0 0 0 1 Short Mode 1 1 1 1 1 1 1 1 2 15 Table 24 shows the combined base data path of ALU1, symmetric to the 23rd

Clients Docket No.: S3U04-0012-TW TT’s DocketNo:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 59 1335550 圖的ALU0資料路徑 23圖的電路,除了 一 ’更特別的是,第24圖的電路相似於第 些例外。第24圖的不同點在於多工器 2302,其接收ALU0的 出結果(dh,dl)。多工 結果底數以及運算元ch_m。此電路輸 器的控制一般相同於針對合併的ALU0 之表所不。冱些多工器可用以因為在ALUl合併底數資料路徑 如第16表所述。 處理的不同資料格式㈣擇特別輸入Clients Docket No.: S3U04-0012-TW TT's DocketNo: 0608-A41110-TW/Final/LukeLee (Li Zongxuan) / 1, Feb, 2007 59 1335550 Figure ALU0 data path 23 circuit, except for a 'more special The circuit of Figure 24 is similar to the first exception. The difference in Fig. 24 lies in the multiplexer 2302, which receives the result (dh, dl) of the ALU0. The multiplex result base and the operand ch_m. The control of this circuit driver is generally the same as for the consolidated ALU0 table. These multiplexers are available because the base data path is merged in ALUl as described in Table 16. Different data formats processed (4) Special input

第25A圖說明合併的移位輿控制邏輯電路,其可用在第 23、24圖的合併底數路徑的指數和移位控制訊號^如上述, 許多的其中所做的改變引進數個特別多工器,提供從分離的短 格式、長格式、或混合格式指數處理通道安排路徑至合併的底 數處理。更特別的是,多工器255〇接收shift h〇和底數移位 訊號mantissa shift h。多工器2552也接收shifu〇和底數=位 訊號mantissa shift h。多工器2554接收shift d和底數移位卞 號mantissa shift c。多工器2556接收Shlft MAC 〇和底數移位 訊號mantissa shift MAC。多工器2558接收shift ch和底數移 位訊號mantissashiftc。多工器2560接收shiftU和底數移位 訊號mantissa shifth。多工器2562接收shifthl和底數移位吒 號mantissashifth。多工器2564接收ShiftMACO和底數移位 訊號 mantissa shift MAC。Figure 25A illustrates a combined shift 舆 control logic circuit that can be used in the combined base and path shift control signals of Figures 23 and 24, as described above, and many of the changes made therein introduce several special multiplexers Provides a base path from a separate short format, long format, or mixed format index processing channel to merged. More specifically, the multiplexer 255 receives the shift h 〇 and the base shift signal mantissa shift h. The multiplexer 2552 also receives the shifu〇 and the base=bit signal mantissa shift h. The multiplexer 2554 receives the shift d and the base shift man number mantissa shift c. The multiplexer 2556 receives the Shlft MAC 〇 and the base shift signal mantissa shift MAC. The multiplexer 2558 receives the shift ch and the base shift signal mantissashiftc. The multiplexer 2560 receives the shiftU and the base shift signal mantissa shifth. The multiplexer 2562 receives the shifthl and the base shift number mantissashifth. The multiplexer 2564 receives the ShiftMACO and the base shift signal mantissa shift MAC.

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 1335550 多工态2566接收shifthl和來自多工器255〇的輸出。多 工器2566輸出Shift H0。多工器2568接收shlft hl和來自多 工态2552的輸出,再輸出ShiftL〇。多工器257〇接收吐沿心 和來自多工器2554的輸出,再輪出Shift CL。多工器2572接 收Shift MAC1和來自多工器2556的輸出,再輸出ShiftClient's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee/1, Feb, 2007 1335550 Multi-mode 2566 receives shifthl and output from multiplexer 255〇. The multiplexer 2566 outputs Shift H0. The multiplexer 2568 receives the shlft hl and the output from the multi-mode 2552, and outputs ShiftL〇. The multiplexer 257 receives the spitting edge and the output from the multiplexer 2554, and then rotates the Shift CL. The multiplexer 2572 receives the Shift MAC1 and the output from the multiplexer 2556, and outputs the Shift.

AccH。多工器2574接收shift ch和來自多工器2558的輸出, 再輸出Shift CH。多工器2576接收Shift 11和來自多工器2560 的輸出’再輸出Shift L1。多工器2578接收Shift hi和來自多 工器2562的輸出’再輸出shift H1。多工器2580接收Shift MAC1和來自多工器2564的輸出,再輸出shift AccL。 第17表說明多工器控制訊號,其用以每個通道的移位控 制。可以見到的是’這些訊號長度相當平均,所以我們可調整 兩條線以從指令解碼狀態機控制該等多工器。 模式 Shift H0 Shift L0 Shift CL Shift AccH Shift CH Shift LI Shift HI Shift AccL 短模 式 2550:0 2566:1 2552:1 2568:0 2554:1 2570: 〇 2556:1 2572:0 2558:1 2574:0 2560:1 2578:0 2562:1 2578:0 2564:1 2580:0 混合 模式 2550: X 2566:1 2552: x 2568:1 2554: x 2570: 1 2556: x 2572:1 2558: x 2574:1 2560: x 2578:1 2562: x 2578:1 2564: x 2580:1 長模 式 2550:0 2566:0 2552:0 2568:0 2554:0 2570: 0 2556:0 2572:0 2558:0 2574:0 2560:0 2578:0 2562:0 2578:0 2564:0 2580:0 第17表AccH. The multiplexer 2574 receives the shift ch and the output from the multiplexer 2558, and outputs Shift CH. The multiplexer 2576 receives the Shift 11 and the output from the multiplexer 2560 and outputs Shift L1. The multiplexer 2578 receives Shift hi and the output from the multiplexer 2562 and outputs shift H1. The multiplexer 2580 receives the Shift MAC1 and the output from the multiplexer 2564, and outputs shift AccL. Table 17 illustrates the multiplexer control signal for shift control of each channel. What can be seen is that the length of these signals is fairly average, so we can adjust the two lines to control the multiplexers from the instruction decode state machine. Mode Shift H0 Shift L0 Shift CL Shift AccH Shift CH Shift LI Shift HI Shift AccL Short mode 2550:0 2566:1 2552:1 2568:0 2554:1 2570: 〇2556:1 2572:0 2558:1 2574:0 2560 :1 2578:0 2562:1 2578:0 2564:1 2580:0 Mixed mode 2550: X 2566:1 2552: x 2568:1 2554: x 2570: 1 2556: x 2572:1 2558: x 2574:1 2560 : x 2578:1 2562: x 2578:1 2564: x 2580:1 long mode 2550:0 2566:0 2552:0 2568:0 2554:0 2570: 0 2556:0 2572:0 2558:0 2574:0 2560 :0 2578:0 2562:0 2578:0 2564:0 2580:0 Table 17

第25B圖說明符號路徑邏輯電路,其可用以轉換由分離 通道產生的符號訊號至第23圖合併的雙格式底數資料路徑的 付號訊號。多工器2582接收sign h0和sign h。多工器2584 接收sign 10和sing 1。多工器2586接收sign cl和sign C。多 工器2588接收sign ch和sign C。多工器2590接收sing ]1和Figure 25B illustrates a symbol path logic circuit that can be used to convert the symbol signals generated by the split channels to the pay signal of the combined dual format data path of Figure 23. The multiplexer 2582 receives sign h0 and sign h. The multiplexer 2584 receives sign 10 and sing 1. The multiplexer 2586 receives the sign cl and the sign C. The multiplexer 2588 receives the sign ch and the sign C. Multiplexer 2590 receives sing ]1 and

Client's Docket No.: S3U04-0012-TW TT’s DocketN〇:0608-A41110-TW/Final/LukeLee(李宗軒yi, Feb,2007 61 sign 1。多工器 2592 接收 sign hi 和 Sign h。 多工器2594接收sign hl和來自多工器的輪出,再 輪出,gn H0。多工器2596接收sign】】和來自多工器2584的 輸出,再輸出SignLO。多工器2598接收化⑽和來自多工 器2586的輸出,再輪出Sign CL。Sign 來自¥ mac。 多工器2599接收Sign ch和來自多工器2588的輸出,再輸出 Sign CH。多工器2597接收sign 11和來自多工器259〇的輸出, 再輸出slgn u。多工器2595接收Slgn Μ和來自多工器2592 的輸出,再輸出Sign H1。Sign AccL來自Sign MAC。 為了產生多工器的切換訊號,我們需要提供為每個多工器 在特別指令以及在處理不同資料格式時,產生切換訊號的特別 狀態機,如第18表所示。可以見到的是,所有多工器可能由 相同狀態機的相同訊號而控制。 模式 Mode Shift HO Shift L〇 Shift CL Shift AccH Shift CH Shift LI Shift HI Shift ArrT 短模 式 Short 2550 2566 2550 2566 0 1 Τ ι 2552:1 2568:0 2552:x 2568:1 2554:1 2570: 0 2554: x 2570:1 2556 2572 2556 2572 1 0 X 1 2558 2574 ~2558 2574 1 0 X 1 2560:1 2578:0 1560: x 2578:1 2562:1 2578:0 2562: x 2578:1 2564:1 2580: 0 2564: x 2580.1 混合 模式 Mixed 長模 式 Long 2550 2566 0 0 2552:0 2568:0 2554:0 2570:0 2556 2572 0 0 2558 2574 0 0 2560:0 2578:0 2562:0 2578:0 2564:0 2580:0 第18表 第26圖說明補數移位輸出與輸出格式表,其可用於第 23、24圖的合併底數資料路徑。此表說明實際輸出或輪入資 料如何被對待、延伸、解譯,和/或修改成短、長、和混合模 式。資料格式的敘述由左至右由上至下。所有訊號名稱相對於 第23、24的合併資料路徑。輸入_輸出資料格式的使用是提供 在資料路徑中CSA的資料適當處理。Client's Docket No.: S3U04-0012-TW TT's DocketN〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan yi, Feb, 2007 61 sign 1. Multiplexer 2592 receives sign hi and Sign h. Multiplexer 2594 receives Sign hl and the wheel from the multiplexer, then turn out, gn H0. The multiplexer 2596 receives the sign] and the output from the multiplexer 2584, and then outputs the SignLO. The multiplexer 2598 receives (10) and comes from multiplex The output of the device 2586, and then the Sign CL is rotated. The sign is from ¥ mac. The multiplexer 2599 receives the Sign ch and the output from the multiplexer 2588, and then outputs the Sign CH. The multiplexer 2597 receives the sign 11 and the multiplexer 259. The output of 〇, then output slgn u. The multiplexer 2595 receives the Slgn Μ and the output from the multiplexer 2592, and then outputs the Sign H1. The Sign AccL is from the Sign MAC. In order to generate the multiplexer switching signal, we need to provide each A multiplexer generates a special state machine for switching signals during special commands and when processing different data formats, as shown in Table 18. It can be seen that all multiplexers may be controlled by the same signal of the same state machine. Mode Mode Shift HO Shift L〇Sh Fift CL Shift AccH Shift CH Shift LI Shift HI Shift ArrT Short Mode Short 2550 2566 2550 2566 0 1 Τ ι 2552:1 2568:0 2552:x 2568:1 2554:1 2570: 0 2554: x 2570:1 2556 2572 2556 2572 1 0 X 1 2558 2574 ~2558 2574 1 0 X 1 2560:1 2578:0 1560: x 2578:1 2562:1 2578:0 2562: x 2578:1 2564:1 2580: 0 2564: x 2580.1 Mixed mode Mixed Long Mode Long 2550 2566 0 0 2552:0 2568:0 2554:0 2570:0 2556 2572 0 0 2558 2574 0 0 2560:0 2578:0 2562:0 2578:0 2564:0 2580:0 Table 18 Figure 26 illustrates the complement shift output and output format table, which can be used for the combined base data path of Figures 23 and 24. This table shows how actual output or wheeled data is treated, extended, interpreted, and/or modified into short, long, and mixed modes. The description of the data format is from left to right from top to bottom. All signal names are relative to the merged data path of 23, 24. The use of the input_output data format is to provide appropriate processing of the CSA data in the data path.

Client's Docket No.: S3U04-0012-TW TT’s Docket Ν〇:〇6〇8·Α4111 〇-TW/Final/LukeLee(李宗軒)/1, Feb, 2007 , 實際上乘法器的26位元HO和HI輸出可以延伸零值的 U個最低重要位元(Least SignificantBits, LSBs)。另外兩個 乘法器的輸出L0和L1可延伸13個LSB且可右移13位元, 伴最冋重要位元(Most Significant Bits,MSBs)填入零值。 加法器的資料輸入CH可藉24LSB延伸以供後續使用。第二 歹J顯示短模式、長模式、和混合模式的補數移位單元輸入-輸 出資料路徑的資料格式。 第27A圖放大第23、24圖的部分底數加法資料路徑。如 圖所示,資料格式轉換在數個單元和多工器之間,在柵攔MAC CSA樹的最終端提供不同資料格式的正確處理。更特別的是, 第26圖的電路包含1/2河八〇〇3入樹27413。1/21^入0 08入樹Client's Docket No.: S3U04-0012-TW TT's Docket Ν〇:〇6〇8·Α4111 〇-TW/Final/LukeLee(李宗轩)/1, Feb, 2007 , Actually 26-bit HO and HI output of the multiplier The U least significant bits (LSBs) of zero values can be extended. The outputs L0 and L1 of the other two multipliers can be extended by 13 LSBs and can be shifted to the right by 13 bits, with the Most Significant Bits (MSBs) being filled with zero values. The data input CH of the adder can be extended by 24LSB for subsequent use. The second 歹J displays the data format of the short-, long-mode, and mixed-mode complement shift unit input-output data paths. Figure 27A magnifies the partial base addition data path of Figures 23 and 24. As shown in the figure, the data format is converted between several units and multiplexers, and the correct processing of different data formats is provided at the terminal of the barrier MAC CSA tree. More specifically, the circuit of Figure 26 contains 1/2 river gossip 3 into the tree 27413. 1/21^ into 0 08 into the tree

2741a接收37位元H0、39位元l〇、和37位元cl。1/2MAD CSA树2741a輸出2+26位元或1+40位元至13位元移位器 2752。接收資料經過移位後,13位元移位器2752傳送資料至 多工器2754 ’其也從wmaD CSA樹2741a接收資料。多工 器2750從多工器2754接收資料’也從其它輸入埠接收訊號 「The 2741a receives 37 bits H0, 39 bits l〇, and 37 bits cl. The 1/2MAD CSA tree 2741a outputs 2+26 bits or 1+40 bits to a 13-bit shifter 2752. After the received data is shifted, the 13-bit shifter 2752 transmits the data to the multiplexer 2754' which also receives the data from the wmaD CSA tree 2741a. The multiplexer 2750 receives data from the multiplexer 2754 and also receives signals from other inputs.

〇」°多工器2750傳送輸出資料至MAD CSA樹2741b,其 也接收37位元CH、39位元L1、和37位元H1。MAD CSAThe multiplexer 2750 transmits the output data to the MAD CSA tree 2741b, which also receives 37-bit CH, 39-bit L1, and 37-bit H1. MAD CSA

樹2741b傳送5+26或5+40位元的MSB至ALU1底數,以及 2+40的MSB至多工器2756。多工器2756也從1/2MAD CSA 2741a接收資料。多工器2756輸出資料至MAC CSA樹0 2745a ’多工器2756也從MACC輸出接收資料。短格式柵欄 電路 2746 分離 MAC CSA 樹 0 2475a 和 MAC CSA 樹 1 2745b, 以取代一長格式運算元而提供處理兩短格式運算元。MACTree 2741b transmits 5+26 or 5+40 bit MSB to ALU1 base, and 2+40 MSB to multiplexer 2756. The multiplexer 2756 also receives data from the 1/2 MAD CSA 2741a. The multiplexer 2756 outputs data to the MAC CSA tree 0 2745a. The multiplexer 2756 also receives data from the MACC output. The short format fence circuit 2746 separates the MAC CSA tree 0 2475a and the MAC CSA tree 1 2745b to provide a processing of two short format operands in place of a long format operand. MAC

Client's Docket No.; S3U04-0012-TW TT’s DocketN〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1 > Feb,2007 63 1335550 CSA樹1 2745b從MAD CSA樹2741a和MACC輸出接收資 料。 第27B圖s兑明短模式、長模式、和混合模式的處理格式, 可以使用於第27A圖的CSA單元中。更特別的是,圖2780a §兒明1/2MAD CSA的短模式處理。如圖所示,資料ho包含 26+11位元且輸入至1/2 MAD CSA樹2741a,L0包含26+13 位元’ CL包含13 + 13 + 11位元。1/2MAD CSA樹2741a輸出 2+26有效位元和13無效位元。圖2780b說明MAD CSA 2741b 的短模式處理。如圖所示,H0包含26+11位元輸入至MAD CSA 樹 2741b’ L0 包含 26+13 位元,CL 包含 13 + 13+11 位元。 此外,1/2MAD 包含 0〇+〇〇+〇〇+〇 位元。MAD CSA 樹 2741b 輸出2+26有效位元和13無效位元。 圖2780c說明長模式處理格式。更特別的是,H0包含 26+11+0 輸入至 1/2MAD CSA 樹 2741a,L0 包含 13+26 位元, CL 包含 13+13+11+0 位元。1/2MAD CSA 樹 2741a 輸出 2+39 有效位元。圖2780d說明長模式處理格式。更特別的是,H0 包含 26+11+0 輸入至 1/2MAD CSA 樹 2741a,L0 包含 13+26 位元,CL包含13 + 13 + 11+0位元,以及1/2MAD包含 13+X+X+26位元。MAD CSA樹2741a輸出3+39有效位元。 圖2780e說明混合模式處理格式。更特別的是,H0包含 26+11+0 位元輸入至 1/2MAD CSA 樹 2741a,L0 包含 13+26 位元,CL 包含 13 + 13 + 11+0 位元。1/2MAD CSA 樹 2741a 輸出 2+39有效位元。圖2780f說明混合模式處理格式。更特別的 是,H0包含26+11+0位元輸入至1/2MAD CSA樹2741a,L0Client's Docket No.; S3U04-0012-TW TT’s DocketN〇: 0608-A41110-TW/Final/LukeLee/1 > Feb, 2007 63 1335550 CSA Tree 1 2745b receives the data from the MAD CSA Tree 2741a and MACC output. The processing format of the short mode, the long mode, and the mixed mode of Fig. 27B can be used in the CSA unit of Fig. 27A. More specifically, Figure 2780a shows the short mode processing of 1/2 MAD CSA. As shown, the data ho contains 26+11 bits and is input to the 1/2 MAD CSA tree 2741a, and L0 contains 26+13 bits 'CL contains 13 + 13 + 11 bits. The 1/2MAD CSA tree 2741a outputs 2+26 valid bits and 13 invalid bits. Figure 2780b illustrates the short mode processing of the MAD CSA 2741b. As shown, H0 contains 26+11 bit inputs to the MAD CSA tree. 2741b’ L0 contains 26+13 bits and CL contains 13 + 13+11 bits. In addition, 1/2MAD contains 0〇+〇〇+〇〇+〇 bits. The MAD CSA tree 2741b outputs 2+26 valid bits and 13 invalid bits. Figure 2780c illustrates the long mode processing format. More specifically, H0 contains 26+11+0 inputs to the 1/2MAD CSA tree 2741a, L0 contains 13+26 bits, and CL contains 13+13+11+0 bits. 1/2MAD CSA Tree 2741a Outputs 2+39 effective bits. Figure 2780d illustrates the long mode processing format. More specifically, H0 contains 26+11+0 inputs to 1/2MAD CSA tree 2741a, L0 contains 13+26 bits, CL contains 13 + 13 + 11+0 bits, and 1/2MAD contains 13+X+ X+26 bits. The MAD CSA tree 2741a outputs 3+39 significant bits. Figure 2780e illustrates the mixed mode processing format. More specifically, H0 contains 26+11+0 bits input to the 1/2MAD CSA tree 2741a, L0 contains 13+26 bits, and CL contains 13 + 13 + 11+0 bits. 1/2MAD CSA Tree 2741a Outputs 2+39 effective bits. Figure 2780f illustrates the mixed mode processing format. More specifically, H0 contains 26+11+0 bits input to 1/2MAD CSA tree 2741a, L0

Clients Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 64 1335550 包含13+26位元,CL包含13 + 13 + 11+0位元,以及1/2MAD CSA 包含X+X+39位元。MAD CSA樹2741a輸出3+39有效位元。 第27C圖繼續說明第27B圖的處理格式。圖2780g說明 MAC CSA的短模式處理格式。更特別的是,MAC CSA樹0 和 l( 2745a、2745b)從 MAD 接收 X+X+26 位元,以及從 MACC 接收 5X+26 位元。MAC CSA 樹 0 和 1 ( 2745a、2745b )輸出 5+26位元x2通道。圖2780h說明長模式處理格式。更特別的 是,MAD 傳送 14+3X+11MSB 至 MACCSA 樹 0 2745a。MAC 傳送 12+5X+11MSB 至 MAC CSA 樹 0 2745a。MAC CSA 樹 0 2745a輸出結果12+5X+11MSB,其中5+11是有效位元。圖 2780i 包含 MAD 傳送 2+26LSB 至 MAC CSA 樹 1 2745b°MAC 傳送 0+0+0十2+26LSB 至 MAC CSA 樹 1 2745b。MAC CSA 樹 1 2745b輸出0+0+0+2+26LSB,其中2+26是有效位元。值得 注意的是,為了在相同的硬體上提供短和長底數處理模式,我 們可利用栅攔邏輯電路,其可用以分離一些CSA和CPA,如 第23、24圖所示,以及增加一些邏輯電路至正規器單元。 第28A圖說明CPA中的栅攔電路,其可使用在第24、27 圖的MACC。使用控制於模式位元的一特別多工器,我們可以 分離長加法器成兩個短部分。在長格式下,我們可傳遞進位訊 號從加法器的一部分至加法器的另一部分。在短格式下,我們 可只傳遞零值。更特別的是,半加器2875a從全加器2876a接 收資料。全加器2876a傳送資料至半加器2875a和全加器 2876d。全加器2876c從多工器2877a與全加器2876d接收資 料。多工器2877a接收訊號「〇」以及從全加器2876e接收資Clients Docket No.: S3U04-0012-TW TT's Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb,2007 64 1335550 Contains 13+26 bits, CL contains 13 + 13 + 11+ 0 bits, and 1/2MAD CSA contains X+X+39 bits. The MAD CSA tree 2741a outputs 3+39 significant bits. Figure 27C continues to illustrate the processing format of Figure 27B. Figure 2780g illustrates the short mode processing format of the MAC CSA. More specifically, MAC CSA trees 0 and l (2745a, 2745b) receive X+X+26 bits from the MAD and 5X+26 bits from the MACC. The MAC CSA trees 0 and 1 ( 2745a, 2745b ) output 5+26 bits x2 channels. Figure 2780h illustrates the long mode processing format. More specifically, the MAD transmits 14+3X+11MSB to the MACCSA tree 0 2745a. The MAC transmits 12+5X+11MSB to the MAC CSA tree 0 2745a. MAC CSA Tree 0 2745a outputs 12+5X+11MSB, where 5+11 is a valid bit. Figure 2780i contains MAD transport 2+26LSB to MAC CSA tree 1 2745b°MAC transport 0+0+0 ten2+26LSB to MAC CSA tree 1 2745b. MAC CSA Tree 1 2745b outputs 0+0+0+2+26LSB, where 2+26 is a valid bit. It is worth noting that in order to provide short and long base processing modes on the same hardware, we can use gate blocking logic, which can be used to separate some CSAs and CPAs, as shown in Figures 23 and 24, and add some logic. Circuit to regular unit. Figure 28A illustrates the gate arrest circuit in the CPA, which can be used in the MACC of Figures 24 and 27. Using a special multiplexer controlled by mode bits, we can separate the long adder into two short parts. In the long format, we can pass the carry signal from one part of the adder to another part of the adder. In short format, we can only pass zero values. More specifically, the half adder 2875a receives data from the full adder 2876a. The full adder 2876a transmits data to the half adder 2875a and the full adder 2876d. Full adder 2876c receives the data from multiplexer 2877a and full adder 2876d. The multiplexer 2877a receives the signal "〇" and receives the capital from the full adder 2876e.

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/ι,Feb, 2007 65 ^35550 ,。柵欄電路2878a分離多工器2877a和全加器28心。半加 器2875b «全加器2876e接收資料β全加器2876e也傳送資料 至全加器2876f。全加器2876§傳送資料至全加器2_。 第28B圖說明CpA中的栅欄電路,复 圖的全加器與正規器單元。更特別的是,;= 杰2876j接收貧料,全加器2876j從多工器287訃接收資料。 多工器2878b接收訊號「〇」和來自全加器2·的輸出。柵 棚電路2878b分離多工器2877b和全加器。全加器 從半加器2875c接收資料。 第29圖說明補數移位單元的栅攔電路,其可用於第22、 23、和26圖的資料路徑。左上角是栅攔電路應用在mac cSA 補數移位單元的俯視圖。更特別的是,第29圖詳述Μ幼 2939a、栅攔電路2940、CASU low 2939b、以及模式多工界 2914a。通道〇模式多工器2914a接收長運算元資料,也從通 道〇模式多工II 2914b接收資料。模式多工器提供輸入至函數 方塊2901和2902a。函數方塊2901a計算預定函數(如圖所 舉例)和輸出N位元至函數方塊29〇2a。函數方塊29〇2&計算 預定函數(如圖所舉例))和輸出NZ位元至多工器n 29〇6&。 多工器2906a也接收訊號「〇」、底數、和底數M H的 反相。多工器2906a輸出5+26位元資料至杈形移位器(barrel shifter) H 2910a。柱形移位器H 2910a也從模式多工哭29〇8a 接收運算元移位資料’模式多工器2908a接收長資料及通道〇 的資料。柱形移位器H 2910a輸出5+26位元的資料至CSA 樹,以及shiftoutH訊號至移位資料多工器2912a。移位資料Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee (李宗轩)/ι,Feb, 2007 65 ^35550 ,. The fence circuit 2878a separates the multiplexer 2877a and the full adder 28 core. The half adder 2875b «full adder 2876e receives the data β full adder 2876e also transmits data to the full adder 2876f. The full adder 2876 § transmits the data to the full adder 2_. Figure 28B illustrates the fence circuit in CpA, the full adder and the normal unit of the complex. More specifically,; = Jie 2876j receives poor materials, and the full adder 2876j receives data from the multiplexer 287讣. The multiplexer 2878b receives the signal "〇" and the output from the full adder 2·. The shed circuit 2878b separates the multiplexer 2877b and the full adder. The full adder receives data from the half adder 2875c. Figure 29 illustrates the gate block circuit of the complement shift unit, which can be used for the data paths of Figures 22, 23, and 26. The top left corner is a top view of the gate block circuit applied to the mac cSA complement shift unit. More specifically, Fig. 29 details the child 2939a, the gate block circuit 2940, the CASU low 2939b, and the mode multiplex 2914a. The channel mode multiplexer 2914a receives the long operand data and also receives the data from the channel mode multiplex II 2914b. The mode multiplexer provides inputs to function blocks 2901 and 2902a. Function block 2901a computes a predetermined function (as exemplified) and outputs N bits to function block 29〇2a. Function block 29〇2& calculates a predetermined function (as exemplified) and outputs NZ bits to multiplexer n 29〇6&. The multiplexer 2906a also receives the inversion of the signal "〇", the base, and the base M H . The multiplexer 2906a outputs 5+26 bit data to the barrier shifter H 2910a. The cylindrical shifter H 2910a also receives the operand shift data from the mode multiplex cry 29 〇 8a, and the mode multiplexer 2908a receives the data of the long data and the channel 。. The cylindrical shifter H 2910a outputs 5 + 26 bits of data to the CSA tree, and a shiftoutH signal to the shift data multiplexer 2912a. Shift data

Client's Docket No.: S3U04-0012-TW TTs Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 66 I33555〇 多工器2912a也接收訊號「〇」和輸出資料至柱形移位器乙 29l〇b。柵攔電路 2940 分離 CASU 2939a 和 2939b。 模式多工器2914b接收通道1資料和長運算元。模式多工 态2914b提供資料給函數方塊MOib. 2902b。函數方塊29〇lb 叶算預定函數(如圖所示),以及提供N位元至函數方塊 2902b。函數方塊2902b輸出NZ位元至多工器3:1 2906b。多 工器2906b也接收訊號「〇」和底數M_L及反相的底數m_l。 多工器2906b傳送資料至柱形移位器L 2910b。柱形移位器L 29l〇b也從模式多工器2908b接收運算元移位訊號,再輸出資 料至CSA樹。模式多工器2908b接收長運算元和通道!資料。 第30A圖說明正規移位器,其可用於第23、24圖的合併 底數處理資料路徑。更特別的是,CPA 0 3047a接收5+26位 元或5+40位元資料。CPA 0 3047a藉由柵欄電路3048與CPA 1 3047b分離。CPA 1 3047b接收5+26位元資料。領導零债測 器LZDO 3029a從CPA 0 3047a接收資料,以及傳送資料至移 位器H 3034a。領導零偵測器LZDL 3030a從CPA 0 3047a 和CPA 1 3047b接收資料,再輸出資料至移位器η 3034a和移 位器L 3034b。領導零偵測器LZD1 3032a從CPA 1 3047b接 收資料’再輸出資料至移位器L 3034b。移位器L 3034b也接 收 LZDL 3030a、LZD1 3032a、和 CPA1 3074b 的資料。移位 器L 3043b輸出ML13。同樣地,移位器H 3034a接收LZD0 3028a、LZDL 3030a 和 CPAO 3047a 的資料。移位器 H 3034a 輸出資料MH13。 第3 0B圖詳細說明第3 0圖的柵欄電路。在此例子中,增Client's Docket No.: S3U04-0012-TW TTs Docket No:0608-A41110-TW/Final/LukeLee/1,Feb,2007 66 I33555〇Multiplexer 2912a also receives the signal "〇" and outputs data to the column Shape shifter B 29l〇b. The gate block circuit 2940 separates the CASU 2939a and 2939b. The mode multiplexer 2914b receives the channel 1 data and the long operand. The mode multi-operation 2914b provides information to the function block MOib. 2902b. The function block 29 〇 lb leaves the predetermined function (as shown) and provides N bits to function block 2902b. Function block 2902b outputs NZ bits to multiplexer 3:1 2906b. The multiplexer 2906b also receives the signal "〇" and the base number M_L and the inverted base m_l. The multiplexer 2906b transmits the data to the cylindrical shifter L 2910b. The cylindrical shifter L 29l〇b also receives the operand shift signal from the mode multiplexer 2908b and outputs the data to the CSA tree. The mode multiplexer 2908b receives long operands and channels! data. Figure 30A illustrates a regular shifter that can be used in the combined base processing data path of Figures 23 and 24. More specifically, CPA 0 3047a receives 5+26 bits or 5+40 bits of data. CPA 0 3047a is separated from CPA 1 3047b by fence circuit 3048. CPA 1 3047b receives 5+26 bit data. The leading zero-batch detector LZDO 3029a receives data from CPA 0 3047a and transmits the data to shifter H 3034a. The leader zero detector LZDL 3030a receives data from CPA 0 3047a and CPA 1 3047b, and outputs data to shifter η 3034a and shifter L 3034b. The lead zero detector LZD1 3032a receives data from CPA 1 3047b and re-outputs the data to shifter L 3034b. The shifter L 3034b also receives data for LZDL 3030a, LZD1 3032a, and CPA1 3074b. The shifter L 3043b outputs ML13. Similarly, shifter H 3034a receives data for LZD0 3028a, LZDL 3030a, and CPAO 3047a. The shifter H 3034a outputs the data MH13. Figure 30B illustrates the fence circuit of Figure 30 in detail. In this example, increase

Client’s Docket No,: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 67 1335550 加兩移位量控制多工器至移位資料傳遞多工器,以實現柵攔電 路並允許單元處理兩短運算元或一長運算元。更特別的是,模 式多工器3049從LZDL 3030b和LZDO 3028b接收資料。模式 多工器3049輸出移位量資料至移位器η 3034c,移位器Η 3034c也接收2+13位元資料以及輸出資料至移位資料多工器 3045。移位器H 3034c輸出13位元至輸出閂3040。 模式多工器3041從LZD1 3032b和LZDL 3030b接收資 料。模式多工器3041傳送移位量資料至移位器l 3034d,移 位器L 3034d也接收2+13位元資料。移位器l 3034d傳送資 料至移位資料多工器3045,移位資料多工器3045也接收訊號 「0」和輸出至移位器H 3034c。移位器L 3034d傳送資料至輸 出閂3040。輸出閂輸出d卜(dh,dl)、和dh。 第31圖說明傳送資料至功能性分離的ALU的處理流程 圖。更特別的是’如第31圖所示’計算系統可判別接收資料 是否為短格式浮點資料(方塊3132 )。判別資料為短格式浮 點資料後’計算系統可根據一指令集,功能性分離第一 ALU 成數個通道以作處理(方塊3134)。計算系統可根據該指令 集’功能性分離一第二ALU成數個通道以作處理(方塊 313 6 )。s十鼻系統可傳送處理過的負料至有數個短格式資料通 道的第二功能性分離ALU (方塊3138)。於一些計算處j鱼的 實施例,可能包含有在一 SFU處理資料’其中SFU用以從第 一 ALU和第二ALU接收資料。 在此討論的數個流程圖說明許多邏輯電路架構、功能性、 和運算的可能實現情形。因此,每個區塊可代表一個模組、片Client's Docket No,: S3U04-0012-TW TT's Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb,2007 67 1335550 Adding two shift amount control multiplexers to shift data transfer The tool is used to implement the gate block circuit and allow the unit to process two short operands or one long operand. More specifically, the mode multiplexer 3049 receives data from the LZDL 3030b and the LZDO 3028b. The mode multiplexer 3049 outputs the shift amount data to the shifter η 3034c, and the shifter Η 3034c also receives the 2+13 bit data and the output data to the shift data multiplexer 3045. The shifter H 3034c outputs 13 bits to the output latch 3040. The mode multiplexer 3041 receives data from the LZD1 3032b and the LZDL 3030b. The mode multiplexer 3041 transmits the shift amount data to the shifter l 3034d, and the shifter L 3034d also receives the 2+13 bit data. The shifter l 3034d transmits the data to the shift data multiplexer 3045, and the shift data multiplexer 3045 also receives the signal "0" and outputs to the shifter H 3034c. Shifter L 3034d transmits data to output latch 3040. The output latch outputs d (dh, dl), and dh. Figure 31 illustrates a process flow diagram for transferring data to a functionally separated ALU. More specifically, the computing system, as shown in Fig. 31, can determine whether the received data is short format floating point data (block 3132). After the discriminant data is short format floating point data, the computing system can functionally separate the first ALU into a plurality of channels for processing according to an instruction set (block 3134). The computing system can functionally separate a second ALU into a plurality of channels for processing according to the set of instructions (block 313 6). The ten nose system can deliver the processed negative material to a second functionally separated ALU having a plurality of short format data channels (block 3138). In some embodiments, the fish may include an SFU processing data 'where the SFU is used to receive data from the first ALU and the second ALU. The various flowcharts discussed herein illustrate possible implementations of many logic circuit architectures, functionalities, and operations. Therefore, each block can represent a module, a piece

Client’s Docket No·: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 1335550 - 段、或是一段程式碼,其可能包含一個或數個可執行的指令以 實現特別的邏輯功能、電路、或其它型式的邏輯電路。值得注 意的是在一些實施例中,區塊裡所述的功能可能不按照舉例之 順序而出現。本發明不限制於所述的資料格式大小,亦可以實 現相似功能性,如處理34/68、64/128位元等等的資料格式。 基本上,任何兩相關格式可以用上述原則處理。假如長格式不 是短格式的整數倍,一些額外的電路可創造於資料路徑如果一 些位元未被使用時。此外,一些實施例可能有數個短格式資料 ® 通道和/或一個長格式資料通道。 在此特別強調的是以上所述的實施例僅是實現本發明的 例子,僅在此提出以闡明本揭露的原則。許多變化和修改可以 用在上述的貫施例’而基本上不脫離本發明專利申請範圍的精 神與範脅。上述所有的修改和變化包含在本揭露的範壽之内。 【圖式簡單說明】 本揭露的許多方面經由下列圖示可更易理解。圖示裡 • 的元件不必然按比例繪製,重點放在清楚說明本揭露的原 則。此外,在這些圖示裡,相同的數字標明相對應的部分。 當數個實施例描述這些圖示時,不意指限制本發明於該等 實施例。相反地,本揭露意圖涵蓋所有替代方案、修改、 和相等物。 第1A圖為一流程圖,說明一向量處理單元的串流資料 處理步驟。 第1B圖為一流程圖,說明一純量處理單元的串流資料 處理步驟,相似於第1A圖所述步驟。Client's Docket No·: S3U04-0012-TW TT's Docket N〇:0608-A41110-TW/Final/LukeLee/1,Feb,2007 1335550 - Segment, or a piece of code, which may contain one or several Executable instructions to implement particular logic functions, circuits, or other types of logic circuits. It is to be noted that in some embodiments, the functions described in the blocks may not occur in the order illustrated. The present invention is not limited to the size of the data format described, and similar functionality can be achieved, such as processing data formats such as 34/68, 64/128 bits, and the like. Basically, any two related formats can be handled using the above principles. If the long format is not an integer multiple of the short format, some extra circuitry can be created in the data path if some bits are not being used. In addition, some embodiments may have several short format data ® channels and/or one long format data channel. It is specifically emphasized that the above-described embodiments are merely examples of implementing the invention, and are merely presented herein to clarify the principles of the disclosure. Many variations and modifications can be made in the above-described embodiments without substantially departing from the spirit and scope of the invention. All such modifications and variations are encompassed within the scope of the disclosure. BRIEF DESCRIPTION OF THE DRAWINGS Many aspects of the disclosure are more readily understood by the following illustration. The components in the illustrations are not necessarily to scale and the emphasis is placed on the principles of the disclosure. In addition, in the figures, the same numerals indicate corresponding parts. The illustrations of the present invention are not intended to limit the invention to the embodiments. On the contrary, the disclosure is intended to cover all alternatives, modifications, and equivalents. Figure 1A is a flow chart illustrating the flow of data processing by a vector processing unit. Fig. 1B is a flow chart showing the flow data processing steps of a scalar processing unit, similar to the steps described in Fig. 1A.

Clients Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 69 1335550 第1C圖為一串流處理SIMD架構,伴隨複雜數學函數 的軟體實施。 第1D圖為一串流處理SIMD架構,伴隨複雜數學函數 的硬體實施,每個ALU使用個別的SFU。 第1E圖為一串流處理SIMD架構,伴隨複雜數學函數 的硬體實施,所有ALU使用一共同的SFU。 第1F圖為一串流處理SIMD架構,伴隨複雜數學函數 的硬體實施,交錯存取一共同的SFU。 第1G圖為減少SIMD係數的一個例子,在一個共同 SIMD架構下的頂點和三角形處理。 第2A圖為一流程圖,說明一純量處理單元,相似於第 1圖的流程圖,伴隨SIMD係數為4。 第.2B圖為一流程圖,說明一純量處理單元,相似於第 1圖的流程圖,伴隨SIMD係數為1。 第2C圖為一流程圖,說明一純量處理單元,相似於第 1圖的流程圖,伴隨SIMD係數為8且資料為短格式。 第2D圖為一流程圖,說明一純量處理單元,相似於第 1圖的流程圖,伴隨SIMD係數為4且資料為長格式。 第3圖為可處理雙格式的偶對ALU的邏輯架構,說明 第1圖和第2圖的處理特徵,也說明串流ALU的功能。 第4圖為一串流處理單元,其偶對純量ALU在長格式 處理模式下,相似於第3圖的架構,以及顯示出更高階的 控制電路和記憶體。 第5A圖為一表格,說明偶對純量ALU的算術功能,Clients Docket No.: S3U04-0012-TW TT's Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 69 1335550 Figure 1C is a stream processing SIMD architecture with complex mathematical functions Software implementation. Figure 1D shows a stream processing SIMD architecture, with the hardware implementation of complex mathematical functions, each ALU uses an individual SFU. Figure 1E shows a stream processing SIMD architecture. With the hardware implementation of complex math functions, all ALUs use a common SFU. Figure 1F shows a stream processing SIMD architecture, with the hardware implementation of complex mathematical functions interleaving a common SFU. Figure 1G is an example of reducing SIMD coefficients, vertices and triangles under a common SIMD architecture. Figure 2A is a flow chart illustrating a scalar processing unit similar to the flow chart of Figure 1, with a SIMD coefficient of four. Fig. 2B is a flow chart illustrating a scalar processing unit similar to the flowchart of Fig. 1 with a SIMD coefficient of one. Figure 2C is a flow diagram illustrating a scalar processing unit similar to the flow chart of Figure 1, with a SIMD coefficient of 8 and data in a short format. Figure 2D is a flow diagram illustrating a scalar processing unit similar to the flow chart of Figure 1, with a SIMD coefficient of 4 and data in a long format. Figure 3 is a logical architecture of an even-pair ALU that can handle dual formats. The processing features of Figures 1 and 2 illustrate the functions of the streaming ALU. Figure 4 is a stream processing unit with an even-pair scalar ALU in the long format processing mode, similar to the architecture of Figure 3, and showing higher order control circuits and memory. Figure 5A is a table showing the arithmetic function of the even-pair scalar ALU.

Client’s Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 70 1335550 ' 可以當作數值處理指令集的基礎發展,如第3圖和第4圖 的 ALU。 第5B圖為一圖像處理單元架構,以串流處理器為運算 核心,串流處理器為可調整大小的架構,且可包含2至16 個ALU以及減少數量的SFU。 第6圖為一流程圖,以及有四個純量ALU和一個SFU 的一串流處理器的邏輯架構,相似於第3圖為第4圖的 ALU。 .第7A圖為一流程圖,說明向量ALU處理正規化向量 差。 第7B圖為一流程圖,說明提議的結合一 SFU的一串 流純量ALU的處理例行程序。 第7C圖為第7B圖的延續。 第8圖為一 ALU模組,實施第6圖ALU的功能。 第9圖為一串流處理器模組,伴隨結合四個ALU,相 似於第3圖和第4圖的ALU。 > 第10A-10C圖為一邏輯架構和乘法累加(MultiplyClient's Docket No.: S3U04-0012-TW TT,s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb,2007 70 1335550 ' can be used as a basis for the development of numerical processing instruction sets, such as The ALU of Figures 3 and 4. Figure 5B shows an image processing unit architecture with a stream processor as the core of the operation. The stream processor is a resizable architecture and can contain 2 to 16 ALUs and a reduced number of SFUs. Figure 6 is a flow chart, and the logical architecture of a stream processor with four scalar ALUs and one SFU, similar to Figure 3 is the ALU of Figure 4. Figure 7A is a flow chart illustrating the vector ALU processing the normalized vector difference. Figure 7B is a flow chart illustrating the proposed processing routine for a stream of scalar ALUs incorporating a SFU. Figure 7C is a continuation of Figure 7B. Figure 8 shows an ALU module that implements the functions of the ALU in Figure 6. Figure 9 is a stream processor module with four ALUs, similar to the ALUs in Figures 3 and 4. > 10A-10C is a logical architecture and multiplication accumulation (Multiply

Accumulate,MACC )單元的資料格式,如第8圖的乘法 累加單元。 第11圖為一乘法累加單元架構,相似於第8圖的乘法 累加單元。 第12圖說明一短指數計算,相似於第11圖的短指數 計算。 第13圖說明一短指數計算,結合一混合指數,相似於The data format of the Accumulate, MACC) unit, such as the multiply accumulate unit of Figure 8. Figure 11 is a multiply-accumulate unit architecture similar to the multiply-accumulate unit of Figure 8. Figure 12 illustrates a short index calculation, similar to the short index calculation in Figure 11. Figure 13 illustrates a short index calculation, combined with a mixed index, similar to

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb,2007 丄叫550 第11圖的短指數計算。 第14圖說明許多通道 的底數路裎。 紐底數路徑,詳述第π圖 塊 乐15圖說明 長‘數叶算’詳述第n圖指數計算方 第16圖說明另一對彳Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-TW/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 Howling 550 Figure 11 is a short index calculation. Figure 14 illustrates the base of many channels. New base number path, detailing the πth block Block Le 15 diagram description Long 'number leaf calculation' detailed nth figure index calculation side Figure 16 illustrates another pair

11圖長指數計算方塊。、LU的長指數計算,詳述I 徑。第17圖說明一長底數資料路徑,詳述第U圖資㈣ 第18圖說明另—對偶的auj的長絲 似於第11圖的資料路徑。 _ 、’偟,和 第19圖說明一混合指數 計算。 ?曰数计异,坪述弟11圖混合指奥 第20圖說明另一對偶的ALU的混合指數計算,相似 於第19圖的混合指數計算。 第21圖說明-混合底數資料路徑,詳述第n 料路徑。 J貝 第22圖說明另一對偶的則的混合底數資料路徑, 相似於第21圖的資料路徑。 第23圖說明合併的底數資料路徑,其_可處理短資料 格式和長資料格式,詳述第U圖可能實施的資料路徑。 第24圖說明合併的底數資料路徑,相似於第u的資 料路控。 ' 第25A圖說明合併的移位及控制邏輯,其可應用在第11 graph length index calculation box. , LU long index calculation, detail I path. Figure 17 illustrates a long base data path, detailing the U map (4). Figure 18 illustrates that the other-dual auj filament is similar to the data path of Figure 11. _, '偟, and Figure 19 illustrate a mixed index calculation. ? The number of turns is different, and the Pingtudi 11 figure is mixed. The 20th figure illustrates the mixed index calculation of another dual ALU, similar to the mixed index calculation of Fig. 19. Figure 21 illustrates the mixed base data path detailing the nth material path. J. Fig. 22 illustrates another dual base data path, similar to the data path of Fig. 21. Figure 23 illustrates the combined base data path, which can handle short data formats and long data formats, detailing the data paths that may be implemented in Figure U. Figure 24 illustrates the combined base data path, similar to the data path of the uth. ' Figure 25A illustrates the combined shift and control logic, which can be applied to

Client's Docket No.: S3U04-0012-TW TT’s Docket N〇:0608-A41110-丁W/Final/LukeLee(李宗軒)/1,Feb, 2007 72 1335550 23圖和第24圖的乘法累加單元。 第25B圖說明符號控制邏輯,其可應用在第23圖和第 24圖的乘法累加單元。 第26圖為一表格,說明互補移位輸入和輸出的格式, 可應用於第11圖中的乘法累加單元。 第27A圖說明底數加法路徑,可應用於第23圖和第 24圖的乘法累加器。 第27B圖說明處理格式,可應用於第23圖和第24圖 的乘加(Multiply Add,MAD )進位儲存加法器(Carry Save Adder,CSA)樹狀單元。 第27C圖為第27B圖處理格式的延續。 第28A圖說明進位儲存加法器的柵攔應用,其應用於 第23圖和第24圖的乘法累加器。 第28B圖說明進位傳遞加法器(Carry Propagate Adder,CPA)的柵欄應用,其應用於第23圖和第24圖的 乘法累加器。 第29圖說明互補移位單元的柵欄應用,其應用於第 23圖和第24圖的乘法累加器。 第30A圖說明正規化移位器的柵欄應用,其應用於第 23圖和第24圖的乘法累加器。 第30B圖更詳細說明第30A圖的柵攔。 第31圖為一流程圖,說明可用來傳送資料至一功能上 分開的ALU的處理。 【主要元件符號說明】Client's Docket No.: S3U04-0012-TW TT’s Docket N〇: 0608-A41110-Ding W/Final/LukeLee (Li Zongxuan)/1, Feb, 2007 72 1335550 23 and the multiply-accumulate unit of Fig. 24. Figure 25B illustrates symbol control logic that can be applied to the multiply-accumulate unit of Figures 23 and 24. Figure 26 is a table illustrating the format of the complementary shift input and output, which can be applied to the multiply-accumulate unit in Figure 11. Fig. 27A illustrates a base addition path which can be applied to the multiply accumulators of Figs. 23 and 24. Figure 27B illustrates the processing format applicable to the Multiply Add (MAD) Carry Save Adder (CSA) tree unit of Figures 23 and 24. Figure 27C is a continuation of the processing format of Figure 27B. Figure 28A illustrates a gate block application of a carry storage adder applied to the multiply accumulators of Figures 23 and 24. Figure 28B illustrates a fence application for the Carry Propagate Adder (CPA) applied to the multiply accumulators of Figures 23 and 24. Figure 29 illustrates a fence application of a complementary shifting unit applied to the multiply accumulators of Figures 23 and 24. Figure 30A illustrates a fence application for a normalized shifter that is applied to the multiply accumulators of Figures 23 and 24. Figure 30B illustrates the gate of Figure 30A in more detail. Figure 31 is a flow chart illustrating the processing that can be used to transfer data to a functionally separate ALU. [Main component symbol description]

Client's Docket No.: S3U04-0012-TW TT’sDocketNo:0608-A41110-TW/Final/LukeLee(李宗軒)/l,Feb,2007 73 1335550 第1A-B圖 輸入緩衝常規記憶體102 輸入緩衝4組正交存取記憶體122 vector ALU 104 scalar ALU 124 SFU 106 第2A-B圖 ALUO 204a ALU1 204b ALU2 204c ALU3 204d 元素改組器226 第3圖 累加暫存器370 快取記憶體輸入資料模組372 ALU portPO 376 ALU port PI 378 ALU port P2 380 輸入多工器382a 384a 382b 384b 延遲暫存器383 乘法器 386a 388a 390a 392a 386b 388b 390b 392b 加法器 396a 399a 396b 399b 乘法累加單元394a 397a 394b 397b 13位元移位器和致能裝置398a 398b 旁通元件395 輸出CL資料元件393 第4圖 快取記憶單元472 記憶體輸出多工器474 輸入埠 portPO 476 port P1 478 port P2 480 輸入多工器閂482a 482b 延遲暫存器386 483 SIMD微碼控制器488 ALU控制和定址元件490 SFU 470 多工器484 第6圖 輸入資料602a-d指令解碼器的控制和位址訊號602eClient's Docket No.: S3U04-0012-TW TT'sDocketNo: 0608-A41110-TW/Final/LukeLee (Li Zongxuan) / l, Feb, 2007 73 1335550 1A-B input buffer conventional memory 102 input buffer 4 groups positive Interleaved memory 122 vector ALU 104 scalar ALU 124 SFU 106 2A-B diagram ALUO 204a ALU1 204b ALU2 204c ALU3 204d element shuffler 226 Fig. 3 accumulating register 370 cache memory input data module 372 ALU portPO 376 ALU port PI 378 ALU port P2 380 Input multiplexer 382a 384a 382b 384b Delay register 383 Multiplier 386a 388a 390a 392a 386b 388b 390b 392b Adder 396a 399a 396b 399b Multiply accumulate unit 394a 397a 394b 397b 13-bit shift And enabling device 398a 398b bypass component 395 output CL data component 393 Figure 4 cache memory unit 472 memory output multiplexer 474 input portPO 476 port P1 478 port P2 480 input multiplexer latch 482a 482b delay 386 483 SIMD microcode controller 488 ALU control and addressing component 490 SFU 470 multiplexer 484 Figure 6 Input data 602a-d command decoder control and address signal 602e

Client’s Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 1335550 共同資料602f SFU670 延遲暫存器683a-q 輸出缓衝器604 多工器672 第8圖 雙格式乘法累加單元872 多工器870 874 延遲暫存器883a-e寫回暫存器876累加器878 區域ALU暫存表的暫存SRAM 880 狀態機與位址產生器的區域控制單元882 第9圖 多工器970 SFU 980 第11圖 SECS0 1120 LECS 1040 MESEC1 1130 最終加法器和正規化單元1147 CASU 1139 1144 乘加暫存器 1143 MAD CSA 單元 1141 MAC CSA 單元 1145 乘法器 1 13 1 1 133 1 135 1 137 第12圖 加法器 1212 1214 1204 1206 1208 1216 1218 1222 1224 1241 1244 1246 多工器 1210 1226 1232 1234 1236 MAC指數暫存器1128 優先權編碼器1220 零值指數偵測器1202 反相器1250 1252 1254 或電路1230 及電路1240 第13圖 零值指數偵測器1302 MAC暫存器1328 多工器1355Client's Docket No.: S3U04-0012-TW TT's Docket No:0608-A41110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 1335550 Common Data 602f SFU670 Delay Register 683a-q Output Buffer 604 Worker 672 Figure 8 Dual Format Multiply Accumulate Unit 872 Multiplexer 870 874 Delay Register 883a-e Write Back Register 876 Accumulator 878 Area ALU Staging Table Scratch SRAM 880 State Machine and Address Generator Area Control Unit 882 Figure 9 Multiplexer 970 SFU 980 Figure 11 SECS0 1120 LECS 1040 MESEC1 1130 Final Adder and Normalization Unit 1147 CASU 1139 1144 Multiply-Add Register 1143 MAD CSA Unit 1141 MAC CSA Unit 1145 Multiplier 1 13 1 1 133 1 135 1 137 Figure 12 Adder 1212 1214 1204 1206 1208 1216 1218 1222 1224 1241 1244 1246 Multiplexer 1210 1226 1232 1234 1236 MAC Index Register 1128 Priority Encoder 1220 Zero Index Detection 1202 Inverter 1250 1252 1254 or Circuit 1230 and Circuit 1240 Figure 13 Zero Value Index Detector 1302 MAC Register 1328 Multiplexer 1355

Client’s Docket No.: S3U04-0012-TW TT’s Docket N〇:0608'A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 75 1335550 第14圖 乘法器 1431、1433、1435、1437 CASU 1439a-h 多工器 1432 進位儲存加法樹1441a 144lb 1445a 1445b MAC短底數暫存器1430a 1430b 全加器和正規器1447al447b 乘法器 1431 1433 1435 1437 第15-16圖 加法器 1501 1503 1505 1507 1509 1257 1259 MAC指數暫存器1515及電路1225 多工器 1511 1513 1521 1523 CAT 單元 1517 多工器1602 第17-18圖 移位器 1743 1749 1753 1769 乘法器 1731 1733 1735 1737 CASU 1739a-g 加法器 1741a 1741b 1745 MAC底數暫存器1747 全加器和正規器1747 多工器1805 第19-20圖 加法器 1903 1905 1907 1909 191 1 1915 1917 1919 1921 1949 1947 1945 多工器 1913 1923 1935 1937 1939 2001 編碼器 192〇 反相器 1929 1931 1933 MAC指數暫存器1943 或電路1925 及電路1941Client's Docket No.: S3U04-0012-TW TT's Docket N〇:0608'A41110-TW/Final/LukeLee(李宗轩)/1,Feb, 2007 75 1335550 Figure 14 Multiplier 1431, 1433, 1435, 1437 CASU 1439a- h multiplexer 1432 carry storage addition tree 1441a 144lb 1445a 1445b MAC short base number register 1430a 1430b full adder and normalizer 1447al447b multiplier 1431 1433 1435 1437 15-16 figure adder 1501 1503 1505 1507 1509 1257 1259 MAC index Register 1515 and circuit 1225 multiplexer 1511 1513 1521 1523 CAT unit 1517 multiplexer 1602 17-18 shifter 1743 1749 1753 1769 multiplier 1731 1733 1735 1737 CASU 1739a-g adder 1741a 1741b 1745 MAC base Register 1747 Full Adder and Normalizer 1747 Multiplexer 1805 19-20 Figure Adder 1903 1905 1907 1909 191 1 1915 1917 1919 1921 1949 1947 1945 Multiplexer 1913 1923 1935 1937 1939 2001 Encoder 192〇 Inverted 1929 1931 1933 MAC index register 1943 or circuit 1925 and circuit 1941

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A411I0-TW/FinaI/LukeLee(李宗軒)/1, Feb, 2007 76 1335550 第21-22圖 乘法器 2131 2133 2135 2137 暫存器 2143 13位元移位器2105 2107 2109 多工器2202 CASU 2139a-g 1/2 MAD CSA 樹 2141a MAD CSA樹2141b 2145 全加器和正規器2147 第23-24圖 乘法器 2331 2333 2337 2339 CASU 2339a-g 暫存器 2342a 2342bClient's Docket No.: S3U04-0012-TW TT's Docket No:0608-A411I0-TW/FinaI/LukeLee/1, Feb, 2007 76 1335550 21-22 Multiplier 2131 2133 2135 2137 Register 2143 13 Bit shifter 2105 2107 2109 multiplexer 2202 CASU 2139a-g 1/2 MAD CSA tree 2141a MAD CSA tree 2141b 2145 full adder and normalizer 2147 23-24 multiplier 2331 2333 2337 2339 CASU 2339a-g Register 2342a 2342b

多工器 2308 2310 2312 2316 2318 2323 2325 2326 2402 13 位元移位器 2302 2304 2306 2320 MACCSA 樹 2345 1/2MAD CSA 樹 2341a MAD CSA 樹 2341b CPA 2347a 2347b LZD 2328 2330 2332 輸出閃 2340 移位器 2334a 2334b CPA 2336a 2336b 栅攔電路2338 第25A-B圖Multiplexer 2308 2310 2312 2316 2318 2323 2325 2326 2402 13 bit shifter 2302 2304 2306 2320 MACCSA tree 2345 1/2MAD CSA tree 2341a MAD CSA tree 2341b CPA 2347a 2347b LZD 2328 2330 2332 Output flash 2340 shifter 2334a 2334b CPA 2336a 2336b Gate Block Circuit 2338 Figure 25A-B

多工器 2550 2552 2554 2556 2558 2560 2562 2564 2566 2568 2570 2572 2574 2576 2578 2580 2582 2584 2586 2588 2590 2592 2594 2596 2598 2599 2597 2595 第27A-B圖 1/2MAD CSA 樹 2741a MAD CSA 樹 2741b MAC CSA 樹 2745a 2745b 柵欄電路 2746 多工器 2750 2754 2756Multiplexer 2550 2552 2554 2556 2558 2560 2562 2564 2566 2568 2570 2572 2574 2576 2578 2580 2582 2584 2586 2588 2590 2592 2594 2596 2598 2599 2597 2595 Figure 27A-B Figure 1/2MAD CSA Tree 2741a MAD CSA Tree 2741b MAC CSA Tree 2745a 2745b fence circuit 2746 multiplexer 2750 2754 2756

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒yi,Feb, 2007 77 1335550 第28A-B圖 半加器2875a-c 全加器2876a-k 第29圖 CASU 2939a 2939b 柵攔電路 2940 模式多工器 2908a 2908b 2914a 2914b 多工器3:1 2906a 2906b 移位資料多工器2912a 函數方塊 2901a 2901b 2902a 2902b 反相器2904a 2904b 柱形移位器2910a 291 Ob 第30A-B圖 CPA 3047a 3047b 柵欄電路 3048 LZD 3028a 3028b 3030a 3030b 3032a 3032b 移位器3034a-d 模式多工器3041 3049 移位資料多工器3045 輸出閂3040Client's Docket No.: S3U04-0012-TW TT's Docket No:0608-A41110-TW/Final/LukeLee (Li Zongxuan yi, Feb, 2007 77 1335550 28A-B Figure Half Adder 2875a-c Full Adder 2876a-k Figure 29 CASU 2939a 2939b Gate Block Circuit 2940 Mode Multiplexer 2908a 2908b 2914a 2914b Multiplexer 3:1 2906a 2906b Shift Data Multiplexer 2912a Function Block 2901a 2901b 2902a 2902b Inverter 2904a 2904b Column Shifter 2910a 291 Ob 30A-B CPA 3047a 3047b Fence Circuit 3048 LZD 3028a 3028b 3030a 3030b 3032a 3032b Shifter 3034a-d Mode Multiplexer 3041 3049 Shift Data Multiplexer 3045 Output Latch 3040

Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee(李宗軒)/1,Feb, 2007 78Client's Docket No.: S3U04-0012-TW TT’s Docket No:0608-A41110-TW/Final/LukeLee/1,Feb, 2007 78

Claims (1)

1335550 9 案號096104282 99年9月20日 修正本 十、申請專利範圍: 1. 一串流處理器,可處理多種格式資料,該串流處理 器包含: 一第一算術邏輯單元(Arithmetic Logic Unit,ALU ), 用以: 處理數個第一組短格式資料,因應自一指令集接收的 短格式控制訊號;以及1335550 9 Case No. 096104282 September 20, 1999 Revision 10: Patent scope: 1. A stream processor that can process multiple formats. The stream processor includes: a first arithmetic logic unit (Arithmetic Logic Unit) , ALU ), for: processing a plurality of first sets of short format data in response to short format control signals received from an instruction set; 處理一第一組長格式資料,因應自該指令集接收的長 格式控制訊號;以及 一第二算術邏輯單元,用以: 處理數個第二組短格式資料,因應自該指令集 接收的該短格式控制訊號; 處理一第二組長格式資料,因應自該指令集接 收的該長格式控制訊號, 接收來自該第一算術邏輯單元的處理資料;以 及 處理輸入資料及來自該第一算術邏輯單元的該 處理資料,根據該指令集的一控制訊號; 其中,當該第一算術邏輯單元的輸出資料被當 成一長格式模式的運算元送至該第二算術邏輯單元時, 該指令集用以控制可變的單指令多資料堆疊(folding) 模式;其中,該第一算術邏輯單元的一第一通道的輸出 資料,以一短格式模式的運算元送至該第一算術邏輯單 元的一第二通道。 S3U04-0012-TW/0608-A41110-TW/Finall 79 1335550 2. 如專利申請範圍第1項的該串流處理器,進一步 包含一特別函數單元(Special Function Unit,SFU), 用以提供該第一算術邏輯單元和該第二算術邏輯單元額 外的運算功能。 3. 如專利申請範圍第1項的該串流處理器,其中, 該第一算術邏輯單元是一純量算術邏輯單元。 4. 如專利申請範圍第1項的該串流處理器,其中, 該第二算術邏輯單元是一純量算術邏輯單元。 5. 如專利申請範圍第1項的該串流處理器,其中, 因應接收到該短格式資料,該串流處理器在功能上分割 至少一對該算術邏輯單元,以促進短格式和長格式的雙 格式處理,伴隨一可變單指令多資料(Single Instruction Multiple Data,SIMD)係數。 6. 如專利申請範圍第1項的該串流處理器,其中, 該指令集包含一指令,以在數個不同模式中處理可變格 式資料。 7. 如專利申請範圍第1項的該串流處理器,其中, 該指令集至少包含下列其中之一:一正常型式指令,一 混合型式指令,以及一交又型式指令,以應用在短格式 資料處理和長格式資料處理。 8. 如專利申請範圍第1項的該串流處理器,其中, 該指令集包含至少一指令以處理至少下列模式的其中之 一:一短格式運算元模式,一長格式運算元模式,以及 一混合格式運算元模式。 S3U04-0012-TW/0608-A41110-TW/Final 1 80 1335550 9. 如專利申請範圍第1項的該串流處理器,其中, 該特別函數單元耦合至該第一算術邏輯單元和該第二算 術邏輯單元。 10. 可處理多種格式資料的方法,該方法包含: 判別接收資料是否為短格式資料; 因應判別接收資料是短格式資料,功能上分割一第一 算術邏輯單元成數個通道,根據一指令集作處理; 功能上分割一第二算術邏輯單元成數個通道,根據該 指令集作處理; 在該第一算術邏輯單元中處理資料;以及 傳送該處理資料至功能上分開的該第二算術邏輯單 元,伴隨數個短格式資料的通道;Processing a first set of long format data in response to a long format control signal received from the set of instructions; and a second arithmetic logic unit for: processing a plurality of second set of short format data in response to the short received from the set of instructions Formatting a control signal; processing a second set of long format data, receiving processing data from the first arithmetic logic unit in response to the long format control signal received from the instruction set; and processing input data and from the first arithmetic logic unit The processing data is based on a control signal of the instruction set; wherein, when the output data of the first arithmetic logic unit is sent to the second arithmetic logic unit as an operation element of a long format mode, the instruction set is used to control a variable single instruction multiple data stacking mode; wherein an output data of a first channel of the first arithmetic logic unit is sent to a second of the first arithmetic logic unit by an operation element in a short format mode aisle. S3U04-0012-TW/0608-A41110-TW/Finall 79 1335550 2. The stream processor of claim 1 further comprising a special function unit (SFU) for providing the An arithmetic logic unit and an additional arithmetic function of the second arithmetic logic unit. 3. The stream processor of claim 1, wherein the first arithmetic logic unit is a scalar arithmetic logic unit. 4. The stream processor of claim 1, wherein the second arithmetic logic unit is a scalar arithmetic logic unit. 5. The streaming processor of claim 1, wherein the stream processor functionally partitions at least one pair of the arithmetic logic units to facilitate short format and long format in response to receiving the short format data The dual format processing is accompanied by a variable single instruction multiple data (SIMD) coefficient. 6. The stream processor of claim 1, wherein the instruction set includes an instruction to process the variable format data in a plurality of different modes. 7. The stream processor of claim 1, wherein the instruction set comprises at least one of: a normal type instruction, a mixed type instruction, and a cross-type instruction for application in a short format. Data processing and long format data processing. 8. The stream processor of claim 1, wherein the instruction set includes at least one instruction to process at least one of the following modes: a short format operand mode, a long format operand mode, and A mixed format operand mode. S3U04-0012-TW/0608-A41110-TW/Final 1 80 1335550. The stream processor of claim 1, wherein the special function unit is coupled to the first arithmetic logic unit and the second Arithmetic logic unit. 10. A method for processing a plurality of formats, the method comprising: determining whether the received data is short format data; and determining that the received data is short format data, functionally dividing a first arithmetic logic unit into a plurality of channels, according to an instruction set Functionally dividing a second arithmetic logic unit into a plurality of channels, processing according to the instruction set; processing data in the first arithmetic logic unit; and transmitting the processing data to the functionally separated second arithmetic logic unit, a channel with several short format data; S3U04-0012-TW/0608-A41110-TW/Finall 81 1335550 當該第一算術邏輯單元的輸出資料被當成一長格式模 式的運算元送至該第二算術邏輯單元時,該指令集用以控 制可變的單指令多資料堆疊(folding)模式;其中,該第 一算術邏輯單元的一第一通道的輸出資料,以一短格式模 式的運算元送至該第一算術邏輯單元的一第二通道。 11. 如專利申請範圍第10項的可處理多種格式資料的 方法,其中,該第一算術邏輯單元用以處理短格式資料和 I 長格式資料。 12. 如專利申請範圍第10項的可處理多種格式資料的 方法,其中,該第二算術邏輯單元用以處理短格式資料和 長格式資料。 13. 如專利申請範圍第10項的可處理多種格式資料的 方法,其中,該第一算術邏輯單元運作如同一純量算術邏 輯單元。 14. 如專利申請範圍第10項的可處理多種格式資料的 φ 方法,其中,該第二算術邏輯單元運作如同一純量算術邏 輯單元,且有至少下列其中之一:數個短格式資料的通道 和一長格式資料的通道。 15. 如專利申請範圍第10項的可處理多種格式資料的 方法,進一步包含在一特別函數單元中處理資料,其中, 該特別函數單元從該第一算數邏輯單元和該第二算數邏輯 單元中接收資料。 16. 如專利申請範圍第10項的可處理多種格式資料的 方法,其中,該指令集包含一指令,以處理數個不同模式 S3U04-0012-TW/0608-A41110-TW/Final 1 82 1335550 下的可變格式資料。 17. 如專利申請範圍第10項的可處理多種格式資料的 方法,其中,該指令集至少包含下列其中之一:一正常型 式指令,一混合型式指令,以及一交叉型式指令。 18. —串流處理器模組,可處理多種格式資料,該串流 處理器模組包含: 一第一算術邏輯單元,用以接收第一輸入資料和控制 資料,該控制資料用以指出該第一輸入資料的一格式,而 該第一算術邏輯單元進一步根據該控制資料,處理短格式 輸入資料和長格式輸入資料; 一第二算術邏輯單元,用以從該第一算術邏輯單元接 收該控制資料,該第二算術邏輯單元進一步用以處理第二 輸入資料,而該第二輸入資料與該第一輸入資料有關,且 該第二算術邏輯單元進一步根據該控制資料,處理短格式 輸入資料和長格式輸入資料; 一第三算術邏輯單元,用以從該第二算術邏輯單元接 收該控制資料,該第三算術邏輯單元進一步用以接收第三 輸入資料,而該第三輸入資料與該第一輸入資料和該第二 輸入資料有關,且該第三算術邏輯單元進一步根據該控制 貧料,處理短格式輸入資料和長格式輸入資料,以及 一第四算術邏輯單元,用以從該第三算術邏輯單元接 收該控制資料,該第四算術邏輯單元進一步用以接收第四 輸入資料,而該第四輸入資料與該第一、第二、第三輸入 資料有關,且該第四算術邏輯單元根據該控制資料,處理 S3U04-0012-TW/0608-A41110-TW/Final 1 83 1335550 短格式貢料和長格式貢料, 其中,當該第一算術邏輯單元的輸出資料被當成一長 格式模式的運算元送至該第二算術邏輯單元時,該指令集 用以控制可變的單指令多資料堆疊(folding )模式;其中, 該第一算術邏輯單元的一第一通道的輸出資料,以一短格 式模式的運算元送至該第一算術邏輯單元的一第二通道。 19. 如專利申請範圍第18項的該串流處理器模組,其 ^ 中,該第一算術邏輯單元、該第二算術邏輯單元、以及該 第三算術邏輯單元用以接收來自一特別函數單元的運算資 料,該運算資料用以表明該接收輸入資料的執行運算。 20. 如專利申請範圍第18項的該串流處理器模組,其 中,該第一算術邏輯單元進一步用以接收共同資料,該第 一算術邏輯單元傳送該共同資料至該第二算術邏輯單元, 該第二算術邏輯單元傳送該共同資料至該第三算術邏輯單 元,該第三算術邏輯單元傳送該共同資料至該第四算術邏 φ 輯單元。 21. 如專利申請範圍第18項的該串流處理器模組,其 中,至少下列其中之一用以處理短格式資料和長格式資 料:該第一算術邏輯單元,該第二算術邏輯單元,該第三 算術邏輯單元,以及該第四算術邏輯單元。 S3U04-0012-TW/0608-A41110-TW/Final 1 84S3U04-0012-TW/0608-A41110-TW/Finall 81 1335550 When the output data of the first arithmetic logic unit is sent to the second arithmetic logic unit as an operation element of a long format mode, the instruction set is used to control a variable single-instruction multi-data folding mode; wherein an output data of a first channel of the first arithmetic logic unit is sent to a second of the first arithmetic logic unit by an operation element in a short format mode aisle. 11. The method of claim 10, wherein the first arithmetic logic unit is configured to process short format data and I long format data. 12. The method of claim 10, wherein the second arithmetic logic unit is configured to process short format data and long format data. 13. The method of claim 10, wherein the first arithmetic logic unit operates as a scalar arithmetic logic unit. 14. The φ method of claim 10, wherein the second arithmetic logic unit operates as a scalar arithmetic logic unit and has at least one of the following: a plurality of short format data. Channel and channel for a long format data. 15. The method of claim 10, wherein the method of processing a plurality of format data further comprises processing data in a special function unit, wherein the special function unit is from the first arithmetic logic unit and the second arithmetic logic unit Receive data. 16. The method of claim 10, wherein the instruction set includes an instruction to process a plurality of different modes S3U04-0012-TW/0608-A41110-TW/Final 1 82 1335550 Variable format data. 17. The method of claim 10, wherein the instruction set comprises at least one of the following: a normal type instruction, a mixed type instruction, and a cross type instruction. 18. A streaming processor module capable of processing a plurality of format data, the streaming processor module comprising: a first arithmetic logic unit for receiving first input data and control data, the control data for indicating a format of the first input data, and the first arithmetic logic unit further processes the short format input data and the long format input data according to the control data; a second arithmetic logic unit for receiving the first arithmetic logic unit Controlling the data, the second arithmetic logic unit is further configured to process the second input data, and the second input data is related to the first input data, and the second arithmetic logic unit further processes the short format input data according to the control data. And a long format input data; a third arithmetic logic unit for receiving the control data from the second arithmetic logic unit, the third arithmetic logic unit further configured to receive the third input data, and the third input data and the The first input data is related to the second input data, and the third arithmetic logic unit is further based on the control poor Processing the short format input data and the long format input data, and a fourth arithmetic logic unit for receiving the control data from the third arithmetic logic unit, the fourth arithmetic logic unit further configured to receive the fourth input data, And the fourth input data is related to the first, second, and third input materials, and the fourth arithmetic logic unit processes the S3U04-0012-TW/0608-A41110-TW/Final 1 83 1335550 according to the control data. a format tribute and a long format tribute, wherein when the output data of the first arithmetic logic unit is sent to the second arithmetic logic unit as an operation element of a long format mode, the instruction set is used to control the variable single The instruction multi-data folding mode; wherein the output data of a first channel of the first arithmetic logic unit is sent to a second channel of the first arithmetic logic unit by an operation element in a short format mode. 19. The streaming processor module of claim 18, wherein the first arithmetic logic unit, the second arithmetic logic unit, and the third arithmetic logic unit are configured to receive a special function The operation data of the unit, the operation data is used to indicate the execution operation of the received input data. 20. The stream processor module of claim 18, wherein the first arithmetic logic unit is further configured to receive a common data, the first arithmetic logic unit transmitting the common data to the second arithmetic logic unit And the second arithmetic logic unit transmits the common data to the third arithmetic logic unit, and the third arithmetic logic unit transmits the common data to the fourth arithmetic logic unit. 21. The stream processor module of claim 18, wherein at least one of the following is for processing short format data and long format data: the first arithmetic logic unit, the second arithmetic logic unit, The third arithmetic logic unit, and the fourth arithmetic logic unit. S3U04-0012-TW/0608-A41110-TW/Final 1 84
TW96104282A 2006-02-06 2007-02-06 Stream processor with variable single instruction multiple data (simd) factor and common special function TWI335550B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US76557106P 2006-02-06 2006-02-06
US11/671,610 US20070186082A1 (en) 2006-02-06 2007-02-06 Stream Processor with Variable Single Instruction Multiple Data (SIMD) Factor and Common Special Function

Publications (2)

Publication Number Publication Date
TW200809690A TW200809690A (en) 2008-02-16
TWI335550B true TWI335550B (en) 2011-01-01

Family

ID=38335357

Family Applications (1)

Application Number Title Priority Date Filing Date
TW96104282A TWI335550B (en) 2006-02-06 2007-02-06 Stream processor with variable single instruction multiple data (simd) factor and common special function

Country Status (1)

Country Link
TW (1) TWI335550B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI482085B (en) * 2011-02-03 2015-04-21 Intel Corp Method and apparatus of stream compaction for rasterization, and non-transitory computer readable medium
TWI514274B (en) * 2011-12-14 2015-12-21 Intel Corp System, apparatus and method for loop remainder mask instruction

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10223071B2 (en) * 2017-04-14 2019-03-05 Qualcomm Incorporated Energy-efficient variable power adder and methods of use thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI482085B (en) * 2011-02-03 2015-04-21 Intel Corp Method and apparatus of stream compaction for rasterization, and non-transitory computer readable medium
TWI514274B (en) * 2011-12-14 2015-12-21 Intel Corp System, apparatus and method for loop remainder mask instruction

Also Published As

Publication number Publication date
TW200809690A (en) 2008-02-16

Similar Documents

Publication Publication Date Title
TWI361379B (en) Dual mode floating point multiply accumulate unit
US4689738A (en) Integrated and programmable processor for word-wise digital signal processing
US6502117B2 (en) Data manipulation instruction for enhancing value and efficiency of complex arithmetic
US7680873B2 (en) Methods and apparatus for efficient complex long multiplication and covariance matrix implementation
TWI467477B (en) Vector friendly instruction format and execution thereof
RU2263947C2 (en) Integer-valued high order multiplication with truncation and shift in architecture with one commands flow and multiple data flows
TW310406B (en)
US9104510B1 (en) Multi-function floating point unit
US6480868B2 (en) Conversion from packed floating point data to packed 8-bit integer data in different architectural registers
US8239438B2 (en) Method and apparatus for implementing a multiple operand vector floating point summation to scalar function
EP2455854B1 (en) System and method for on-the-fly permutations of vector memories for executing intra-vector operations
JPH0850575A (en) Programmable processor,method for execution of digital signal processing by using said programmable processor and its improvement
US8051123B1 (en) Multipurpose functional unit with double-precision and filtering operations
JPH087083A (en) Three-input arithmetic and logic unit for formation of arithmetic and logic mixed combination
WO1998032071A9 (en) Processor with reconfigurable arithmetic data path
JPH07271969A (en) Device conductiong storage in memory with condition attachedfrom registor pair
WO1998032071A2 (en) Processor with reconfigurable arithmetic data path
TW200805146A (en) Instruction set encoding in a dual-mode computer processing environment
JPH086544A (en) Rotary register for orthogonal data conversion
TWI335550B (en) Stream processor with variable single instruction multiple data (simd) factor and common special function
IL169374A (en) Result partitioning within simd data processing systems
US8352528B2 (en) Apparatus for efficient DCT calculations in a SIMD programmable processor
US6212627B1 (en) System for converting packed integer data into packed floating point data in reduced time
US6922771B2 (en) Vector floating point unit
EP0333306B1 (en) Single chip integrated circuit digital signal processor with a slow or fast mode of operation