TW200529041A

TW200529041A - System and method for accelerating a special purpose processor

Info

Publication number: TW200529041A
Application number: TW093135869A
Authority: TW
Inventors: Jen-Hsun Huang; Michael Brian Cox; Ziyad S Hakura; John S Montrym; Brad W Simeral; Brian Keith Langendorf; Blanton Scott Kephart; Frank R Diard
Original assignee: Nvidia Corp
Priority date: 2003-12-11
Filing date: 2004-11-22
Publication date: 2005-09-01
Also published as: TWI298454B; DE602004021667D1; ATE434807T1; EP1542160A1; EP1542160B1; US20050128203A1; US7053901B2

Abstract

Embodiments of the invention accelerate at least one special purpose processor, such as a GPU, or a driver managing a special purpose processor, by using at least one co-processor. Advantageously, embodiments of the invention are fault-tolerant in that the at least one GPU or other special purpose processor is able to execute all computations, although perhaps at a lower level of performance, if the at least one co-processor is rendered inoperable. The co-processor may also be used selectively, based on performance considerations.

Description

200529041 (1) 九、發明說明【發明所屬之技術領域】本發明大體上關於資料處理領域。詳言之，本發明關於使用特殊用途處理器來處理的系統和方法。【先前技術】桌上型電腦和其他資料處理系統通常包含中央處理單元（CPU )以進行算術計算、邏輯運算、控制功能和/或其他處理。許多應用爲處理器密集。例如，將三維（3 D )場景繪圖中，每一影像物體通常使用稱爲圖元（通常是三角形或其他多邊形）的數百或數千或甚至數萬的幾何物體來描繪。場景可由數百或數千個圖元的組合來代表。每一物體的表面可貼圖和著色以造成看起來真實的3 D影像。在指定時間限制內將圖元定義、定位、貼圖、著色、繪圖所需的計算會超乎CPU的處理能力（或頻寬）。開發許多措施從C P U減少負載處理。一措施是在多處理組態新增額外的一般用途CPU。此措施的缺點是一般用途CPU可能不適於一些應用的計算需求。此外，多處理需要某些同步和管理負擔，會在主CPU產生無效率。不用新增CPU，特殊用途處理器可從CPU減少特定任務。例如，圖形應用中，稱爲圖形處理單元（GPU )的特殊用途處理器有時用以從CPU減少配合3D圖形之產生和/或繪圖的計算。特殊用途處理器也可用來控制資料儲存碟、網路通訊、或其他功能。在應用程式或作業系統（ 200529041 (2) 〇 S )的控制下，驅動程式軟體用來管理特殊用途處理器的介面。但從C P U減少計算至特殊用途處理器的已知系統和方法也有各種缺點。例如，在圖形處理的情形，GPU甚至會負擔過多。此外，已知應用中，當特殊用途處理器故障時，特殊用途處理器所進行的整體功能失去。因此，需要能加速諸如GPU之特殊用途處理器的系統和方法。【發明內容】本發明的實施例使用至少一輔助處理器來加速諸如 GPU的至少一特殊用途處理器，或管理特殊用途處理器的驅動程式。本發明中，其他實施例可選擇性實施。任何揭露的實施例可新增一個以上的特殊用途處理器和/或輔助處理器。本發明的實施例容忍故障，若輔助處理器不能運作’則GPU或其他特殊用途處理器可執行所有計算，雖然也許性能較差。根據性能考量，輔助處理器也可選擇性使用。從下文和詳細說明會凸顯本發明的特性和優點。【實施方式】本發明實施例利用輔助處理器加速特殊用途處理器的處理，而圖形處理單元（GPU )是此特殊用途處理器的實例。說明本發明實施例中，參考圖]-4呈現四功能架構。 200529041 (3) 參考圖5 A說明故障容忍運作的方法，例如當輔助處理器不運作時。參考圖5 B說明輔助處理器選擇性使用的方法。然後，參考圖6和7提供輔助處理器二例證。圖8和9 提供關於圖形處理領域之本發明實施例的二應用：分別是頂點著色加速和二次Z - c u 11。以下次標題只爲了組織方便；任何特定特性會以一段以上來說明。架構圖1 -4顯示系統的另一功能架構，具有應用軟體、驅動程式元件、特殊用途處理器、加速特殊用途處理器的輔助處理器。這些顯示的實施例中，驅動程式是圖形驅動程式1 10，特殊用途處理器是GPU (分別是120、210、310 、4 1 〇 )，輔助處理器（分別是1 1 5、2 0 5、3 0 5 ' 4 0 5 )用來加速 GPU(分別是 120、210、310、410)。顯示的實施例中，應用軟體1 05和圖形驅動程式1 1 0可常駐於CPU (未顯示）或由其執行。圖形驅動程式1 1 0管理輔助處理器和/或G P U上進行的處理任務。圖1是本發明實施例之功能系統架構的方塊圖。如所示，圖形驅動程式 Π 〇提供資料A ( 1 2 5 )給輔助處理器 1 1 5和G P U 1 2 0。輔助處理器1 1 5輸出A ( 1 2 5 )的轉換 A’（ 1 3 0 )給 GPU 1 20。然後 GPU 1 20 使用 A ( 1 25 )和 A’ (1 3 0 )做爲輸入以產生輸出B ( 1 3 5 ) 。( 1 3 0 )使200529041 (1) IX. Description of the invention [Technical field to which the invention belongs] The present invention relates generally to the field of data processing. In particular, the present invention relates to a system and method for processing using a special purpose processor. [Prior Art] Desktop computers and other data processing systems usually include a central processing unit (CPU) for arithmetic calculations, logic operations, control functions, and / or other processing. Many applications are processor-intensive. For example, in drawing a three-dimensional (3D) scene, each image object is usually depicted using hundreds, thousands, or even tens of thousands of geometric objects called primitives (usually triangles or other polygons). A scene can be represented by a combination of hundreds or thousands of primitives. The surface of each object can be mapped and colored to create a realistic 3D image. The calculations required to define, locate, map, shade, and draw primitives within a specified time limit will exceed the processing power (or bandwidth) of the CPU. Many measures were developed to reduce load handling from CPU. One measure is to add an additional general-purpose CPU to the multiprocessing configuration. The disadvantage of this measure is that general purpose CPUs may not be suitable for the computing needs of some applications. In addition, multiprocessing requires some synchronization and management burden, which can cause inefficiency in the main CPU. Instead of adding a new CPU, special-purpose processors can reduce specific tasks from the CPU. For example, in graphics applications, a special-purpose processor called a graphics processing unit (GPU) is sometimes used to reduce the number of computations required to co-ordinate 3D graphics generation and / or drawing from the CPU. Special purpose processors can also be used to control data storage disks, network communications, or other functions. Under the control of an application program or operating system (200529041 (2) 〇 S), the driver software is used to manage the interface of the special-purpose processor. However, known systems and methods that reduce computation from CPU to special-purpose processors also have various disadvantages. For example, in the case of graphics processing, the GPU may even be overburdened. In addition, in known applications, when a special purpose processor fails, the overall function performed by the special purpose processor is lost. Therefore, there is a need for systems and methods that can accelerate special-purpose processors such as GPUs. SUMMARY OF THE INVENTION Embodiments of the present invention use at least one auxiliary processor to accelerate at least one special-purpose processor such as a GPU, or to manage a driver of the special-purpose processor. In the present invention, other embodiments may be selectively implemented. Any disclosed embodiment may add more than one special-purpose processor and / or auxiliary processor. Embodiments of the present invention tolerate failures. If the auxiliary processor is inoperable, the GPU or other special-purpose processor can perform all calculations, although performance may be poor. Depending on performance considerations, the auxiliary processor is also optional. The features and advantages of the invention will become apparent from the following and detailed description. [Embodiment] The embodiment of the present invention uses an auxiliary processor to accelerate the processing of a special-purpose processor, and a graphics processing unit (GPU) is an example of this special-purpose processor. In the embodiment of the present invention, a four-functional architecture is presented with reference to FIG. 4. 200529041 (3) A method for fault tolerance operation will be described with reference to FIG. 5A, for example, when the auxiliary processor is not operating. A method for selectively using the auxiliary processor will be described with reference to FIG. 5B. Then, two examples of auxiliary processors are provided with reference to FIGS. 6 and 7. Figures 8 and 9 provide two applications of the embodiment of the invention in the field of graphics processing: vertex shading acceleration and quadratic Z-c u 11 respectively. The following subheadings are for organizational convenience only; any particular feature will be explained in more than one paragraph. Architecture Figure 1-4 shows another functional architecture of the system, with application software, driver components, special-purpose processors, and auxiliary processors that accelerate special-purpose processors. In the examples shown, the driver is a graphics driver 1 10, the special-purpose processor is a GPU (120, 210, 310, 4 1 0), and the auxiliary processor (1 1 5, 2 0 5, 3 0 5 '4 0 5) is used to accelerate the GPU (120, 210, 310, 410 respectively). In the embodiment shown, the application software 105 and the graphics driver 110 may reside or be executed by a CPU (not shown). The graphics driver 110 manages the processing tasks performed on the auxiliary processor and / or the GPU. FIG. 1 is a block diagram of a functional system architecture according to an embodiment of the present invention. As shown, the graphics driver Π〇 provides the data A (125) to the auxiliary processors 115 and GPU 1 2 0. The auxiliary processor 1 1 5 outputs a conversion A (1 2 5) to the GPU 1 20. GPU 1 20 then uses A (1 25) and A ’(1 3 0) as inputs to produce output B (1 3 5). (1 3 0) make

G P U 1 2 0比只有A ( ] 2 5 )輸入G P U ] 2 0較快產生輸出B 200529041 (4) (13 5)。圖2是本發明實施例之功能系統架精示，圖形驅動程式1 1 〇提供資料A ( 2 1 2 0 5。輔助處理器2 0 5輸出A ( 2 1 5 )的車i GPU 210。然後 GPU 210 使用 A，（220) )° 圖3是本發明實施例之功能系統架構示，圖形驅動程式1 1 〇提供資料A ( 3 1 5 GP U 3 1 0將資料a ( 3〗5 )送到輔助處理§ 器305輸出A(315)的轉換A，（320)給 GPU 310 使用 A(315)和 A，（320)做爲 B ( 3 2 5 )。八’（320)使 GPU 310 比只 1 GPU 310較快產生輸出B (325)。圖4是本發明實施例之功能系統架構示，圖形驅動程式1 1 0提供資料A ( 4 1 ΐ 4 〇 5。然後輔助處理器4 0 5輸出A ( 4 ] 5 ) )給圖形驅動程式1 1 0。然後G P U 4 1 0 A’（420)做爲輸入以產生輸出 B (425) GPU 410 比只有 A(415)輸入 GPU 410 (42 5 ) 〇因此，參考圖4，輔助處理器4 0 5可但當輔助處理器4〇5進行通常配合圖形驅定處理任務時，輔助處理器4 0 5加速管理形驅動程式1 I 〇的處理。參考圖]-3所述 I的方塊圖。如所 5 )給輔助處理器 | 換 A ’（ 2 2 0 )給產生輸出B ( 225 丨的方塊圖。如所 )給 GPU 3 10。 I 3 0 5。輔助處理 GPU 3 1 0。然後輸入以產生輸出 f A ( 3 1 5 )輸入的方塊圖。如所 ;)給輔助處理器的轉換A ’（ 4 2 0 g 用 A ( 4 1 5 )和〇 A，（ 4 2 0 )使較快產生輸出B 加速 GPU 410。動程式]1 〇的特 ί GPU 4 1 0 之圖的架構同樣可加 >8- 200529041 (5) 速圖形驅動程式1 1 0。本發明一實施例中，圖形驅動程式1 1 0依據應用特定性能需求或資源可用性而選擇性實施二個以上的功能架構。例如，對一處理任務，圖形驅動程式1 1 0實施圖1的功能架構’而對不同處理任務，圖形驅動程式1 〇〇實施圖4 的功能架構。因此，本發明的實施例可交替或組合使用以提供彈性處理解決。上述架構可修改而不偏離發明的範疇和精神。例如，圖1 -4的每一實施例雖參考涉及圖形處理的應用，但本發明可用於其他驅動程式或介面取代圖形驅動程式1 1 0，另一種特殊用途處理器可用來取代 GPU ( 135、210、310、 4 10)。此外，圖1 -4的任何功能架構可修改使得多個輔助處理器提供轉換給G P U ( 1 3 5、2 1 0 ' 3 1 0或4 1 0 )或其他特殊用途處理器以加速處理。此外，其他實施例中，單一輔助處理器可用來加速多個GPU ( 1 3 5、2 ] 0、3 ] 0或4 1 0 ) 或其他特殊用途處理器的運作。於是，依據應用需求，本發明實施例可縮放。根據應用，輔助處理器（1 1 5、2 0 5、3 0 5、4 0 5、6 2 5 、73 0 )可具有進行相當簡單任務的能力。例如，圖形處理環境中，輔助處理器可進行第一次z-cull處理（下述）。其他實施例中，輔助處理器（1 15、2 05、3 05、4 05、 625、7 3 0 )可具有 GPU (120、210、310、410、635、735 )或由輔助處理器所加速之其他特殊用途處理器的所有功 200529041 (6) 能。故障容忍圖5 A是本發明實施例之故障容忍方法的流程圖。圖 5 A顯示對輔助處理器1 1 5、3 0 5或4 0 5之故障之反應的方法。如所示，流程在步驟5 0 5開始，然後前進到條件步驟 5 1 〇判定輔助處理器是否運作。當條件步驟5 1 0的結果肯定（是）時，流程前進到步驟5 1 5，GPU或其他特殊用途處理器運算輸入A和A’，或只根據A’（ A ’是輔助處理器的輸出，如圖1 -4 )。當條件步驟5 1 0的結果否定（否）時，流程前進到步驟5 2 0，GPU或其他特殊用途處理器只運算輸入A (例如，沒有來自輔助處理器的結果）。依據設計選擇，可對圖1、3、4的任何架構實施圖 5 A的故障容忍處理。若輔助處理器故障，GPU或其他特殊用途處理器只根據A來運算（例如，步驟5 2 0 )，則性能會退化。例如，依據設計選擇，當一個以上的輔助處理器故障時，可預先判定一個以上的像素解析度、色彩解析度、或框速會減小輔助處理器的選擇性使用即使一個以上的輔助處理器運作，相較於只用特殊用途處理器，使用一個以上的輔助處理器不一定增進性能。於是，輔助處理器的選擇性使用會有利。 _ 10 - 200529041 (7) 圖5 b是本發明實施例之輔助處理器選擇性使用方法的流程圖。如所示，流程在步驟5 2 5開始’然後前進到條件步驟5 3 0判定輔助處理器使用是否增進性能。性能可關於處理速度、精確度、或其他準則。當條件步驟5 3 0的結果肯定（是）時，流程前進到步驟5 3 5，GPU或其他特殊用途處理器運算輸入A和A ’，或只根據A ’（ A ’是輔助處理器的輸出’如圖1 - 4 ) 。g條件步驟5 j 0的結果否疋（否）時，流程前進到步驟5 3 5，GP U或其他特殊用途處理器只運算輸入A (例如，沒有來自輔助處理器的結果）。至少有條件步驟5 3 0的三實施例可交替或組合使用。條件步驟5 3 0的第一實施例中，預先判定哪個應用或任務經由輔助處理器的使用達成增進性能。此例中，條件步驟 5 3 0的運算根據預定設定。預定設定可包含在查表。條件步驟5 3 0的第二實施例中，歷史資料（例如，用和不用輔助處理器之實際處理時間的日誌）用來判定輔助處理器應用是否增進性能。例如，條件步驟5 3 0的運算可包含用和不用輔助處理器之平均處理時間的比較。條件步驟5 3 0的第三實施例中，輔助處理器是否增進性能的判定是根據瞬間或近瞬間知識。例如，參考圖]，若GPU 120未及時接收Af來開始處理框N + 1，則可在條件步驟5 3 〇判定輔助處理器1 1 5不增進性能。另一方面，若G P U 1 2 0及時接收A ’而開始處理框N + 2，則可在條件步驟5 3 G判定輔助處理器增進性能。參考圖2，輔助處理器2 0 5可輪詢G P U 2 1 0的狀態暫存器以判定g P U 2 ] 〇何 -11 - 200529041 (8) 時可開始處理資料的最早點。當 GP U 2 1 0可開始處理，且輔助處理器2 0 5未完成的計算時，輔助處理器可將A 送到GPU 210而非A’。參考圖3，當GPU 3 1 0開始處理 A時，GPU 3 10的正常運作模式可從輔助處理器3 0 5取還 A’。當輔助處理器3 0 5從GPU 3 1 0接收取還命令時，若輔助處理器3 0 5未完成計算A’，則輔助處理器3 0 5將零送到G P U 3 1 0以回應取還命令。當G P U 3 1 0接收零時，條件步驟5 3 0的結果否定（否），GPU 310只根據A處理（步驟5 4 0 )。如上述，依據設計需求，條件步驟5 3 0的運算可在圖形驅動程式、輔助處理器、和/或G P U中進行。輔助處理器例證圖6和7提供上述功能架構更詳細的圖。前段的任何功能架構可依據參考圖6或7的說明來實施。其他實施也可肯巨。圖6是本發明一實施例之顯示輔助處理器例證之功能系統架構的方塊圖。如所示，C P U 6 0 5包含應用軟體6 j 〇和圖形驅動程式6 1 5。核心邏輯6 2 0包含積體輔助處理器 6 2 5。核心邏輯6 2 0可爲或包含晶片組，諸如北橋和/或南橋。北橋晶片組通常將C P U接到P CI匯流排和/或系統記憶體；南橋晶片組通常控制通用串列匯流排（U S B ) 和/或積體開發環境（IDE )匯流排，和/或進行電力管理、鍵盤/滑鼠控制、或其他功能。核心邏輯620耦合到 -12 - 200529041 (9) 記憶體63 0和GPU 6 3 5。記憶體6 3 0可爲系統記憶體或區域記憶體。積體輔助處理器6 2 5加速GP U 6 3 5或其他特殊用途處理器。圖7是本發明另一實施例之顯示輔助處理器例證之功能系統架構的方塊圖。如所示，CPU 7 0 5包含應用軟體 7 1 0和圖形驅動程式7 1 5。C P U 7 0 5耦合到核心邏輯720 。核心邏輯720可爲或包含晶片組，諸如北橋和/或南橋。核心邏輯7 2 0耦!合到記憶體7 2 5、輔助處理器7 3 0、 GPU 73 5。核心邏輯72 0和輔助處理器7 3 0之間的耦合可符合PCI或其他通訊協定。記憶體72 5可爲系統記憶體或區域記憶體。積體輔助處理器73 0加速GPU 7 3 5或其他特殊用途處理器。圖1-7中，依據設計選擇，CPU(605、705)可爲或包含 I n t e 1 P e n t i u m 111 X e ο η、1111 e 1 P e n t i u m 4、I η t e 1 Pentium M、AMD Athlon、或其他 CPU。GPU ( 1 35、225 、310、410、635、735)可爲或包含 NVIDIA GeForce 2 5 6 GPU、NVIDIA Quad i-o FX 500、NVIDIA GeForce FX G〇5200、NVIDIA GeForce FX G〇5600、或其他 GPU。無關圖形處理的應用中，可使用非GPU的特殊用途處理器例示應用圖7和8提供圖形處理領域之本發明的例示應用。無關圖形處理的其他應用也可受惠於加速特殊用途處理器的 -13- 200529041 (10) 輔助處理器。圖8是本發明實施例之進行頂點著色之方法的流程圖。顯不的方法預處理頂點緩衝，因而可較快繪圖。如所示，頂點緩衝A在步驟8 0 5產生，頂點在步驟8丨〇過濾或著色’頂點緩衝A在步驟8 1 5繪圖。於是，頂點緩衝a 在步驟8 1 0預處理’因而可較快繪圖。步驟§ 1 〇和$ 1 5選擇性利用著色程式（未顯示）來執行各處理。步驟8 〇 5可由圖形驅動程式1 1 0執行，步驟8 1 0可由輔助處理器（ 1 15、20 5、3 0 5、405、62 5、7 3 0 )進行，步驟 815 可由 GPU (120、210、310、410、635、7 3 5 )執行。圖9是本發明實施例之進行二次Z-ciill之方法的流程圖。3 D成像中，z軸是離開銀幕往觀者眼睛的軸。z軸過濾（Z-cull ’ a/k/a閉塞過濾）通常是丟棄第一組圖元的處理，另一圖元要在第一組圖元和觀者眼睛之間的位置繪圖於z軸上。換言之，Z-CU11是丟棄阻隔顯示影像之圖元的處理。運作中，通常對在相同框中分享相同x和y空間的物體做Z値比較，以判定哪個可看見，哪個要過濾。二次Z-cull中，在二步驟進行過濾。於是，如圖9，在步驟9 0 5接收圖元，然後在第一次z_ c u u步驟9丨〇繪圖以產生z-cull資訊。然後，在第二次z-cuu步驟915，第一次z-cull資訊可用來過濾比單次Z_CUU措施多的圖元。步驟9 0 5可由圖形驅動程式I丨〇執行，步驟9丨〇可由輔助處理器（1 1 5、2 0 5、3 0 5、4 0 5、6 2 5、7 3 0 )進行，步驟 915 可由 GPU ( ]20、2]〇、3]0、4]0、635、735)執行。 -14 - 200529041 (11) 其他應用中，輔助處理器（1 1 5、2 0 5、3 0 5、4 0 5、 6 2 5、7 3 Ο )進行其他功能。例如，圖形應用中，輔助處理器（115、205、305、405、625、730)可進行 GPU 加速之第一次的二次模板陰影量演算法、代表驅動程式之記憶體拷貝的實施（使得拷貝不涉及C P U )、網路控制器所完成之網路封包處理的進一步加速、產生較小輸入A，以節省頻寬之輸入A的壓縮、和/或特殊用途處理器之較快存取的資料位置管理。參考美國專利案09/585,810 ( 5/31/00申請）、 0 9/8 8 5,6 6 5 ( 6/ 1 9/0 1 申請）、i〇/23〇，1 24 ( 8/2 7/02 申請 )可更完整瞭解上述實施例，倂入做爲參考。結論因此上述本發明實施例以一個以上的其他特殊用途處理器加速特殊用途處理器或管理特殊用途處理器的驅動程式來克服已知系統方法的缺點。此外，揭露的措施有彈性、可縮放，能以故障容忍和/或選擇性的方式實施。本發明得由熟悉技藝之人任施匠思而爲諸般修飾，然皆不脫如申請專利範圍所欲保護者。例如，描述單一輔助處理器之使用的貫施例可修改以使用多個輔助處理器。此外’描述G p U之使用的實施例可修改以使用不同類的特殊用途處理器’例如非關圖形處理的應用。【圖式簡單說明】 -15- 200529041 (12) 圖1是本發明實施例之功能系統圖2是本發明實施例之功能系統圖3是本發明實施例之功能系統圖4是本發明實施例之功能系統圖5 A是本發明實施例之故障容$ 圖5 B是本發明實施例之輔助處的流程圖；圖6是本發明一實施例之顯示輔系統架構的方塊圖；圖7是本發明另一實施例之顯示能系統架構的方塊圖；圖8是本發明實施例之進行頂點 j 圓9是本發明實施例之進行二次 1湖。架構的方塊圖；架構的方塊圖；架構的方塊圖；架構的方塊圖； ?、方法的流程圖；理器選擇性使用方法助處理器例證之功能輔助處理器例證之功著色之方法的流程圖 Z-cull之方法的流程荽元件符號說明 105 應用程式 1 1 0 圖形驅動程式 1 1 5 輔助處理器 1 20 圖形處理 πα 早元 205 輔助處理器 2 1 〇圖形處理單元 3 05 輔助處理器 -16- 200529041 (13) 3 10 圖形處理單元 405 輔助處理器 4 10 圖形處理 CD 口早元 605 中央處理單元 6 10 應用程式 625 圖形驅動程式 620 核心邏輯 625 積體輔助處理器 630 記憶體 63 5 圖形處理口口早元 705 中央處理單元 7 10 應用程式 725 圖形驅動程式 720 核心邏輯 73 0 積助處理器 72 5 記憶體 73 5 圖形處理單元G P U 1 2 0 produces output B faster than only A (] 2 5) input G P U] 2 0 200529041 (4) (13 5). FIG. 2 is a detailed illustration of a functional system rack according to an embodiment of the present invention. The graphics driver 1 10 provides data A (2 1 2 0 5. The auxiliary processor 2 5 outputs A (2 1 5) of the vehicle i GPU 210. Then GPU 210 uses A, (220)) ° Figure 3 is a functional system architecture diagram of an embodiment of the present invention. The graphics driver 1 1 〇 provides data A (3 1 5 GP U 3 1 0 sends data a (3〗 5) Auxiliary Processing § Output 305 of the processor A (315) A, (320) for GPU 310 Use A (315) and A, (320) as B (3 2 5). Eight '(320) makes GPU 310 more than Only 1 GPU 310 generates output B (325) faster. Figure 4 is a functional system architecture diagram of an embodiment of the present invention. The graphics driver 1 1 0 provides data A (4 1 ΐ 4 005. Then the auxiliary processor 4 0 5 Output A (4] 5)) to the graphics driver 1 1 0. Then GPU 4 1 0 A '(420) is used as input to produce output B (425) GPU 410 is better than only A (415) input GPU 410 (42 5 ) 〇 Therefore, referring to FIG. 4, the auxiliary processor 405 can speed up the management of the driver 1 I 〇 when the auxiliary processor 405 performs processing tasks normally associated with graphics driving. Refer to Figure] -3 for a block diagram of I. As described in 5) for the auxiliary processor | change A '(2 2 0) to produce a block diagram of output B (225 丨 as shown) to GPU 3 10. I 3 0 5. Auxiliary processing GPU 3 1 0. Then input to produce a block diagram of the output f A (3 1 5) input. As mentioned;) to the auxiliary processor's conversion A '(420 g) with A (41 15) and 〇A, (420) to make the output B faster to accelerate the GPU 410. The program] 1 〇 special ί The graphics architecture of GPU 4 1 0 can also be added.> 8- 200529041 (5) Fast graphics driver 1 1 0. In an embodiment of the present invention, the graphics driver 1 1 0 is based on application specific performance requirements or resource availability. Selectively implement more than two functional architectures. For example, for a processing task, the graphics driver 110 implements the functional architecture of FIG. 1 'and for different processing tasks, the graphics driver 100 implements the functional architecture of FIG. 4. Therefore, The embodiments of the present invention can be used alternately or in combination to provide a flexible processing solution. The above architecture can be modified without departing from the scope and spirit of the invention. For example, although each embodiment of Figures 1-4 refers to applications involving graphics processing, The invention can be used for other drivers or interfaces to replace the graphics driver 110, and another special-purpose processor can be used to replace the GPU (135, 210, 310, 4 10). In addition, any functional architecture of Figure 1-4 can be modified Make more The auxiliary processor provides conversion to a GPU (135, 2 10'3 1 0, or 4 1 0) or other special-purpose processors to speed up processing. In addition, in other embodiments, a single auxiliary processor can be used to accelerate multiple GPU (1 3 5, 2] 0, 3] 0 or 4 1 0) or other special-purpose processors. Therefore, according to application requirements, the embodiments of the present invention can be scaled. According to the application, the auxiliary processor (1 1 5 , 2 0 5, 3 5 5, 4 0 5, 6 2 5, 73 0) can have the ability to perform fairly simple tasks. For example, in a graphics processing environment, the auxiliary processor can perform the first z-cull processing (under (Described below). In other embodiments, the auxiliary processor (115, 2 05, 3 05, 4 05, 625, 7 3 0) may have a GPU (120, 210, 310, 410, 635, 735) or be processed by an auxiliary All functions of other special-purpose processors accelerated by the processor 200529041 (6). Fault tolerance Figure 5A is a flowchart of the fault tolerance method of the embodiment of the present invention. Figure 5A shows the auxiliary processor 1 1 5, 3 0 5 or 4 0 5 fault response method. As shown, the process starts at step 5 5 and then advances to the bar Step 5 1〇 Determine whether the auxiliary processor is functioning. When the result of the conditional step 5 10 is affirmative (YES), the flow proceeds to step 5 1 5 and the GPU or other special purpose processor calculates the inputs A and A ', or only A '(A' is the output of the auxiliary processor, as shown in Figure 1-4). When the result of conditional step 5 1 0 is negative (No), the flow proceeds to step 5 2 0, and the GPU or other special-purpose processor only operates on input A (for example, there is no result from the auxiliary processor). Depending on the design choices, the fault tolerance processing of Figure 5A can be implemented on any architecture of Figures 1, 3, and 4. If the auxiliary processor fails and the GPU or other special-purpose processor only operates according to A (for example, step 5 2 0), the performance will be degraded. For example, according to design choices, when more than one auxiliary processor fails, it can be determined in advance that more than one pixel resolution, color resolution, or frame speed will reduce the selective use of the auxiliary processor even if more than one auxiliary processor is used. Operation, compared to using only special-purpose processors, using more than one auxiliary processor does not necessarily improve performance. Thus, the selective use of the auxiliary processor may be advantageous. _ 10-200529041 (7) Fig. 5b is a flowchart of a method for selectively using an auxiliary processor according to an embodiment of the present invention. As shown, the flow begins at step 5 2 5 and then proceeds to condition step 5 3 0 to determine whether the use of the auxiliary processor improves performance. Performance can be related to processing speed, accuracy, or other criteria. When the result of conditional step 5 3 0 is affirmative (Yes), the flow proceeds to step 5 3 5 and the GPU or other special-purpose processor calculates the inputs A and A ', or only according to A' (A 'is the output of the auxiliary processor 'As shown in Figure 1-4). When the result of step 5 j 0 in condition g is no (No), the flow advances to step 5 3 5 and the GPU or other special-purpose processor only operates on input A (for example, there is no result from the auxiliary processor). The three embodiments with at least the conditional step 5 3 0 can be used alternately or in combination. In the first embodiment of the conditional step 530, it is preliminarily determined which application or task achieves improved performance through the use of an auxiliary processor. In this example, the operation in conditional step 5 3 0 is set according to a predetermined setting. Preset settings can be included in the lookup table. In the second embodiment of conditional step 530, historical data (e.g., logs with and without actual processing time of the auxiliary processor) is used to determine whether the auxiliary processor application improves performance. For example, a conditional step 5 3 0 operation may include a comparison of the average processing time with and without an auxiliary processor. In the third embodiment of conditional step 530, the determination of whether the auxiliary processor improves performance is based on instantaneous or near-instantaneous knowledge. For example, referring to the figure], if the GPU 120 does not receive Af in time to start processing the frame N + 1, it may be determined in the condition step 5 3 0 that the auxiliary processor 1 1 5 does not improve performance. On the other hand, if G P U 1 2 0 receives A 'in time and starts processing frame N + 2, then it can be determined in step 5 3 that G assists the processor to improve performance. Referring to FIG. 2, the auxiliary processor 2 05 may poll the status register of G P U 2 1 0 to determine g P U 2] 〇 Ho -11-200529041 (8) at the earliest point when processing data can be started. When GP U 2 10 can start processing and the auxiliary processor 2 05 has not completed the calculation, the auxiliary processor can send A to GPU 210 instead of A '. Referring to FIG. 3, when the GPU 3 10 starts processing A, the normal operation mode of the GPU 3 10 can retrieve A 'from the auxiliary processor 3 05. When the auxiliary processor 3 0 5 receives a return command from the GPU 3 1 0, if the auxiliary processor 3 0 5 has not completed the calculation of A ′, the auxiliary processor 3 0 5 sends zero to the GPU 3 1 0 in response to the return command. When G P U 3 1 0 receives zero, the result of step 5 3 0 is negative (No), and GPU 310 only processes according to A (step 5 4 0). As described above, according to the design requirements, the operation in conditional step 5 3 0 can be performed in the graphics driver, the auxiliary processor, and / or the GPU. Examples of auxiliary processors Figures 6 and 7 provide more detailed diagrams of the functional architecture described above. Any functional architecture in the previous paragraph can be implemented according to the description with reference to FIG. 6 or 7. Other implementations are also possible. FIG. 6 is a block diagram showing an exemplary functional system architecture of an auxiliary processor according to an embodiment of the present invention. As shown, C P U 6 0 5 contains application software 6 j 〇 and graphics driver 6 1 5. The core logic 6 2 0 contains the integrated auxiliary processor 6 2 5. The core logic 6 2 0 may be or include a chipset such as the North Bridge and / or the South Bridge. The Northbridge chipset usually connects the CPU to the P CI bus and / or system memory; the Southbridge chipset usually controls the universal serial bus (USB) and / or integrated development environment (IDE) bus, and / or performs power Management, keyboard / mouse control, or other functions. The core logic 620 is coupled to -12-200529041 (9) memory 63 0 and GPU 6 3 5. The memory 630 can be a system memory or a local memory. Integrated auxiliary processor 6 2 5 accelerates GP U 6 3 5 or other special purpose processors. FIG. 7 is a block diagram showing a functional system architecture of an exemplary auxiliary processor according to another embodiment of the present invention. As shown, the CPU 7 0 5 contains application software 7 10 and a graphics driver 7 1 5. C P U 7 0 5 is coupled to the core logic 720. The core logic 720 may be or include a chipset such as a north bridge and / or a south bridge. The core logic 7 2 0 is coupled to the memory 7 2 5, the auxiliary processor 7 3 0 and the GPU 73 5. The coupling between the core logic 720 and the auxiliary processor 730 may conform to PCI or other communication protocols. The memory 72 5 may be a system memory or a local memory. Integrated auxiliary processor 73 0 accelerated GPU 7 3 5 or other special purpose processor. In Figure 1-7, according to the design choice, the CPU (605, 705) can be or include Inte 1 Pentium 111 X e ο η, 1111 e 1 P entium 4, I te 1 Pentium M, AMD Athlon, or other CPU. The GPU (135, 225, 310, 410, 635, 735) can be or include an NVIDIA GeForce 2 5 6 GPU, NVIDIA Quad i-o FX 500, NVIDIA GeForce FX G5200, NVIDIA GeForce FX G05600, or other GPU. In graphics-independent applications, non-GPU special-purpose processors may be used. Exemplary Applications Figures 7 and 8 provide exemplary applications of the invention in the graphics processing field. Other applications that are not related to graphics processing can also benefit from -13- 200529041 (10) auxiliary processors that accelerate special-purpose processors. FIG. 8 is a flowchart of a method for performing vertex coloring according to an embodiment of the present invention. The explicit method pre-processes the vertex buffer, so it can draw faster. As shown, vertex buffer A is generated at step 805, and vertices are filtered or colored at step 8 ′. Vertex buffer A is drawn at step 815. Thus, the vertex buffer a is pre-processed at step 8 10 and thus can be drawn faster. Steps § 10 and $ 1 5 optionally use a coloring program (not shown) to perform each process. Step 8 05 can be performed by the graphics driver 1 10, step 8 10 can be performed by the auxiliary processor (1 15, 20 5, 3 5, 5, 405, 62 5, 7 3 0), and step 815 can be performed by the GPU (120, 210, 310, 410, 635, 7 3 5). FIG. 9 is a flowchart of a method for performing secondary Z-ciill according to an embodiment of the present invention. In 3D imaging, the z-axis is the axis leaving the screen towards the viewer's eyes. Z-cull filtering (Z-cull 'a / k / a occlusion filtering) is usually a process of discarding the first group of primitives. on. In other words, Z-CU11 is a process of discarding the primitives that block the displayed image. In operation, Z 値 comparison is usually performed on objects sharing the same x and y space in the same frame to determine which is visible and which is to be filtered. In the secondary Z-cull, filtering was performed in two steps. Therefore, as shown in FIG. 9, the primitives are received in step 905, and then the first z_cuu step 9i0 drawing is performed to generate z-cull information. Then, in the second z-cuu step 915, the first z-cull information can be used to filter more primitives than a single Z_CUU measure. Step 9 05 can be executed by the graphics driver I 丨〇, step 9 丨〇 can be performed by the auxiliary processor (1 15, 5, 5 3, 5, 5 0, 6 2 5, 7 3 0), step 915 Can be executed by GPU (] 20, 2] 〇, 3] 0, 4] 0, 635, 735). -14-200529041 (11) In other applications, the auxiliary processor (1 1 5, 2, 5 5, 3 0 5, 4 0 5, 6 2 5, 7 3 0) performs other functions. For example, in graphics applications, the auxiliary processor (115, 205, 305, 405, 625, 730) can perform the first GPU-accelerated secondary template shadow volume algorithm, implement the memory copy on behalf of the driver (make The copy does not involve the CPU), further acceleration of the network packet processing performed by the network controller, generation of a smaller input A, compression of input A to save bandwidth, and / or faster access of special-purpose processors Data location management. Reference U.S. Patent No. 09 / 585,810 (5/31/00 application), 0 9/8 8 5,6 6 5 (6/1 9/0 1 application), i〇 / 23〇, 1 24 (8/2 7 / 02 application) for a more complete understanding of the above embodiment, which is incorporated as a reference. Conclusion Therefore, the above-mentioned embodiment of the present invention overcomes the shortcomings of the known system method by using more than one other special-purpose processor to accelerate the special-purpose processor or to manage the driver of the special-purpose processor. In addition, the disclosed measures are flexible, scalable, and can be implemented in a fault-tolerant and / or selective manner. The present invention may be modified in various ways by those skilled in the art, but none of them can be protected as intended by the scope of patent application. For example, embodiments that describe the use of a single auxiliary processor may be modified to use multiple auxiliary processors. In addition, " embodiments describing the use of G p U may be modified to use different types of special-purpose processors " such as non-graphics processing applications. [Brief description of the drawings] -15- 200529041 (12) Figure 1 is a functional system of the embodiment of the present invention Figure 2 is a functional system of the embodiment of the present invention Functional system FIG. 5A is a fault tolerance of an embodiment of the present invention. FIG. 5B is a flowchart of an auxiliary part of the embodiment of the present invention. FIG. 6 is a block diagram showing an auxiliary system architecture according to an embodiment of the present invention. FIG. 8 is a block diagram showing a system architecture of another embodiment of the present invention; FIG. 8 is a progress vertex j circle 9 of the embodiment of the present invention. Block diagram of the architecture; Block diagram of the architecture; Block diagram of the architecture; Block diagram of the architecture;? Method flow chart; Figure Z-cull method flow 荽 Component symbol description 105 Application 1 1 0 Graphics driver 1 1 5 Auxiliary processor 1 20 Graphics processing πα Early element 205 Auxiliary processor 2 1 〇 Graphics processing unit 3 05 Auxiliary processor- 16- 200529041 (13) 3 10 graphics processing unit 405 auxiliary processor 4 10 graphics processing CD port early element 605 central processing unit 6 10 application program 625 graphics driver 620 core logic 625 integrated auxiliary processor 630 memory 63 5 graphics Processing Mouth Early Yuan 705 Central Processing Unit 7 10 Application Program 725 Graphics Driver 720 Core Logic 73 0 Product Assistant Processor 72 5 Memory 73 5 Graphics Processing Unit

Claims

200529041 (1) X. Patent application scope 1. A method for processing data, comprising: outputting data from a driver to an auxiliary processor and a special-purpose processor, converting the data of the auxiliary processor; outputting the conversion from the auxiliary processor Data to the special-purpose processor; and to calculate the result of the special-purpose processor based on the data and the conversion data. 2. The method according to the scope of claim 1, wherein the driver is a graphics driver 'and the special-purpose processor is a graphics processing unit. 3. The method according to item 2 of the patent application range, wherein converting the data includes performing the first c U 11 ′ calculation result including performing the second z-c u 11. 4. A method for providing output from a special-purpose processor based on data from a driver component, comprising: receiving data on the special-purpose processor; determining whether an auxiliary processor is operating; and if the auxiliary processor is operating, processing from the special-purpose processor The auxiliary processor of the processor receives the conversion data; and calculates the result of the special-purpose processor based on the data and the conversion data, the calculation achieves the result faster than the special-purpose processor only receives the data as input, ~ 18-200529041 (2) If the auxiliary processor is not functioning, the result of the special-purpose processor is calculated based only on the data received from the driver component. 5. The method according to item 4 of the patent application, wherein the driver is a graphics driver and the special-purpose processor is a graphics processing unit. 6 · The method according to item 4 of the patent application scope, further comprising receiving data of the auxiliary processor from the driver component if the auxiliary processor operates'. 7. The method of claim 4 further comprising receiving data from the special-purpose processor from the special-purpose processor if the auxiliary processor is operating. 8 · A system for processing data, including: a driver outputting data; an auxiliary processor coupled to the driver and converting the data; a special-purpose processor coupled to the driver and the auxiliary processor, the special-purpose processor receiving data from From the driver data and the conversion data from the auxiliary processor, the special-purpose processor further calculates the result based on the data and the conversion data faster than the special-purpose processor only receiving the data as input. 9. If the system according to item 8 of the patent application scope, wherein the special-purpose processor further determines whether the auxiliary processor is operating, and if the auxiliary processor is not operating ', the result is calculated based on the data only. 10. The system according to item 8 of the patent application, wherein the driver is a graphics driver and the special-purpose processor is a graphics processing unit. ]]. The method according to the scope of patent application] 0, wherein the conversion data -19- 200529041 (3) contains the first z-cull information, and the result contains the second z_cuU information. 1 2. — A machine-readable medium storing instructions to be executed by the processor to perform the method, including: outputting data from a driver to an auxiliary processor and a special-purpose processor i converting auxiliary processor data; The auxiliary processor outputs the conversion data to the special-purpose processor to calculate the result of the special-purpose processor according to the data and the conversion data. The calculation is faster than the special-purpose processor only receiving the data as input to achieve the result. 1 3 · —A machine-readable medium storing instructions to be executed by a processor and a method for providing output from a special-purpose processor based on data from a driver component, including: receiving data at the special-purpose processor; determining Whether the auxiliary processor is operating; φ if the auxiliary processor is operating, receiving conversion data from the auxiliary processor of the special-purpose processor; and calculating the result of the special-purpose processor according to the data and the conversion data, the calculation ratio The special-purpose processor only receives the data as input. The result is reached faster; if the auxiliary processor does not work, the result of the special-purpose processor is calculated only based on the data received from the driver component. ] 4. A method for processing data, including: -20- 200529041 (4) outputting data from a driver to an auxiliary processor and converting special-purpose processor data to an auxiliary processor; outputting converted data from the auxiliary processor to the special processor Purpose processor: Calculate the result of the special purpose processor based on the data and the converted data. The calculation can achieve the result faster than the special purpose processor only receives the data as input. 15. The method according to item 14 of the scope of patent application, wherein the driver is a graphics driver and the special-purpose processor is a graphics processing unit. 16. A method for processing data, including: outputting data from a driver to a special-purpose processor and an auxiliary processor to convert the data of the auxiliary processor; outputting conversion data from the auxiliary processor to the driver; The program outputs conversion data to the special-purpose processor; and calculates the result of the special-purpose processor based on the data and the converted data. The calculation achieves the result faster than the special-purpose processor only receives the data as input. 17 · The method according to item 16 of the patent application scope, wherein the driver is a graphics driver, and the special-purpose processor is a graphics processing unit. 1 8 · —A method for providing output from a special-purpose processor based on data from a driver component, including: receiving the data at the special-purpose processor; determining whether the auxiliary processor is functioning; -21-200529041 (5) if the When the auxiliary processor operates, the conversion data is received from the auxiliary processor of the driver component; the conversion data is transmitted from the driver to the special-purpose processor; and the result of the special processor is calculated based on the data and the conversion data. δ ten liters achieves results faster than the special-purpose processor only receives data as input. If the auxiliary processor does not work, the result of the special-purpose processor is calculated only based on the data received from the driver component. 19. A method for processing data, comprising: outputting data from a driver to a special-purpose processor; selectively outputting data from one of the driver and the special-purpose processor to an auxiliary processor; converting data of the auxiliary processor; If data is output from the driver to the auxiliary processor, the conversion data is selectively output from the auxiliary processor to one of the driver and the special-purpose processor; if the conversion data is output to the driver, the conversion data is output from the driver. The driver outputs conversion data to a special-purpose processor; if data is output from the special-purpose processor to the auxiliary processor, the conversion data is output from the auxiliary processor to the special-purpose processor; according to the data and the conversion data To calculate the results of the special-purpose processor. The calculation is faster than the special-purpose processor only receiving the data as input to achieve a result. 20. A method for processing graphic data, including 200529041 (6) generating a vertex buffer in a driver, and transforming the vertex buffer, the conversion including at least one of filtering and coloring the vertex buffer using the auxiliary processor One; drawing the vertex buffer according to the converted vertex buffer of the special-purpose processor 'The drawing is faster than the special-purpose processor receiving the vertex buffer without receiving the converted vertex buffer. 2 1. A method for selectively using an auxiliary processor, including: determining whether the use of the auxiliary processor improves performance; if it is determined that the use of the auxiliary processor does not improve performance, processing the input of the special-purpose processor; and If it is determined that the use of the auxiliary processor improves performance, the conversion input of the special-purpose processor is processed, and the conversion input is the input processing result of the auxiliary processor. 2 2 · If the method according to item 21 of the patent application scope, if it is judged that the use of the auxiliary processor improves performance, it further includes processing the input of the special processor. 2 3. The method according to item 21 of the scope of patent application, wherein the determination is based on a predetermined list of applications using the auxiliary processor to improve performance. 2 4 · The method according to item 21 of the patent application scope, wherein the judgment is based on historical performance data of the task. 25. The method according to item 21 of the scope of patent application, wherein the determination is based on near-instantaneous knowledge.