TWI715613B

TWI715613B - Apparatus, system and method for performing gpu-cpu two-path memory copy

Info

Publication number: TWI715613B
Application number: TW105126630A
Authority: TW
Inventors: 孔令一; 沈磊; 李源源; 楊宇艇; 奎元路
Original assignee: 美商英特爾股份有限公司
Priority date: 2015-09-25
Filing date: 2016-08-19
Publication date: 2021-01-11
Also published as: TW201729111A; WO2017049583A1

Abstract

Methods and apparatus relating to GPU-CPU (Graphics Processing Unit-Central Processing Unit) two-path memory copy are described. In an embodiment, a Graphics Processing Unit (GPU) copies at least a first portion of a data block from a source to a buffer. The GPU also copies a second portion of the data block from the source to a destination. A Central Processing Unit (CPU) copies the first portion of the data block from the first buffer to one or more corresponding locations in the destination. Other embodiments are also disclosed and claimed.

Description

Device, system and method for implementing GPU-CPU dual-path memory copy

本揭露大體上關於電子元件之範疇。更特定地，實施例關於GPU-CPU(圖形處理單元-中央處理單元)雙路徑記憶體複製。 This disclosure generally relates to the category of electronic components. More specifically, the embodiment relates to GPU-CPU (graphics processing unit-central processing unit) dual-path memory replication.

若干處理器包括CPU及GPU二者。當處理圖像及/或視訊資料時，此資料通常需要於主機記憶體及視訊記憶體之間反復複製。該些複製作業在運算週期及記憶體帶寬使用方面非常昂貴。 Several processors include both CPU and GPU. When processing image and/or video data, this data usually needs to be repeatedly copied between the host memory and the video memory. These copy operations are very expensive in terms of computing cycles and memory bandwidth usage.

100‧‧‧運算系統 100‧‧‧Computer system

102、102-1至102-N、702、800、1630‧‧‧處理器 102, 102-1 to 102-N, 702, 800, 1630‧‧‧ processor

104、112‧‧‧互連 104, 112‧‧‧ interconnection

106、106-1至106-M、707、802A-802N‧‧‧處理器核心 106, 106-1 to 106-M, 707, 802A-802N‧‧‧Processor core

108、704‧‧‧快取記憶體 108, 704‧‧‧Cache

110‧‧‧路由器 110‧‧‧ Router

112‧‧‧匯流排 112‧‧‧Bus

114、660‧‧‧記憶體 114, 660‧‧‧Memory

116、116-1、1451‧‧‧L1快取記憶體 116, 116-1, 1451‧‧‧L1 cache

140‧‧‧圖形邏輯 140‧‧‧Graphic logic

150‧‧‧視訊記憶體 150‧‧‧Video memory

202‧‧‧儲存裝置 202‧‧‧Storage Device

400‧‧‧方法 400‧‧‧Method

402、404、406、408、410‧‧‧作業 402, 404, 406, 408, 410‧‧‧ operation

602‧‧‧系統晶片 602‧‧‧system chip

620‧‧‧中央處理單元(CPU)核心 620‧‧‧Central Processing Unit (CPU) Core

630‧‧‧圖形處理器單元(GPU)核心 630‧‧‧Graphics processing unit (GPU) core

640‧‧‧輸入/輸出(I/O)介面 640‧‧‧Input/Output (I/O) Interface

642、814、1865‧‧‧記憶體控制器 642, 814, 1865‧‧‧Memory Controller

670‧‧‧I/O裝置 670‧‧‧I/O device

700‧‧‧處理系統 700‧‧‧Processing system

706‧‧‧暫存器檔案 706‧‧‧ Temporary File

708、712、808、900、1100、1400、1632、1810‧‧‧圖形處理器 708, 712, 808, 900, 1100, 1400, 1632, 1810‧‧‧Graphic Office Manager

709‧‧‧指令集 709‧‧‧Instruction set

710‧‧‧處理器匯流排 710‧‧‧Processor Bus

716‧‧‧記憶體控制器集線器 716‧‧‧Memory Controller Hub

720‧‧‧記憶體裝置 720‧‧‧Memory device

721‧‧‧指令 721‧‧‧Command

722‧‧‧資料 722‧‧‧Data

724‧‧‧資料儲存裝置 724‧‧‧Data Storage Device

726‧‧‧無線收發器 726‧‧‧Wireless Transceiver

728‧‧‧軔體介面 728‧‧‧Fixture Interface

730‧‧‧輸入/輸出(I/O)控制器集線器 730‧‧‧Input/Output (I/O) Controller Hub

734‧‧‧網路控制器 734‧‧‧Network Controller

740‧‧‧I/O控制器 740‧‧‧I/O Controller

742‧‧‧通用序列匯流排(USB)控制器 742‧‧‧Universal Serial Bus (USB) Controller

744‧‧‧鍵盤及滑鼠 744‧‧‧Keyboard and mouse

746‧‧‧音頻控制器 746‧‧‧Audio Controller

804A-804N‧‧‧快取記憶體單元 804A-804N‧‧‧Cache unit

806‧‧‧公用快取記憶體單元 806‧‧‧Public Cache Unit

810‧‧‧系統代理器核心 810‧‧‧System Agent Core

811、902、1443‧‧‧顯示控制器 811, 902, 1443‧‧‧Display Controller

812‧‧‧環狀互連單元 812‧‧‧Ring interconnection unit

813‧‧‧I/O鏈路 813‧‧‧I/O Link

816‧‧‧匯流排控制器單元 816‧‧‧Bus controller unit

818‧‧‧嵌入記憶體模組 818‧‧‧embedded memory module

904‧‧‧方塊圖像轉移(BLIT)引擎 904‧‧‧Block Image Transfer (BLIT) Engine

906‧‧‧視訊編解碼器引擎 906‧‧‧Video Codec Engine

910、1010‧‧‧圖形處理引擎 910、1010‧‧‧Graphics processing engine

912、1012、1522‧‧‧3D管線 912, 1012, 1522‧‧‧3D pipeline

914‧‧‧記憶體介面 914‧‧‧Memory Interface

915‧‧‧3D/媒體子系統 915‧‧‧3D/Media Subsystem

916、1016、1430、1524‧‧‧媒體管線 916, 1016, 1430, 1524‧‧‧Media pipeline

920、1845‧‧‧顯示裝置 920, 1845‧‧‧Display device

1003、1103、1403‧‧‧命令串流器 1003, 1103, 1403‧‧‧Command Streamer

1014‧‧‧執行單元陣列 1014‧‧‧ Execution Unit Array

1030‧‧‧取樣引擎 1030‧‧‧Sampling engine

1032‧‧‧去雜訊/去交錯模組 1032‧‧‧Denoising/Deinterlacing Module

1034‧‧‧動態估計模組 1034‧‧‧Dynamic Estimation Module

1036‧‧‧圖像縮放及濾波環狀模組 1036‧‧‧Image zoom and filter ring module

1044、1214、1456‧‧‧資料埠 1044, 1214, 1456‧‧‧Data port

1102、1402‧‧‧環狀互連 1102, 1402‧‧‧ring interconnection

1104‧‧‧管線前端 1104‧‧‧Front end of pipeline

1130‧‧‧視訊品質引擎(VQF) 1130‧‧‧Video Quality Engine (VQF)

1133‧‧‧多格式編碼/解碼(MFX)引擎 1133‧‧‧Multi-format encoding/decoding (MFX) engine

1134、1434‧‧‧視訊前端 1134, 1434‧‧‧Video front end

1136‧‧‧幾何管線 1136‧‧‧Geometric pipeline

1137、1437‧‧‧媒體引擎 1137, 1437‧‧‧Media Engine

1150A-1150N、1160A-1160N‧‧‧子核心 1150A-1150N, 1160A-1160N‧‧‧sub core

1152A-1152N‧‧‧第一組執行單元 1152A-1152N‧‧‧The first execution unit

1154A-1154N‧‧‧媒體/紋理取樣器 1154A-1154N‧‧‧Media/Texture Sampler

1162A-1162N‧‧‧第二組執行單元 1162A-1162N‧‧‧The second group of execution units

1164A-1164N、1210‧‧‧取樣器 1164A-1164N、1210‧‧‧Sampler

1170A-1170N‧‧‧公用資源組 1170A-1170N‧‧‧Common Resources Group

1180A-1180N‧‧‧圖形核心 1180A-1180N‧‧‧Graphics core

1200、1450‧‧‧執行緒執行邏輯 1200、1450‧‧‧Thread execution logic

1202‧‧‧像素著色器 1202‧‧‧Pixel Shader

1204、1431‧‧‧執行緒調度器 1204, 1431‧‧‧Thread scheduler

1206‧‧‧指令快取記憶體 1206‧‧‧Command cache

1208A-1208N、1452A、1452B‧‧‧執行單元 1208A-1208N, 1452A, 1452B‧‧‧ Execution unit

1212‧‧‧資料快取記憶體 1212‧‧‧Data Cache

1300‧‧‧圖形處理器指令格式 1300‧‧‧Graphics processor instruction format

1310‧‧‧128位元格式 1310‧‧‧128-bit format

1312‧‧‧指令運算碼 1312‧‧‧Instruction opcode

1313‧‧‧索引欄 1313‧‧‧Index column

1314‧‧‧指令控制欄 1314‧‧‧Command control bar

1316‧‧‧尺寸欄 1316‧‧‧Size bar

1318‧‧‧目的地 1318‧‧‧Destination

1320‧‧‧來源運算元SRC0 1320‧‧‧Source operand SRC0

1322‧‧‧來源運算元SRC1 1322‧‧‧Source operand SRC1

1324‧‧‧來源運算元SRC2 1324‧‧‧Source operand SRC2

1326‧‧‧存取/位址模式資訊 1326‧‧‧Access/Address Mode Information

1330‧‧‧64位元緊實指令格式 1330‧‧‧64-bit compact instruction format

1340‧‧‧運算碼解碼 1340‧‧‧Operation code decoding

1342‧‧‧移動及邏輯運算碼群集 1342‧‧‧Mobile and logical operation code cluster

1344‧‧‧流程控制指令群集 1344‧‧‧Flow control command cluster

1346‧‧‧雜項指令群集 1346‧‧‧Miscellaneous Command Cluster

1348‧‧‧並列數學指令群集 1348‧‧‧Parallel Math Command Cluster

1350‧‧‧向量數學群集 1350‧‧‧Vector Math Cluster

1405‧‧‧頂點收件器 1405‧‧‧Vertex Receiver

1407‧‧‧頂點著色器 1407‧‧‧Vertex Shader

1411‧‧‧可程控外殼著色器 1411‧‧‧Programmable Shell Shader

1413‧‧‧鑲嵌器 1413‧‧‧Inlay

1417‧‧‧可程控域著色器 1417‧‧‧Programmable Domain Shader

1419‧‧‧幾何著色器 1419‧‧‧Geometry Shader

1420‧‧‧圖形管線 1420‧‧‧Graphics pipeline

1423‧‧‧流出單元 1423‧‧‧Outflow unit

1429‧‧‧截波器 1429‧‧‧Chopper

1440‧‧‧顯示引擎 1440‧‧‧Display Engine

1441‧‧‧2D引擎 1441‧‧‧2D engine

1454‧‧‧紋理及媒體取樣器 1454‧‧‧Texture and Media Sampler

1458‧‧‧紋理/取樣器快取記憶體 1458‧‧‧Texture/sampler cache

1470‧‧‧呈現輸出管線 1470‧‧‧Present output pipeline

1473‧‧‧光柵器及深度測試組件 1473‧‧‧Rasterizer and Depth Test Unit

1475‧‧‧共用L3快取記憶體 1475‧‧‧Share L3 cache

1477‧‧‧像素作業組件 1477‧‧‧Pixel operation component

1478‧‧‧呈現快取記憶體 1478‧‧‧Present cache memory

1479‧‧‧深度快取記憶體 1479‧‧‧Deep cache

1500‧‧‧圖形處理器命令格式 1500‧‧‧Graphics processor command format

1502‧‧‧目標客戶 1502‧‧‧Target customers

1504‧‧‧命令作業碼(運算碼) 1504‧‧‧Command Operation Code (Operation Code)

1505‧‧‧子運算碼 1505‧‧‧Sub-opcode

1506‧‧‧資料欄 1506‧‧‧Data column

1508‧‧‧命令尺寸 1508‧‧‧Command size

1510‧‧‧圖形處理器命令序列 1510‧‧‧Graphics processor command sequence

1512‧‧‧管線刷新命令 1512‧‧‧Pipeline refresh command

1513‧‧‧管線選擇命令 1513‧‧‧Pipeline selection command

1514‧‧‧管線控制命令 1514‧‧‧Pipeline control command

1516‧‧‧返回緩衝器狀態 1516‧‧‧Return to buffer status

1520‧‧‧管線決定 1520‧‧‧Pipeline decision

1530‧‧‧3D管線狀態 1530‧‧‧3D pipeline status

1532‧‧‧3D基元 1532‧‧‧3D primitive

1534‧‧‧執行 1534‧‧‧Execution

1540‧‧‧媒體管線狀態 1540‧‧‧Media pipeline status

1542‧‧‧媒體物件命令 1542‧‧‧Media Object Command

1544‧‧‧執行命令 1544‧‧‧Execute command

1600‧‧‧資料處理系統 1600‧‧‧Data Processing System

1610‧‧‧3D圖形應用 1610‧‧‧3D graphics application

1612‧‧‧著色器指令 1612‧‧‧Shader Instructions

1614‧‧‧可執行指令 1614‧‧‧Executable instructions

1616‧‧‧圖形物件 1616‧‧‧Graphic objects

1620‧‧‧作業系統 1620‧‧‧Operating System

1622‧‧‧圖形應用編程介面 1622‧‧‧Graphic application programming interface

1624‧‧‧前端著色器編譯器 1624‧‧‧Front-end shader compiler

1626‧‧‧使用者模式圖形驅動器 1626‧‧‧User mode graphics driver

1627‧‧‧後端著色器編譯器 1627‧‧‧Back-end shader compiler

1628‧‧‧作業系統內核模式功能 1628‧‧‧Operating system kernel mode function

1629‧‧‧內核模式圖形驅動器 1629‧‧‧Kernel Mode Graphics Driver

1634‧‧‧通用處理器核心 1634‧‧‧Universal processor core

1650‧‧‧系統記憶體 1650‧‧‧System memory

1700‧‧‧IP核心發展系統 1700‧‧‧IP Core Development System

1710‧‧‧軟體模擬 1710‧‧‧Software simulation

1715‧‧‧暫存器轉移級(RTL)設計 1715‧‧‧Register Transfer Level (RTL) Design

1720‧‧‧硬體模型 1720‧‧‧Hardware Model

1730‧‧‧設計設施 1730‧‧‧Design facilities

1740‧‧‧非揮發性記憶體 1740‧‧‧Non-volatile memory

1750‧‧‧有線連接 1750‧‧‧Wired connection

1760‧‧‧無線連接 1760‧‧‧Wireless connection

1765‧‧‧製造設施 1765‧‧‧Manufacturing facilities

1800‧‧‧系統晶片積體電路 1800‧‧‧System Chip Integrated Circuit

1805‧‧‧應用處理器 1805‧‧‧Application Processor

1815‧‧‧圖像處理器 1815‧‧‧Image processor

1820‧‧‧視訊處理器 1820‧‧‧Video Processor

1825‧‧‧USB控制器 1825‧‧‧USB Controller

1830‧‧‧UART控制器 1830‧‧‧UART Controller

1835‧‧‧SPI/SDIO控制器 1835‧‧‧SPI/SDIO Controller

1840‧‧‧I2S/I2C控制器 1840‧‧‧I2S/I2C Controller

1850‧‧‧高解析度多媒體介面(HDMI)控制器 1850‧‧‧High-resolution multimedia interface (HDMI) controller

1855‧‧‧行動產業處理器介面(M1P1)顯示介面 1855‧‧‧Mobile industry processor interface (M1P1) display interface

1860‧‧‧快閃記憶體子系統 1860‧‧‧Flash memory subsystem

1870‧‧‧嵌入安全引擎 1870‧‧‧Embedded security engine

參照附圖提供詳細描述。在圖中，參考號碼之最左數字識別參考號碼首先出現之圖。不同圖中使用相同參考號碼表示類似或相同項目。 A detailed description is provided with reference to the drawings. In the figure, the leftmost digit of the reference number identifies the figure where the reference number appears first. The same reference numbers are used in different figures to indicate similar or identical items.

圖1、6-7、16、及18描繪運算系統之實施例之方塊圖，其可用以實施文中所討論之各式實施例。 Figures 1, 6-7, 16, and 18 depict block diagrams of embodiments of the computing system, which can be used to implement the various embodiments discussed in the text.

圖2依據實施例，描繪與純CPU複製相對於混合複製相關聯之資料流。 Figure 2 depicts the data flow associated with pure CPU copy versus hybrid copy according to an embodiment.

圖3依據實施例，描繪實施雙路徑記憶體複製作業之方塊圖。 FIG. 3 depicts a block diagram of a dual-path memory copy operation according to an embodiment.

圖4依據實施例，描繪實施雙路徑記憶體複製作業之方法流程圖。 FIG. 4 depicts a flowchart of a method for implementing a dual-path memory copy operation according to an embodiment.

圖5依據實施例，描繪雙路徑記憶體複製可達成產量性能之樣本圖。 FIG. 5 depicts a sample diagram of a dual-path memory copy that can achieve yield performance according to an embodiment.

圖8-12及14描繪依據若干實施例之處理器之各式組件。 Figures 8-12 and 14 depict various components of a processor according to several embodiments.

圖13描繪依據若干實施例之圖形核心指令格式。 Figure 13 depicts a graphics core command format according to several embodiments.

圖15A及15B分別描繪依據若干實施例之圖形處理器命令格式及序列。 15A and 15B respectively depict the graphics processor command format and sequence according to several embodiments.

圖17描繪依據實施例之IP核心發展圖。 Figure 17 depicts an IP core development diagram according to an embodiment.

[Content and Implementation of the Invention]

在下列描述中，提出許多特定細節以便提供各式實施例之徹底理解。然而，可無特定細節而實現各式實施例。在其他狀況下，未詳細描述知名方法、程序、組件、及電路，以便不混淆特定實施例。此外，可使用各式機制實施實施例之各式態樣，諸如整合半導體電路(「硬體」)、組織為一或更多個程式之電腦可讀取指令(「軟體」)、或硬體及軟體之若干組合。對本揭露而言，提及「邏輯」將表示硬體、軟體、軔體、或其若干組合。 In the following description, many specific details are presented in order to provide a thorough understanding of various embodiments. However, various embodiments can be implemented without specific details. In other cases, well-known methods, procedures, components, and circuits are not described in detail so as not to obscure specific embodiments. In addition, various mechanisms can be used to implement various aspects of the embodiment, such as integrated semiconductor circuits ("hardware"), computer readable instructions organized into one or more programs ("software"), or hardware And certain combinations of software. For the purposes of this disclosure, reference to "logic" will mean hardware, software, firmware, or some combination thereof.

通常，圖像及/或視訊框資料需從系統或主機記憶體(其亦可稱為CPU記憶體)複製至視訊記憶體(其亦可稱為GPU記憶體)，供GPU存取。一旦處理結束，資料將複製回至系統/主機記憶體，以實施下一處理作業或例如顯示在螢幕上。在運算週期及/或記憶體帶寬使用方面，複製入(從與CPU相關聯之記憶體複製至與GPU相關聯之記憶體)及複製出(從與GPU相關聯之記憶體複製至與CPU相關聯之記憶體)是昂貴的。而且，有時框布局之轉換對複製作業是必要的，諸如鋪磚至線性、線性至鋪磚等。線性格式一般適於系統記憶體之一維列序列存取型式，其中運算元之每一列係儲存於序列遞增記憶體位置中。鋪磚格式將圖像/視訊框之封閉區劃分為較小矩形區陣列，提昇視訊記憶體之二維(2D)子區存取性能。因而，記憶體複製或轉移之費用為圖像及/或視訊處理應用中最通常之性能瓶頸之一。 Generally, image and/or video frame data needs to be copied from system or host memory (which may also be referred to as CPU memory) to video memory (which may also be referred to as GPU memory) for GPU access. Once the processing is over, the data will be copied back to the system/host memory for the next processing operation or, for example, displayed on the screen. In terms of computing cycles and/or memory bandwidth usage, copy in (copy from the memory associated with the CPU to the memory associated with the GPU) and copy out (copy from the memory associated with the GPU to the CPU associated Associated memory) is expensive. Moreover, sometimes the conversion of the frame layout is necessary for the copy operation, such as tile to linear, linear to tile and so on. The linear format is generally suitable for a one-dimensional row sequence access type of system memory, in which each row of operands is stored in a sequence increment memory location. The tile format divides the enclosed area of the image/video frame into an array of smaller rectangular areas to improve the access performance of the two-dimensional (2D) sub-areas of the video memory. Therefore, the cost of memory copy or transfer is one of the most common performance bottlenecks in image and/or video processing applications.

再者，一般有三種可能解決方案以於處理器中GPU記憶體及CPU主機記憶體之間交換資料，其中CPU及GPU係整合於相同積體電路裝置上，解決方案如下：(1)零複製為最快速解決方案，但具相對大量限制；(2)純GPU複製較零複製慢，但仍為具較少限制之最快速解決方案；以及(3)純CPU複製(例如使用SSE/AVX或串流SIMD(單指令多資料)延伸/先進向量延伸)為無限制之最慢解決方案。 Furthermore, there are generally three possible solutions to exchange data between the GPU memory in the processor and the CPU host memory. The CPU and GPU are integrated on the same integrated circuit device. The solutions are as follows: (1) Zero copy It is the fastest solution, but with relatively large restrictions; (2) Pure GPU replication is slower than zero replication, but still the fastest solution with fewer restrictions; and (3) Pure CPU replication (for example, using SSE/AVX or Streaming SIMD (single instruction multiple data) extension/advanced vector extension) is the slowest solution without limitation.

為此，若干實施例提供GPU及CPU記憶體複製作業之雙路徑記憶體複製技術，制衡二裝置以轉移記憶體資料。在實施例中，大部分資料係由GPU複製，剩餘部分資料則由CPU複製。此混合方法移除純GPU複製之障礙，同時極小地減少性能。再者，相較於純CPU複製作業(例如使用SSE/AVX)，揭露之混合方法更快。 To this end, several embodiments provide GPU and CPU memory replication The dual-path memory copy technology in the manufacturing industry checks and balances two devices to transfer memory data. In the embodiment, most of the data is copied by the GPU, and the rest of the data is copied by the CPU. This hybrid method removes the barriers of pure GPU replication while minimizing performance. Furthermore, compared to pure CPU copy operations (such as using SSE/AVX), the disclosed hybrid method is faster.

此外，若干實施例可應用於運算系統中，其包括一或更多個處理器(例如具一或更多個處理器核心)，諸如參照圖1-9所討論者，包括例如行動運算裝置，例如智慧手機、平板電腦、UMPC(超行動個人電腦)、膝上型電腦、超筆電(Ultrabook^TM)運算裝置、智慧手錶、智慧眼鏡等。更特定地，圖1依據實施例，描繪運算系統100之方塊圖。系統100可包括一或更多個處理器102-1至102-N(文中統稱為「處理器102」)。在各式實施例中，處理器102可包括通用CPU及/或GPU。處理器102可經由互連或匯流排104通訊。每一處理器可包括各式組件，為求清晰僅參照處理器102-1討論。因此，每一其餘處理器102-2至102-N可包括參照處理器102-1討論之相同或類似組件。 In addition, several embodiments may be applied to computing systems that include one or more processors (for example, with one or more processor cores), such as those discussed with reference to FIGS. 1-9, including, for example, mobile computing devices, For example, smart phones, tablet computers, UMPC (Ultra Mobile Personal Computers), laptop computers, Ultrabook ^TM computing devices, smart watches, smart glasses, etc. More specifically, FIG. 1 depicts a block diagram of the computing system 100 according to an embodiment. The system 100 may include one or more processors 102-1 to 102-N (collectively referred to herein as "processors 102"). In various embodiments, the processor 102 may include a general-purpose CPU and/or GPU. The processor 102 may communicate via interconnection or bus 104. Each processor may include various components. For clarity, only the processor 102-1 is discussed. Therefore, each of the remaining processors 102-2 to 102-N may include the same or similar components discussed with reference to processor 102-1.

在實施例中，處理器102-1可包括一或更多個處理器核心106-1至106-M(文中統稱為「核心106」)、快取記憶體108、及/或路由器110。處理器核心106可於單一積體電路(IC)晶片上實施。再者，晶片可包括一或更多個公用及/或私用快取記憶體(諸如快取記憶體108)、匯流排或互連(諸如匯流排或互連 112)、圖形及/或記憶體控制器(諸如參照圖6-9所討論者)、或其他組件。 In an embodiment, the processor 102-1 may include one or more processor cores 106-1 to 106-M (hereinafter collectively referred to as “core 106”), a cache memory 108, and/or a router 110. The processor core 106 can be implemented on a single integrated circuit (IC) chip. Furthermore, the chip may include one or more public and/or private cache memory (such as cache memory 108), bus or interconnection (such as bus or interconnect 112), graphics and/or memory controllers (such as those discussed with reference to FIGS. 6-9), or other components.

在一實施例中，路由器110可用於處理器102-1及/或系統100之各式組件間之通訊。再者，處理器102-1可包括一個以上路由器110。此外，許多路由器110可通訊而致能處理器102-1內部或外部各式組件間之資料路由。 In one embodiment, the router 110 may be used for communication between the processor 102-1 and/or various components of the system 100. Furthermore, the processor 102-1 may include more than one router 110. In addition, many routers 110 can communicate to enable data routing between various components inside or outside the processor 102-1.

快取記憶體108可儲存資料(例如包括指令)，供處理器102-1之一或更多個組件利用，諸如核心106。例如，為處理器102之組件更快存取(例如核心106更快存取)，快取記憶體108可本機快取儲存於記憶體114中之資料。如圖1中所示，記憶體114可經由互連104與處理器102通訊。在實施例中，快取記憶體108(可為公用)可為中級快取記憶體(MLC)、末級快取記憶體(LLC)等。而且，每一核心106可包括1級(L1)快取記憶體(116-1)(文中統稱為「L1快取記憶體116」)或其他級快取記憶體，諸如2級(L2)快取記憶體。再者，處理器102-1之各式組件可經由匯流排(例如匯流排112)及/或記憶體控制器或集線器，而直接與快取記憶體108通訊。 The cache memory 108 can store data (including instructions, for example) for use by one or more components of the processor 102-1, such as the core 106. For example, for faster access to components of the processor 102 (for example, faster access by the core 106), the cache memory 108 can locally cache data stored in the memory 114. As shown in FIG. 1, the memory 114 can communicate with the processor 102 via the interconnect 104. In an embodiment, the cache memory 108 (may be public) may be a middle-level cache (MLC), a last-level cache (LLC), and so on. Moreover, each core 106 may include level 1 (L1) cache memory (116-1) (collectively referred to as "L1 cache memory 116" in the text) or other levels of cache memory, such as level 2 (L2) cache memory. Take memory. Furthermore, various components of the processor 102-1 can directly communicate with the cache memory 108 via a bus (such as the bus 112) and/or a memory controller or hub.

如圖1中所示，處理器102可進一步包括圖形邏輯140(例如其可包括一或更多個圖形處理單元(GPU)核心，諸如參照圖6-9所討論者)，而實施各式圖形及/或通用運算相關作業，諸如文中所討論者。邏輯 140可存取文中所討論之一或更多個儲存裝置(諸如視訊(或圖像、圖形等)記憶體150、快取記憶體108、L1快取記憶體116、記憶體114、暫存器、或系統100中另一記憶體)，而儲存關於邏輯140之作業的資訊，諸如與系統100之各式組件通訊之資訊，如文中所討論者。而且，雖然邏輯140及視訊記憶體150係顯示於處理器102內部(或耦接至互連104)，在各式實施例中，其可設於系統100中任何地方。例如，邏輯140可取代核心106之一，可直接耦接至互連112等。而且，視訊記憶體150可直接耦接至互連112等。 As shown in FIG. 1, the processor 102 may further include graphics logic 140 (for example, it may include one or more graphics processing unit (GPU) cores, such as those discussed with reference to FIGS. 6-9) to implement various graphics And/or general computing related tasks, such as those discussed in the text. logic 140 can access one or more of the storage devices discussed in the article (such as video (or images, graphics, etc.)) memory 150, cache memory 108, L1 cache memory 116, memory 114, register , Or another memory in the system 100), and store information about the operations of the logic 140, such as information communicated with various components of the system 100, as discussed in the text. Moreover, although the logic 140 and the video memory 150 are displayed inside the processor 102 (or coupled to the interconnect 104), in various embodiments, they can be located anywhere in the system 100. For example, the logic 140 can replace one of the cores 106, can be directly coupled to the interconnect 112, and the like. Moreover, the video memory 150 can be directly coupled to the interconnect 112 and the like.

圖2依據實施例，描繪與純CPU複製相對於混合複製相關聯之資料流。參照圖1-2，對純CPU複製作業而言(顯示於圖2之頂部)，CPU鎖定資料(例如避免相同資料多次存取之問題)，接著從視訊記憶體150複製鎖定之資料至系統記憶體114。在實施例中，對混合複製作業而言(顯示於圖2之底部)，GPU 140從視訊記憶體150複製至少一部分資料至若干輔助/次要緩衝器/儲存裝置202。如參照圖3將進一步討論，CPU將從儲存裝置202複製資料至系統記憶體114。 Figure 2 depicts the data flow associated with pure CPU copy versus hybrid copy according to an embodiment. Referring to Figure 1-2, for a pure CPU copy operation (shown at the top of Figure 2), the CPU locks the data (for example, to avoid the problem of multiple access to the same data), and then copies the locked data from the video memory 150 to the system Memory 114. In an embodiment, for a hybrid copy operation (shown at the bottom of FIG. 2), the GPU 140 copies at least a part of the data from the video memory 150 to the auxiliary/secondary buffer/storage devices 202. As will be discussed further with reference to FIG. 3, the CPU will copy data from the storage device 202 to the system memory 114.

此外，如上述，存在三種可能解決方案如下，將資料從GPU轉移至CPU。零複製為最快速方法，但具相對大量限制，諸如：(a)記憶體需配置於系統記憶體空間中，並映射至視訊記憶體位址空間；(b)系統記憶體儲存裝置布局需具有線性布局(其有利於CPU存取，但不利於GPU存取，一般於視訊記憶體中為更快GPU存取，而使用Y_鋪磚2D表面儲存裝置格式)；以及(c)映射至GPU位址空間，系統記憶體需為(例如4K位元組)頁面對齊(其在許多現有軟體產品中可為硬限制)。除了零複製解決方案外，純GPU複製為最快速複製解決方案，但其仍具有若干限制(諸如以某方式對齊資料)，其可避免許多開發者廣泛採用。CPU複製作業不具有限制，但其為最慢複製解決方案，例如因Y_鋪磚表面儲存裝置格式中之分線交錯。因需要從視訊記憶體實施四快取記憶體列讀取作業，而於系統記憶體中建立全轉換64位元組快取列，記憶體讀取頻寬之效率僅25%。 In addition, as mentioned above, there are three possible solutions as follows, transferring data from GPU to CPU. Zero copy is the fastest method, but it has relatively large limitations, such as: (a) memory needs to be allocated in the system memory space and mapped to the video memory address space; (b) the system memory storage device layout needs to be linear Layout (which is conducive to CPU memory However, it is not conducive to GPU access. Generally, for faster GPU access in video memory, the Y_tiled 2D surface storage device format is used); and (c) mapping to GPU address space, system memory needs It is (for example, 4K bytes) page alignment (which can be a hard limit in many existing software products). In addition to the zero copy solution, pure GPU copy is the fastest copy solution, but it still has several limitations (such as aligning data in a certain way), which can avoid widespread adoption by many developers. The CPU copy operation has no limitation, but it is the slowest copy solution, for example, due to the staggered lines in the Y_tile surface storage device format. Due to the need to perform four-cache memory row read operations from the video memory, and a fully converted 64-byte cache line is created in the system memory, the memory read bandwidth efficiency is only 25%.

圖3依據實施例，描繪雙路徑記憶體複製作業之MDF實施之方塊圖。如文中所討論，MDF(媒體開發架構)一般係指高階編程架構，以暴露GPU一般運算/處理能力，且其可顯著提昇高度並列或運算密集工作之性能。通常，MDF應用具有二組件：內核及主機程式。主機程式建立及啟動內核。內核接著於GPU硬體上執行。 FIG. 3 depicts a block diagram of MDF implementation of a dual-path memory copy operation according to an embodiment. As discussed in the article, MDF (Media Development Architecture) generally refers to a high-level programming architecture to expose the general computing/processing capabilities of GPUs, and it can significantly improve the performance of highly parallel or computing-intensive tasks. Generally, MDF applications have two components: the kernel and the host program. The host program is created and the kernel is started. The kernel is then executed on the GPU hardware.

在以雙路徑複製從視訊記憶體150轉移資料至系統記憶體114之狀況下，目的地並非16位元組對齊之系統記憶體。因此未對齊之目的地記憶體位址，GPU 140無法直接複製系統記憶體114中第一及最末若干位元組至目的地。原因在於Oword(八字)方塊寫入命令用以將資料寫入至目的地，但其需要至少16位元組對齊之系統記憶體及表面寬度。為處理此限制，實施例導入(例如預配置)輔助緩衝器202做為暫時緩衝器，其可為具頁面對齊系統記憶體之使用者提供之記憶體緩衝器。如文中所討論，輔助緩衝器202可為零複製之緩衝器，亦稱為緩衝器UP。 In the case of transferring data from the video memory 150 to the system memory 114 by dual-path copying, the destination is not a 16-byte aligned system memory. Therefore, for the unaligned destination memory address, the GPU 140 cannot directly copy the first and last bytes in the system memory 114 to the destination. The reason is that the Oword (eight-character) box write command is used to write data to the destination, but it needs to be aligned with at least 16 bytes. System memory and surface width. To deal with this limitation, the embodiment introduces (eg, pre-configured) the auxiliary buffer 202 as a temporary buffer, which can be a memory buffer provided by a user with a page-aligned system memory. As discussed herein, the auxiliary buffer 202 may be a zero-copy buffer, also known as the buffer UP.

參照圖1-3，實施例利用具下列二步驟之解決方案。首先(圖3中標示為圓圈1)，GPU部件：MDF複製內核：(a)圖像中，GPU 140複製全快取列(在一實施例中，快取列寬度為64位元組)，從視訊記憶體150至(例如16位元組)對齊之系統記憶體至目的地(系統記憶體114)。該些具64位元組寬度之全快取列代表圖像中之像素。對此GPU複製之部件而言，實施例使用媒體方塊讀取及Oword方塊寫入命令，及具有轉置以映射視訊記憶體中之鋪磚布局至系統記憶體中之線性布局；及(b)GPU 140複製列起始及末端之部分快取列至輔助記憶體202內，具媒體方塊讀取及Oword方塊寫入命令。此GPU複製之部件實施如以上(a)中描述之相同功能，因為將發送相同讀取/寫入命令而實施資料轉移。其次(圖3中標示為圓圈2)，CPU部件包括：(a)在GPU結束上述執行後，CPU存取輔助緩衝器202中之資料，因為它為零複製；及(b)如圖3中所示，CPU複製此資料至目的地系統記憶體114中相應地方(即起始及末端)。 Referring to FIGS. 1-3, the embodiment uses a solution with the following two steps. First (marked as circle 1 in Figure 3), GPU component: MDF copy kernel: (a) In the image, GPU 140 copies the full cache line (in one embodiment, the cache line width is 64 bytes), from The video memory 150 is aligned with the system memory (for example, 16 bytes) to the destination (system memory 114). These 64-byte full cache columns represent pixels in the image. For this GPU copy component, the embodiment uses media block read and Oword block write commands, and has a transposition to map the tile layout in the video memory to the linear layout in the system memory; and (b) The GPU 140 copies part of the cache at the beginning and the end of the row to the auxiliary memory 202, with media block read and Oword block write commands. This GPU copy component implements the same function as described in (a) above, because the same read/write command will be sent to implement data transfer. Secondly (marked as circle 2 in Figure 3), the CPU components include: (a) After the GPU finishes the above execution, the CPU accesses the data in the auxiliary buffer 202 because it is zero copy; and (b) as shown in Figure 3 As shown, the CPU copies this data to the corresponding places in the destination system memory 114 (ie, the beginning and the end).

再者，圖3中所示之雙路徑解決方案使用輔助緩衝器202(例如使用者提供之記憶體緩衝器)複製未對齊之資料，相較於圖2頂部中所示直接從視訊記憶體至系統記憶體之CPU複製方式，提供更高性能。原因在於雙路徑複製作業使用與純GPU複製相同快速之GPU複製機構。由於緩衝器202制衡零複製，在GPU複製後，資料已在系統記憶體中。將該些少量資料項(例如在資料區塊起始及末端)複製至系統記憶體之成本相對極低。然而，如參照圖2所討論，純CPU複製需要表面上之鎖定作業，以避免資源衝突，其花費較高。參照圖1中資料轉移管線。 Furthermore, the dual-path solution shown in FIG. 3 uses an auxiliary buffer 202 (such as a memory buffer provided by the user) to copy The aligned data provides higher performance than the CPU copy method shown in the top of Figure 2 directly from the video memory to the system memory. The reason is that the dual-path copy operation uses the same fast GPU copy mechanism as pure GPU copy. Since the buffer 202 checks and balances zero copy, the data is already in the system memory after the GPU copy. The cost of copying these small data items (for example, at the beginning and end of the data block) to the system memory is relatively low. However, as discussed with reference to FIG. 2, pure CPU replication requires a seemingly locking operation to avoid resource conflicts, which is expensive. Refer to the data transfer pipeline in Figure 1.

參照圖3，假定目的地之開始處位址為對齊之40位元組(並非對齊之16位元組)，雙路徑複製解決方案用以處理錯位問題。在實施例中，內核處理每一執行緒中8*128位元組，如參照圖4將進一步討論。 Referring to FIG. 3, assuming that the start address of the destination is aligned 40 bytes (not aligned 16 bytes), the dual path copy solution is used to deal with the misalignment problem. In an embodiment, the kernel processes 8*128 bytes in each thread, as will be discussed further with reference to FIG. 4.

更特定地，圖4依據實施例，描繪方法400之流程圖，實施雙路徑記憶體複製作業。在一實施例中，方法400顯示由軟體執行緒實施之作業，其利用CPU及GPU二者從來源(例如視訊記憶體150)複製資料方塊至目的地(例如系統記憶體114)。在實施例中，參照其他圖討論之各式組件可用以實施方法400之一或更多個作業。 More specifically, FIG. 4 depicts a flowchart of a method 400 to implement a dual-path memory copy operation according to an embodiment. In one embodiment, the method 400 shows an operation performed by a software thread that uses both a CPU and a GPU to copy data blocks from a source (such as video memory 150) to a destination (such as system memory 114). In an embodiment, various components discussed with reference to other figures may be used to implement one or more operations of the method 400.

參照圖1-4，在作業402，資料區塊之起始部(例如記憶體或快取列之列中所提供，諸如圖3中所示之視訊記憶體150之位元0至24)，從來源(例如視訊記憶體150)複製(例如藉由圖形邏輯/GPU 140)至輔助緩衝器(例如緩衝器202)。可依據系統記憶體中開始處位址(例如，64-40或24，如圖3之範例中所示)，計算(例如基於對齊之64位元組偏移)作業402之資料區塊之起始部。 Referring to FIGS. 1-4, in operation 402, the starting part of the data block (for example, provided in the memory or cache row, such as bits 0 to 24 of the video memory 150 shown in FIG. 3), Copy from source (e.g. video memory 150) (e.g. by graphics logic/GPU 140) to auxiliary Buffer (e.g., buffer 202). The start of the data block of operation 402 can be calculated (for example, based on the 64-byte offset of alignment) based on the starting address in the system memory (for example, 64-40 or 24, as shown in the example of FIG. 3) Beginning part.

在作業404，資料區塊之其餘部(即作業402之起始部之後，其可為其餘全快取列(64位元組))從來源複製(例如藉由圖形邏輯/GPU 140)至(例如對齊之系統記憶體)目的地(例如系統記憶體114)。可依據開始處位址歸整(例如，40對齊，或64，如圖3之範例中所示)，計算(例如基於對齊之64位元組歸整)作業404之對齊之系統記憶體之開始處位址。作業404重複，直至達到資料區塊之末端為止，如作業406所決定(例如適遇列之末端的最後40位元組，如圖3之範例中所示)。 In operation 404, the rest of the data block (that is, after the start of operation 402, which can be the remaining full cache (64 bytes)) is copied from the source (for example by graphics logic/GPU 140) to (for example Aligned system memory) destination (such as system memory 114). The start of the aligned system memory of operation 404 can be calculated (for example, 64-byte alignment based on alignment) based on the starting address (for example, 40 alignment, or 64, as shown in the example in Figure 3) Office address. Operation 404 is repeated until the end of the data block is reached, as determined by operation 406 (for example, the last 40 bytes at the end of the encounter row, as shown in the example of FIG. 3).

在作業408，從來源複製(例如藉由圖形邏輯/GPU 140)資料區塊之最後部(例如列之最後40位元組，如圖3之範例中所示)至輔助緩衝器。在作業410，儲存於輔助緩衝器中之資料複製(例如藉由CPU或圖1之一核心106)至目的地中之正確地方(例如如圖3中所示之開始及末端)。方法400接著可重複進行另一複製執行緒/作業。 In operation 408, the last part of the data block (for example, the last 40 bytes of the row, as shown in the example of FIG. 3) is copied from the source (for example, by graphics logic/GPU 140) to the auxiliary buffer. In operation 410, the data stored in the auxiliary buffer is copied (for example, by the CPU or the core 106 of FIG. 1) to the correct place in the destination (for example, the beginning and the end as shown in FIG. 3). The method 400 can then repeat another copy thread/operation.

圖5依據實施例，描繪雙路徑記憶體複製可達成產量性能之樣本圖。如同所示，雙路徑複製提供較純CPU複製(例如使用SSE4)更佳之性能，且僅較純GPU 複製小於1位元。因此，利用雙路徑記憶體複製之若干實施例達成預期功能及具單一平坦表面及多平坦表面之良好性能。為實施平坦表面，以上狀況可延伸而覆蓋UV平面及Y平面。而且，如圖5中所示，雙路徑解決方案約較純CPU複製(使用SSE4)快1.68倍，其係依據描繪樣本測試結果，且無對齊限制。再者，在以上測試中，雙路徑複製之目的地系統記憶體位址為對齊之1位元組，同時純GPU複製需要目的地系統記憶體之對齊之16位元組。 FIG. 5 depicts a sample diagram of a dual-path memory copy that can achieve yield performance according to an embodiment. As shown, dual-path replication provides better performance than pure CPU replication (for example, using SSE4), and is only better than pure GPU Copy less than 1 bit. Therefore, several embodiments using dual-path memory replication achieve expected functions and good performance with a single flat surface and multiple flat surfaces. To implement a flat surface, the above conditions can be extended to cover the UV plane and the Y plane. Moreover, as shown in Figure 5, the dual-path solution is approximately 1.68 times faster than pure CPU replication (using SSE4), which is based on the test results of the depicted sample and has no alignment restrictions. Furthermore, in the above test, the destination system memory address of the dual-path copy is aligned 1 byte, and pure GPU copy requires the aligned 16 bytes of the destination system memory.

如文中所討論，YUV(亦稱為YCbCr)為二主要顏色空間之一，用以代表數位組件視訊(另一者為RGB)。YCbCr及RGB間之差異為YCbCr以亮度及二顏色差異信號代表顏色，同時RGB以紅、綠、及藍代表顏色。在YCbCr中，Y為亮度，Cb為藍減亮度(B-Y)及Cr為紅減亮度(R-Y)。YUV格式之一為將Y置入一平面及將UV置入另一平面。 As discussed in the text, YUV (also known as YCbCr) is one of the two main color spaces used to represent digital component video (the other is RGB). The difference between YCbCr and RGB is that YCbCr uses brightness and two-color difference signals to represent colors, while RGB uses red, green, and blue to represent colors. In YCbCr, Y is brightness, Cb is blue minus brightness (B-Y) and Cr is red minus brightness (R-Y). One of the YUV formats is to place Y in one plane and UV into another plane.

如以上所討論，若干實施例利用CPU及GPU二者來實施記憶體複製作業。大部分(例如對齊之)資料係使用GPU複製內核來複製，其餘(例如未對齊之)資料將複製至CPU複製作業。更特定地，一實施例係依據MDF GPU編程架構實施。然而，實施例不侷限於MDF架構及暴露API(應用編程介面)之任何運行時間SDK(軟體開發工具)，其使用CPU及GPU二者來實施記憶體資料轉移，而可用以實施各式實施例。 As discussed above, several embodiments utilize both CPU and GPU to implement memory copy operations. Most of the data (such as aligned) is copied using the GPU copy kernel, and the remaining (such as unaligned) data will be copied to the CPU copy operation. More specifically, an embodiment is implemented based on the MDF GPU programming architecture. However, the embodiment is not limited to any runtime SDK (software development tool) that exposes the MDF architecture and API (application programming interface). It uses both CPU and GPU to implement memory data transfer, and can be used to implement various embodiments. .

因此，文中所討論之GPU/CPU雙路徑複製實施例制衡CPU(為無限制)及GPU(為高性能)二者。該解決方案維持純GPU複製之高性能(極少性能損失)，及移除與純GPU複製作業相關聯之障礙。結果，其亦為使用者移除障礙，而使用GPU複製作業來加速其軟體產品中之記憶體轉移。 Therefore, the GPU/CPU dual-path replication discussed in the article is The embodiment checks and balances both the CPU (for unlimited) and GPU (for high performance). This solution maintains the high performance of pure GPU replication (with minimal performance loss) and removes obstacles associated with pure GPU replication operations. As a result, it also removes obstacles for users and uses GPU copy operations to accelerate memory transfer in their software products.

在若干實施例中，文中所討論之一或更多個組件可體現為系統晶片(SOC)裝置。圖6依據實施例，描繪SOC封裝之方塊圖。如圖6中所描繪，SOC 602包括一或更多個中央處理單元(CPU)核心620(其可與圖1之核心106相同或類似)、一或更多個圖形處理器單元(GPU)核心630(其可與圖1之圖形邏輯140相同或類似)、輸入/輸出(I/O)介面640、及記憶體控制器642。SOC封裝602之各式組件可耦接至互連或匯流排，諸如文中參照其他圖所討論者。而且，SOC封裝602可包括更多或較少組件，諸如文中參照其他圖所討論者。此外，SOC封裝602之每一組件可包括一或更多個其他組件，例如文中參照其他圖所討論者。在一實施例中，SOC封裝602(及其組件)係提供於一或更多個積體電路(IC)晶粒上，例如其封裝於單一半導體裝置中。 In several embodiments, one or more of the components discussed herein may be embodied as a system-on-a-chip (SOC) device. Fig. 6 depicts a block diagram of the SOC package according to an embodiment. As depicted in FIG. 6, the SOC 602 includes one or more central processing unit (CPU) cores 620 (which may be the same as or similar to the core 106 of FIG. 1), one or more graphics processing unit (GPU) cores 630 (which may be the same as or similar to the graphics logic 140 in FIG. 1), an input/output (I/O) interface 640, and a memory controller 642. Various components of the SOC package 602 may be coupled to interconnects or bus bars, such as those discussed in the text with reference to other figures. Moreover, the SOC package 602 may include more or fewer components, such as those discussed herein with reference to other figures. In addition, each component of the SOC package 602 may include one or more other components, such as those discussed herein with reference to other figures. In one embodiment, the SOC package 602 (and its components) is provided on one or more integrated circuit (IC) dies, for example, it is packaged in a single semiconductor device.

如圖6中所描繪，經由記憶體控制器642，SOC封裝602耦接至記憶體660(其可與文中參照其他圖所討論者類似或相同，諸如圖1之系統記憶體114)。在實施例中，記憶體660(或其一部分)可整合於SOC封裝602上。 As depicted in FIG. 6, the SOC package 602 is coupled to the memory 660 via the memory controller 642 (which may be similar or identical to those discussed in the text with reference to other figures, such as the system memory 114 of FIG. 1). In an embodiment, the memory 660 (or a part thereof) may be integrated on the SOC package 602.

I/O介面640可耦接至一或更多個I/O裝置670，例如經由諸如文中參照其他圖所討論者之互連及/或匯流排。I/O裝置670可包括鍵盤、滑鼠、觸控墊、顯示器、圖像/視訊捕捉裝置(諸如相機或攝影機/錄影機)、觸控螢幕、揚聲器等一或更多者。此外，在實施例中，SOC封裝602可包括/整合邏輯140及/或視訊記憶體150(或一部分視訊記憶體150)。另一方面，邏輯140及/或視訊記憶體150(或一部分視訊記憶體150)可於SOC封裝602外部提供(即做為個別邏輯)。 The I/O interface 640 may be coupled to one or more I/O devices 670, for example, via interconnections and/or buses such as those discussed with reference to other figures herein. The I/O device 670 may include one or more of a keyboard, a mouse, a touch pad, a display, an image/video capture device (such as a camera or a video camera/recorder), a touch screen, a speaker, and so on. In addition, in an embodiment, the SOC package 602 may include/integrate the logic 140 and/or the video memory 150 (or a part of the video memory 150). On the other hand, the logic 140 and/or the video memory 150 (or a part of the video memory 150) can be provided outside the SOC package 602 (that is, as individual logic).

圖7依據實施例，為處理系統700之方塊圖。在各式實施例中，系統700包括一或更多個處理器702及一或更多個圖形處理器708(諸如圖1之圖形邏輯140)，並可為單一處理器桌上型系統、多處理器工作站系統、或具有大量處理器702(諸如圖1之處理器102)或處理器核心707(諸如圖1之核心106)之伺服器系統。在一實施例中，系統700為併入系統晶片(SoC)積體電路之處理平台，用於行動、手持、或嵌入裝置中。 FIG. 7 is a block diagram of the processing system 700 according to an embodiment. In various embodiments, the system 700 includes one or more processors 702 and one or more graphics processors 708 (such as the graphics logic 140 in FIG. 1), and may be a single-processor desktop system, multiple A processor workstation system, or a server system with a large number of processors 702 (such as the processor 102 in FIG. 1) or a processor core 707 (such as the core 106 in FIG. 1). In one embodiment, the system 700 is a processing platform incorporated into a system-on-chip (SoC) integrated circuit for use in mobile, handheld, or embedded devices.

系統700之實施例可包括或併入基於伺服器之遊戲平台，遊戲操縱台，包括遊戲及媒體操縱台、行動遊戲操縱台、手持遊戲操縱台、或線上遊戲操縱台。在若干實施例中，系統700為行動電話、智慧手機、平板運算裝置或行動網際網路裝置。資料處理系統700亦可包括、耦接、或整合可穿戴裝置，諸如智慧手錶可穿戴裝置、智慧眼鏡裝置、擴增實境裝置、或虛擬實境裝置。在若干實施例中，資料處理系統700為電視或機上盒裝置，具有一或更多個處理器702及由一或更多個圖形處理器708產生之圖形介面。 Embodiments of the system 700 may include or incorporate a server-based gaming platform, a gaming console, including a game and media console, a mobile game console, a handheld game console, or an online game console. In some embodiments, the system 700 is a mobile phone, a smart phone, a tablet computing device, or a mobile Internet device. The data processing system 700 may also include, couple, or integrate a wearable device, such as a smart watch wearable device, a smart glasses device, an augmented reality device, or a virtual reality device. In some real In an embodiment, the data processing system 700 is a television or a set-top box device, which has one or more processors 702 and a graphics interface generated by one or more graphics processors 708.

在若干實施例中，一或更多個處理器702各包括一或更多個處理器核心707，於執行時處理指令，實施系統及使用者軟體之作業。在若干實施例中，每一一或更多個處理器核心707係組配以處理特定指令集709。在若干實施例中，指令集709可促進複雜指令集運算(CISC)、精簡指令集運算(RISC)、或經由極長指令字(VLIW)之運算。多處理器核心707可分別處理不同指令集709，其可包括指令以促進其他指令集之仿真。處理器核心707亦可包括其他處理裝置，諸如數位信號處理器(DSP)。 In some embodiments, the one or more processors 702 each include one or more processor cores 707, which process instructions during execution to perform operations of the system and user software. In several embodiments, each one or more processor cores 707 are configured to process a specific instruction set 709. In some embodiments, the instruction set 709 can facilitate complex instruction set operations (CISC), reduced instruction set operations (RISC), or operations via very long instruction words (VLIW). The multi-processor core 707 can respectively process different instruction sets 709, which may include instructions to facilitate the simulation of other instruction sets. The processor core 707 may also include other processing devices, such as a digital signal processor (DSP).

在若干實施例中，處理器702包括快取記憶體704。依據架構，處理器702可具有單一內部快取記憶體或多級內部快取記憶體。在若干實施例中，快取記憶體於處理器702之各式組件間公用。在若干實施例中，處理器702亦使用外部快取記憶體(例如3級(L3)快取記憶體或末級快取記憶體(LLC))(未顯示)，其可於使用已知快取記憶體一致性技術之處理器核心707間公用。暫存器檔案706額外包括於處理器702中，其可包括不同類型暫存器，用於儲存不同類型資料(例如整數暫存器、浮點暫存器、狀態暫存器、及指令指標暫存器)。若干暫存器可為通用暫存器，同時其他暫存器可特定用於處理器 702之設計。 In several embodiments, the processor 702 includes a cache memory 704. Depending on the architecture, the processor 702 may have a single internal cache memory or multiple levels of internal cache memory. In some embodiments, the cache memory is shared among various components of the processor 702. In some embodiments, the processor 702 also uses external cache memory (such as level 3 (L3) cache memory or last level cache memory (LLC)) (not shown), which can be used in known caches. It is shared between the processor core 707 of memory consistency technology. The register file 706 is additionally included in the processor 702. It may include different types of registers for storing different types of data (such as integer registers, floating point registers, status registers, and instruction index registers). Memory). Some registers can be general registers, while other registers can be used specifically for the processor 702 design.

在若干實施例中，處理器702耦接至處理器匯流排710，而於處理器702及系統700中其他組件之間傳輸通訊信號，諸如位址、資料、或控制信號。在一實施例中，系統700使用示例「集線器」系統架構，包括記憶體控制器集線器716及輸入/輸出(I/O)控制器集線器730。記憶體控制器集線器716促進記憶體裝置及系統700之其他組件間之通訊，同時I/O控制器集線器(ICH)730經由本機I/O匯流排提供至I/O裝置之連接。在一實施例中，記憶體控制器集線器716之邏輯整合於處理器中。 In some embodiments, the processor 702 is coupled to the processor bus 710 to transmit communication signals, such as address, data, or control signals, between the processor 702 and other components in the system 700. In one embodiment, the system 700 uses an exemplary "hub" system architecture, including a memory controller hub 716 and an input/output (I/O) controller hub 730. The memory controller hub 716 facilitates communication between the memory device and other components of the system 700, while the I/O controller hub (ICH) 730 provides connection to the I/O device via the local I/O bus. In one embodiment, the logic of the memory controller hub 716 is integrated into the processor.

記憶體裝置720可為動態隨機存取記憶體(DRAM)裝置、靜態隨機存取記憶體(SRAM)裝置、快閃記憶體裝置、相變記憶體裝置、或具有適當性能做為程序記憶體之若干其他記憶體裝置。在一實施例中，記憶體裝置720可操作做為系統700之系統記憶體，以儲存資料722及指令721供一或更多個處理器702執行應用或程序時使用。記憶體控制器集線器716亦與可選外部圖形處理器712耦接，其可與處理器702中之一或更多個圖形處理器708通訊，而實施圖形及媒體作業。 The memory device 720 can be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase change memory device, or one with appropriate performance as a program memory Several other memory devices. In one embodiment, the memory device 720 can operate as a system memory of the system 700 to store data 722 and instructions 721 for use by one or more processors 702 when executing applications or programs. The memory controller hub 716 is also coupled to an optional external graphics processor 712, which can communicate with one or more of the graphics processors 708 of the processors 702 to perform graphics and media operations.

在若干實施例中，ICH 730致能週邊設備經由高速I/O匯流排而連接至記憶體裝置720及處理器702。I/O週邊設備包括但不侷限於音頻控制器746、軔體介面728、無線收發器726(例如Wi-Fi、Bluetooth)、資料儲存裝置724(例如硬碟、快閃記憶體等)、及舊有I/O控制器740，用於耦接舊有(例如個人系統2(PS/2))裝置至系統。一或更多個通用序列匯流排(USB)控制器742連接輸入裝置，諸如鍵盤及滑鼠744組合。網路控制器734亦可耦接至ICH 730。在若干實施例中，高性能網路控制器(未顯示)耦接至處理器匯流排710。將理解的是，所顯示之系統700為示例且未侷限，亦可使用不同組配之其他類型資料處理系統。例如，I/O控制器集線器730可整合於一或更多個處理器702內，或記憶體控制器集線器716及I/O控制器集線器730可整合於個別外部圖形處理器內，諸如外部圖形處理器712。 In some embodiments, the ICH 730 enables peripheral devices to be connected to the memory device 720 and the processor 702 via a high-speed I/O bus. I/O peripheral devices include but are not limited to audio controller 746, firmware interface 728, wireless transceiver 726 (such as Wi-Fi, Bluetooth), data storage The storage device 724 (such as hard disk, flash memory, etc.) and the legacy I/O controller 740 are used to couple the legacy (such as personal system 2 (PS/2)) device to the system. One or more universal serial bus (USB) controllers 742 are connected to input devices, such as a keyboard and mouse 744 combination. The network controller 734 can also be coupled to the ICH 730. In some embodiments, a high-performance network controller (not shown) is coupled to the processor bus 710. It will be understood that the system 700 shown is an example and is not limited, and other types of data processing systems with different configurations can also be used. For example, the I/O controller hub 730 can be integrated into one or more processors 702, or the memory controller hub 716 and the I/O controller hub 730 can be integrated into individual external graphics processors, such as external graphics Processor 712.

圖8為處理器800之實施例之方塊圖，具有一或更多個處理器核心802A-802N、整合記憶體控制器814、及整合圖形處理器808。處理器800可與參照圖1討論之處理器102類似或相同。圖8之該些元件具有與文中任何其他圖之元件相同的參考號碼(或名稱)，可以類似於文中任何地方描述之任何方式操作或作動，但不侷限於此。處理器800可包括其餘核心，並包括由虛線方塊表示之其餘核心802N。每一處理器核心802A-802N包括一或更多個內部快取記憶體單元804A-804N。在若干實施例中，每一處理器核心亦存取一或更多個公用快取記憶體單元806。 FIG. 8 is a block diagram of an embodiment of the processor 800, which has one or more processor cores 802A-802N, an integrated memory controller 814, and an integrated graphics processor 808. The processor 800 may be similar or identical to the processor 102 discussed with reference to FIG. 1. The elements of FIG. 8 have the same reference numbers (or names) as the elements of any other figures in the text, and can be operated or acted in any manner similar to that described anywhere in the text, but are not limited thereto. The processor 800 may include the remaining cores, and includes the remaining cores 802N represented by dashed squares. Each processor core 802A-802N includes one or more internal cache memory units 804A-804N. In some embodiments, each processor core also accesses one or more common cache units 806.

內部快取記憶體單元804A-804N及公用快取記憶體單元806代表處理器800內之快取記憶體階層。快取記憶體階層可包括每一處理器核心內至少一級指令及資料快取記憶體，及一或更多個級公用中級快取記憶體，諸如2級(L2)、3級(L3)、4級(L4)、或其他級快取記憶體，其中外部記憶體之前之最高級快取記憶體分類為LLC。在若干實施例中，快取記憶體一致性邏輯維持各式快取記憶體單元806及804A-804N間之一致性。 The internal cache memory units 804A-804N and the public cache memory unit 806 represent the cache hierarchy in the processor 800. fast The fetch memory hierarchy can include at least one level of instruction and data cache memory in each processor core, and one or more levels of common intermediate cache memory, such as level 2 (L2), level 3 (L3), 4 Level (L4), or other levels of cache memory, where the highest level of cache memory before external memory is classified as LLC. In some embodiments, the cache coherency logic maintains coherency between the various cache units 806 and 804A-804N.

在若干實施例中，處理器800亦可一組一或更多個匯流排控制器單元816及系統代理器核心810。一或更多個匯流排控制器單元816管理一組週邊設備匯流排，諸如一或更多個週邊設備組件互連匯流排(例如PCI、PCI Express)。系統代理器核心810提供各式處理器組件之管理功能。在若干實施例中，系統代理器核心810包括一或更多個整合記憶體控制器814，以管理各式外部記憶體裝置之存取(未顯示)。 In some embodiments, the processor 800 may also be a set of one or more bus controller units 816 and system agent core 810. One or more bus controller units 816 manage a group of peripheral device buses, such as one or more peripheral device component interconnection buses (for example, PCI, PCI Express). The system agent core 810 provides management functions for various processor components. In some embodiments, the system agent core 810 includes one or more integrated memory controllers 814 to manage access to various external memory devices (not shown).

在若干實施例中，一或更多個處理器核心802A-802N包括支援同步多執行緒。在該實施例中，系統代理器核心810包括多執行緒處理期間用於協調及作業核心802A-802N之組件。系統代理器核心810可額外包括電力控制單元(PCU)，其包括邏輯及組件以調節處理器核心802A-802N及圖形處理器808之電力狀態。 In some embodiments, one or more processor cores 802A-802N include support for simultaneous multiple threads. In this embodiment, the system agent core 810 includes components for coordination and operation cores 802A-802N during multi-thread processing. The system agent core 810 may additionally include a power control unit (PCU), which includes logic and components to adjust the power state of the processor cores 802A-802N and the graphics processor 808.

在若干實施例中，處理器800額外包括圖形處理器808以執行圖形處理作業。在若干實施例中，圖形處理器808與公用快取記憶體單元806及系統代理器核心810之組合耦接，包括一或更多個整合記憶體控制器 814。在若干實施例中，顯示控制器811與圖形處理器808耦接，以驅動圖形處理器輸出至一或更多個耦接之顯示裝置。在若干實施例中，顯示控制器811可為個別模組，經由至少一互連而與圖形處理器耦接，或可整合於圖形處理器808或系統代理器核心810內。 In some embodiments, the processor 800 additionally includes a graphics processor 808 to perform graphics processing tasks. In some embodiments, the graphics processor 808 is coupled to a combination of the public cache memory unit 806 and the system agent core 810, and includes one or more integrated memory controllers 814. In some embodiments, the display controller 811 is coupled to the graphics processor 808 to drive the graphics processor to output to one or more coupled display devices. In some embodiments, the display controller 811 may be a separate module, coupled to the graphics processor via at least one interconnection, or may be integrated in the graphics processor 808 or the system agent core 810.

在若干實施例中，環狀互連單元812用以耦接處理器800之內部組件。然而，可使用替代互連單元，諸如點對點互連、交換互連、或其他技術，包括本技藝中熟知之技術。在若干實施例中，圖形處理器808經由I/O鏈路813而與環狀互連812耦接。 In some embodiments, the ring interconnect unit 812 is used to couple the internal components of the processor 800. However, alternative interconnection units may be used, such as point-to-point interconnection, exchange interconnection, or other technologies, including those well known in the art. In several embodiments, the graphics processor 808 is coupled to the ring interconnect 812 via an I/O link 813.

示例I/O鏈路813代表多種I/O互連之至少一種，包括封裝I/O互連，其促進各式處理器組件及高性能嵌入記憶體模組818間之通訊，諸如eDRAM(或嵌入DRAM)模組。在若干實施例中，每一處理器核心802A-802N及圖形處理器808使用嵌入記憶體模組818做為公用末級快取記憶體。 The example I/O link 813 represents at least one of a variety of I/O interconnects, including packaged I/O interconnects, which facilitate communication between various processor components and high-performance embedded memory modules 818, such as eDRAM (or Embedded DRAM) module. In some embodiments, each of the processor cores 802A-802N and the graphics processor 808 uses the embedded memory module 818 as the common final cache memory.

在若干實施例中，處理器核心802A-802N為同質核心，執行相同指令集架構。在另一實施例中，在指令集架構(ISA)方面，處理器核心802A-802N為異質，其中一或更多個處理器核心802A-802N執行第一指令集，同時至少一其他核心執行第一指令集之子集或不同指令集。在一實施例中，在微架構方面，處理器核心802A-802N為異質，其中具有相對較高電力損耗之一或更多個核心與具有較低電力損耗之一或更多個電力核心耦接。此外，除了其他組件外，處理器800可於一或更多個晶片上實施，或做為具有所描繪組件之SoC積體電路。 In several embodiments, the processor cores 802A-802N are homogeneous cores that execute the same instruction set architecture. In another embodiment, in terms of instruction set architecture (ISA), the processor cores 802A-802N are heterogeneous, in which one or more processor cores 802A-802N execute the first instruction set, while at least one other core executes the first instruction set. A subset of an instruction set or a different instruction set. In one embodiment, in terms of microarchitecture, the processor cores 802A-802N are heterogeneous, in which one or more cores with relatively high power loss are coupled with one or more power cores with low power loss . this In addition, among other components, the processor 800 can be implemented on one or more chips, or as an SoC integrated circuit with the depicted components.

圖9為圖形處理器900之方塊圖，其可為個別圖形處理單元，或可為與複數處理核心整合之圖形處理器。圖形處理器900可與參照圖1討論之圖形邏輯140類似或相同。在若干實施例中，圖形處理器經由記憶體映射之I/O介面而通訊至圖形處理器上之暫存器，並具置入處理器記憶體之命令。在若干實施例中，圖形處理器900包括記憶體介面914以存取記憶體。記憶體介面914可為至本機記憶體、一或更多個內部快取記憶體、一或更多個公用外部快取記憶體、及/或系統記憶體之介面。 FIG. 9 is a block diagram of the graphics processor 900, which can be an individual graphics processing unit or a graphics processor integrated with a plurality of processing cores. The graphics processor 900 may be similar or identical to the graphics logic 140 discussed with reference to FIG. 1. In some embodiments, the graphics processor communicates to the registers on the graphics processor via the memory-mapped I/O interface, and has commands placed in the processor memory. In some embodiments, the graphics processor 900 includes a memory interface 914 to access memory. The memory interface 914 may be an interface to local memory, one or more internal cache memory, one or more public external cache memory, and/or system memory.

在若干實施例中，圖形處理器900亦包括顯示控制器902，以驅動顯示輸出資料至顯示裝置920。顯示控制器902包括硬體，用於顯示裝置及多層視訊或使用者介面元件之組成的一或更多個覆蓋平面。在若干實施例中，圖形處理器900包括視訊編解碼器引擎906，而至、自或於一或更多個媒體編碼格式之間編碼、解碼、或轉碼媒體，包括但不侷限於動態圖像專家群組(MPEG)格式，諸如MPEG-2；先進視訊編碼(AVC)格式，諸如H.264/MPEG-4 AVC；以及動畫與電視工程師協會(SMPTE)421M/VC-1；聯合圖像專家群組(JPEG)格式，諸如JPEG；及動畫JPEG(MJPEG)格式。 In some embodiments, the graphics processor 900 also includes a display controller 902 to drive display and output data to the display device 920. The display controller 902 includes hardware for one or more overlay planes composed of a display device and a multi-layer video or user interface element. In some embodiments, the graphics processor 900 includes a video codec engine 906 to encode, decode, or transcode media to, from, or among one or more media encoding formats, including but not limited to dynamic graphics Like Group of Experts (MPEG) formats, such as MPEG-2; Advanced Video Coding (AVC) formats, such as H.264/MPEG-4 AVC; and Society of Animation and Television Engineers (SMPTE) 421M/VC-1; Joint Image Group of Experts (JPEG) formats, such as JPEG; and animated JPEG (MJPEG) formats.

在若干實施例中，圖形處理器900包括方塊圖像轉移(BLIT)引擎904，以實施二維(2D)光柵器作業，包括例如位元邊界方塊轉移。然而，在一實施例中，使用圖形處理引擎(GPE)910之一或更多個組件來實施8D圖形作業。在若干實施例中，圖形處理引擎910為運算引擎，用於實施圖形作業，包括三維(3D)圖形作業及媒體作業。 In several embodiments, the graphics processor 900 includes a block image transfer (BLIT) engine 904 to implement two-dimensional (2D) rasterizer operations Industry, including, for example, bit boundary block transfer. However, in one embodiment, one or more components of the graphics processing engine (GPE) 910 are used to implement 8D graphics operations. In some embodiments, the graphics processing engine 910 is an arithmetic engine for implementing graphics tasks, including three-dimensional (3D) graphics tasks and media tasks.

在若干實施例中，GPE 910包括3D管線912用於實施3D作業，諸如使用在3D原始形狀(例如矩形、三角形等)上動作之處理功能呈現三維圖像及場景。3D管線912包括可程控及固定功能元件，其於元件內實施各式工作，及/或產生執行緒至3D/媒體子系統915。雖然3D管線912可用以實施媒體作業，GPE 910之實施例亦包括媒體管線916，其具體地用以實施媒體作業，諸如視訊後處理及圖像增強。 In some embodiments, the GPE 910 includes a 3D pipeline 912 for implementing 3D operations, such as rendering 3D images and scenes using processing functions that operate on 3D original shapes (eg rectangles, triangles, etc.). The 3D pipeline 912 includes programmable and fixed-function components, which implement various tasks in the components and/or generate threads to the 3D/media subsystem 915. Although the 3D pipeline 912 can be used to implement media operations, an embodiment of the GPE 910 also includes a media pipeline 916, which is specifically used to implement media operations, such as video post-processing and image enhancement.

在若干實施例中，媒體管線916包括固定功能或可程控邏輯單元，代替或代表視訊編解碼器引擎906實施一或更多個特定媒體作業，諸如視訊解碼加速、視訊去交錯、及視訊編碼加速。在若干實施例中，媒體管線916額外包括執行緒產生單元來產生執行緒，用於在3D/媒體子系統915上執行。產生之執行緒於3D/媒體子系統915中所包括之一或更多個圖形執行單元上實施媒體作業運算。 In some embodiments, the media pipeline 916 includes a fixed function or programmable logic unit, instead of or on behalf of the video codec engine 906 to perform one or more specific media operations, such as video decoding acceleration, video deinterlacing, and video encoding acceleration . In some embodiments, the media pipeline 916 additionally includes a thread generation unit to generate threads for execution on the 3D/media subsystem 915. The generated thread executes media operation operations on one or more graphics execution units included in the 3D/media subsystem 915.

在若干實施例中，3D/媒體子系統915包括邏輯，用於執行3D管線912及媒體管線916產生之執行緒。在一實施例中，管線發送執行緒執行請求至3D/媒體子系統915，其包括執行緒調度邏輯，用於針對可用執行緒執行資源仲裁及調度各式請求。執行資源包括圖形執行單元之陣列以處理3D及媒體執行緒。在若干實施例中，3D/媒體子系統915包括執行緒指令及資料之一或更多個內部快取記憶體。在若干實施例中，子系統亦包括公用記憶體，包括暫存器及可定址記憶體，而於執行緒之間共用資料及儲存輸出資料。 In some embodiments, the 3D/media subsystem 915 includes logic for executing the threads generated by the 3D pipeline 912 and the media pipeline 916. In one embodiment, the pipeline sends a thread execution request to the 3D/media The body subsystem 915 includes thread scheduling logic for executing resource arbitration and scheduling various requests for available threads. Execution resources include an array of graphics execution units to process 3D and media threads. In some embodiments, the 3D/media subsystem 915 includes one or more internal caches of thread instructions and data. In some embodiments, the subsystem also includes a common memory, including a register and an addressable memory, and shares data between threads and stores output data.

圖10依據若干實施例，為圖形處理器之圖形處理引擎1010之方塊圖。在一實施例中，GPE 1010為圖9中所示之GPE 910的版本。圖10之元件具有與文中任何其他圖之元件相同參考號碼(或名稱)，可以類似於文中其他地方描述之任何方式操作或做動，但不侷限於此。 FIG. 10 is a block diagram of a graphics processing engine 1010 of a graphics processor according to several embodiments. In one embodiment, GPE 1010 is a version of GPE 910 shown in FIG. 9. The elements in FIG. 10 have the same reference numbers (or names) as the elements in any other figures in the text, and can be operated or acted in any manner similar to those described elsewhere in the text, but are not limited thereto.

在若干實施例中，GPE 1010與命令串流器1003耦接，其提供命令流至GPE 3D及媒體管線1012、1016。在若干實施例中，命令串流器1003耦接至記憶體，其可為系統記憶體或一或更多個內部快取記憶體及公用快取記憶體。在若干實施例中，命令串流器1003接收來自記憶體之命令，並發送命令至3D管線1012及/或媒體管線1016。命令為取自儲存3D及媒體管線1012、1016之命令之環狀緩衝器之指引。在一實施例中，環狀緩衝器可額外包括批次命令緩衝器，儲存多批命令。3D及媒體管線1012、1016藉由經由個別管線內之邏輯實施作業，或藉由調度一或更多個執行緒至執行單元陣列1014，而處理命令。在若干實施例中，執行單元陣列1014可擴縮，使得陣列依據GPE 1010之目標電力及效能位準而包括可變數量執行單元。 In some embodiments, the GPE 1010 is coupled to the command streamer 1003, which provides the command stream to the GPE 3D and media pipelines 1012, 1016. In some embodiments, the command streamer 1003 is coupled to a memory, which can be system memory or one or more internal cache memory and public cache memory. In some embodiments, the command streamer 1003 receives commands from the memory and sends the commands to the 3D pipeline 1012 and/or the media pipeline 1016. The command is a guide taken from the ring buffer storing the commands of the 3D and media pipelines 1012 and 1016. In one embodiment, the ring buffer may additionally include a batch command buffer to store multiple batches of commands. The 3D and media pipelines 1012, 1016 process commands by performing operations through logic within individual pipelines, or by dispatching one or more threads to the execution unit array 1014. In several embodiments, the execution unit array 1014 can be expanded Shrink, so that the array includes a variable number of execution units based on the target power and performance level of the GPE 1010.

在若干實施例中，取樣引擎1030與記憶體(例如快取記憶體或系統記憶體)及執行單元陣列1014耦接。在若干實施例中，取樣引擎1030提供執行單元陣列1014之記憶體存取機構，允許執行單元陣列1014從記憶體讀取圖形及媒體資料。在若干實施例中，取樣引擎1030包括邏輯以實施媒體之特定圖像取樣作業。 In some embodiments, the sampling engine 1030 is coupled to memory (such as cache memory or system memory) and the execution unit array 1014. In some embodiments, the sampling engine 1030 provides a memory access mechanism for the execution unit array 1014, allowing the execution unit array 1014 to read graphics and media data from the memory. In some embodiments, the sampling engine 1030 includes logic to implement specific image sampling operations for the media.

在若干實施例中，取樣引擎1030中特定媒體取樣邏輯包括去雜訊/去交錯模組1032、動態估計模組1034、及圖像縮放及濾波環狀模組1036。在若干實施例中，去雜訊/去交錯模組1032包括邏輯以於解碼之視訊資料上實施一或更多個去雜訊或去交錯演算法。去交錯邏輯組合交錯視訊內容之交錯欄位為單一視訊訊框。去雜訊邏輯減少或移除視訊及圖像資料之資料雜訊。在若干實施例中，去雜訊邏輯及去交錯邏輯係隨動作調整，並依據視訊資料中檢測之動作數量使用空間或時間濾波。在若干實施例中，去雜訊/去交錯模組1032包括專用動作檢測邏輯(例如動作估計引擎1034內)。 In some embodiments, the specific media sampling logic in the sampling engine 1030 includes a denoising/deinterlacing module 1032, a dynamic estimation module 1034, and an image scaling and filtering ring module 1036. In some embodiments, the denoising/deinterlacing module 1032 includes logic to implement one or more denoising or deinterlacing algorithms on the decoded video data. The de-interlacing logic combines the interlaced field of the interlaced video content into a single video frame. The noise removal logic reduces or removes the data noise of the video and image data. In some embodiments, the denoising logic and the de-interlacing logic are adjusted with the action, and spatial or temporal filtering is used according to the number of actions detected in the video data. In some embodiments, the denoising/deinterlacing module 1032 includes dedicated motion detection logic (for example, in the motion estimation engine 1034).

在若干實施例中，動作估計引擎1034藉由實施視訊加速功能，諸如視訊資料之動作向量估計及預測，而提供視訊作業之硬體加速。動作估計引擎決定動作向量，其描述連續視訊訊框間之圖像資料之轉換。在若干實施例中，圖形處理器媒體編解碼器使用視訊動作估計引擎 1034而於巨集區塊級之視訊上實施作業，其可能因運算過於集中而無法以通用處理器實施。在若干實施例中，動作估計引擎1034一般可用於圖形處理器組件而協助視訊解碼及處理功能，其對於視訊資料內之動作方向及量值敏感或調適。 In some embodiments, the motion estimation engine 1034 provides hardware acceleration for video operations by implementing video acceleration functions, such as motion vector estimation and prediction of video data. The motion estimation engine determines the motion vector, which describes the conversion of image data between consecutive video frames. In several embodiments, the graphics processor media codec uses a video motion estimation engine 1034 The operations performed on the video at the macro block level may not be implemented by general-purpose processors due to the concentration of operations. In some embodiments, the motion estimation engine 1034 can generally be used in graphics processor components to assist video decoding and processing functions, which are sensitive or adaptable to the direction and magnitude of motion in the video data.

在若干實施例中，圖像縮放及濾波環狀模組1036實施圖像處理作業，以增強產生之圖像及視訊的視覺品質。在若干實施例中，縮放及濾波環狀模組1036於提供資料至執行單元陣列1014之前，於取樣作業期間處理圖像及視訊資料。 In some embodiments, the image scaling and filtering ring module 1036 performs image processing operations to enhance the visual quality of the generated images and videos. In some embodiments, the zoom and filter ring module 1036 processes the image and video data during the sampling operation before providing the data to the execution unit array 1014.

在若干實施例中，GPE 1010包括資料埠1044，其提供圖形子系統之額外機構至存取記憶體。在若干實施例中，資料埠1044促進作業之記憶體存取，包括呈現目標寫入、常量緩衝器讀取、暫用記憶體空間讀取/寫入、及媒體表面存取。在若干實施例中，資料埠1044包括快取記憶體存取記憶體之快取記憶體空間。快取記憶體可為單一資料快取記憶體或分為多子系統之多快取記憶體，其經由資料埠(例如呈現緩衝器快取記憶體、常量緩衝器快取記憶體等)存取記憶體。在若干實施例中，執行緒於執行單元陣列1014中之執行單元上執行，經由耦接GPE 1010之每一子系統之資料分布互連交換信息，而與資料埠通訊。 In some embodiments, the GPE 1010 includes a data port 1044, which provides additional mechanisms for the graphics subsystem to access memory. In some embodiments, the data port 1044 facilitates memory access for operations, including presentation target write, constant buffer read, temporary memory space read/write, and media surface access. In some embodiments, the data port 1044 includes a cache space for the cache memory to access the memory. Cache memory can be a single data cache memory or multiple cache memory divided into multiple subsystems, which are accessed through data ports (such as presentation buffer cache memory, constant buffer cache memory, etc.) Memory. In some embodiments, the threads execute on the execution units in the execution unit array 1014, exchange information through the data distribution interconnection of each subsystem coupled to the GPE 1010, and communicate with the data port.

圖11為圖形處理器1100之另一實施例之方塊圖。圖11之該些元件具有與文中任何其他圖之元件相同的參考號碼(或名稱)，可以類似於文中任何地方描述之任何方式操作或作動，但不侷限於此。 FIG. 11 is a block diagram of another embodiment of a graphics processor 1100. These elements in Figure 11 have the same elements as any other figure in the text The same reference number (or name) can be operated or acted in any manner similar to that described anywhere in the text, but is not limited to this.

在若干實施例中，圖形處理器1100包括環狀互連1102、管線前端1104、媒體引擎1137、及圖形核心1180A-1180N。在若干實施例中，環狀互連1102耦接圖形處理器至其他處理單元，包括其他圖形處理器或一或更多個通用處理器核心。在若干實施例中，圖形處理器為整合於多核心處理系統內之許多處理器之一。 In several embodiments, the graphics processor 1100 includes a ring interconnect 1102, a pipeline front end 1104, a media engine 1137, and a graphics core 1180A-1180N. In some embodiments, the ring interconnect 1102 couples the graphics processor to other processing units, including other graphics processors or one or more general-purpose processor cores. In some embodiments, the graphics processor is one of many processors integrated in a multi-core processing system.

在若干實施例中，圖形處理器1100經由環狀互連1102而接收若干批命令。匯入命令係由管線前端1104中之命令串流器1103解譯。在若干實施例中，圖形處理器1100包括可擴縮執行邏輯，經由圖形核心1180A-1180N實施3D幾何處理及媒體處理。對3D幾何處理命令而言，命令串流器1103供應命令至幾何管線1136。至少若干媒體處理命令而言，命令串流器1103供應命令至視訊前端1134，其與媒體引擎1137耦接。在若干實施例中，媒體引擎1137包括視訊品質引擎(VQF)1130，用於視訊及圖像後處理及多格式編碼/解碼(MFX)引擎1133，而提供硬體加速之媒體資料編碼及解碼。在若干實施例中，幾何管線1136及媒體引擎1137各產生執行緒，用於至少一圖形核心1180A提供之執行緒執行資源。 In several embodiments, the graphics processor 1100 receives several batches of commands via the ring interconnect 1102. The import command is interpreted by the command streamer 1103 in the pipeline front end 1104. In some embodiments, the graphics processor 1100 includes scalable execution logic, and implements 3D geometry processing and media processing via the graphics cores 1180A-1180N. For 3D geometry processing commands, the command streamer 1103 supplies the commands to the geometry pipeline 1136. For at least some media processing commands, the command streamer 1103 supplies commands to the video front end 1134, which is coupled to the media engine 1137. In some embodiments, the media engine 1137 includes a video quality engine (VQF) 1130 for video and image post-processing and a multi-format encoding/decoding (MFX) engine 1133 to provide hardware-accelerated media data encoding and decoding. In some embodiments, the geometry pipeline 1136 and the media engine 1137 each generate threads for the thread execution resources provided by at least one graphics core 1180A.

在若干實施例中，圖形處理器1100包括可擴縮執行緒執行資源特色模組核心1180A-1180N(有時稱為核心切片)，各具有多子核心1150A-1150N、1160A- 1160N(有時稱為核心子切片)。在若干實施例中，圖形處理器1100可具有任何數量之圖形核心1180A至1180N。在若干實施例中，圖形處理器1100包括具有至少第一子核心1150A及第二核心子核心1160A之圖形核心1180A。在其他實施例中，圖形處理器為低電力處理器，具單一子核心(例如1150A)。在若干實施例中，圖形處理器1100包括多圖形核心1180A-1180N，各包括第一子核心組1150A-1150N及第二子核心組1160A-1160N。第一子核心組1150A-1150N中每一子核心包括至少第一組執行單元1152A-1152N及媒體/紋理取樣器1154A-1154N。第二子核心組1160A-1160N中每一子核心包括至少第二組執行單元1162A-1162N及取樣器1164A-1164N。在若干實施例中，每一子核心1150A-1150N、1160A-1160N共用公用資源組1170A-1170N。在若干實施例中，公用資源包括公用快取記憶體及像素作業邏輯。其他公用資源亦可包括於圖形處理器之各式實施例中。 In some embodiments, the graphics processor 1100 includes scalable thread execution resource feature module cores 1180A-1180N (sometimes called core slices), each with multiple sub-cores 1150A-1150N, 1160A- 1160N (sometimes called core sub-slice). In some embodiments, the graphics processor 1100 may have any number of graphics cores 1180A to 1180N. In some embodiments, the graphics processor 1100 includes a graphics core 1180A having at least a first sub-core 1150A and a second core sub-core 1160A. In other embodiments, the graphics processor is a low-power processor with a single sub-core (for example, 1150A). In some embodiments, the graphics processor 1100 includes multiple graphics cores 1180A-1180N, each including a first sub-core group 1150A-1150N and a second sub-core group 1160A-1160N. Each sub-core in the first sub-core group 1150A-1150N includes at least a first group of execution units 1152A-1152N and a media/texture sampler 1154A-1154N. Each sub-core in the second sub-core group 1160A-1160N includes at least a second group of execution units 1162A-1162N and samplers 1164A-1164N. In some embodiments, each sub-core 1150A-1150N, 1160A-1160N shares a common resource group 1170A-1170N. In some embodiments, the common resources include common cache memory and pixel operation logic. Other common resources can also be included in various embodiments of the graphics processor.

圖12描繪執行緒執行邏輯1200，其包括在GPE之若干實施例中採用之處理元件陣列。圖12之該些元件具有與文中任何其他圖之元件相同的參考號碼(或名稱)，可以類似於文中任何地方描述之任何方式操作或作動，但不侷限於此。 Figure 12 depicts thread execution logic 1200, which includes an array of processing elements employed in several embodiments of GPE. The elements in FIG. 12 have the same reference numbers (or names) as the elements in any other figures in the text, and can be operated or acted in any manner similar to that described anywhere in the text, but are not limited thereto.

在若干實施例中，執行緒執行邏輯1200包括像素著色器1202、執行緒調度器1204、指令快取記憶體1206、包括複數執行單元1208A-1208N之可擴縮執行單元陣列、取樣器1210、資料快取記憶體1212、及資料埠1214。在一實施例中，包括之組件經由鏈接每一組件之互連架構互連。在若干實施例中，執行緒執行邏輯1200包括至諸如系統記憶體或快取記憶體之記憶體之一或更多個連接；至一或更多個指令快取記憶體1206之一或更多個連接；資料埠1214；取樣器1210；及執行單元陣列1208A-1208N。在若干實施例中，每一執行單元(例如1208A)為個別向量處理器，可每一執行緒並列地執行多同步執行緒及處理多資料元件。在若干實施例中，執行單元陣列1208A-1208N包括任何數量個別執行單元。 In some embodiments, the thread execution logic 1200 includes a pixel shader 1202, a thread scheduler 1204, an instruction cache 1206, and a scalable execution unit including multiple execution units 1208A-1208N Element array, sampler 1210, data cache 1212, and data port 1214. In one embodiment, the included components are interconnected via an interconnection fabric linking each component. In some embodiments, the thread execution logic 1200 includes one or more connections to memory such as system memory or cache memory; one or more connections to one or more instruction cache memory 1206 One connection; data port 1214; sampler 1210; and execution unit array 1208A-1208N. In some embodiments, each execution unit (such as 1208A) is a separate vector processor, and each thread can execute multiple synchronous threads and process multiple data elements in parallel. In several embodiments, the execution unit arrays 1208A-1208N include any number of individual execution units.

在若干實施例中，執行單元陣列1208A-1208N主要用以執行「著色器」程式。在若干實施例中，陣列1208A-1208N中之執行單元執行指令集，包括本機支援許多標準3D圖形著色器指令，使得以最小翻譯執行來自圖形庫(例如直接3D及OpenGL)之著色器程式。執行單元支援頂點及幾何處理(例如頂點程式、幾何程式、頂點著色器)、像素處理(例如像素著色器、資料塊著色器)及通用處理(例如運算及媒體著色器)。 In some embodiments, the execution unit arrays 1208A-1208N are mainly used to execute "shader" programs. In some embodiments, the execution units in the arrays 1208A-1208N execute instruction sets, including native support for many standard 3D graphics shader instructions, so that the shader programs from graphics libraries (such as direct 3D and OpenGL) can be executed with minimal translation. The execution unit supports vertex and geometric processing (such as vertex programs, geometric programs, vertex shaders), pixel processing (such as pixel shaders, block shaders), and general processing (such as operations and media shaders).

執行單元陣列1208A-1208N中每一執行單元於資料元件陣列上操作。資料元件數量為「執行尺寸」或指令之通道數量。執行通道為執行資料元件存取、遮罩、及指令之流程控制的邏輯單元。通道數量可獨立於特定圖形處理器之實體算術邏輯單元(ALU)或浮點單元(FPU)之數量。在若干實施例中，執行單元1208A- 1208N支援整數及浮點資料類型。 Each execution unit in the execution unit arrays 1208A-1208N operates on the data element array. The number of data components is the "execution size" or the number of channels of the command. The execution channel is a logical unit that executes data element access, masking, and flow control of instructions. The number of channels can be independent of the number of physical arithmetic logic units (ALU) or floating point units (FPU) of a particular graphics processor. In several embodiments, the execution unit 1208A- 1208N supports integer and floating point data types.

執行單元指令集包括單指令多資料(SIMD)指令。各式資料元件可以封裝資料類型儲存於暫存器中，且執行單元將依據元件之資料尺寸處理各式元件。例如，當在256位元寬向量上作業時，向量之256位元儲存於暫存器中，且執行單元在向量操作為4個別64位元封裝資料元件(四字(QW)尺寸資料元件)、8個別32位元封裝資料元件(雙字(DW)尺寸資料元件)、16個別16位元封裝資料元件(字(W)尺寸資料元件)、或32個別8位元資料元件(位元組(B)尺寸資料元件)。然而，不同向量寬度及暫存器尺寸亦可。 The execution unit instruction set includes single instruction multiple data (SIMD) instructions. Various data elements can be packaged and stored in the register, and the execution unit will process various elements according to the data size of the element. For example, when operating on a 256-bit wide vector, the 256-bit of the vector is stored in the register, and the execution unit operates on the vector as 4 individual 64-bit packaged data elements (quad-word (QW) size data elements) , 8 individual 32-bit package data elements (double word (DW) size data element), 16 individual 16-bit package data elements (word (W) size data element), or 32 individual 8-bit data elements (byte group) (B) Size data element). However, different vector widths and register sizes are also possible.

一或更多個內部指令快取記憶體(例如1206)係包括於執行緒執行邏輯1200中，以快取執行單元之執行緒指令。在若干實施例中，係包括一或更多個資料快取記憶體(例如1212)，而於執行緒執行期間快取執行緒資料。在若干實施例中，係包括取樣器1210，而提供3D作業之紋理取樣及媒體作業之媒體取樣。在若干實施例中，取樣器1210包括特定紋理或媒體取樣功能，而於提供取樣資料至執行單元之前，在取樣程序期間處理紋理或媒體資料。 One or more internal instruction caches (such as 1206) are included in the thread execution logic 1200 to cache the thread instructions of the execution unit. In some embodiments, one or more data caches (such as 1212) are included, and the thread data is cached during thread execution. In some embodiments, a sampler 1210 is included to provide texture sampling for 3D operations and media sampling for media operations. In some embodiments, the sampler 1210 includes a specific texture or media sampling function, and the texture or media data is processed during the sampling process before the sampling data is provided to the execution unit.

執行期間，圖形及媒體管線經由執行緒產生及調度邏輯發送執行緒啟動請求至執行緒執行邏輯1200。在若干實施例中，執行緒執行邏輯1200包括本機執行緒調度器1204，其仲裁來自圖形及媒體管線之執行緒啟動請求，並於一或更多個執行單元1208A-1208N上例示請求之執行緒。例如，幾何管線(例如圖11之1136)針對執行緒執行邏輯1200，而調度頂點處理、密鋪、或幾何處理執行緒(圖12)。在若干實施例中，執行緒調度器1204亦可處理來自執行著色器程式之運行時間執行緒產生請求。 During execution, the graphics and media pipeline sends a thread activation request to the thread execution logic 1200 via the thread generation and scheduling logic. In some embodiments, the thread execution logic 1200 includes a native thread scheduler 1204, which arbitrates execution from the graphics and media pipelines The thread initiates the request and instantiates the requested thread on one or more execution units 1208A-1208N. For example, the geometry pipeline (such as 1136 in FIG. 11) executes the logic 1200 for the thread and schedules vertex processing, tiling, or geometry processing threads (FIG. 12). In some embodiments, the thread scheduler 1204 can also process runtime thread generation requests from executing shader programs.

一旦幾何物件群經處理及格柵化為像素資料，便調用像素著色器1202而進一步運算輸出資訊，並致使結果寫入至輸出表面(例如顏色緩衝器、深度緩衝器、模板緩衝器等)。在若干實施例中，像素著色器1202計算跨越格柵化物件而內插之各式頂點屬性的值。在若干實施例中，像素著色器1202接著執行供應像素著色器程式之應用編程介面(API)。為執行像素著色器程式，像素著色器1202經由執行緒調度器1204而調度執行緒至執行單元(例如1208A)。在若干實施例中，像素著色器1202使用取樣器1210中之紋理取樣邏輯，存取儲存於記憶體中之紋理映射中之紋理資料。紋理資料及輸入幾何資料上之算術作業運算每一幾何資料塊之像素顏色資料，或拋棄來自進一步處理之一或更多個像素。 Once the geometric object group is processed and gridded into pixel data, the pixel shader 1202 is called to further calculate the output information and cause the result to be written to the output surface (such as color buffer, depth buffer, stencil buffer, etc.). In some embodiments, the pixel shader 1202 calculates the values of various vertex attributes that are interpolated across the grid objects. In some embodiments, the pixel shader 1202 then executes an application programming interface (API) that supplies the pixel shader program. To execute the pixel shader program, the pixel shader 1202 schedules threads to the execution unit (for example, 1208A) through the thread scheduler 1204. In some embodiments, the pixel shader 1202 uses the texture sampling logic in the sampler 1210 to access texture data stored in the texture map in memory. Arithmetic operations on texture data and input geometric data calculate the pixel color data of each geometric data block, or discard one or more pixels from further processing.

在若干實施例中，資料埠1214提供執行緒執行邏輯1200之記憶體存取機構，而輸出處理之資料至記憶體，供圖形處理器輸出管線上之處理。在若干實施例中，資料埠1214包括或耦接至一或更多個快取記憶體(例如資料快取記憶體1212)，而經由資料埠快取記憶體存取之資料。 In some embodiments, the data port 1214 provides a memory access mechanism for the thread execution logic 1200, and outputs processed data to the memory for processing on the output pipeline of the graphics processor. In some embodiments, the data port 1214 includes or is coupled to one or more cache memories (such as the data cache memory 1212), and the data port 1214 caches the memory Body access data.

圖13為方塊圖，依據若干實施例，描繪圖形處理器指令格式1300。在一或更多個實施例中，圖形處理器執行單元支援具有多格式指令之指令集。實線框描繪一般包括於執行單元指令中之組件，同時虛線包括可選或僅包括於指令子集中之組件。在若干實施例中，所描述及描繪之指令格式1300為巨集指令，其中相對於源自指令處理之指令解碼的微運算，巨集指令係供應至執行單元之指令。 Figure 13 is a block diagram depicting a graphics processor instruction format 1300 according to several embodiments. In one or more embodiments, the graphics processor execution unit supports an instruction set with multi-format instructions. The solid line box depicts the components that are generally included in the execution unit instructions, while the dotted line includes the components that are optional or only included in the instruction subset. In some embodiments, the instruction format 1300 described and depicted is a macro instruction, where the macro instruction is an instruction supplied to the execution unit as opposed to micro-operations derived from instruction decoding of instruction processing.

在若干實施例中，圖形處理器執行單元本機支援128位元格式1310之指令。依據選擇之指令、指令選項、及運算元數量，64位元緊實指令格式1330可用於若干指令。本機128位元格式1310提供所有指令選項之存取，同時若干選項及作業限制於64位元格式1330。可用於64位元格式1330之本機指令隨實施例而異。在若干實施例中，部分使用索引欄1313中之一組索引值而緊實指令。執行單元硬體依據索引值而參考一組緊實表，並使用緊實表輸出而重建128位元格式1310之本機指令。 In some embodiments, the graphics processor execution unit natively supports instructions in the 128-bit format 1310. According to the selected instruction, instruction options, and number of operands, the 64-bit compact instruction format 1330 can be used for several instructions. The native 128-bit format 1310 provides access to all command options, while some options and operations are limited to the 64-bit format 1330. The native instructions available for the 64-bit format 1330 vary from embodiment to embodiment. In some embodiments, a set of index values in the index column 1313 is partially used to compact instructions. The execution unit hardware refers to a set of compaction tables based on the index value, and uses the compaction table output to reconstruct the native instructions in the 128-bit format 1310.

對每一格式而言，指令運算碼1312定義執行單元將實施之作業。執行單元跨越每一運算元之多資料元件而並列執行每一指令。例如，回應於加法指令，執行單元跨越代表紋理元件或圖像元件之每一顏色通道，而實施同步加法作業。藉由預置，執行單元跨越運算元之所有資料通道而實施每一指令。在若干實施例中，指令控制欄 1314致能某些執行選項之控制，諸如通道選擇(例如預測)及資料通道順序(例如拌和)。對128位元指令1310而言，執行尺寸欄1316限制將並列執行之資料通道數量。在若干實施例中，執行尺寸欄1316無法用於64位元緊實指令格式1330。 For each format, the instruction opcode 1312 defines the operation that the execution unit will perform. The execution unit spans multiple data elements of each operand and executes each instruction in parallel. For example, in response to the addition instruction, the execution unit crosses each color channel representing the texture element or the image element to perform a synchronous addition operation. By preset, the execution unit implements each command across all data channels of the operand. In several embodiments, the command control bar 1314 enables control of certain execution options, such as channel selection (for example, prediction) and data channel order (for example, mixing). For the 128-bit command 1310, the execution size column 1316 limits the number of data channels that will be executed in parallel. In some embodiments, the execution size column 1316 cannot be used in the 64-bit compact instruction format 1330.

若干執行單元指令具有多達三運算元，包括二來源運算元SRC0 1320、SRC1 1322，及一目的地1318。在若干實施例中，執行單元支援雙目的地指令，其中一目的地隱含。資料調處指令可具有第三來源運算元(例如SRC2 1324)，其中指令運算碼1312決定來源運算元數量。指令之最後來源運算元可為以指令傳遞之立即(例如硬編碼)值。 Several execution unit instructions have up to three operands, including two source operands SRC0 1320, SRC1 1322, and one destination 1318. In some embodiments, the execution unit supports dual-destination instructions, where one destination is implicit. The data adjustment instruction may have a third source operand (for example, SRC2 1324), where the instruction opcode 1312 determines the number of source operands. The last source operand of the instruction can be an immediate (eg hard-coded) value passed by the instruction.

在若干實施例中，128位元指令格式1310包括存取/位址模式資訊1326，指定例如使用直接暫存器定址模式或間接暫存器定址模式。當使用直接暫存器定址模式時，指令1310中之位元直接提供一或更多個運算元之暫存器位址。 In some embodiments, the 128-bit command format 1310 includes access/address mode information 1326, specifying, for example, direct register addressing mode or indirect register addressing mode. When the direct register addressing mode is used, the bits in the instruction 1310 directly provide the register addresses of one or more operands.

在若干實施例中，128位元指令格式1310包括存取/位址模式欄1326，其指定指令之位址模式及/或存取模式。在一實施例中，存取模式定義指令之資料存取對齊。若干實施例支援存取模式，包括16位元組對齊存取模式及1位元組對齊存取模式，其中存取模式之位元組對齊決定指令運算元之存取對齊。例如，當處於第一模式時，指令1310可使用來源及目的地運算元之位元組對齊定址，當處於第二模式時，指令1310可使用所有來源及目的地運算元之16位元組對齊定址。 In some embodiments, the 128-bit command format 1310 includes an access/address mode column 1326, which specifies the address mode and/or access mode of the command. In one embodiment, the access mode defines the data access alignment of the command. Several embodiments support access modes, including a 16-byte aligned access mode and a 1-byte aligned access mode, where the byte alignment of the access mode determines the access alignment of instruction operands. For example, when in the first mode, the instruction 1310 can use the byte pair of the source and destination operands Uniform addressing. When in the second mode, instruction 1310 can use 16-byte aligned addressing of all source and destination operands.

在一實施例中，存取/位址模式欄1326之位址模式部決定指令係使用直接或間接定址。當使用直接暫存器定址模式時，指令1310中之位元直接提供一或更多個運算元之暫存器位址。當使用間接暫存器定址模式時，可依據指令中之位址暫存器值及位址立即欄，運算一或更多個運算元之暫存器位址。 In one embodiment, the address mode part of the access/address mode column 1326 determines whether the command uses direct or indirect addressing. When the direct register addressing mode is used, the bits in the instruction 1310 directly provide the register addresses of one or more operands. When the indirect register addressing mode is used, the register address of one or more operands can be calculated based on the address register value and the address immediate column in the instruction.

在若干實施例中，指令依據運算碼1312位元欄而群集，以簡化運算碼解碼1340。對8位元運算碼而言，位元10、11、及12允許執行單元決定運算碼類型。所示精確運算碼群集僅為範例。在若干實施例中，移動及邏輯運算碼群集1342包括資料移動及邏輯指令(例如移動(mov)、比較(cmp))。在若干實施例中，移動及邏輯群集1342共用五個最高效位元(MSB)，其中移動(mov)指令係採0000xxxxb形式，且邏輯指令係採0001xxxxb形式。流程控制指令群集1344(例如呼叫、跳躍(jmp))包括0010xxxxb(例如0x20)形式之指令。雜項指令群集1346包括指令混合，包括0011xxxxb(例如0x30)形式之同步指令(例如等候、發送)。並列數學指令群集1348包括0100xxxxb(例如0x40)形式之組件形式算術指令(例如加法、乘法(mul))。並列數學群集1348跨越資料通道並列實施算術作業。向量數學群集1350包括0101xxxxb(例如0x50)形式之算術指令 (例如dp4)。向量數學群集於向量運算元上實施算術，諸如點乘積計算。 In some embodiments, the instructions are grouped according to the 1312 bit column of the operation code to simplify the operation code decoding 1340. For 8-bit opcodes, bits 10, 11, and 12 allow the execution unit to determine the type of opcode. The exact operation code cluster shown is only an example. In some embodiments, the movement and logical operation code cluster 1342 includes data movement and logical instructions (eg, move (mov), compare (cmp)). In some embodiments, the move and logical cluster 1342 share the five most efficient bits (MSB), where the move (mov) instruction is in the form of 0000xxxxb, and the logical instruction is in the form of 0001xxxxb. The flow control command cluster 1344 (such as call, jump (jmp)) includes commands in the form of 0010xxxxb (such as 0x20). The miscellaneous command cluster 1346 includes command mixes, including synchronization commands (such as waiting, sending) in the form of 0011xxxxb (such as 0x30). The parallel math instruction cluster 1348 includes a component form arithmetic instruction (e.g., addition, multiplication (mul)) in the form of 0100xxxxb (e.g., 0x40). Parallel math cluster 1348 performs arithmetic operations in parallel across data channels. The vector math cluster 1350 includes arithmetic instructions in the form of 0101xxxxb (e.g. 0x50) (E.g. dp4). Vector math clusters perform arithmetic on vector operands, such as dot product calculations.

圖14為圖形處理器1400之另一實施例的方塊圖。圖14之該些元件具有與文中任何其他圖之元件相同的參考號碼(或名稱)，可以類似於文中任何地方描述之任何方式操作或作動，但不侷限於此。 FIG. 14 is a block diagram of another embodiment of a graphics processor 1400. These elements of FIG. 14 have the same reference numbers (or names) as the elements of any other figures in the text, and can be operated or acted in any manner similar to that described anywhere in the text, but are not limited thereto.

在若干實施例中，圖形處理器1400包括圖形管線1420、媒體管線1430、顯示引擎1440、執行緒執行邏輯1450、及呈現輸出管線1470。在若干實施例中，圖形處理器1400為包括一或更多個通用處理核心之多核心處理系統內之圖形處理器。圖形處理器係藉由暫存器寫入至一或更多個控制暫存器(未顯示)，或經由環狀互連1402發布至圖形處理器1400之命令所控制。在若干實施例中，環狀互連1402將圖形處理器1400耦接至其他處理組件，諸如其他圖形處理器或通用處理器。來自環狀互連1402之命令係由命令串流器1403解譯，其供應指令至圖形管線1420或媒體管線1430之個別組件。 In several embodiments, the graphics processor 1400 includes a graphics pipeline 1420, a media pipeline 1430, a display engine 1440, a thread execution logic 1450, and a presentation output pipeline 1470. In some embodiments, the graphics processor 1400 is a graphics processor in a multi-core processing system including one or more general-purpose processing cores. The graphics processor is controlled by a register written to one or more control registers (not shown), or a command issued to the graphics processor 1400 via the ring interconnect 1402. In several embodiments, the ring interconnect 1402 couples the graphics processor 1400 to other processing components, such as other graphics processors or general purpose processors. The commands from the ring interconnect 1402 are interpreted by the command streamer 1403, which supplies commands to individual components of the graphics pipeline 1420 or the media pipeline 1430.

在若干實施例中，命令串流器1403指示頂點收件器1405之作業，其讀取來自記憶體之頂點資料，並執行命令串流器1403提供之頂點處理命令。在若干實施例中，頂點收件器1405提供頂點資料至頂點著色器1407，其實施至每一頂點之座標空間轉換及照明作業。在若干實施例中，頂點收件器1405及頂點著色器1407經由執行緒調度器1431調度執行緒至執行單元1452A、 1452B，而執行頂點處理指令。 In some embodiments, the command streamer 1403 instructs the vertex receiver 1405 to read the vertex data from the memory and execute the vertex processing commands provided by the command streamer 1403. In some embodiments, the vertex receiver 1405 provides vertex data to the vertex shader 1407, which implements coordinate space conversion and lighting operations to each vertex. In some embodiments, the vertex receiver 1405 and the vertex shader 1407 schedule the threads to the execution unit 1452A via the thread scheduler 1431, 1452B, and execute vertex processing instructions.

在若干實施例中，執行單元1452A、1452B為具有實施圖形及媒體作業之指令集之向量處理器的陣列。在若干實施例中，執行單元1452A、1452B具有附屬L1快取記憶體1451，其特定用於每一陣列或公用於陣列之間。快取記憶體可組配為資料快取記憶體、指令快取記憶體、或單一快取記憶體，其分區而包含不同分區之資料及指令。 In some embodiments, the execution units 1452A, 1452B are arrays of vector processors with instruction sets for implementing graphics and media operations. In some embodiments, the execution units 1452A, 1452B have auxiliary L1 cache memory 1451, which is specifically used for each array or common between arrays. The cache memory can be configured as a data cache, a command cache, or a single cache, and its partitions include data and commands of different partitions.

在若干實施例中，圖形管線1420包括密鋪組件而實施3D物件之硬體加速密鋪。在若干實施例中，可程控外殼著色器1411組配密鋪作業。可程控域著色器1417提供密鋪輸出之後端評估。鑲嵌器1413以外殼著色器1411之方向操作，並包含專用邏輯以依據提供做為圖形管線1420之輸入之粗略幾何模型，而產生一組詳細幾何物件。在若干實施例中，若未使用密鋪，密鋪組件1411、1413、1417可略過。 In some embodiments, the graphics pipeline 1420 includes a tiling component to implement hardware accelerated tiling of 3D objects. In some embodiments, the programmable hull shader 1411 is configured for dense paving operations. The programmable domain shader 1417 provides post-layout evaluation of the tiled output. The tessellator 1413 operates in the direction of the shell shader 1411 and contains dedicated logic to generate a set of detailed geometric objects based on the rough geometric model provided as input to the graphics pipeline 1420. In some embodiments, if dense paving is not used, the dense paving components 1411, 1413, and 1417 can be skipped.

在若干實施例中，完整幾何物件可藉由幾何著色器1419經由調度至執行單元1452A、1452B之一或更多個執行緒處理，或可直接前進至截波器1429。在若干實施例中，幾何著色器在整個幾何物件上操作，而非如圖形管線之先前級中之頂點或頂點之修補。若密鋪停用，幾何著色器1419便接收來自頂點著色器1407之輸入。在若干實施例中，若密鋪單元停用，幾何著色器1419可由幾何著色器程式程控，而實施幾何密鋪。 In some embodiments, the complete geometry object can be processed by the geometry shader 1419 to one or more of the execution units 1452A, 1452B, or can be directly advanced to the chopper 1429. In some embodiments, the geometry shader operates on the entire geometry object, rather than vertices or vertex repairs in previous stages of the graphics pipeline. If tiling is disabled, the geometry shader 1419 receives input from the vertex shader 1407. In some embodiments, if the tiling unit is disabled, the geometry shader 1419 can be programmed by the geometry shader to implement geometric tiling.

在柵格化之前，截波器1429處理頂點資料。截波器1429可為固定功能截波器或具有截波及幾何著色器功能之可程控截波器。在若干實施例中，呈現輸出管線1470中之光柵器/深度1473調度像素著色器，而將幾何物件轉換為其每一像素代表。在若干實施例中，像素著色器邏輯係包括於執行緒執行邏輯1450中。在若干實施例中，應用可略過光柵器1473，並經由流出單元1423存取未格柵化頂點資料。 Before rasterization, the chopper 1429 processes the vertex data. The chopper 1429 can be a fixed-function chopper or a programmable chopper with chopping and geometry shader functions. In some embodiments, the rasterizer/depth 1473 in the rendering output pipeline 1470 schedules pixel shaders to convert geometric objects into their respective pixel representations. In some embodiments, the pixel shader logic is included in the thread execution logic 1450. In some embodiments, the application can skip the rasterizer 1473 and access the ungridded vertex data through the outflow unit 1423.

圖形處理器1400具有互連匯流排、互連架構、或若干其他互連機構，允許資料及信息於處理器之主要組件間通過。在若干實施例中，執行單元1452A、1452B及相關快取記憶體1451、紋理及媒體取樣器1454、及紋理/取樣器快取記憶體1458經由資料埠1456互連，而實施記憶體存取並與處理器之呈現輸出管線組件通訊。在若干實施例中，取樣器1454、快取記憶體1451、1458及執行單元1452A、1452B各具有不同記憶體存取路徑。 The graphics processor 1400 has an interconnection bus, an interconnection structure, or several other interconnection mechanisms that allow data and information to pass between the main components of the processor. In some embodiments, execution units 1452A, 1452B and related cache memory 1451, texture and media sampler 1454, and texture/sampler cache memory 1458 are interconnected via data port 1456 to implement memory access and Communicate with the presentation output pipeline component of the processor. In some embodiments, the sampler 1454, the cache memory 1451, 1458, and the execution units 1452A, 1452B each have different memory access paths.

在若干實施例中，呈現輸出管線1470包含光柵器及深度測試組件1473，將基於頂點之物件轉換為相關基於像素代表。在若干實施例中，光柵器邏輯包括視窗/遮罩器單元，以實施固定功能三角形及線光柵化。在若干實施例中，相關呈現快取記憶體1478及深度快取記憶體1479可用。像素作業組件1477於資料上實施基於像素之作業，儘管在若干狀況下，與2D作業相關之像素作業 (例如混合位元方塊圖像轉移)係由2D引擎1441實施，或於顯示時由顯示控制器1443使用覆蓋顯示平面替代。在若干實施例中，共用L3快取記憶體1475可用於所有圖形組件，允許資料共用而未使用主系統記憶體。 In some embodiments, the rendering output pipeline 1470 includes a rasterizer and a depth test component 1473 to convert vertex-based objects into relevant pixel-based representations. In some embodiments, the rasterizer logic includes a window/masker unit to implement fixed function triangle and line rasterization. In some embodiments, the correlation presentation cache 1478 and the deep cache 1479 are available. The pixel operation component 1477 implements pixel-based operations on the data, although in some cases, pixel operations related to 2D operations (For example, mixed bit block image transfer) is implemented by the 2D engine 1441, or replaced by the display controller 1443 using an overlay display plane during display. In some embodiments, the shared L3 cache memory 1475 can be used for all graphics components, allowing data sharing without using the main system memory.

在若干實施例中，圖形處理器媒體管線1430包括媒體引擎1437及視訊前端1434。在若干實施例中，視訊前端1434接收來自命令串流器1403之管線命令。在若干實施例中，媒體管線1430包括不同命令串流。在若干實施例中，視訊前端1434於發送命令至媒體引擎1437之前，處理媒體命令。在若干實施例中，媒體引擎1437包括執行緒產生功能，以產生執行緒，經由執行緒調度器1431調度執行緒執行邏輯1450。 In some embodiments, the graphics processor media pipeline 1430 includes a media engine 1437 and a video front end 1434. In some embodiments, the video front end 1434 receives pipeline commands from the command streamer 1403. In several embodiments, the media pipeline 1430 includes different command streams. In some embodiments, the video front end 1434 processes the media commands before sending the commands to the media engine 1437. In some embodiments, the media engine 1437 includes a thread generation function to generate threads, and the thread execution logic 1450 is scheduled through the thread scheduler 1431.

在若干實施例中，圖形處理器1400包括顯示引擎1440。在若干實施例中，顯示引擎1440為處理器1400外部，經由環狀互連1402或若干其他互連匯流排或架構而與圖形處理器耦接。在若干實施例中，顯示引擎1440包括2D引擎1441及顯示控制器1443。在若干實施例中，顯示引擎1440包含專用邏輯，可獨立於3D管線作業。在若干實施例中，顯示控制器1443與顯示裝置(未顯示)耦接，其可為系統整合顯示裝置，如膝上型電腦中，或經由顯示裝置連接器附著之外部顯示裝置。 In several embodiments, the graphics processor 1400 includes a display engine 1440. In some embodiments, the display engine 1440 is external to the processor 1400 and is coupled to the graphics processor via a ring interconnect 1402 or a number of other interconnection buses or structures. In some embodiments, the display engine 1440 includes a 2D engine 1441 and a display controller 1443. In some embodiments, the display engine 1440 includes dedicated logic and can be independent of 3D pipeline operations. In some embodiments, the display controller 1443 is coupled to a display device (not shown), which can be a system integrated display device, such as a laptop computer, or an external display device attached via a display device connector.

在若干實施例中，圖形管線1420及媒體管線1430可組配而依據多圖形及媒體編程介面實施作業，且不限於任一應用編程介面(API)。在若干實施例中，圖形處理器之驅動軟體將特定圖形或媒體庫之API呼叫翻譯為可由圖形處理器處理之命令。在若干實施例中，支援係提供用於Khronos組織之開放圖形庫(OpenGL)及開放運算語言(OpenCL)，微軟公司之直接3D庫，或支援可提供至OpenGL及D3D。支援亦可提供用於開放源電腦版(OpenCV)。若可從未來API之管線映射至圖形處理器之管線，亦可支援具相容3D管線之未來API。 In some embodiments, the graphics pipeline 1420 and the media pipeline 1430 can be configured to perform operations based on multiple graphics and media programming interfaces, and are not limited to any application programming interface (API). In several embodiments, the figure The driver software of the graphics processor translates API calls of specific graphics or media libraries into commands that can be processed by the graphics processor. In some embodiments, the support system provides the Open Graphics Library (OpenGL) and Open Computing Language (OpenCL) for the Khronos organization, the direct 3D library of Microsoft Corporation, or the support can be provided to OpenGL and D3D. Support can also be provided for the open source computer version (OpenCV). If it can map from the pipeline of the future API to the pipeline of the graphics processor, it can also support the future API with a compatible 3D pipeline.

圖15A為方塊圖，依據若干實施例，描繪圖形處理器命令格式1500。圖15B為方塊圖，依據實施例，描繪圖形處理器命令序列1510。圖15A中實線框描繪一般包括於圖形命令中之組件，同時虛線包括可選的或僅包括於圖形命令之子集中之組件。圖15A之示例圖形處理器命令格式1500包括資料欄以識別命令之目標客戶1502、命令作業碼(運算碼)1504、及命令之相關資料1506。子運算碼1505及命令尺寸1508亦包括於若干命令中。 Figure 15A is a block diagram depicting a graphics processor command format 1500 according to several embodiments. Figure 15B is a block diagram depicting a graphics processor command sequence 1510 according to an embodiment. The solid line box in FIG. 15A depicts components that are generally included in graphics commands, while the dashed lines include components that are optional or only included in a subset of graphics commands. The example graphics processor command format 1500 of FIG. 15A includes data fields to identify the target client 1502 of the command, the command operation code (operation code) 1504, and the related data 1506 of the command. The sub-operation code 1505 and command size 1508 are also included in several commands.

在若干實施例中，客戶1502指明處理命令資料之圖形裝置的客戶單元。在若干實施例中，圖形處理器命令剖析器檢查每一命令之客戶欄，以決定命令之進一步處理，並將命令資料傳遞至適當客戶單元。在若干實施例中，圖形處理器客戶單元包括記憶體介面單元、呈現單元、2D單元、3D單元、及媒體單元。每一客戶單元具有處理命令之相應處理管線。一旦客戶單元接收命令，客戶單元便讀取運算碼1504，並讀取子運算碼1505(若存在)，以決定實施之作業。客戶單元使用資料欄1506中之資訊來實施命令。對若干命令而言，預期明確命令尺寸1508以指定命令之尺寸。在若干實施例中，命令剖析器依據命令運算碼而自動決定至少若干命令之尺寸。在若干實施例中，命令經由多個雙字而對齊。 In some embodiments, the client 1502 specifies the client unit of the graphics device that processes the command data. In some embodiments, the graphics processor command parser checks the client field of each command to determine the further processing of the command, and transmits the command data to the appropriate client unit. In some embodiments, the graphics processor client unit includes a memory interface unit, a presentation unit, a 2D unit, a 3D unit, and a media unit. Each client unit has a corresponding processing pipeline for processing commands. Once the client unit receives the command, the client unit reads the operation code 1504 and reads the sub-operation code 1505 (if there is In) to determine the operation to be implemented. The client unit uses the information in the data column 1506 to implement the command. For some commands, it is expected that the command size 1508 is specified to specify the size of the command. In some embodiments, the command parser automatically determines the size of at least some commands according to the command operation code. In several embodiments, commands are aligned via multiple double words.

圖15B中流程圖顯示示例圖形處理器命令序列1510。在若干實施例中，資料處理系統之軟體或軔體之特徵在於圖形處理器之實施例使用所示命令序列之版本建立執行，及終止一組圖形作業。所顯示及描述之樣本命令序列僅為做為實施例之範例，不侷限於該些特定命令或此命令序列。再者，可發布命令做為命令序列中之一批命令，使得圖形處理器將至少部分同時處理命令序列。 The flowchart in FIG. 15B shows an example graphics processor command sequence 1510. In some embodiments, the software or firmware of the data processing system is characterized in that the embodiment of the graphics processor uses the version of the command sequence shown to create and execute, and terminate a set of graphics operations. The sample command sequence shown and described is only an example of the embodiment, and is not limited to these specific commands or this command sequence. Furthermore, the command can be issued as a batch of commands in the command sequence, so that the graphics processor will process the command sequence at least partially at the same time.

在若干實施例中，圖形處理器命令序列1510可基於管線刷新命令1512展開，以致使任何現用圖形管線完成管線之未決命令。在若干實施例中，3D管線1522及媒體管線1524未同時操作。實施管線刷新以致使現用圖形管線完成任何未決命令。回應於管線刷新，圖形處理器之命令剖析器將暫停命令處理，直至現用牽引引擎完成未決作業及相關讀取快取記憶體無效為止。可選地，標示為「已使用」之呈現快取記憶體中任何資料可刷新至記憶體。在若干實施例中，在圖形處理器置為低電力狀態之前，管線刷新命令1512可用於管線同步。 In some embodiments, the graphics processor command sequence 1510 can be expanded based on the pipeline refresh command 1512 to cause any active graphics pipeline to complete the pending commands of the pipeline. In some embodiments, the 3D pipeline 1522 and the media pipeline 1524 do not operate simultaneously. The pipeline refresh is implemented to cause the active graphics pipeline to complete any pending commands. In response to the pipeline refresh, the command parser of the graphics processor will suspend command processing until the current traction engine completes pending operations and the related read cache memory is invalid. Optionally, any data in the presentation cache marked as "used" can be refreshed to the memory. In several embodiments, the pipeline refresh command 1512 can be used for pipeline synchronization before the graphics processor is placed in a low power state.

在若干實施例中，當命令序列要求圖形處理器明確地在管線之間切換時，使用管線選擇命令1513。在若干實施例中，管線選擇命令1513僅需於發布管線命令之前執行情境，除非情境為發布管線之命令。在若干實施例中，在經由管線選擇命令1513之管線切換之前，立即需要管線刷新命令1512。 In several embodiments, the pipeline selection command 1513 is used when the command sequence requires the graphics processor to explicitly switch between pipelines. In some embodiments, the pipeline selection command 1513 only needs to execute the context before issuing the pipeline command, unless the context is a pipeline command. In some embodiments, the pipeline refresh command 1512 is required immediately before the pipeline switch via the pipeline selection command 1513.

在若干實施例中，管線控制命令1514組配作業之圖形管線，並用以程控3D管線1522及媒體管線1524。在若干實施例中，管線控制命令1514組配現用管線之管線狀態。在一實施例中，管線控制命令1514用於管線同步，並於處理一批命令之前，清除來自現用管線內一或更多個快取記憶體之資料。 In some embodiments, the pipeline control command 1514 configures the graphics pipeline of the job and is used to program the 3D pipeline 1522 and the media pipeline 1524. In some embodiments, the pipeline control command 1514 configures the pipeline status of the active pipeline. In one embodiment, the pipeline control command 1514 is used for pipeline synchronization and clears data from one or more caches in the active pipeline before processing a batch of commands.

在若干實施例中，返回緩衝器狀態命令1516用以組配個別管線之一組返回緩衝器而寫入資料。若干管線作業需要一或更多個返回緩衝器之配置、選擇或組態，其中作業於處理期間寫入中間資料。在若干實施例中，圖形處理器亦使用一或更多個返回緩衝器來儲存輸出資料及實施跨執行緒通訊。在若干實施例中，返回緩衝器狀態1516包括選擇尺寸及返回緩衝器數量，而用於管線作業組。 In some embodiments, the return buffer status command 1516 is used to configure a set of return buffers for individual pipelines to write data. Some pipeline operations require the configuration, selection or configuration of one or more return buffers, where the operations write intermediate data during processing. In some embodiments, the graphics processor also uses one or more return buffers to store output data and implement cross-thread communication. In some embodiments, the return buffer status 1516 includes the selection size and the number of return buffers, and is used for the pipeline group.

命令序列中其餘命令基於作業之現用管線而異。依據管線決定1520，命令序列專用於3D管線1522，其始自3D管線狀態1530，或始自媒體管線狀態1540之媒體管線1524。 The remaining commands in the command sequence vary based on the active pipeline of the job. According to the pipeline decision 1520, the command sequence is dedicated to the 3D pipeline 1522, which starts from the 3D pipeline state 1530 or the media pipeline 1524 from the media pipeline state 1540.

3D管線狀態1530之命令包括3D狀態設定命令，用於頂點緩衝器狀態、頂點元件狀態、常量顏色狀態、深度緩衝器狀態、及3D基元命令處理前組配之其他狀態變數。該些命令之值至少部分依據使用之特定3DAPI決定。在若干實施例中，若將不使用該些元件，3D管線狀態1530命令亦可選擇性停用或略過某些管線元件。 The commands of 3D pipeline status 1530 include 3D status setting commands for vertex buffer status, vertex component status, constant color status State, depth buffer state, and other state variables assembled before 3D primitive command processing. The value of these commands depends at least in part on the specific 3DAPI used. In some embodiments, if these components will not be used, the 3D pipeline status 1530 command can also selectively disable or skip certain pipeline components.

在若干實施例中，3D基元1532命令用以提交將由3D管線處理之3D基元。經由3D基元1532命令傳遞至圖形處理器之命令及相關參數，被傳送至圖形管線中頂點提取功能。頂點提取功能使用3D基元1532命令資料以產生頂點資料結構。頂點資料結構係儲存於一或更多個返回暫存器中。在若干實施例中，3D基元1532命令用以經由頂點著色器在3D基元上實施頂點作業。為處理頂點著色器，3D管線1522調度著色器執行緒至圖形處理器執行單元。 In some embodiments, the 3D primitive 1532 command is used to submit 3D primitives to be processed by the 3D pipeline. The commands and related parameters transmitted to the graphics processor via the 3D primitive 1532 commands are transmitted to the vertex extraction function in the graphics pipeline. The vertex extraction function uses 3D primitives 1532 to command data to generate vertex data structures. The vertex data structure is stored in one or more return registers. In some embodiments, the 3D primitive 1532 command is used to perform vertex operations on the 3D primitive via the vertex shader. To process the vertex shader, the 3D pipeline 1522 dispatches the shader thread to the graphics processor execution unit.

在若干實施例中，經由執行1534命令或事件而觸發3D管線1522。在若干實施例中，暫存器寫入觸發命令執行。在若干實施例中，經由命令序列中「前進」或「踢除」命令而觸發執行。在一實施例中，使用管線同步命令觸發命令執行，以刷新命令序列至圖形管線。3D管線將實施3D基元之幾何處理。一旦作業完成，最終幾何物件被格柵化，且像素引擎著色最終像素。亦可包括控制像素蔽影及像素後端作業之額外命令而用於該些作業。 In several embodiments, the 3D pipeline 1522 is triggered via execution of 1534 commands or events. In several embodiments, the register write triggers the execution of the command. In some embodiments, execution is triggered by a "forward" or "kick" command in the command sequence. In one embodiment, a pipeline synchronization command is used to trigger command execution to refresh the command sequence to the graphics pipeline. The 3D pipeline will implement geometric processing of 3D primitives. Once the job is completed, the final geometric objects are gridded, and the pixel engine colorizes the final pixels. It may also include additional commands for controlling pixel shadowing and pixel back-end operations for these operations.

在若干實施例中，當實施媒體作業時，圖形處理器命令序列1510依循媒體管線1524路徑。通常，媒體管線1524之特定使用及編程方式取決於將實施之媒體或運算作業。特定媒體解碼作業於媒體解碼期間可卸載至媒體管線。在若干實施例中，亦可略過媒體管線，並可完整或部分使用一或更多個通用處理核心提供之資源實施媒體解碼。在一實施例中，媒體管線亦包括通用圖形處理器單元(GPGPU)作業之元件，其中圖形處理器用以使用運算著色器程式實施SIMD向量作業，其未明確關於圖形基元之呈現。 In some embodiments, the graphics processor command sequence 1510 follows the path of the media pipeline 1524 when performing media operations. Usually, the media The specific use and programming method of the body pipeline 1524 depends on the media or computing operations to be implemented. Specific media decoding operations can be offloaded to the media pipeline during media decoding. In some embodiments, the media pipeline can also be skipped, and the resources provided by one or more general processing cores can be used in whole or in part to implement media decoding. In one embodiment, the media pipeline also includes components for general graphics processing unit (GPGPU) operations, where the graphics processor is used to implement SIMD vector operations using arithmetic shader programs, which do not specify the rendering of graphics primitives.

在若干實施例中，媒體管線1524係以類似於3D管線1522之方式組配。一組媒體管線狀態命令1540係於媒體物件命令1542之前被調度或置於命令佇列中。在若干實施例中，媒體管線狀態命令1540包括資料以組配媒體管線元件，其將用以處理媒體物件。此包括資料而將視訊解碼及視訊編碼邏輯組配於媒體管線內，諸如編碼或解碼格式。在若干實施例中，媒體管線狀態命令1540亦支援使用「間接」狀態元件之一或更多個指標，其包含一批狀態設定。 In some embodiments, the media pipeline 1524 is configured in a manner similar to the 3D pipeline 1522. A group of media pipeline status commands 1540 are scheduled before the media object command 1542 or placed in the command queue. In some embodiments, the media pipeline status command 1540 includes data to configure media pipeline components that will be used to process media objects. This includes data and combines the video decoding and video encoding logic in the media pipeline, such as encoding or decoding formats. In some embodiments, the media pipeline status command 1540 also supports the use of one or more indicators of the "indirect" status element, which includes a batch of status settings.

在若干實施例中，媒體物件命令1542供應指標至媒體物件，供媒體管線處理。媒體物件包括記憶體緩衝器，其包含將處理之視訊資料。在若干實施例中，於發布媒體物件命令1542之前，所有媒體管線狀態必須有效。一旦管線狀態組配及媒體物件命令1542佇列，便經由執行命令1544或等效執行事件(例如暫存器寫入)，而觸發媒體管線1524。來自媒體管線1524之輸出接著可由3D管線1522或媒體管線1524提供之作業後置處理。在若干實施例中，以類似於媒體作業之方式組配及執行GPGPU作業。 In some embodiments, the media object commands 1542 to supply indicators to the media object for processing by the media pipeline. The media object includes a memory buffer, which contains the video data to be processed. In some embodiments, all media pipeline states must be valid before issuing the media object command 1542. Once the pipeline status is configured and the media object command 1542 is queued, the media pipeline 1524 is triggered by executing the command 1544 or equivalent execution event (such as register write). The output from the media pipeline 1524 can then be Post-processing provided by the 3D pipeline 1522 or the media pipeline 1524. In some embodiments, GPGPU operations are configured and executed in a manner similar to media operations.

圖16依據若干實施例，描繪資料處理系統1600之示例圖形軟體架構。在若干實施例中，軟體架構包括3D圖形應用1610、作業系統1620、及至少一處理器1630。在若干實施例中，處理器1630包括圖形處理器1632及一或更多個通用處理器核心1634。圖形應用1610及作業系統1620各於資料處理系統之系統記憶體1650中執行。 Figure 16 depicts an example graphics software architecture of the data processing system 1600 according to several embodiments. In some embodiments, the software architecture includes a 3D graphics application 1610, an operating system 1620, and at least one processor 1630. In several embodiments, the processor 1630 includes a graphics processor 1632 and one or more general-purpose processor cores 1634. The graphics application 1610 and the operating system 1620 are each executed in the system memory 1650 of the data processing system.

在若干實施例中，3D圖形應用1610包含一或更多個著色器程式，包括著色器指令1612。著色器語言指令可為高階著色器語言，諸如高階著色器語言(HLSL)或OpenGL著色器語言(GLSL)。應用亦包括可執行指令1614，其為適於通用處理器核心1634執行之機器語言。應用亦包括由頂點資料定義之圖形物件1616。 In some embodiments, the 3D graphics application 1610 includes one or more shader programs, including shader instructions 1612. The shader language instruction may be a high-level shader language, such as high-level shader language (HLSL) or OpenGL shader language (GLSL). The application also includes executable instructions 1614, which are machine language suitable for execution by the general-purpose processor core 1634. The application also includes graphic objects 1616 defined by vertex data.

在若干實施例中，作業系統1620為來自微軟公司之微軟視窗(Microsoft® Windows®)作業系統，專屬UNIX型作業系統，或使用Linux內核變體之開放源UNIX型作業系統。當使用直接SD API時，作業系統1620使用前端著色器編譯器1624，將HLSL之任何著色器指令1612編譯為低階著色器語言。編譯可為即時(JIT)編譯或應用可實施著色器預編譯。在若干實施例中，高階著色器於3D圖形應用1610編譯期間編譯為低階著色器。 In some embodiments, the operating system 1620 is a Microsoft® Windows® operating system from Microsoft Corporation, a proprietary UNIX operating system, or an open source UNIX operating system using a variant of the Linux kernel. When using the direct SD API, the operating system 1620 uses the front-end shader compiler 1624 to compile any shader instructions 1612 of HLSL into a low-level shader language. The compilation can be just-in-time (JIT) compilation or the application can implement shader pre-compilation. In several embodiments Medium, high-level shaders are compiled into low-level shaders during the compilation of 3D graphics application 1610.

在若干實施例中，使用者模式圖形驅動器1626包含後端著色器編譯器1627，將著色器指令1612轉換為硬體特定代表。當使用OpenGL API時，GLSL高階語言之著色器指令1612傳遞至使用者模式圖形驅動器1626進行編譯。在若干實施例中，使用者模式圖形驅動器1626使用作業系統內核模式功能1628與內核模式圖形驅動器1629通訊。在若干實施例中，內核模式圖形驅動器1629與圖形處理器1632通訊而調度命令及指令。 In some embodiments, the user-mode graphics driver 1626 includes a back-end shader compiler 1627, which converts the shader instructions 1612 into hardware specific representations. When using the OpenGL API, the GLSL high-level language shader command 1612 is passed to the user mode graphics driver 1626 for compilation. In some embodiments, the user mode graphics driver 1626 uses the operating system kernel mode function 1628 to communicate with the kernel mode graphics driver 1629. In some embodiments, the kernel mode graphics driver 1629 communicates with the graphics processor 1632 to schedule commands and instructions.

至少一實施例之一或更多個態樣可由儲存於機器可讀取媒體上之代表碼實施，其代表及/或定義積體電路內之邏輯，諸如處理器。例如，機器可讀取媒體可包括指令，其代表處理器內之各式邏輯。當機器讀取時，指令可致使機器製造邏輯而實施文中所描述之技術。該等代表已知為「IP核心」，為積體電路之邏輯的可再用單元，可儲存於實體機器可讀取媒體上，做為硬體模型，描述積體電路之結構。硬體模型可供應予各式客戶或製造商，其將硬體模型裝載於製造機器上，來製造積體電路。可製造積體電路，使得電路實施結合文中所描述任何實施例而描述之作業。 One or more aspects of at least one embodiment may be implemented by a representative code stored on a machine-readable medium, which represents and/or defines logic within an integrated circuit, such as a processor. For example, a machine-readable medium may include instructions, which represent various logic within the processor. When read by a machine, the instructions can cause the machine to manufacture logic to implement the techniques described in the text. These representatives are known as the "IP core", which is the logical reusable unit of the integrated circuit, which can be stored on a physical machine readable medium and used as a hardware model to describe the structure of the integrated circuit. The hardware model can be supplied to various customers or manufacturers, who load the hardware model on a manufacturing machine to manufacture an integrated circuit. An integrated circuit can be manufactured so that the circuit performs operations described in conjunction with any of the embodiments described in the text.

圖17為方塊圖，依據實施例，描繪IP核心發展系統1700，可用以製造積體電路而實施作業。IP核心發展系統1700可用以產生模組式可再用設計，其可併入較大設計或用以組建整個積體電路(例如SOC積體電路)。設計設施1730可產生高階編程語言(例如C/C++)之IP核心設計的軟體模擬1710。軟體模擬1710可用以設計、測試、及驗證IP核心之行為。接著可從模擬模型1700製造或合成暫存器轉移級(RTL)設計。RTL設計1715為積體電路行為之抽象化，其為硬體暫存器間之數位信號流之模型，包括使用模型數位信號實施之相關邏輯。除了RTL設計1715外，亦可製造、設計、或合成邏輯級或電晶體級之低階設計。因而，最初設計及模擬之特定細節可改變。 FIG. 17 is a block diagram depicting an IP core development system 1700 according to an embodiment, which can be used to manufacture integrated circuits and perform operations. The IP core development system 1700 can be used to generate modular reusable designs, which can be combined with Into a larger design or to build the entire integrated circuit (such as SOC integrated circuit). The design facility 1730 can generate a software simulation 1710 of the IP core design in a high-level programming language (such as C/C++). Software simulation 1710 can be used to design, test, and verify the behavior of the IP core. The register transfer level (RTL) design can then be manufactured from the simulation model 1700 or synthesized. RTL design 1715 is an abstraction of integrated circuit behavior, which is a model of digital signal flow between hardware registers, including related logic implemented using model digital signals. In addition to RTL design 1715, it can also manufacture, design, or synthesize logic-level or transistor-level low-level designs. Therefore, the specific details of the initial design and simulation can be changed.

藉由設計設施為硬體模型1720，可進一步合成RTL設計1715或等效物件，其可為硬體描述語言(HDL)，或實體設計資料之若干其他代表。HDL可進一步模擬或測試以驗證IP核心設計。使用非揮發性記憶體1740(例如硬碟、快閃記憶體、或任何非揮發性儲存裝置媒體)，可儲存IP核心設計用於傳遞至第三方製造設施1765。另一方面，透過有線連接1750或無線連接1760，可傳輸(例如經由網際網路)IP核心設計。製造設施1765接著可製造積體電路，其至少部分依據IP核心設計。依據文中描述之至少一實施例，可組配製造之積體電路而實施作業。 By using the hardware model 1720 as the design facility, the RTL design 1715 or equivalent can be further synthesized, which can be a hardware description language (HDL), or some other representative of physical design data. HDL can be further simulated or tested to verify the IP core design. Using non-volatile memory 1740 (such as hard disk, flash memory, or any non-volatile storage device media), the IP core can be stored for delivery to third-party manufacturing facilities 1765. On the other hand, through wired connection 1750 or wireless connection 1760, the IP core design can be transmitted (for example, via the Internet). Manufacturing facility 1765 can then manufacture integrated circuits, which are based at least in part on the IP core design. According to at least one embodiment described in the text, the manufactured integrated circuit can be assembled to perform operations.

圖18為方塊圖，依據實施例，描繪示例系統晶片整合電路1800，其可使用一或更多個IP核心而予製造。示例積體電路包括一或更多個應用處理器1805(例如CPU)，至少一圖形處理器1810，此外可包括圖像處理器1815及/或視訊處理器1820，任一者可為來自相同或多個不同設計設施之模組式IP核心。積體電路包括週邊設備或匯流排邏輯，包括USB控制器1825、UART控制器1830、SPI/SDIO控制器1835、及I2S/I2C控制器1840。此外，積體電路可包括顯示裝置1845，耦接至一或更多個高解析度多媒體介面(HDMI)控制器1850及行動產業處理器介面(M1P1)顯示介面1855。儲存裝置可由快閃記憶體子系統1860提供，包括快閃記憶體及快閃記憶體控制器。記憶體介面可經由記憶體控制器1865提供，用於存取SDRAM或SRAM記憶體裝置。此外，若干積體電路包括嵌入安全引擎1870。 FIG. 18 is a block diagram illustrating an example system-on-chip integrated circuit 1800, which can be fabricated using one or more IP cores, according to an embodiment. The example integrated circuit includes one or more application processors 1805 (example Such as CPU), at least one graphics processor 1810, and can also include image processor 1815 and/or video processor 1820, either of which can be a modular IP core from the same or multiple different design facilities. The integrated circuit includes peripheral devices or bus logic, including a USB controller 1825, a UART controller 1830, an SPI/SDIO controller 1835, and an I2S/I2C controller 1840. In addition, the integrated circuit may include a display device 1845, coupled to one or more high-resolution multimedia interface (HDMI) controller 1850 and a mobile industry processor interface (M1P1) display interface 1855. The storage device can be provided by the flash memory subsystem 1860, including flash memory and a flash memory controller. The memory interface can be provided by the memory controller 1865 for accessing SDRAM or SRAM memory devices. In addition, several integrated circuits include an embedded security engine 1870.

此外，積體電路1800之處理器中可包括其他邏輯及電路，包括額外圖形處理器/核心、週邊設備介面控制器、或通用處理器核心。 In addition, the processor of the integrated circuit 1800 may include other logic and circuits, including additional graphics processors/cores, peripheral device interface controllers, or general-purpose processor cores.

下列範例關於進一步實施例。範例1包括一種設備，包含：圖形處理單元(GPU)，從來源至緩衝器複製資料區塊之至少第一部，及從來源至目的地複製資料區塊之第二部；以及中央處理單元(CPU)，從緩衝器至目的地中一或更多個相應位置複製資料區塊之第一部。範例2包括範例1之設備，其中，資料區塊之第一部包含資料區塊之開始處之第一資料，或資料區塊之末端之第二資料。範例3包括範例2之設備，其中，第一資料及第二資料各具有小於記憶體單元之尺寸。範例4包括範例3之設備，其中，記憶體單元包含快取列或64位元組。範例5包括範例1之設備，其中，資料區塊之第二部包含一或更多個完整記憶體單元。範例6包括範例5之設備，其中，一或更多個完整記憶體單元之每一者包含64位元組。範例7包括範例1之設備，其中，來源包含耦接至GPU之視訊記憶體。範例8包括範例1之設備，其中，目的地包含耦接至CPU之系統記憶體。範例9包括範例1之設備，其中，GPU包含一或更多個圖形處理核心。範例10包括範例1之設備，其中，CPU包含一或更多個處理器核心。範例11包括範例1之設備，其中，一或更多個GPU、CPU、或記憶體係在單一積體電路晶粒上。 The following examples relate to further embodiments. Example 1 includes a device including: a graphics processing unit (GPU), at least a first part for copying data blocks from source to buffer, and a second part for copying data blocks from source to destination; and a central processing unit ( CPU) to copy the first part of the data block from the buffer to one or more corresponding locations in the destination. Example 2 includes the device of Example 1, wherein the first part of the data block contains the first data at the beginning of the data block or the second data at the end of the data block. Example 3 includes the device of Example 2, wherein each of the first data and the second data has a size smaller than a memory cell. Example 4 includes the settings of Example 3 Device, where the memory unit contains a cache row or 64-byte group. Example 5 includes the device of Example 1, wherein the second part of the data block includes one or more complete memory cells. Example 6 includes the device of Example 5, wherein each of one or more complete memory cells contains 64 bytes. Example 7 includes the device of Example 1, where the source includes video memory coupled to the GPU. Example 8 includes the device of Example 1, where the destination includes system memory coupled to the CPU. Example 9 includes the device of Example 1, wherein the GPU includes one or more graphics processing cores. Example 10 includes the device of Example 1, wherein the CPU includes one or more processor cores. Example 11 includes the device of Example 1, in which one or more GPUs, CPUs, or memory systems are on a single integrated circuit die.

範例12包括一種系統，包含：處理器，耦接至系統記憶體，系統記憶體儲存從視訊記憶體複製之資料區塊，處理器包含：圖形處理單元(GPU)，用以複製：資料區塊之第一部，從視訊記憶體至第一緩衝器；資料區塊之第二部，從視訊記憶體至系統記憶體；以及資料區塊之第三部，從視訊記憶體至第二緩衝器；以及中央處理單元(CPU)，用以複製：資料區塊之第一部，從第一緩衝器至系統記憶體中之相應位置；以及資料區塊之第二部，從第二緩衝器至系統記憶體中之相應位置。範例13包括範例12之系統，其中，資料區塊係連續依序由第一部、第二部、及第三部組成。範例14包括範例12之系統，其中，資料區塊之第一部及第三部各包含較記憶體單元少之位元組。範例15包括範例14之系統，其中，記憶體單元包含快取列或64位元組。範例16包括範例12之系統，其中，資料區塊之第二部包含一或更多個完整記憶體單元。範例17包括範例16之系統，其中，一或更多個完整記憶體單元之每一者包含64位元組。範例18包括範例12之系統，其中，第一緩衝器及第二緩衝器組合為單一緩衝器。範例19包括範例12之系統，其中，具有一或更多個圖形處理核心之一或更多個GPU，具有一或更多個處理器核心之CPU，至少一部分系統記憶體，或至少一部分視訊記憶體係在單一積體電路晶粒上。 Example 12 includes a system including: a processor coupled to a system memory, the system memory stores data blocks copied from the video memory, and the processor includes: a graphics processing unit (GPU) for copying: data blocks The first part of the data block is from the video memory to the first buffer; the second part of the data block is from the video memory to the system memory; and the third part of the data block is from the video memory to the second buffer And a central processing unit (CPU) for copying: the first part of the data block, from the first buffer to the corresponding position in the system memory; and the second part of the data block, from the second buffer to The corresponding location in the system memory. Example 13 includes the system of Example 12, in which the data block is composed of the first part, the second part, and the third part in sequence. Example 14 includes the system of Example 12, in which the first part and the third part of the data block each contain fewer bytes than memory cells. Example 15 includes the system of Example 14, in which the memory unit Contains cache columns or 64 bytes. Example 16 includes the system of Example 12, wherein the second part of the data block includes one or more complete memory cells. Example 17 includes the system of Example 16, wherein each of one or more complete memory cells contains 64 bytes. Example 18 includes the system of Example 12, wherein the first buffer and the second buffer are combined into a single buffer. Example 19 includes the system of Example 12, in which one or more GPUs with one or more graphics processing cores, a CPU with one or more processor cores, at least a portion of system memory, or at least a portion of video memory The system is on a single integrated circuit die.

範例20包括一種電腦可讀取媒體，包含一或更多個指令，當指令在處理器上執行時組配處理器而實施一或更多個作業，用以：致使圖形處理單元(GPU)，從來源至緩衝器複製資料區塊之至少第一部，及從來源至目的地複製資料區塊之第二部；以及致使中央處理單元(CPU)，從第一緩衝器至目的地中一或更多個相應位置複製資料區塊之第一部。範例21包括範例20之電腦可讀取媒體，其中，資料區塊之第一部包含資料區塊之開始處之第一資料，或資料區塊之末端之第二資料。範例22包括範例21之電腦可讀取媒體，其中，第一資料及第二資料各具有小於記憶體單元之尺寸。範例23包括範例22之電腦可讀取媒體其中，記憶體單元包含快取列或64位元組。範例24包括範例20之電腦可讀取媒體，其中，資料區塊之第二部包含一或更多個完整記憶體單元。範例25包括範例20之電腦可讀取媒體，其中，一或更多個完整記憶體單元之每一者包含64位元組。 Example 20 includes a computer-readable medium containing one or more instructions. When the instructions are executed on the processor, the processor is configured to perform one or more operations to: cause a graphics processing unit (GPU), Copy at least the first part of the data block from the source to the buffer, and copy the second part of the data block from the source to the destination; and cause the central processing unit (CPU) to copy one or more of the data blocks from the first buffer to the destination Copy the first part of the data block in more corresponding positions. Example 21 includes the computer readable medium of Example 20, wherein the first part of the data block includes the first data at the beginning of the data block or the second data at the end of the data block. Example 22 includes the computer readable medium of Example 21, wherein the first data and the second data each have a size smaller than a memory cell. Example 23 includes the computer readable medium of Example 22, in which the memory unit includes a cache or 64 bytes. Example 24 includes the computer-readable medium of Example 20, wherein the second part of the data block includes one or more complete memory cells. Example 25 includes the computer-readable media of Example 20, in which one or more complete Each of the memory units contains 64 bytes.

範例26包括一種方法，包含：致使圖形處理單元(GPU)，從來源至緩衝器複製資料區塊之至少第一部，及從來源至目的地複製資料區塊之第二部；以及致使中央處理單元(CPU)，從第一緩衝器至目的地中一或更多個相應位置複製資料區塊之第一部。範例27包括範例26之方法，其中，資料區塊之第一部包含資料區塊之開始處之第一資料，或資料區塊之末端之第二資料，或其中，資料區塊之第二部包含一或更多個完整記憶體單元，或其中，一或更多個完整記憶體單元之每一者包含64位元組。範例28包括範例27之方法，其中，第一資料及第二資料各具有小於記憶體單元之尺寸。範例29包括範例28之方法，其中，記憶體單元包含快取列或64位元組。 Example 26 includes a method including: causing a graphics processing unit (GPU) to copy at least the first part of the data block from the source to the buffer, and copy the second part of the data block from the source to the destination; and causing the central processing The unit (CPU) copies the first part of the data block from the first buffer to one or more corresponding locations in the destination. Example 27 includes the method of Example 26, wherein the first part of the data block contains the first data at the beginning of the data block, or the second data at the end of the data block, or among them, the second part of the data block Contains one or more complete memory cells, or where each of the one or more complete memory cells contains 64 bytes. Example 28 includes the method of Example 27, wherein each of the first data and the second data has a size smaller than a memory cell. Example 29 includes the method of Example 28, wherein the memory cell includes a cache or 64-byte.

範例30包括一種設備，其包含機制以實施如任何前述範例中提出之方法。範例31包含機器可讀取儲存裝置，其包括機器可讀取指令，當執行指令時，實施或實現如任何前述範例中提出之方法或設備。 Example 30 includes a device that includes a mechanism to implement the method as proposed in any of the previous examples. Example 31 includes a machine-readable storage device, which includes machine-readable instructions, which when executed, implement or implement the method or device as proposed in any of the foregoing examples.

在各式實施例中，文中所討論之作業，例如參照圖1-18，可實施為硬體(例如邏輯電路)、軟體、軔體、或其組合，其可提供為電腦程式產品，例如包括實體(例如非暫態)機器可讀取或電腦可讀取媒體，具有儲存於其上之指令(或軟體程序)用以編程電腦，而實施文中所討論之程序。機器可讀取媒體可包括儲存裝置，諸如參照圖1-18所討論者。 In various embodiments, the operations discussed in the text, such as referring to Figures 1-18, can be implemented as hardware (such as logic circuits), software, firmware, or a combination thereof, which can be provided as computer program products, such as including A physical (for example, non-transitory) machine-readable or computer-readable medium has instructions (or software programs) stored on it to program the computer to implement the procedures discussed in the article. The machine-readable medium may include storage devices, such as those discussed with reference to Figures 1-18.

此外，該電腦可讀取媒體可下載為電腦程式產品，其中，程式可藉由載波或其他傳播媒體中所提供之資料信號，經由通訊鏈路(例如匯流排、數據機、或網路連接)，而從遠端電腦(例如伺服器)轉移至請求電腦(例如客戶)。 In addition, the computer readable medium can be downloaded as a computer program product, where the program can be via a communication link (such as a bus, a modem, or a network connection) using a data signal provided in a carrier wave or other communication media , And transfer from a remote computer (such as a server) to a requesting computer (such as a client).

說明書中提及「一實施例」或「實施例」表示結合實施例描述之特定部件、結構、及/或特性可包括於至少實施中。在說明書中各處出現之「在一實施例中」用語可或不可均指相同實施例。 The reference to "an embodiment" or "an embodiment" in the specification means that a specific component, structure, and/or characteristic described in conjunction with the embodiment can be included in at least the implementation. The terms "in an embodiment" appearing in various places in the specification may or may not all refer to the same embodiment.

而且，在描述及申請項中，可使用「耦接」及「連接」用詞，連同其衍生字。在若干實施例中，「連接」可用以表示二或更多個元件係彼此直接實體或電接觸。「耦接」可表示二或更多個元件係直接實體或電接觸。然而，「耦接」亦可表示二或更多個元件並非彼此直接接觸，但仍可彼此合作或互動。 Moreover, in the description and application items, the terms "coupled" and "connected" can be used along with their derivatives. In some embodiments, “connected” may be used to mean that two or more components are in direct physical or electrical contact with each other. "Coupling" can mean that two or more components are in direct physical or electrical contact. However, “coupled” can also mean that two or more components are not in direct contact with each other, but can still cooperate or interact with each other.

因而，儘管已以結構部件及/或方法動作專用之語言描述實施例，應理解的是所主張之技術主題可不侷限於所描述之特定部件或動作。而是，特定部件及動作被揭露為實施所主張之技術主題的樣本型式。 Therefore, although the embodiments have been described in language dedicated to structural components and/or method actions, it should be understood that the claimed technical subject may not be limited to the specific components or actions described. Rather, specific components and actions are exposed as sample forms for implementing the claimed technical subject.

114:記憶體 114: memory

150:視訊記憶體 150: Video memory

202:儲存裝置 202: storage device

Claims

A device for implementing GPU-CPU dual-path memory copying, comprising: a graphics processing unit (GPU), copying at least the first unaligned portion of a data block from a source to a buffer, and copying the source to the destination A second aligned portion of the data block; and a central processing unit (CPU) to copy the first unaligned portion of the data block from the buffer to one or more corresponding positions in the destination.

For example, the device of the first item of the patent application, wherein the first unaligned part of the data block includes the first data at the beginning of the data block or the second data at the end of the data block.

For example, the device of item 2 of the scope of patent application, wherein the first data and the second data each have a size smaller than a memory cell.

For example, the device of item 3 of the scope of patent application, wherein the memory unit includes a cache row or a 64-byte group.

Such as the device of the first item of the scope of patent application, wherein the second alignment part of the data block includes one or more complete memory cells.

For example, the device of item 5 of the scope of patent application, wherein each of the one or more complete memory units contains 64 bytes.

Such as the device of the first item in the scope of patent application, wherein the source includes a video memory coupled to the GPU.

Such as the device of the first item in the scope of patent application, wherein the destination includes the system memory coupled to the CPU.

Such as the device of item 1 of the scope of patent application, where the GPU package Contains one or more graphics processing cores.

Such as the device of the first item of the scope of patent application, wherein the CPU includes one or more processor cores.

Such as the device of the first item in the scope of the patent application, wherein one or more of the GPU, the CPU, or the memory system is on a single integrated circuit die.

A system for implementing GPU-CPU dual-path memory copying, comprising: a processor coupled to a system memory, the system memory storing data blocks copied from the video memory, the processor including: a graphics processing unit (GPU) for copying: the first unaligned part of the data block from the video memory to the first buffer; the second aligned part of the data block from the video memory to the system memory And the third unaligned portion of the data block from the video memory to the second buffer; and a central processing unit (CPU) for copying: the first unaligned portion of the data block from the The first buffer to the corresponding position in the system memory; and the third unaligned portion of the data block from the second buffer to the corresponding position in the system memory.

For example, the system of item 12 of the scope of patent application, in which the data block is successively and sequentially from the first unaligned portion, the second aligned portion, and the The third unaligned part is composed.

For example, the system of item 12 of the scope of patent application, wherein the first unaligned portion and the third unaligned portion of the data block each include fewer bytes than memory cells.

A computer-readable medium that contains one or more instructions. When the instructions are executed on a processor, the processor is configured to perform one or more operations to: cause a graphics processing unit (GPU), from the source Copy at least the first unaligned portion of the data block to the buffer, and copy the second aligned portion of the data block from the source to the destination; and cause the central processing unit (CPU) to transfer from the first buffer to the Copy the first unaligned portion of the data block at one or more corresponding locations in the destination.

For example, the computer-readable medium of item 15 of the scope of patent application, wherein the first unaligned portion of the data block includes the first data at the beginning of the data block, or the second data at the end of the data block data.

For example, the computer-readable medium of the 16th patent application, wherein the first data and the second data each have a size smaller than a memory unit.

For example, the computer-readable medium of item 17 of the scope of patent application, in which the memory unit contains a cache or 64-byte.

For example, the computer-readable medium of the 15th patent application, wherein the second alignment portion of the data block includes one or more complete memory cells.

For example, the computer-readable medium of item 15 of the scope of the patent application, wherein each of the one or more complete memory units includes 64 bytes.

A method for implementing GPU-CPU dual-path memory copying includes: causing a graphics processing unit (GPU) to copy at least the first unaligned portion of a data block from a source to a buffer, and copying from the source to the destination The second aligned portion of the data block; and the first unaligned portion that causes the central processing unit (CPU) to copy the data block from the first buffer to one or more corresponding locations in the destination.

For example, the method of claim 21, wherein the first unaligned portion of the data block includes the first data at the beginning of the data block, or the second data at the end of the data block, or where , The second alignment portion of the data block includes one or more complete memory cells, or wherein each of the one or more complete memory cells includes 64 bytes.

Such as the method of claim 22, wherein the first data and the second data each have a size smaller than a memory cell.

Such as the method of claim 23, wherein the memory unit includes a cache row or a 64-byte group.