TW201729111A - GPU-CPU two-path memory copy - Google Patents

GPU-CPU two-path memory copy Download PDF

Info

Publication number
TW201729111A
TW201729111A TW105126630A TW105126630A TW201729111A TW 201729111 A TW201729111 A TW 201729111A TW 105126630 A TW105126630 A TW 105126630A TW 105126630 A TW105126630 A TW 105126630A TW 201729111 A TW201729111 A TW 201729111A
Authority
TW
Taiwan
Prior art keywords
memory
data
data block
processor
graphics
Prior art date
Application number
TW105126630A
Other languages
Chinese (zh)
Other versions
TWI715613B (en
Inventor
孔令一
沈磊
李源源
楊宇艇
奎元 路
Original Assignee
英特爾股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 英特爾股份有限公司 filed Critical 英特爾股份有限公司
Publication of TW201729111A publication Critical patent/TW201729111A/en
Application granted granted Critical
Publication of TWI715613B publication Critical patent/TWI715613B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Controls And Circuits For Display Device (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Methods and apparatus relating to GPU-CPU (Graphics Processing Unit-Central Processing Unit) two-path memory copy are described. In an embodiment, a Graphics Processing Unit (GPU) copies at least a first portion of a data block from a source to a buffer. The GPU also copies a second portion of the data block from the source to a destination. A Central Processing Unit (CPU) copies the first portion of the data block from the first buffer to one or more corresponding locations in the destination. Other embodiments are also disclosed and claimed.

Description

GPU-CPU雙路徑記憶體複製 GPU-CPU dual path memory copy

本揭露大體上關於電子元件之範疇。更特定地,實施例關於GPU-CPU(圖形處理單元-中央處理單元)雙路徑記憶體複製。 The disclosure relates generally to the scope of electronic components. More specifically, embodiments relate to GPU-CPU (Graphics Processing Unit - Central Processing Unit) dual path memory copy.

若干處理器包括CPU及GPU二者。當處理圖像及/或視訊資料時,此資料通常需要於主機記憶體及視訊記憶體之間反復複製。該些複製作業在運算週期及記憶體帶寬使用方面非常昂貴。 Several processors include both CPU and GPU. When processing images and/or video data, this data usually needs to be copied repeatedly between the host memory and the video memory. These copy jobs are very expensive in terms of computation cycles and memory bandwidth usage.

100‧‧‧運算系統 100‧‧‧ computing system

102、102-1至102-N、702、800、1630‧‧‧處理器 102, 102-1 to 102-N, 702, 800, 1630‧‧‧ processors

104、112‧‧‧互連 104, 112‧‧‧ Interconnection

106、106-1至106-M、707、802A-802N‧‧‧處理器核心 106, 106-1 to 106-M, 707, 802A-802N‧‧‧ processor core

108、704‧‧‧快取記憶體 108, 704‧‧‧ Cache memory

110‧‧‧路由器 110‧‧‧ router

112‧‧‧匯流排 112‧‧‧ Busbars

114、660‧‧‧記憶體 114, 660‧‧‧ memory

116、116-1、1451‧‧‧L1快取記憶體 116, 116-1, 1451‧‧‧L1 cache memory

140‧‧‧圖形邏輯 140‧‧‧Graphical Logic

150‧‧‧視訊記憶體 150‧‧‧Video Memory

202‧‧‧儲存裝置 202‧‧‧Storage device

400‧‧‧方法 400‧‧‧ method

402、404、406、408、410‧‧‧作業 402, 404, 406, 408, 410‧‧‧ homework

602‧‧‧系統晶片 602‧‧‧ system chip

620‧‧‧中央處理單元(CPU)核心 620‧‧‧Central Processing Unit (CPU) Core

630‧‧‧圖形處理器單元(GPU)核心 630‧‧‧Graphic Processor Unit (GPU) Core

640‧‧‧輸入/輸出(I/O)介面 640‧‧‧Input/Output (I/O) interface

642、814、1865‧‧‧記憶體控制器 642, 814, 1865‧‧‧ memory controller

670‧‧‧I/O裝置 670‧‧‧I/O device

700‧‧‧處理系統 700‧‧‧Processing system

706‧‧‧暫存器檔案 706‧‧‧Scratch file

708、712、808、900、1100、1400、1632、1810‧‧‧圖形處 理器 708, 712, 808, 900, 1100, 1400, 1632, 1810‧‧‧ graphics Processor

709‧‧‧指令集 709‧‧‧ instruction set

710‧‧‧處理器匯流排 710‧‧‧Processor bus

716‧‧‧記憶體控制器集線器 716‧‧‧Memory Controller Hub

720‧‧‧記憶體裝置 720‧‧‧ memory device

721‧‧‧指令 721‧‧‧ directive

722‧‧‧資料 722‧‧‧Information

724‧‧‧資料儲存裝置 724‧‧‧Data storage device

726‧‧‧無線收發器 726‧‧‧Wireless transceiver

728‧‧‧軔體介面 728‧‧‧ Body interface

730‧‧‧輸入/輸出(I/O)控制器集線器 730‧‧‧Input/Output (I/O) Controller Hub

734‧‧‧網路控制器 734‧‧‧Network Controller

740‧‧‧I/O控制器 740‧‧‧I/O controller

742‧‧‧通用序列匯流排(USB)控制器 742‧‧‧Common Serial Bus (USB) Controller

744‧‧‧鍵盤及滑鼠 744‧‧‧Keyboard and mouse

746‧‧‧音頻控制器 746‧‧‧Audio Controller

804A-804N‧‧‧快取記憶體單元 804A-804N‧‧‧ cache memory unit

806‧‧‧公用快取記憶體單元 806‧‧‧Common cache memory unit

810‧‧‧系統代理器核心 810‧‧‧System Agent Core

811、902、1443‧‧‧顯示控制器 811, 902, 1443‧‧‧ display controller

812‧‧‧環狀互連單元 812‧‧‧Circular interconnect unit

813‧‧‧I/O鏈路 813‧‧‧I/O link

816‧‧‧匯流排控制器單元 816‧‧‧ Busbar controller unit

818‧‧‧嵌入記憶體模組 818‧‧‧ embedded memory module

904‧‧‧方塊圖像轉移(BLIT)引擎 904‧‧‧ Block Image Transfer (BLIT) Engine

906‧‧‧視訊編解碼器引擎 906‧‧‧Video Codec Engine

910、1010‧‧‧圖形處理引擎 910, 1010‧‧‧ graphics processing engine

912、1012、1522‧‧‧3D管線 912, 1012, 1522‧‧3D pipeline

914‧‧‧記憶體介面 914‧‧‧ memory interface

915‧‧‧3D/媒體子系統 915‧‧3D/media subsystem

916、1016、1430、1524‧‧‧媒體管線 916, 1016, 1430, 1524‧‧‧ media pipeline

920、1845‧‧‧顯示裝置 920, 1845‧‧‧ display devices

1003、1103、1403‧‧‧命令串流器 1003, 1103, 1403‧‧‧ Command Streamer

1014‧‧‧執行單元陣列 1014‧‧‧Execution unit array

1030‧‧‧取樣引擎 1030‧‧‧Sampling engine

1032‧‧‧去雜訊/去交錯模組 1032‧‧‧To noise/deinterlacing module

1034‧‧‧動態估計模組 1034‧‧‧Dynamic Estimation Module

1036‧‧‧圖像縮放及濾波環狀模組 1036‧‧‧Image scaling and filtering ring module

1044、1214、1456‧‧‧資料埠 1044, 1214, 1456‧‧‧Information埠

1102、1402‧‧‧環狀互連 1102, 1402‧‧‧Circular interconnection

1104‧‧‧管線前端 1104‧‧‧ pipeline front end

1130‧‧‧視訊品質引擎(VQF) 1130‧‧·Video Quality Engine (VQF)

1133‧‧‧多格式編碼/解碼(MFX)引擎 1133‧‧‧Multi-format encoding/decoding (MFX) engine

1134、1434‧‧‧視訊前端 1134, 1434‧‧‧ video front end

1136‧‧‧幾何管線 1136‧‧‧Geometric pipeline

1137、1437‧‧‧媒體引擎 1137, 1437‧‧‧Media Engine

1150A-1150N、1160A-1160N‧‧‧子核心 1150A-1150N, 1160A-1160N‧‧ ‧ subcore

1152A-1152N‧‧‧第一組執行單元 1152A-1152N‧‧‧First group of execution units

1154A-1154N‧‧‧媒體/紋理取樣器 1154A-1154N‧‧‧Media/Texture Sampler

1162A-1162N‧‧‧第二組執行單元 1162A-1162N‧‧‧Second group of execution units

1164A-1164N、1210‧‧‧取樣器 1164A-1164N, 1210‧‧‧ sampler

1170A-1170N‧‧‧公用資源組 1170A-1170N‧‧‧Common Resources Group

1180A-1180N‧‧‧圖形核心 1180A-1180N‧‧‧ graphics core

1200、1450‧‧‧執行緒執行邏輯 1200, 1450‧‧‧ thread execution logic

1202‧‧‧像素著色器 1202‧‧‧pixel shader

1204、1431‧‧‧執行緒調度器 1204, 1431‧‧‧ thread scheduler

1206‧‧‧指令快取記憶體 1206‧‧‧ instruction cache memory

1208A-1208N、1452A、1452B‧‧‧執行單元 1208A-1208N, 1452A, 1452B‧‧‧ execution units

1212‧‧‧資料快取記憶體 1212‧‧‧Data cache memory

1300‧‧‧圖形處理器指令格式 1300‧‧‧Graphic processor instruction format

1310‧‧‧128位元格式 1310‧‧‧128 bit format

1312‧‧‧指令運算碼 1312‧‧‧ instruction opcode

1313‧‧‧索引欄 1313‧‧‧Index bar

1314‧‧‧指令控制欄 1314‧‧‧Command Control Bar

1316‧‧‧尺寸欄 1316‧‧‧ size bar

1318‧‧‧目的地 1318‧‧‧ Destination

1320‧‧‧來源運算元SRC0 1320‧‧‧Source operator SRC0

1322‧‧‧來源運算元SRC1 1322‧‧‧Source Operator SRC1

1324‧‧‧來源運算元SRC2 1324‧‧‧Source operand SRC2

1326‧‧‧存取/位址模式資訊 1326‧‧‧Access/address mode information

1330‧‧‧64位元緊實指令格式 1330‧‧‧64-bit compact instruction format

1340‧‧‧運算碼解碼 1340‧‧‧Operation code decoding

1342‧‧‧移動及邏輯運算碼群集 1342‧‧‧Mobile and Logical Opcode Clusters

1344‧‧‧流程控制指令群集 1344‧‧‧Process Control Command Cluster

1346‧‧‧雜項指令群集 1346‧‧‧ Miscellaneous Directive Cluster

1348‧‧‧並列數學指令群集 1348‧‧‧ Parallel Mathematical Instruction Cluster

1350‧‧‧向量數學群集 1350‧‧‧ Vector Mathematical Cluster

1405‧‧‧頂點收件器 1405‧‧‧Vertex Receiver

1407‧‧‧頂點著色器 1407‧‧‧Vertex Shader

1411‧‧‧可程控外殼著色器 1411‧‧‧Programmable Shell Shader

1413‧‧‧鑲嵌器 1413‧‧‧Inlay

1417‧‧‧可程控域著色器 1417‧‧‧ programmable field shader

1419‧‧‧幾何著色器 1419‧‧‧Geometry shader

1420‧‧‧圖形管線 1420‧‧‧Graphic pipeline

1423‧‧‧流出單元 1423‧‧‧Outflow unit

1429‧‧‧截波器 1429‧‧‧Chopper

1440‧‧‧顯示引擎 1440‧‧‧Display Engine

1441‧‧‧2D引擎 1441‧‧‧2D engine

1454‧‧‧紋理及媒體取樣器 1454‧‧‧Texture and Media Sampler

1458‧‧‧紋理/取樣器快取記憶體 1458‧‧‧Texture/sampler cache memory

1470‧‧‧呈現輸出管線 1470‧‧‧presenting the output pipeline

1473‧‧‧光柵器及深度測試組件 1473‧‧‧Raster and depth test components

1475‧‧‧共用L3快取記憶體 1475‧‧‧Shared L3 cache memory

1477‧‧‧像素作業組件 1477‧‧‧Pixel working components

1478‧‧‧呈現快取記憶體 1478‧‧‧ Presenting cache memory

1479‧‧‧深度快取記憶體 1479‧‧‧Deep cache memory

1500‧‧‧圖形處理器命令格式 1500‧‧‧Graphic Processor Command Format

1502‧‧‧目標客戶 1502‧‧‧ Target customers

1504‧‧‧命令作業碼(運算碼) 1504‧‧‧Command job code (opcode)

1505‧‧‧子運算碼 1505‧‧‧sub-operating code

1506‧‧‧資料欄 1506‧‧‧Information Bar

1508‧‧‧命令尺寸 1508‧‧‧Command size

1510‧‧‧圖形處理器命令序列 1510‧‧‧Graphic Processor Command Sequence

1512‧‧‧管線刷新命令 1512‧‧‧Line Refresh Command

1513‧‧‧管線選擇命令 1513‧‧‧Pipeline selection order

1514‧‧‧管線控制命令 1514‧‧‧Line Control Command

1516‧‧‧返回緩衝器狀態 1516‧‧‧Return buffer status

1520‧‧‧管線決定 1520‧‧‧ pipeline decision

1530‧‧‧3D管線狀態 1530‧‧‧3D pipeline status

1532‧‧‧3D基元 1532‧‧3D primitive

1534‧‧‧執行 1534‧‧‧Execution

1540‧‧‧媒體管線狀態 1540‧‧‧Media pipeline status

1542‧‧‧媒體物件命令 1542‧‧‧Media Object Order

1544‧‧‧執行命令 1544‧‧‧Execution of orders

1600‧‧‧資料處理系統 1600‧‧‧Data Processing System

1610‧‧‧3D圖形應用 1610‧‧‧3D graphics application

1612‧‧‧著色器指令 1612‧‧‧ Shader Instructions

1614‧‧‧可執行指令 1614‧‧‧executable instructions

1616‧‧‧圖形物件 1616‧‧‧Graphic objects

1620‧‧‧作業系統 1620‧‧‧Operating system

1622‧‧‧圖形應用編程介面 1622‧‧‧Graphic application programming interface

1624‧‧‧前端著色器編譯器 1624‧‧‧ Front End Shader Compiler

1626‧‧‧使用者模式圖形驅動器 1626‧‧‧User mode graphics driver

1627‧‧‧後端著色器編譯器 1627‧‧‧Backend shader compiler

1628‧‧‧作業系統內核模式功能 1628‧‧‧Operating system kernel mode function

1629‧‧‧內核模式圖形驅動器 1629‧‧‧ kernel mode graphics driver

1634‧‧‧通用處理器核心 1634‧‧‧General Processor Core

1650‧‧‧系統記憶體 1650‧‧‧System Memory

1700‧‧‧IP核心發展系統 1700‧‧‧IP Core Development System

1710‧‧‧軟體模擬 1710‧‧‧Software simulation

1715‧‧‧暫存器轉移級(RTL)設計 1715‧‧‧Storage Transfer Level (RTL) Design

1720‧‧‧硬體模型 1720‧‧‧ hardware model

1730‧‧‧設計設施 1730‧‧‧Design facilities

1740‧‧‧非揮發性記憶體 1740‧‧‧Non-volatile memory

1750‧‧‧有線連接 1750‧‧‧Wired connection

1760‧‧‧無線連接 1760‧‧‧Wireless connection

1765‧‧‧製造設施 1765‧‧‧Manufactured facilities

1800‧‧‧系統晶片積體電路 1800‧‧‧System Wafer Integrated Circuit

1805‧‧‧應用處理器 1805‧‧‧Application Processor

1815‧‧‧圖像處理器 1815‧‧‧Image Processor

1820‧‧‧視訊處理器 1820‧‧‧Video Processor

1825‧‧‧USB控制器 1825‧‧‧USB controller

1830‧‧‧UART控制器 1830‧‧‧UART controller

1835‧‧‧SPI/SDIO控制器 1835‧‧‧SPI/SDIO Controller

1840‧‧‧I2S/I2C控制器 1840‧‧‧I2S/I2C controller

1850‧‧‧高解析度多媒體介面(HDMI)控制器 1850‧‧‧High-resolution multimedia interface (HDMI) controller

1855‧‧‧行動產業處理器介面(M1P1)顯示介面 1855‧‧‧Mobile Industry Processor Interface (M1P1) Display Interface

1860‧‧‧快閃記憶體子系統 1860‧‧‧Flash Memory Subsystem

1870‧‧‧嵌入安全引擎 1870‧‧‧ embedded security engine

參照附圖提供詳細描述。在圖中,參考號碼之最左數字識別參考號碼首先出現之圖。不同圖中使用相同參考號碼表示類似或相同項目。 A detailed description is provided with reference to the drawings. In the figure, the leftmost digit of the reference number identifies the figure in which the reference number first appears. The same reference numbers are used in the different figures to indicate similar or identical items.

圖1、6-7、16、及18描繪運算系統之實施例之方塊圖,其可用以實施文中所討論之各式實施例。 1, 6-7, 16, and 18 depict block diagrams of embodiments of an arithmetic system that can be used to implement various embodiments discussed herein.

圖2依據實施例,描繪與純CPU複製相對於 混合複製相關聯之資料流。 Figure 2 depicts, in contrast to a pure CPU copy, in accordance with an embodiment Mixed replication of the associated data stream.

圖3依據實施例,描繪實施雙路徑記憶體複製作業之方塊圖。 3 depicts a block diagram of a dual path memory copy operation in accordance with an embodiment.

圖4依據實施例,描繪實施雙路徑記憶體複製作業之方法流程圖。 4 depicts a flow diagram of a method of implementing a dual path memory copy job, in accordance with an embodiment.

圖5依據實施例,描繪雙路徑記憶體複製可達成產量性能之樣本圖。 Figure 5 depicts a sample plot of dual path memory copying to achieve yield performance, in accordance with an embodiment.

圖8-12及14描繪依據若干實施例之處理器之各式組件。 8-12 and 14 depict various components of a processor in accordance with several embodiments.

圖13描繪依據若干實施例之圖形核心指令格式。 Figure 13 depicts a graphics core instruction format in accordance with several embodiments.

圖15A及15B分別描繪依據若干實施例之圖形處理器命令格式及序列。 15A and 15B depict graphics processor command formats and sequences, respectively, in accordance with several embodiments.

圖17描繪依據實施例之IP核心發展圖。 Figure 17 depicts an IP core development diagram in accordance with an embodiment.

【發明內容及實施方式】 SUMMARY OF THE INVENTION AND EMBODIMENT

在下列描述中,提出許多特定細節以便提供各式實施例之徹底理解。然而,可無特定細節而實現各式實施例。在其他狀況下,未詳細描述知名方法、程序、組件、及電路,以便不混淆特定實施例。此外,可使用各式機制實施實施例之各式態樣,諸如整合半導體電路(「硬體」)、組織為一或更多個程式之電腦可讀取指令(「軟體」)、或硬體及軟體之若干組合。對本揭露而言,提及「邏輯」將表示硬體、軟體、軔體、或其若干組合。 In the following description, numerous specific details are set forth in the description However, various embodiments may be implemented without specific details. In other instances, well-known methods, procedures, components, and circuits are not described in detail so as not to obscure the specific embodiments. In addition, various embodiments may be used to implement various aspects of the embodiments, such as integrated semiconductor circuits ("hardware"), computer readable instructions ("software") organized into one or more programs, or hardware And several combinations of software. For the purposes of this disclosure, reference to "logic" shall mean hardware, software, corpus, or a combination thereof.

通常,圖像及/或視訊框資料需從系統或主機記憶體(其亦可稱為CPU記憶體)複製至視訊記憶體(其亦可稱為GPU記憶體),供GPU存取。一旦處理結束,資料將複製回至系統/主機記憶體,以實施下一處理作業或例如顯示在螢幕上。在運算週期及/或記憶體帶寬使用方面,複製入(從與CPU相關聯之記憶體複製至與GPU相關聯之記憶體)及複製出(從與GPU相關聯之記憶體複製至與CPU相關聯之記憶體)是昂貴的。而且,有時框布局之轉換對複製作業是必要的,諸如鋪磚至線性、線性至鋪磚等。線性格式一般適於系統記憶體之一維列序列存取型式,其中運算元之每一列係儲存於序列遞增記憶體位置中。鋪磚格式將圖像/視訊框之封閉區劃分為較小矩形區陣列,提昇視訊記憶體之二維(2D)子區存取性能。因而,記憶體複製或轉移之費用為圖像及/或視訊處理應用中最通常之性能瓶頸之一。 Typically, images and/or video frame data are copied from the system or host memory (which may also be referred to as CPU memory) to video memory (which may also be referred to as GPU memory) for GPU access. Once processing is complete, the data is copied back to the system/host memory for execution of the next processing job or for example on the screen. Copying (from memory associated with the CPU to memory associated with the GPU) and copying (from memory associated with the GPU to CPU-related) in terms of computation cycles and/or memory bandwidth usage The memory of the joint) is expensive. Moreover, sometimes the conversion of the frame layout is necessary for copying operations, such as paving to linear, linear to tile, and the like. The linear format is generally adapted to one of the system memory serial array access patterns, wherein each column of the operands is stored in the sequence increment memory location. The tile layout divides the enclosed area of the image/video frame into a smaller rectangular area array to improve the two-dimensional (2D) sub-area access performance of the video memory. Thus, the cost of memory copying or transfer is one of the most common performance bottlenecks in image and/or video processing applications.

再者,一般有三種可能解決方案以於處理器中GPU記憶體及CPU主機記憶體之間交換資料,其中CPU及GPU係整合於相同積體電路裝置上,解決方案如下:(1)零複製為最快速解決方案,但具相對大量限制;(2)純GPU複製較零複製慢,但仍為具較少限制之最快速解決方案;以及(3)純CPU複製(例如使用SSE/AVX或串流SIMD(單指令多資料)延伸/先進向量延伸)為無限制之最慢解決方案。 Furthermore, there are generally three possible solutions for exchanging data between the GPU memory and the CPU host memory in the processor. The CPU and GPU are integrated on the same integrated circuit device. The solution is as follows: (1) Zero copy For the fastest solution, but with a relatively large number of restrictions; (2) pure GPU replication is slower than zero replication, but still the fastest solution with fewer restrictions; and (3) pure CPU replication (eg using SSE/AVX or Streaming SIMD (Single Instruction Multiple Data Extension/Advanced Vector Extension) is the slowest solution without restrictions.

為此,若干實施例提供GPU及CPU記憶體複 製作業之雙路徑記憶體複製技術,制衡二裝置以轉移記憶體資料。在實施例中,大部分資料係由GPU複製,剩餘部分資料則由CPU複製。此混合方法移除純GPU複製之障礙,同時極小地減少性能。再者,相較於純CPU複製作業(例如使用SSE/AVX),揭露之混合方法更快。 To this end, several embodiments provide GPU and CPU memory complexes. A dual-path memory copying technique for manufacturing operations, and a check and balance device to transfer memory data. In the embodiment, most of the data is copied by the GPU, and the rest of the data is copied by the CPU. This hybrid approach removes the barrier of pure GPU replication while minimizing performance. Furthermore, the hybrid method of disclosure is faster than pure CPU copy jobs (eg, using SSE/AVX).

此外,若干實施例可應用於運算系統中,其包括一或更多個處理器(例如具一或更多個處理器核心),諸如參照圖1-9所討論者,包括例如行動運算裝置,例如智慧手機、平板電腦、UMPC(超行動個人電腦)、膝上型電腦、超筆電(UltrabookTM)運算裝置、智慧手錶、智慧眼鏡等。更特定地,圖1依據實施例,描繪運算系統100之方塊圖。系統100可包括一或更多個處理器102-1至102-N(文中統稱為「處理器102」)。在各式實施例中,處理器102可包括通用CPU及/或GPU。處理器102可經由互連或匯流排104通訊。每一處理器可包括各式組件,為求清晰僅參照處理器102-1討論。因此,每一其餘處理器102-2至102-N可包括參照處理器102-1討論之相同或類似組件。 Moreover, several embodiments are applicable to an arithmetic system that includes one or more processors (e.g., having one or more processor cores), such as those discussed with reference to Figures 1-9, including, for example, a mobile computing device, such as smart phones, tablet PCs, UMPC (ultra mobile PC), laptop computer, ultra-notebook (Ultrabook TM) computing devices, watches wisdom, the wisdom of glasses. More specifically, FIG. 1 depicts a block diagram of computing system 100 in accordance with an embodiment. System 100 can include one or more processors 102-1 through 102-N (collectively referred to herein as "processor 102"). In various embodiments, processor 102 can include a general purpose CPU and/or GPU. The processor 102 can communicate via an interconnect or bus bar 104. Each processor can include a variety of components, discussed only with reference to processor 102-1 for clarity. Thus, each of the remaining processors 102-2 through 102-N may include the same or similar components discussed with reference to processor 102-1.

在實施例中,處理器102-1可包括一或更多個處理器核心106-1至106-M(文中統稱為「核心106」)、快取記憶體108、及/或路由器110。處理器核心106可於單一積體電路(IC)晶片上實施。再者,晶片可包括一或更多個公用及/或私用快取記憶體(諸如快取記憶體108)、匯流排或互連(諸如匯流排或互連 112)、圖形及/或記憶體控制器(諸如參照圖6-9所討論者)、或其他組件。 In an embodiment, processor 102-1 may include one or more processor cores 106-1 through 106-M (collectively referred to herein as "core 106"), cache memory 108, and/or router 110. Processor core 106 can be implemented on a single integrated circuit (IC) wafer. Furthermore, the wafer may include one or more public and/or private cache memories (such as cache memory 108), bus bars or interconnects (such as bus bars or interconnects). 112), graphics and/or memory controllers (such as those discussed with reference to Figures 6-9), or other components.

在一實施例中,路由器110可用於處理器102-1及/或系統100之各式組件間之通訊。再者,處理器102-1可包括一個以上路由器110。此外,許多路由器110可通訊而致能處理器102-1內部或外部各式組件間之資料路由。 In an embodiment, router 110 can be used for communication between various components of processor 102-1 and/or system 100. Moreover, processor 102-1 can include more than one router 110. In addition, a number of routers 110 can communicate to enable routing of data between various components within or outside of processor 102-1.

快取記憶體108可儲存資料(例如包括指令),供處理器102-1之一或更多個組件利用,諸如核心106。例如,為處理器102之組件更快存取(例如核心106更快存取),快取記憶體108可本機快取儲存於記憶體114中之資料。如圖1中所示,記憶體114可經由互連104與處理器102通訊。在實施例中,快取記憶體108(可為公用)可為中級快取記憶體(MLC)、末級快取記憶體(LLC)等。而且,每一核心106可包括1級(L1)快取記憶體(116-1)(文中統稱為「L1快取記憶體116」)或其他級快取記憶體,諸如2級(L2)快取記憶體。再者,處理器102-1之各式組件可經由匯流排(例如匯流排112)及/或記憶體控制器或集線器,而直接與快取記憶體108通訊。 The cache memory 108 can store data (eg, including instructions) for use by one or more components of the processor 102-1, such as the core 106. For example, for faster access by components of processor 102 (e.g., faster access by core 106), cache memory 108 can locally cache data stored in memory 114. As shown in FIG. 1, memory 114 can communicate with processor 102 via interconnect 104. In an embodiment, the cache memory 108 (which may be common) may be a medium cache memory (MLC), a last level cache (LLC), or the like. Moreover, each core 106 may include level 1 (L1) cache memory (116-1) (collectively referred to herein as "L1 cache memory 116") or other level cache memory, such as level 2 (L2) fast. Take the memory. Moreover, various components of processor 102-1 can communicate directly with cache memory 108 via busbars (e.g., busbars 112) and/or memory controllers or hubs.

如圖1中所示,處理器102可進一步包括圖形邏輯140(例如其可包括一或更多個圖形處理單元(GPU)核心,諸如參照圖6-9所討論者),而實施各式圖形及/或通用運算相關作業,諸如文中所討論者。邏輯 140可存取文中所討論之一或更多個儲存裝置(諸如視訊(或圖像、圖形等)記憶體150、快取記憶體108、L1快取記憶體116、記憶體114、暫存器、或系統100中另一記憶體),而儲存關於邏輯140之作業的資訊,諸如與系統100之各式組件通訊之資訊,如文中所討論者。而且,雖然邏輯140及視訊記憶體150係顯示於處理器102內部(或耦接至互連104),在各式實施例中,其可設於系統100中任何地方。例如,邏輯140可取代核心106之一,可直接耦接至互連112等。而且,視訊記憶體150可直接耦接至互連112等。 As shown in FIG. 1, processor 102 can further include graphics logic 140 (eg, which can include one or more graphics processing unit (GPU) cores, such as those discussed with respect to Figures 6-9), while implementing various graphics And/or general purpose computing related tasks, such as those discussed in the text. logic 140 may access one or more storage devices (such as video (or image, graphics, etc.) memory 150, cache memory 108, L1 cache memory 116, memory 114, scratchpad discussed in the text. Or another memory in system 100, and store information about the operations of logic 140, such as information communicated with various components of system 100, as discussed herein. Moreover, although logic 140 and video memory 150 are shown internal to processor 102 (or to interconnect 104), in various embodiments, it can be located anywhere in system 100. For example, logic 140 can replace one of cores 106 and can be directly coupled to interconnect 112 or the like. Moreover, the video memory 150 can be directly coupled to the interconnect 112 or the like.

圖2依據實施例,描繪與純CPU複製相對於混合複製相關聯之資料流。參照圖1-2,對純CPU複製作業而言(顯示於圖2之頂部),CPU鎖定資料(例如避免相同資料多次存取之問題),接著從視訊記憶體150複製鎖定之資料至系統記憶體114。在實施例中,對混合複製作業而言(顯示於圖2之底部),GPU 140從視訊記憶體150複製至少一部分資料至若干輔助/次要緩衝器/儲存裝置202。如參照圖3將進一步討論,CPU將從儲存裝置202複製資料至系統記憶體114。 2 depicts a data flow associated with a pure CPU copy relative to a hybrid copy, in accordance with an embodiment. Referring to FIG. 1-2, for a pure CPU copy job (shown at the top of FIG. 2), the CPU locks the data (for example, avoids the problem of multiple accesses of the same data), and then copies the locked data from the video memory 150 to the system. Memory 114. In an embodiment, for a hybrid copy job (shown at the bottom of FIG. 2), GPU 140 copies at least a portion of the data from video memory 150 to a number of auxiliary/secondary buffer/storage devices 202. As will be discussed further with respect to FIG. 3, the CPU will copy data from storage device 202 to system memory 114.

此外,如上述,存在三種可能解決方案如下,將資料從GPU轉移至CPU。零複製為最快速方法,但具相對大量限制,諸如:(a)記憶體需配置於系統記憶體空間中,並映射至視訊記憶體位址空間;(b)系統記憶體儲存裝置布局需具有線性布局(其有利於CPU存 取,但不利於GPU存取,一般於視訊記憶體中為更快GPU存取,而使用Y_鋪磚2D表面儲存裝置格式);以及(c)映射至GPU位址空間,系統記憶體需為(例如4K位元組)頁面對齊(其在許多現有軟體產品中可為硬限制)。除了零複製解決方案外,純GPU複製為最快速複製解決方案,但其仍具有若干限制(諸如以某方式對齊資料),其可避免許多開發者廣泛採用。CPU複製作業不具有限制,但其為最慢複製解決方案,例如因Y_鋪磚表面儲存裝置格式中之分線交錯。因需要從視訊記憶體實施四快取記憶體列讀取作業,而於系統記憶體中建立全轉換64位元組快取列,記憶體讀取頻寬之效率僅25%。 Furthermore, as mentioned above, there are three possible solutions for transferring data from the GPU to the CPU as follows. Zero copying is the fastest method, but with a relatively large number of restrictions, such as: (a) memory needs to be placed in the system memory space and mapped to the video memory address space; (b) system memory storage device layout needs to be linear Layout (it is good for CPU storage Take, but not conducive to GPU access, generally for faster GPU access in video memory, using Y_layout 2D surface storage device format); and (c) mapping to GPU address space, system memory needs For page alignment (eg 4K bytes) (which can be a hard limit in many existing software products). In addition to the zero-copy solution, pure GPU replication is the fastest replication solution, but it still has several limitations (such as aligning data in some way) that can be avoided by many developers. CPU copy jobs are not limited, but they are the slowest replication solution, for example because of the interleaving of the lines in the Y_layout surface storage format. Since it is necessary to perform a four-cache memory column read operation from the video memory and a full-conversion 64-bit cache cache in the system memory, the memory read bandwidth efficiency is only 25%.

圖3依據實施例,描繪雙路徑記憶體複製作業之MDF實施之方塊圖。如文中所討論,MDF(媒體開發架構)一般係指高階編程架構,以暴露GPU一般運算/處理能力,且其可顯著提昇高度並列或運算密集工作之性能。通常,MDF應用具有二組件:內核及主機程式。主機程式建立及啟動內核。內核接著於GPU硬體上執行。 3 depicts a block diagram of an MDF implementation of a dual path memory copy job, in accordance with an embodiment. As discussed herein, MDF (Media Development Architecture) generally refers to a high-level programming architecture to expose GPU general arithmetic/processing capabilities, and it can significantly improve the performance of highly parallel or computationally intensive operations. Typically, MDF applications have two components: the kernel and the host program. The host program builds and starts the kernel. The kernel is then executed on the GPU hardware.

在以雙路徑複製從視訊記憶體150轉移資料至系統記憶體114之狀況下,目的地並非16位元組對齊之系統記憶體。因此未對齊之目的地記憶體位址,GPU 140無法直接複製系統記憶體114中第一及最末若干位元組至目的地。原因在於Oword(八字)方塊寫入命令用以將資料寫入至目的地,但其需要至少16位元組對齊之系 統記憶體及表面寬度。為處理此限制,實施例導入(例如預配置)輔助緩衝器202做為暫時緩衝器,其可為具頁面對齊系統記憶體之使用者提供之記憶體緩衝器。如文中所討論,輔助緩衝器202可為零複製之緩衝器,亦稱為緩衝器UP。 In the case of copying data from the video memory 150 to the system memory 114 in a dual path, the destination is not a 16-byte aligned system memory. Thus, the unaligned destination memory address, GPU 140 cannot directly copy the first and last bytes of system memory 114 to the destination. The reason is that the Oword (eight character) block write command is used to write data to the destination, but it requires at least 16 byte alignment. Memory and surface width. To address this limitation, an embodiment imports (e.g., pre-configured) the auxiliary buffer 202 as a temporary buffer that can provide a memory buffer for a user with page-aligned system memory. As discussed herein, the auxiliary buffer 202 can be a zero copy buffer, also known as a buffer UP.

參照圖1-3,實施例利用具下列二步驟之解決方案。首先(圖3中標示為圓圈1),GPU部件:MDF複製內核:(a)圖像中,GPU 140複製全快取列(在一實施例中,快取列寬度為64位元組),從視訊記憶體150至(例如16位元組)對齊之系統記憶體至目的地(系統記憶體114)。該些具64位元組寬度之全快取列代表圖像中之像素。對此GPU複製之部件而言,實施例使用媒體方塊讀取及Oword方塊寫入命令,及具有轉置以映射視訊記憶體中之鋪磚布局至系統記憶體中之線性布局;及(b)GPU 140複製列起始及末端之部分快取列至輔助記憶體202內,具媒體方塊讀取及Oword方塊寫入命令。此GPU複製之部件實施如以上(a)中描述之相同功能,因為將發送相同讀取/寫入命令而實施資料轉移。其次(圖3中標示為圓圈2),CPU部件包括:(a)在GPU結束上述執行後,CPU存取輔助緩衝器202中之資料,因為它為零複製;及(b)如圖3中所示,CPU複製此資料至目的地系統記憶體114中相應地方(即起始及末端)。 Referring to Figures 1-3, the embodiment utilizes a solution having the following two steps. First (marked as circle 1 in Figure 3), GPU component: MDF replication kernel: (a) In the image, GPU 140 copies the full cache column (in one embodiment, the cache column width is 64 bytes), from The video memory 150 is (for example, 16 bytes) aligned to the system memory to the destination (system memory 114). The full cache columns with a width of 64 bytes represent the pixels in the image. For this GPU-replicated component, the embodiment uses a media block read and Oword block write command, and has a transpose to map the tile layout in the video memory to a linear layout in the system memory; and (b) The GPU 140 copies a portion of the cache start and end caches into the auxiliary memory 202 with media block read and Oword block write commands. The components of this GPU copy implement the same functionality as described in (a) above, since data transfer will be performed by sending the same read/write command. Secondly (labeled as circle 2 in Figure 3), the CPU components include: (a) after the GPU ends the above execution, the CPU accesses the data in the auxiliary buffer 202 because it is zero-copy; and (b) as shown in Figure 3. As shown, the CPU copies this material to the corresponding location (i.e., start and end) in the destination system memory 114.

再者,圖3中所示之雙路徑解決方案使用輔助緩衝器202(例如使用者提供之記憶體緩衝器)複製未 對齊之資料,相較於圖2頂部中所示直接從視訊記憶體至系統記憶體之CPU複製方式,提供更高性能。原因在於雙路徑複製作業使用與純GPU複製相同快速之GPU複製機構。由於緩衝器202制衡零複製,在GPU複製後,資料已在系統記憶體中。將該些少量資料項(例如在資料區塊起始及末端)複製至系統記憶體之成本相對極低。然而,如參照圖2所討論,純CPU複製需要表面上之鎖定作業,以避免資源衝突,其花費較高。參照圖1中資料轉移管線。 Furthermore, the dual path solution shown in Figure 3 uses an auxiliary buffer 202 (e.g., a memory buffer provided by the user) to copy Alignment data provides higher performance than CPU copying directly from video memory to system memory as shown at the top of Figure 2. The reason is that dual-path replication jobs use the same fast GPU replication mechanism as pure GPU replication. Since the buffer 202 checks and balances zero copy, after the GPU is copied, the data is already in the system memory. The cost of copying these small data items (eg, at the beginning and end of the data block) to the system memory is relatively low. However, as discussed with respect to Figure 2, pure CPU replication requires surface locking operations to avoid resource conflicts, which is costly. Refer to the data transfer pipeline in Figure 1.

參照圖3,假定目的地之開始處位址為對齊之40位元組(並非對齊之16位元組),雙路徑複製解決方案用以處理錯位問題。在實施例中,內核處理每一執行緒中8*128位元組,如參照圖4將進一步討論。 Referring to Figure 3, assuming that the address at the beginning of the destination is an aligned 40-bit tuple (not an aligned 16-bit tuple), the dual-path replication solution is used to handle the misalignment problem. In an embodiment, the kernel processes 8*128 bytes in each thread, as will be discussed further with respect to FIG.

更特定地,圖4依據實施例,描繪方法400之流程圖,實施雙路徑記憶體複製作業。在一實施例中,方法400顯示由軟體執行緒實施之作業,其利用CPU及GPU二者從來源(例如視訊記憶體150)複製資料方塊至目的地(例如系統記憶體114)。在實施例中,參照其他圖討論之各式組件可用以實施方法400之一或更多個作業。 More specifically, FIG. 4 depicts a flowchart of method 400 in accordance with an embodiment, implementing a dual path memory copy operation. In one embodiment, method 400 displays an operation performed by a software thread that utilizes both CPU and GPU to copy a data block from a source (eg, video memory 150) to a destination (eg, system memory 114). In an embodiment, various components discussed with reference to other figures may be used to implement one or more of the methods 400.

參照圖1-4,在作業402,資料區塊之起始部(例如記憶體或快取列之列中所提供,諸如圖3中所示之視訊記憶體150之位元0至24),從來源(例如視訊記憶體150)複製(例如藉由圖形邏輯/GPU 140)至輔助 緩衝器(例如緩衝器202)。可依據系統記憶體中開始處位址(例如,64-40或24,如圖3之範例中所示),計算(例如基於對齊之64位元組偏移)作業402之資料區塊之起始部。 Referring to Figures 1-4, at job 402, the beginning of the data block (e.g., provided in a memory or cache column, such as bits 0 through 24 of video memory 150 shown in Figure 3), Copy from source (eg video memory 150) (eg by graphics logic / GPU 140) to auxiliary A buffer (such as buffer 202). The data block of job 402 can be calculated (eg, based on the aligned 64-bit offset) based on the starting address in the system memory (eg, 64-40 or 24, as shown in the example of FIG. 3). The beginning.

在作業404,資料區塊之其餘部(即作業402之起始部之後,其可為其餘全快取列(64位元組))從來源複製(例如藉由圖形邏輯/GPU 140)至(例如對齊之系統記憶體)目的地(例如系統記憶體114)。可依據開始處位址歸整(例如,40對齊,或64,如圖3之範例中所示),計算(例如基於對齊之64位元組歸整)作業404之對齊之系統記憶體之開始處位址。作業404重複,直至達到資料區塊之末端為止,如作業406所決定(例如適遇列之末端的最後40位元組,如圖3之範例中所示)。 At job 404, the remainder of the data block (ie, after the beginning of job 402, which may be the remaining full cache columns (64 bytes)) is copied from the source (eg, by graphics logic / GPU 140) to (eg, Aligned system memory) destination (eg, system memory 114). The beginning of the system memory of the alignment of the job 404 (eg, based on aligned 64-bit normalization) may be calculated based on the initial address refinement (eg, 40 alignment, or 64, as shown in the example of FIG. 3). Address. Job 404 is repeated until the end of the data block is reached, as determined by job 406 (e.g., the last 40 bytes of the end of the match column, as shown in the example of Figure 3).

在作業408,從來源複製(例如藉由圖形邏輯/GPU 140)資料區塊之最後部(例如列之最後40位元組,如圖3之範例中所示)至輔助緩衝器。在作業410,儲存於輔助緩衝器中之資料複製(例如藉由CPU或圖1之一核心106)至目的地中之正確地方(例如如圖3中所示之開始及末端)。方法400接著可重複進行另一複製執行緒/作業。 At job 408, the last portion of the data block (e.g., the last 40 bytes of the column, as shown in the example of FIG. 3) is copied from the source (e.g., by graphics logic/GPU 140) to the auxiliary buffer. At job 410, the data stored in the auxiliary buffer is copied (e.g., by the CPU or core 106 of Figure 1) to the correct place in the destination (e.g., the beginning and end as shown in Figure 3). Method 400 can then repeat another copy thread/job.

圖5依據實施例,描繪雙路徑記憶體複製可達成產量性能之樣本圖。如同所示,雙路徑複製提供較純CPU複製(例如使用SSE4)更佳之性能,且僅較純GPU 複製小於1位元。因此,利用雙路徑記憶體複製之若干實施例達成預期功能及具單一平坦表面及多平坦表面之良好性能。為實施平坦表面,以上狀況可延伸而覆蓋UV平面及Y平面。而且,如圖5中所示,雙路徑解決方案約較純CPU複製(使用SSE4)快1.68倍,其係依據描繪樣本測試結果,且無對齊限制。再者,在以上測試中,雙路徑複製之目的地系統記憶體位址為對齊之1位元組,同時純GPU複製需要目的地系統記憶體之對齊之16位元組。 Figure 5 depicts a sample plot of dual path memory copying to achieve yield performance, in accordance with an embodiment. As shown, dual-path replication provides better performance than pure CPU replication (for example, using SSE4) and is only purer GPU The copy is less than 1 bit. Thus, several embodiments utilizing dual path memory replication achieve the desired functionality and good performance with a single flat surface and multiple flat surfaces. To implement a flat surface, the above conditions can be extended to cover the UV plane and the Y plane. Moreover, as shown in Figure 5, the dual path solution is approximately 1.68 times faster than pure CPU replication (using SSE4), which is based on plotting sample test results with no alignment restrictions. Furthermore, in the above test, the destination system memory address of the dual path copy is aligned to 1 byte, and the pure GPU copy requires the aligned 16-bit tuple of the destination system memory.

如文中所討論,YUV(亦稱為YCbCr)為二主要顏色空間之一,用以代表數位組件視訊(另一者為RGB)。YCbCr及RGB間之差異為YCbCr以亮度及二顏色差異信號代表顏色,同時RGB以紅、綠、及藍代表顏色。在YCbCr中,Y為亮度,Cb為藍減亮度(B-Y)及Cr為紅減亮度(R-Y)。YUV格式之一為將Y置入一平面及將UV置入另一平面。 As discussed herein, YUV (also known as YCbCr) is one of the two main color spaces used to represent digital component video (the other is RGB). The difference between YCbCr and RGB is that YCbCr represents color by brightness and two color difference signals, while RGB represents color by red, green, and blue. In YCbCr, Y is brightness, Cb is blue minus brightness (B-Y), and Cr is red minus brightness (R-Y). One of the YUV formats is to place Y in a plane and to place the UV in another plane.

如以上所討論,若干實施例利用CPU及GPU二者來實施記憶體複製作業。大部分(例如對齊之)資料係使用GPU複製內核來複製,其餘(例如未對齊之)資料將複製至CPU複製作業。更特定地,一實施例係依據MDF GPU編程架構實施。然而,實施例不侷限於MDF架構及暴露API(應用編程介面)之任何運行時間SDK(軟體開發工具),其使用CPU及GPU二者來實施記憶體資料轉移,而可用以實施各式實施例。 As discussed above, several embodiments utilize both CPU and GPU to implement a memory copy job. Most (eg, aligned) data is copied using the GPU copy kernel, and the rest (eg, unaligned) data is copied to the CPU copy job. More specifically, an embodiment is implemented in accordance with the MDF GPU programming architecture. However, embodiments are not limited to any runtime SDK (software development tool) that exposes an MDF architecture and exposes an API (application programming interface), which uses both CPU and GPU to implement memory data transfer, and can be used to implement various embodiments. .

因此,文中所討論之GPU/CPU雙路徑複製實 施例制衡CPU(為無限制)及GPU(為高性能)二者。該解決方案維持純GPU複製之高性能(極少性能損失),及移除與純GPU複製作業相關聯之障礙。結果,其亦為使用者移除障礙,而使用GPU複製作業來加速其軟體產品中之記憶體轉移。 Therefore, the GPU/CPU dual path replication discussed in this article The example checks and balances both CPU (for unlimited) and GPU (for high performance). This solution maintains the high performance of pure GPU replication (with minimal performance penalty) and removes the barriers associated with pure GPU replication jobs. As a result, it also removes barriers for the user and uses GPU copy jobs to speed up memory transfers in their software products.

在若干實施例中,文中所討論之一或更多個組件可體現為系統晶片(SOC)裝置。圖6依據實施例,描繪SOC封裝之方塊圖。如圖6中所描繪,SOC 602包括一或更多個中央處理單元(CPU)核心620(其可與圖1之核心106相同或類似)、一或更多個圖形處理器單元(GPU)核心630(其可與圖1之圖形邏輯140相同或類似)、輸入/輸出(I/O)介面640、及記憶體控制器642。SOC封裝602之各式組件可耦接至互連或匯流排,諸如文中參照其他圖所討論者。而且,SOC封裝602可包括更多或較少組件,諸如文中參照其他圖所討論者。此外,SOC封裝602之每一組件可包括一或更多個其他組件,例如文中參照其他圖所討論者。在一實施例中,SOC封裝602(及其組件)係提供於一或更多個積體電路(IC)晶粒上,例如其封裝於單一半導體裝置中。 In several embodiments, one or more of the components discussed herein may be embodied as a system on a chip (SOC) device. Figure 6 depicts a block diagram of a SOC package, in accordance with an embodiment. As depicted in FIG. 6, SOC 602 includes one or more central processing unit (CPU) cores 620 (which may be the same or similar to core 106 of FIG. 1), one or more graphics processing unit (GPU) cores. 630 (which may be the same as or similar to graphics logic 140 of FIG. 1), an input/output (I/O) interface 640, and a memory controller 642. The various components of SOC package 602 can be coupled to interconnects or bus bars, such as those discussed herein with reference to other figures. Moreover, SOC package 602 can include more or fewer components, such as those discussed herein with reference to other figures. Moreover, each component of SOC package 602 can include one or more other components, such as those discussed herein with reference to other figures. In one embodiment, SOC package 602 (and components thereof) are provided on one or more integrated circuit (IC) dies, for example, packaged in a single semiconductor device.

如圖6中所描繪,經由記憶體控制器642,SOC封裝602耦接至記憶體660(其可與文中參照其他圖所討論者類似或相同,諸如圖1之系統記憶體114)。在實施例中,記憶體660(或其一部分)可整合於SOC封裝602上。 As depicted in FIG. 6, SOC package 602 is coupled to memory 660 via memory controller 642 (which may be similar or identical to those discussed herein with respect to other figures, such as system memory 114 of FIG. 1). In an embodiment, memory 660 (or a portion thereof) may be integrated on SOC package 602.

I/O介面640可耦接至一或更多個I/O裝置670,例如經由諸如文中參照其他圖所討論者之互連及/或匯流排。I/O裝置670可包括鍵盤、滑鼠、觸控墊、顯示器、圖像/視訊捕捉裝置(諸如相機或攝影機/錄影機)、觸控螢幕、揚聲器等一或更多者。此外,在實施例中,SOC封裝602可包括/整合邏輯140及/或視訊記憶體150(或一部分視訊記憶體150)。另一方面,邏輯140及/或視訊記憶體150(或一部分視訊記憶體150)可於SOC封裝602外部提供(即做為個別邏輯)。 The I/O interface 640 can be coupled to one or more I/O devices 670, such as via interconnects and/or busbars such as those discussed herein with reference to other figures. I/O device 670 can include one or more of a keyboard, mouse, touch pad, display, image/video capture device (such as a camera or camera/recorder), a touch screen, a speaker, and the like. Moreover, in an embodiment, SOC package 602 can include/integrate logic 140 and/or video memory 150 (or a portion of video memory 150). Alternatively, logic 140 and/or video memory 150 (or a portion of video memory 150) may be provided external to SOC package 602 (ie, as individual logic).

圖7依據實施例,為處理系統700之方塊圖。在各式實施例中,系統700包括一或更多個處理器702及一或更多個圖形處理器708(諸如圖1之圖形邏輯140),並可為單一處理器桌上型系統、多處理器工作站系統、或具有大量處理器702(諸如圖1之處理器102)或處理器核心707(諸如圖1之核心106)之伺服器系統。在一實施例中,系統700為併入系統晶片(SoC)積體電路之處理平台,用於行動、手持、或嵌入裝置中。 FIG. 7 is a block diagram of processing system 700, in accordance with an embodiment. In various embodiments, system 700 includes one or more processors 702 and one or more graphics processors 708 (such as graphics logic 140 of FIG. 1), and can be a single processor desktop system, multiple A processor workstation system, or a server system having a number of processors 702 (such as processor 102 of FIG. 1) or processor core 707 (such as core 106 of FIG. 1). In one embodiment, system 700 is a processing platform incorporated into a system wafer (SoC) integrated circuit for use in a mobile, handheld, or embedded device.

系統700之實施例可包括或併入基於伺服器之遊戲平台,遊戲操縱台,包括遊戲及媒體操縱台、行動遊戲操縱台、手持遊戲操縱台、或線上遊戲操縱台。在若干實施例中,系統700為行動電話、智慧手機、平板運算裝置或行動網際網路裝置。資料處理系統700亦可包括、耦接、或整合可穿戴裝置,諸如智慧手錶可穿戴裝置、智慧眼鏡裝置、擴增實境裝置、或虛擬實境裝置。在若干實 施例中,資料處理系統700為電視或機上盒裝置,具有一或更多個處理器702及由一或更多個圖形處理器708產生之圖形介面。 Embodiments of system 700 may include or incorporate a server-based gaming platform, a gaming console, including a gaming and media console, a mobile gaming console, a handheld gaming console, or an online gaming console. In some embodiments, system 700 is a mobile phone, smart phone, tablet computing device, or mobile internet device. Data processing system 700 can also include, couple, or integrate wearable devices, such as smart watch wearable devices, smart eyewear devices, augmented reality devices, or virtual reality devices. In a few realities In the embodiment, data processing system 700 is a television or set-top box device having one or more processors 702 and a graphical interface generated by one or more graphics processors 708.

在若干實施例中,一或更多個處理器702各包括一或更多個處理器核心707,於執行時處理指令,實施系統及使用者軟體之作業。在若干實施例中,每一一或更多個處理器核心707係組配以處理特定指令集709。在若干實施例中,指令集709可促進複雜指令集運算(CISC)、精簡指令集運算(RISC)、或經由極長指令字(VLIW)之運算。多處理器核心707可分別處理不同指令集709,其可包括指令以促進其他指令集之仿真。處理器核心707亦可包括其他處理裝置,諸如數位信號處理器(DSP)。 In several embodiments, one or more processors 702 each include one or more processor cores 707 that, when executed, process instructions to perform operations of the system and user software. In several embodiments, each of the one or more processor cores 707 is assembled to process a particular set of instructions 709. In several embodiments, the set of instructions 709 can facilitate complex instruction set operations (CISC), reduced instruction set operations (RISC), or operations via very long instruction words (VLIW). The multiprocessor core 707 can process different sets of instructions 709, respectively, which can include instructions to facilitate emulation of other sets of instructions. Processor core 707 may also include other processing devices, such as a digital signal processor (DSP).

在若干實施例中,處理器702包括快取記憶體704。依據架構,處理器702可具有單一內部快取記憶體或多級內部快取記憶體。在若干實施例中,快取記憶體於處理器702之各式組件間公用。在若干實施例中,處理器702亦使用外部快取記憶體(例如3級(L3)快取記憶體或末級快取記憶體(LLC))(未顯示),其可於使用已知快取記憶體一致性技術之處理器核心707間公用。暫存器檔案706額外包括於處理器702中,其可包括不同類型暫存器,用於儲存不同類型資料(例如整數暫存器、浮點暫存器、狀態暫存器、及指令指標暫存器)。若干暫存器可為通用暫存器,同時其他暫存器可特定用於處理器 702之設計。 In several embodiments, processor 702 includes cache memory 704. Depending on the architecture, processor 702 can have a single internal cache or multiple levels of internal cache. In some embodiments, the cache memory is shared between various components of the processor 702. In some embodiments, the processor 702 also uses an external cache (eg, level 3 (L3) cache or last level cache (LLC)) (not shown), which can be used with known fast The processor core 707 of the memory consistency technique is shared. The scratchpad file 706 is additionally included in the processor 702, which may include different types of registers for storing different types of data (eg, integer registers, floating point registers, status registers, and instruction indicators). Memory). Several registers can be general purpose registers, while other registers can be specific to the processor 702 design.

在若干實施例中,處理器702耦接至處理器匯流排710,而於處理器702及系統700中其他組件之間傳輸通訊信號,諸如位址、資料、或控制信號。在一實施例中,系統700使用示例「集線器」系統架構,包括記憶體控制器集線器716及輸入/輸出(I/O)控制器集線器730。記憶體控制器集線器716促進記憶體裝置及系統700之其他組件間之通訊,同時I/O控制器集線器(ICH)730經由本機I/O匯流排提供至I/O裝置之連接。在一實施例中,記憶體控制器集線器716之邏輯整合於處理器中。 In several embodiments, processor 702 is coupled to processor bus 710 to communicate communication signals, such as address, data, or control signals, between processor 702 and other components in system 700. In one embodiment, system 700 uses an example "hub" system architecture, including a memory controller hub 716 and an input/output (I/O) controller hub 730. The memory controller hub 716 facilitates communication between the memory device and other components of the system 700 while the I/O controller hub (ICH) 730 provides connectivity to the I/O devices via the local I/O bus. In one embodiment, the logic of the memory controller hub 716 is integrated into the processor.

記憶體裝置720可為動態隨機存取記憶體(DRAM)裝置、靜態隨機存取記憶體(SRAM)裝置、快閃記憶體裝置、相變記憶體裝置、或具有適當性能做為程序記憶體之若干其他記憶體裝置。在一實施例中,記憶體裝置720可操作做為系統700之系統記憶體,以儲存資料722及指令721供一或更多個處理器702執行應用或程序時使用。記憶體控制器集線器716亦與可選外部圖形處理器712耦接,其可與處理器702中之一或更多個圖形處理器708通訊,而實施圖形及媒體作業。 The memory device 720 can be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase change memory device, or have appropriate performance as program memory. Several other memory devices. In one embodiment, memory device 720 is operative as system memory of system 700 for storing data 722 and instructions 721 for use by one or more processors 702 executing an application or program. The memory controller hub 716 is also coupled to an optional external graphics processor 712 that can communicate with one or more of the graphics processors 708 to implement graphics and media operations.

在若干實施例中,ICH 730致能週邊設備經由高速I/O匯流排而連接至記憶體裝置720及處理器702。I/O週邊設備包括但不侷限於音頻控制器746、軔體介面728、無線收發器726(例如Wi-Fi、Bluetooth)、資料儲 存裝置724(例如硬碟、快閃記憶體等)、及舊有I/O控制器740,用於耦接舊有(例如個人系統2(PS/2))裝置至系統。一或更多個通用序列匯流排(USB)控制器742連接輸入裝置,諸如鍵盤及滑鼠744組合。網路控制器734亦可耦接至ICH 730。在若干實施例中,高性能網路控制器(未顯示)耦接至處理器匯流排710。將理解的是,所顯示之系統700為示例且未侷限,亦可使用不同組配之其他類型資料處理系統。例如,I/O控制器集線器730可整合於一或更多個處理器702內,或記憶體控制器集線器716及I/O控制器集線器730可整合於個別外部圖形處理器內,諸如外部圖形處理器712。 In several embodiments, the ICH 730 enabled peripheral device is coupled to the memory device 720 and the processor 702 via a high speed I/O bus. I/O peripherals include, but are not limited to, audio controller 746, media interface 728, wireless transceiver 726 (eg, Wi-Fi, Bluetooth), data storage A storage device 724 (e.g., a hard disk, a flash memory, etc.), and an old I/O controller 740 are used to couple legacy (e.g., Personal System 2 (PS/2)) devices to the system. One or more universal serial bus (USB) controllers 742 are coupled to input devices, such as a combination of a keyboard and a mouse 744. Network controller 734 can also be coupled to ICH 730. In several embodiments, a high performance network controller (not shown) is coupled to the processor bus 710. It will be understood that the system 700 shown is exemplary and not limiting, and other types of data processing systems may be used. For example, I/O controller hub 730 can be integrated into one or more processors 702, or memory controller hub 716 and I/O controller hub 730 can be integrated into an individual external graphics processor, such as external graphics. Processor 712.

圖8為處理器800之實施例之方塊圖,具有一或更多個處理器核心802A-802N、整合記憶體控制器814、及整合圖形處理器808。處理器800可與參照圖1討論之處理器102類似或相同。圖8之該些元件具有與文中任何其他圖之元件相同的參考號碼(或名稱),可以類似於文中任何地方描述之任何方式操作或作動,但不侷限於此。處理器800可包括其餘核心,並包括由虛線方塊表示之其餘核心802N。每一處理器核心802A-802N包括一或更多個內部快取記憶體單元804A-804N。在若干實施例中,每一處理器核心亦存取一或更多個公用快取記憶體單元806。 8 is a block diagram of an embodiment of a processor 800 having one or more processor cores 802A-802N, an integrated memory controller 814, and an integrated graphics processor 808. Processor 800 can be similar or identical to processor 102 discussed with reference to FIG. The elements of Figure 8 have the same reference numbers (or names) as the elements of any other figures in the text, and may operate or behave in any manner similar to that described anywhere in the text, but are not limited thereto. Processor 800 can include the remaining cores and include the remaining cores 802N represented by dashed squares. Each processor core 802A-802N includes one or more internal cache memory units 804A-804N. In some embodiments, each processor core also accesses one or more public cache memory units 806.

內部快取記憶體單元804A-804N及公用快取記憶體單元806代表處理器800內之快取記憶體階層。快 取記憶體階層可包括每一處理器核心內至少一級指令及資料快取記憶體,及一或更多個級公用中級快取記憶體,諸如2級(L2)、3級(L3)、4級(L4)、或其他級快取記憶體,其中外部記憶體之前之最高級快取記憶體分類為LLC。在若干實施例中,快取記憶體一致性邏輯維持各式快取記憶體單元806及804A-804N間之一致性。 Internal cache memory units 804A-804N and common cache memory unit 806 represent cache memory levels within processor 800. fast The memory hierarchy may include at least one level of instruction and data cache memory in each processor core, and one or more levels of common intermediate cache memory, such as level 2 (L2), level 3 (L3), 4 Level (L4), or other level of cache memory, where the highest-level cache memory before the external memory is classified as LLC. In several embodiments, the cache memory coherency logic maintains consistency between the various cache memory units 806 and 804A-804N.

在若干實施例中,處理器800亦可一組一或更多個匯流排控制器單元816及系統代理器核心810。一或更多個匯流排控制器單元816管理一組週邊設備匯流排,諸如一或更多個週邊設備組件互連匯流排(例如PCI、PCI Express)。系統代理器核心810提供各式處理器組件之管理功能。在若干實施例中,系統代理器核心810包括一或更多個整合記憶體控制器814,以管理各式外部記憶體裝置之存取(未顯示)。 In some embodiments, processor 800 can also be a set of one or more bus controller unit 816 and system agent core 810. One or more bus controller units 816 manage a set of peripheral busses, such as one or more peripheral device component interconnect busses (eg, PCI, PCI Express). System Agent Core 810 provides management functions for various processor components. In several embodiments, system agent core 810 includes one or more integrated memory controllers 814 to manage access (not shown) of various external memory devices.

在若干實施例中,一或更多個處理器核心802A-802N包括支援同步多執行緒。在該實施例中,系統代理器核心810包括多執行緒處理期間用於協調及作業核心802A-802N之組件。系統代理器核心810可額外包括電力控制單元(PCU),其包括邏輯及組件以調節處理器核心802A-802N及圖形處理器808之電力狀態。 In several embodiments, one or more processor cores 802A-802N include support for synchronous multi-threading. In this embodiment, system agent core 810 includes components for coordinating and working cores 802A-802N during multi-thread processing. System agent core 810 can additionally include a power control unit (PCU) that includes logic and components to regulate the power states of processor cores 802A-802N and graphics processor 808.

在若干實施例中,處理器800額外包括圖形處理器808以執行圖形處理作業。在若干實施例中,圖形處理器808與公用快取記憶體單元806及系統代理器核心810之組合耦接,包括一或更多個整合記憶體控制器 814。在若干實施例中,顯示控制器811與圖形處理器808耦接,以驅動圖形處理器輸出至一或更多個耦接之顯示裝置。在若干實施例中,顯示控制器811可為個別模組,經由至少一互連而與圖形處理器耦接,或可整合於圖形處理器808或系統代理器核心810內。 In several embodiments, processor 800 additionally includes graphics processor 808 to perform graphics processing operations. In several embodiments, graphics processor 808 is coupled to a combination of public cache memory unit 806 and system proxy core 810, including one or more integrated memory controllers. 814. In some embodiments, display controller 811 is coupled to graphics processor 808 to drive the graphics processor output to one or more coupled display devices. In some embodiments, display controller 811 can be an individual module coupled to the graphics processor via at least one interconnect, or can be integrated into graphics processor 808 or system agent core 810.

在若干實施例中,環狀互連單元812用以耦接處理器800之內部組件。然而,可使用替代互連單元,諸如點對點互連、交換互連、或其他技術,包括本技藝中熟知之技術。在若干實施例中,圖形處理器808經由I/O鏈路813而與環狀互連812耦接。 In some embodiments, the ring interconnect unit 812 is configured to couple internal components of the processor 800. However, alternative interconnect units may be utilized, such as point-to-point interconnects, switched interconnects, or other techniques, including those well known in the art. In several embodiments, graphics processor 808 is coupled to ring interconnect 812 via I/O link 813.

示例I/O鏈路813代表多種I/O互連之至少一種,包括封裝I/O互連,其促進各式處理器組件及高性能嵌入記憶體模組818間之通訊,諸如eDRAM(或嵌入DRAM)模組。在若干實施例中,每一處理器核心802A-802N及圖形處理器808使用嵌入記憶體模組818做為公用末級快取記憶體。 The example I/O link 813 represents at least one of a variety of I/O interconnects, including packaged I/O interconnects that facilitate communication between various processor components and high performance embedded memory modules 818, such as eDRAM (or Embedded DRAM) module. In some embodiments, each processor core 802A-802N and graphics processor 808 uses embedded memory module 818 as a common final cache.

在若干實施例中,處理器核心802A-802N為同質核心,執行相同指令集架構。在另一實施例中,在指令集架構(ISA)方面,處理器核心802A-802N為異質,其中一或更多個處理器核心802A-802N執行第一指令集,同時至少一其他核心執行第一指令集之子集或不同指令集。在一實施例中,在微架構方面,處理器核心802A-802N為異質,其中具有相對較高電力損耗之一或更多個核心與具有較低電力損耗之一或更多個電力核心耦接。此 外,除了其他組件外,處理器800可於一或更多個晶片上實施,或做為具有所描繪組件之SoC積體電路。 In several embodiments, processor cores 802A-802N are homogeneous cores that implement the same instruction set architecture. In another embodiment, processor cores 802A-802N are heterogeneous in terms of an instruction set architecture (ISA), wherein one or more processor cores 802A-802N execute a first set of instructions while at least one other core executes A subset of an instruction set or a different instruction set. In an embodiment, processor cores 802A-802N are heterogeneous in terms of microarchitecture, wherein one or more cores having relatively high power losses are coupled to one or more power cores having lower power losses . this In addition, processor 800 can be implemented on one or more wafers, or as a SoC integrated circuit with the depicted components, among other components.

圖9為圖形處理器900之方塊圖,其可為個別圖形處理單元,或可為與複數處理核心整合之圖形處理器。圖形處理器900可與參照圖1討論之圖形邏輯140類似或相同。在若干實施例中,圖形處理器經由記憶體映射之I/O介面而通訊至圖形處理器上之暫存器,並具置入處理器記憶體之命令。在若干實施例中,圖形處理器900包括記憶體介面914以存取記憶體。記憶體介面914可為至本機記憶體、一或更多個內部快取記憶體、一或更多個公用外部快取記憶體、及/或系統記憶體之介面。 9 is a block diagram of a graphics processor 900, which may be an individual graphics processing unit, or may be a graphics processor integrated with a complex processing core. Graphics processor 900 can be similar or identical to graphics logic 140 discussed with respect to FIG. In some embodiments, the graphics processor communicates to the scratchpad on the graphics processor via the memory mapped I/O interface and has commands placed into the processor memory. In some embodiments, graphics processor 900 includes a memory interface 914 to access memory. The memory interface 914 can be an interface to local memory, one or more internal cache memories, one or more common external cache memories, and/or system memory.

在若干實施例中,圖形處理器900亦包括顯示控制器902,以驅動顯示輸出資料至顯示裝置920。顯示控制器902包括硬體,用於顯示裝置及多層視訊或使用者介面元件之組成的一或更多個覆蓋平面。在若干實施例中,圖形處理器900包括視訊編解碼器引擎906,而至、自或於一或更多個媒體編碼格式之間編碼、解碼、或轉碼媒體,包括但不侷限於動態圖像專家群組(MPEG)格式,諸如MPEG-2;先進視訊編碼(AVC)格式,諸如H.264/MPEG-4 AVC;以及動畫與電視工程師協會(SMPTE)421M/VC-1;聯合圖像專家群組(JPEG)格式,諸如JPEG;及動畫JPEG(MJPEG)格式。 In some embodiments, graphics processor 900 also includes display controller 902 to drive display output data to display device 920. Display controller 902 includes hardware for one or more coverage planes of the display device and a plurality of layers of video or user interface elements. In several embodiments, graphics processor 900 includes video codec engine 906, encoding, decoding, or transcoding media to, from, or between one or more media encoding formats, including but not limited to dynamic graphics Like Expert Group (MPEG) format, such as MPEG-2; Advanced Video Coding (AVC) format, such as H.264/MPEG-4 AVC; and Society for Motion Picture and Television Engineers (SMPTE) 421M/VC-1; Joint Image Expert Group (JPEG) format, such as JPEG; and Motion JPEG (MJPEG) format.

在若干實施例中,圖形處理器900包括方塊圖像轉移(BLIT)引擎904,以實施二維(2D)光柵器作 業,包括例如位元邊界方塊轉移。然而,在一實施例中,使用圖形處理引擎(GPE)910之一或更多個組件來實施8D圖形作業。在若干實施例中,圖形處理引擎910為運算引擎,用於實施圖形作業,包括三維(3D)圖形作業及媒體作業。 In several embodiments, graphics processor 900 includes a block image transfer (BLIT) engine 904 to implement a two-dimensional (2D) rasterizer. Industry, including, for example, bit boundary block transfer. However, in an embodiment, the 8D graphics job is implemented using one or more components of a graphics processing engine (GPE) 910. In some embodiments, graphics processing engine 910 is an arithmetic engine for implementing graphics jobs, including three-dimensional (3D) graphics jobs and media jobs.

在若干實施例中,GPE 910包括3D管線912用於實施3D作業,諸如使用在3D原始形狀(例如矩形、三角形等)上動作之處理功能呈現三維圖像及場景。3D管線912包括可程控及固定功能元件,其於元件內實施各式工作,及/或產生執行緒至3D/媒體子系統915。雖然3D管線912可用以實施媒體作業,GPE 910之實施例亦包括媒體管線916,其具體地用以實施媒體作業,諸如視訊後處理及圖像增強。 In several embodiments, GPE 910 includes a 3D pipeline 912 for implementing 3D jobs, such as rendering three-dimensional images and scenes using processing functions that act on 3D original shapes (eg, rectangles, triangles, etc.). The 3D pipeline 912 includes programmable and fixed function components that perform various functions within the components and/or generate threads to the 3D/media subsystem 915. While 3D pipeline 912 can be used to implement media operations, embodiments of GPE 910 also include media pipeline 916, which is specifically used to implement media jobs, such as post-video processing and image enhancement.

在若干實施例中,媒體管線916包括固定功能或可程控邏輯單元,代替或代表視訊編解碼器引擎906實施一或更多個特定媒體作業,諸如視訊解碼加速、視訊去交錯、及視訊編碼加速。在若干實施例中,媒體管線916額外包括執行緒產生單元來產生執行緒,用於在3D/媒體子系統915上執行。產生之執行緒於3D/媒體子系統915中所包括之一或更多個圖形執行單元上實施媒體作業運算。 In several embodiments, media pipeline 916 includes fixed or programmable logic units that implement one or more specific media jobs, such as video decoding acceleration, video deinterlacing, and video encoding acceleration, in place of or on behalf of video codec engine 906. . In several embodiments, media pipeline 916 additionally includes a thread generation unit to generate threads for execution on 3D/media subsystem 915. The generated thread implements a media job operation on one or more graphics execution units included in the 3D/media subsystem 915.

在若干實施例中,3D/媒體子系統915包括邏輯,用於執行3D管線912及媒體管線916產生之執行緒。在一實施例中,管線發送執行緒執行請求至3D/媒 體子系統915,其包括執行緒調度邏輯,用於針對可用執行緒執行資源仲裁及調度各式請求。執行資源包括圖形執行單元之陣列以處理3D及媒體執行緒。在若干實施例中,3D/媒體子系統915包括執行緒指令及資料之一或更多個內部快取記憶體。在若干實施例中,子系統亦包括公用記憶體,包括暫存器及可定址記憶體,而於執行緒之間共用資料及儲存輸出資料。 In several embodiments, 3D/media subsystem 915 includes logic for executing threads generated by 3D pipeline 912 and media pipeline 916. In an embodiment, the pipeline sends a thread execution request to the 3D/media The volume subsystem 915 includes thread scheduling logic for performing resource arbitration and scheduling various requests for available threads. Execution resources include an array of graphics execution units to handle 3D and media threads. In several embodiments, the 3D/media subsystem 915 includes one or more internal cache memories of thread instructions and data. In some embodiments, the subsystem also includes a common memory, including a scratchpad and addressable memory, sharing data between the threads and storing the output data.

圖10依據若干實施例,為圖形處理器之圖形處理引擎1010之方塊圖。在一實施例中,GPE 1010為圖9中所示之GPE 910的版本。圖10之元件具有與文中任何其他圖之元件相同參考號碼(或名稱),可以類似於文中其他地方描述之任何方式操作或做動,但不侷限於此。 10 is a block diagram of a graphics processing engine 1010 of a graphics processor, in accordance with several embodiments. In an embodiment, GPE 1010 is a version of GPE 910 shown in FIG. The elements of Figure 10 have the same reference numbers (or names) as the elements of any other figures in the text, and may operate or behave in any manner similar to that described elsewhere herein, but are not limited thereto.

在若干實施例中,GPE 1010與命令串流器1003耦接,其提供命令流至GPE 3D及媒體管線1012、1016。在若干實施例中,命令串流器1003耦接至記憶體,其可為系統記憶體或一或更多個內部快取記憶體及公用快取記憶體。在若干實施例中,命令串流器1003接收來自記憶體之命令,並發送命令至3D管線1012及/或媒體管線1016。命令為取自儲存3D及媒體管線1012、1016之命令之環狀緩衝器之指引。在一實施例中,環狀緩衝器可額外包括批次命令緩衝器,儲存多批命令。3D及媒體管線1012、1016藉由經由個別管線內之邏輯實施作業,或藉由調度一或更多個執行緒至執行單元陣列1014,而處理命令。在若干實施例中,執行單元陣列1014可擴 縮,使得陣列依據GPE 1010之目標電力及效能位準而包括可變數量執行單元。 In several embodiments, GPE 1010 is coupled to command streamer 1003, which provides command flow to GPE 3D and media pipelines 1012, 1016. In some embodiments, the command streamer 1003 is coupled to a memory, which can be system memory or one or more internal cache memories and a common cache memory. In several embodiments, command streamer 1003 receives commands from memory and sends commands to 3D pipeline 1012 and/or media pipeline 1016. The command is a guide taken from a ring buffer that stores commands for the 3D and media pipelines 1012, 1016. In an embodiment, the ring buffer may additionally include a batch command buffer to store multiple batches of commands. The 3D and media pipelines 1012, 1016 process the commands by performing operations via logic within the individual pipelines, or by scheduling one or more threads to execute the array of cells 1014. In several embodiments, the array of execution units 1014 is expandable Shrinking causes the array to include a variable number of execution units based on the target power and performance levels of the GPE 1010.

在若干實施例中,取樣引擎1030與記憶體(例如快取記憶體或系統記憶體)及執行單元陣列1014耦接。在若干實施例中,取樣引擎1030提供執行單元陣列1014之記憶體存取機構,允許執行單元陣列1014從記憶體讀取圖形及媒體資料。在若干實施例中,取樣引擎1030包括邏輯以實施媒體之特定圖像取樣作業。 In some embodiments, the sampling engine 1030 is coupled to a memory (eg, a cache or system memory) and an array of execution units 1014. In several embodiments, the sampling engine 1030 provides a memory access mechanism that executes the cell array 1014, allowing the execution cell array 1014 to read graphics and media material from memory. In several embodiments, the sampling engine 1030 includes logic to implement a particular image sampling job for the media.

在若干實施例中,取樣引擎1030中特定媒體取樣邏輯包括去雜訊/去交錯模組1032、動態估計模組1034、及圖像縮放及濾波環狀模組1036。在若干實施例中,去雜訊/去交錯模組1032包括邏輯以於解碼之視訊資料上實施一或更多個去雜訊或去交錯演算法。去交錯邏輯組合交錯視訊內容之交錯欄位為單一視訊訊框。去雜訊邏輯減少或移除視訊及圖像資料之資料雜訊。在若干實施例中,去雜訊邏輯及去交錯邏輯係隨動作調整,並依據視訊資料中檢測之動作數量使用空間或時間濾波。在若干實施例中,去雜訊/去交錯模組1032包括專用動作檢測邏輯(例如動作估計引擎1034內)。 In some embodiments, the particular media sampling logic in the sampling engine 1030 includes a de-noising/de-interlacing module 1032, a dynamic estimation module 1034, and an image scaling and filtering ring module 1036. In some embodiments, the de-noising/de-interlacing module 1032 includes logic to implement one or more de-noising or de-interlacing algorithms on the decoded video material. The interleaving logic combines the interlaced fields of the interlaced video content into a single video frame. The noise logic reduces or removes data noise from video and image data. In some embodiments, the denoising logic and deinterleaving logic are adjusted as a function of motion and spatial or temporal filtering is used depending on the number of actions detected in the video material. In some embodiments, the de-noising/de-interlacing module 1032 includes dedicated motion detection logic (eg, within the motion estimation engine 1034).

在若干實施例中,動作估計引擎1034藉由實施視訊加速功能,諸如視訊資料之動作向量估計及預測,而提供視訊作業之硬體加速。動作估計引擎決定動作向量,其描述連續視訊訊框間之圖像資料之轉換。在若干實施例中,圖形處理器媒體編解碼器使用視訊動作估計引擎 1034而於巨集區塊級之視訊上實施作業,其可能因運算過於集中而無法以通用處理器實施。在若干實施例中,動作估計引擎1034一般可用於圖形處理器組件而協助視訊解碼及處理功能,其對於視訊資料內之動作方向及量值敏感或調適。 In some embodiments, the motion estimation engine 1034 provides hardware acceleration of the videowork by implementing a video acceleration function, such as motion vector estimation and prediction of video data. The motion estimation engine determines an action vector that describes the conversion of image data between successive video frames. In some embodiments, the graphics processor media codec uses a video motion estimation engine At 1034, the operation is performed on the video of the macro block level, which may be implemented by a general purpose processor because the operation is too concentrated. In some embodiments, the motion estimation engine 1034 is generally operable with graphics processor components to assist in video decoding and processing functions that are sensitive or tuned to the direction and magnitude of motion within the video material.

在若干實施例中,圖像縮放及濾波環狀模組1036實施圖像處理作業,以增強產生之圖像及視訊的視覺品質。在若干實施例中,縮放及濾波環狀模組1036於提供資料至執行單元陣列1014之前,於取樣作業期間處理圖像及視訊資料。 In some embodiments, the image scaling and filtering ring module 1036 performs image processing operations to enhance the visual quality of the resulting image and video. In some embodiments, the scaling and filtering ring module 1036 processes the image and video material during the sampling operation prior to providing the data to the array of execution units 1014.

在若干實施例中,GPE 1010包括資料埠1044,其提供圖形子系統之額外機構至存取記憶體。在若干實施例中,資料埠1044促進作業之記憶體存取,包括呈現目標寫入、常量緩衝器讀取、暫用記憶體空間讀取/寫入、及媒體表面存取。在若干實施例中,資料埠1044包括快取記憶體存取記憶體之快取記憶體空間。快取記憶體可為單一資料快取記憶體或分為多子系統之多快取記憶體,其經由資料埠(例如呈現緩衝器快取記憶體、常量緩衝器快取記憶體等)存取記憶體。在若干實施例中,執行緒於執行單元陣列1014中之執行單元上執行,經由耦接GPE 1010之每一子系統之資料分布互連交換信息,而與資料埠通訊。 In several embodiments, GPE 1010 includes data 埠 1044 that provides additional mechanisms to the graphics subsystem to access memory. In some embodiments, the data volume 1044 facilitates memory access to the job, including rendering target writes, constant buffer reads, temporary memory space read/write, and media surface access. In some embodiments, the data volume 1044 includes a cache memory space of the cache memory access memory. The cache memory can be a single data cache or a multi-subsystem cache memory, which is accessed via data files (such as rendering buffer cache memory, constant buffer cache memory, etc.). Memory. In some embodiments, the threads are executed on the execution units in the execution unit array 1014, and the information is exchanged via the data distribution interconnects of each of the subsystems of the GPE 1010 to communicate with the data.

圖11為圖形處理器1100之另一實施例之方塊圖。圖11之該些元件具有與文中任何其他圖之元件相 同的參考號碼(或名稱),可以類似於文中任何地方描述之任何方式操作或作動,但不侷限於此。 11 is a block diagram of another embodiment of a graphics processor 1100. The elements of Figure 11 have elements associated with any of the other figures in the text. The same reference number (or name) may be operated or actuated in any manner similar to that described anywhere in the text, but is not limited thereto.

在若干實施例中,圖形處理器1100包括環狀互連1102、管線前端1104、媒體引擎1137、及圖形核心1180A-1180N。在若干實施例中,環狀互連1102耦接圖形處理器至其他處理單元,包括其他圖形處理器或一或更多個通用處理器核心。在若干實施例中,圖形處理器為整合於多核心處理系統內之許多處理器之一。 In several embodiments, graphics processor 1100 includes a ring interconnect 1102, a pipeline front end 1104, a media engine 1137, and graphics cores 1180A-1180N. In several embodiments, the ring interconnect 1102 is coupled to a graphics processor to other processing units, including other graphics processors or one or more general purpose processor cores. In several embodiments, the graphics processor is one of many processors integrated into a multi-core processing system.

在若干實施例中,圖形處理器1100經由環狀互連1102而接收若干批命令。匯入命令係由管線前端1104中之命令串流器1103解譯。在若干實施例中,圖形處理器1100包括可擴縮執行邏輯,經由圖形核心1180A-1180N實施3D幾何處理及媒體處理。對3D幾何處理命令而言,命令串流器1103供應命令至幾何管線1136。至少若干媒體處理命令而言,命令串流器1103供應命令至視訊前端1134,其與媒體引擎1137耦接。在若干實施例中,媒體引擎1137包括視訊品質引擎(VQF)1130,用於視訊及圖像後處理及多格式編碼/解碼(MFX)引擎1133,而提供硬體加速之媒體資料編碼及解碼。在若干實施例中,幾何管線1136及媒體引擎1137各產生執行緒,用於至少一圖形核心1180A提供之執行緒執行資源。 In several embodiments, graphics processor 1100 receives a number of batches of commands via ring interconnect 1102. The import command is interpreted by the command stream 1103 in the pipeline front end 1104. In several embodiments, graphics processor 1100 includes scalable execution logic to implement 3D geometry processing and media processing via graphics cores 1180A-1180N. For the 3D geometry processing command, the command streamer 1103 supplies commands to the geometry pipeline 1136. In terms of at least a number of media processing commands, the command streamer 1103 supplies commands to the video front end 1134 that are coupled to the media engine 1137. In some embodiments, media engine 1137 includes a video quality engine (VQF) 1130 for video and image post-processing and multi-format encoding/decoding (MFX) engine 1133 to provide hardware accelerated encoding and decoding of media data. In some embodiments, geometry pipeline 1136 and media engine 1137 each generate threads for at least one thread execution resource provided by graphics core 1180A.

在若干實施例中,圖形處理器1100包括可擴縮執行緒執行資源特色模組核心1180A-1180N(有時稱為核心切片),各具有多子核心1150A-1150N、1160A- 1160N(有時稱為核心子切片)。在若干實施例中,圖形處理器1100可具有任何數量之圖形核心1180A至1180N。在若干實施例中,圖形處理器1100包括具有至少第一子核心1150A及第二核心子核心1160A之圖形核心1180A。在其他實施例中,圖形處理器為低電力處理器,具單一子核心(例如1150A)。在若干實施例中,圖形處理器1100包括多圖形核心1180A-1180N,各包括第一子核心組1150A-1150N及第二子核心組1160A-1160N。第一子核心組1150A-1150N中每一子核心包括至少第一組執行單元1152A-1152N及媒體/紋理取樣器1154A-1154N。第二子核心組1160A-1160N中每一子核心包括至少第二組執行單元1162A-1162N及取樣器1164A-1164N。在若干實施例中,每一子核心1150A-1150N、1160A-1160N共用公用資源組1170A-1170N。在若干實施例中,公用資源包括公用快取記憶體及像素作業邏輯。其他公用資源亦可包括於圖形處理器之各式實施例中。 In several embodiments, graphics processor 1100 includes scalable thread execution resource feature module cores 1180A-1180N (sometimes referred to as core slices), each having multiple sub-cores 1150A-1150N, 1160A- 1160N (sometimes called a core sub-slice). In several embodiments, graphics processor 1100 can have any number of graphics cores 1180A through 1180N. In several embodiments, graphics processor 1100 includes a graphics core 1180A having at least a first sub-core 1150A and a second core sub-core 1160A. In other embodiments, the graphics processor is a low power processor with a single sub-core (eg, 1150A). In several embodiments, graphics processor 1100 includes multiple graphics cores 1180A-1180N, each including a first sub-core set 1150A-1150N and a second sub-core set 1160A-1160N. Each of the first sub-core sets 1150A-1150N includes at least a first set of execution units 1152A-1152N and media/texture samplers 1154A-1154N. Each of the second sub-core sets 1160A-1160N includes at least a second set of execution units 1162A-1162N and samplers 1164A-1164N. In several embodiments, each sub-core 1150A-1150N, 1160A-1160N shares a common resource group 1170A-1170N. In several embodiments, the common resources include public cache memory and pixel job logic. Other common resources may also be included in various embodiments of the graphics processor.

圖12描繪執行緒執行邏輯1200,其包括在GPE之若干實施例中採用之處理元件陣列。圖12之該些元件具有與文中任何其他圖之元件相同的參考號碼(或名稱),可以類似於文中任何地方描述之任何方式操作或作動,但不侷限於此。 Figure 12 depicts thread execution logic 1200 that includes an array of processing elements employed in several embodiments of the GPE. The elements of Figure 12 have the same reference numbers (or names) as the elements of any other figures in the text, and may operate or behave in any manner similar to that described anywhere in the text, but are not limited thereto.

在若干實施例中,執行緒執行邏輯1200包括像素著色器1202、執行緒調度器1204、指令快取記憶體1206、包括複數執行單元1208A-1208N之可擴縮執行單 元陣列、取樣器1210、資料快取記憶體1212、及資料埠1214。在一實施例中,包括之組件經由鏈接每一組件之互連架構互連。在若干實施例中,執行緒執行邏輯1200包括至諸如系統記憶體或快取記憶體之記憶體之一或更多個連接;至一或更多個指令快取記憶體1206之一或更多個連接;資料埠1214;取樣器1210;及執行單元陣列1208A-1208N。在若干實施例中,每一執行單元(例如1208A)為個別向量處理器,可每一執行緒並列地執行多同步執行緒及處理多資料元件。在若干實施例中,執行單元陣列1208A-1208N包括任何數量個別執行單元。 In several embodiments, thread execution logic 1200 includes a pixel shader 1202, a thread scheduler 1204, an instruction cache memory 1206, and a scalable execution list including a plurality of execution units 1208A-1208N. The element array, the sampler 1210, the data cache memory 1212, and the data port 1214. In an embodiment, the components included are interconnected via an interconnect fabric that links each component. In several embodiments, thread execution logic 1200 includes one or more connections to a memory such as system memory or cache memory; to one or more of one or more instruction caches 1206 Connections; data 埠 1214; sampler 1210; and array of execution units 1208A-1208N. In several embodiments, each execution unit (e.g., 1208A) is an individual vector processor that can execute multiple synchronization threads and process multiple data elements in parallel for each thread. In several embodiments, the array of execution units 1208A-1208N includes any number of individual execution units.

在若干實施例中,執行單元陣列1208A-1208N主要用以執行「著色器」程式。在若干實施例中,陣列1208A-1208N中之執行單元執行指令集,包括本機支援許多標準3D圖形著色器指令,使得以最小翻譯執行來自圖形庫(例如直接3D及OpenGL)之著色器程式。執行單元支援頂點及幾何處理(例如頂點程式、幾何程式、頂點著色器)、像素處理(例如像素著色器、資料塊著色器)及通用處理(例如運算及媒體著色器)。 In some embodiments, the execution unit arrays 1208A-1208N are primarily used to execute a "shader" program. In several embodiments, the execution units in arrays 1208A-1208N execute a set of instructions, including native support for a number of standard 3D graphics shader instructions, such that colorizer programs from graphics libraries (eg, direct 3D and OpenGL) are executed with minimal translation. Execution units support vertex and geometry processing (such as vertex programs, geometry programs, vertex shaders), pixel processing (such as pixel shaders, block shaders), and general processing (such as arithmetic and media shaders).

執行單元陣列1208A-1208N中每一執行單元於資料元件陣列上操作。資料元件數量為「執行尺寸」或指令之通道數量。執行通道為執行資料元件存取、遮罩、及指令之流程控制的邏輯單元。通道數量可獨立於特定圖形處理器之實體算術邏輯單元(ALU)或浮點單元(FPU)之數量。在若干實施例中,執行單元1208A- 1208N支援整數及浮點資料類型。 Each of the execution unit arrays 1208A-1208N operates on an array of data elements. The number of data elements is the "execution size" or the number of channels of the command. The execution channel is a logic unit that performs flow control of data element access, masking, and instructions. The number of channels can be independent of the number of physical arithmetic logic units (ALUs) or floating point units (FPUs) of a particular graphics processor. In several embodiments, execution unit 1208A- 1208N supports integer and floating point data types.

執行單元指令集包括單指令多資料(SIMD)指令。各式資料元件可以封裝資料類型儲存於暫存器中,且執行單元將依據元件之資料尺寸處理各式元件。例如,當在256位元寬向量上作業時,向量之256位元儲存於暫存器中,且執行單元在向量操作為4個別64位元封裝資料元件(四字(QW)尺寸資料元件)、8個別32位元封裝資料元件(雙字(DW)尺寸資料元件)、16個別16位元封裝資料元件(字(W)尺寸資料元件)、或32個別8位元資料元件(位元組(B)尺寸資料元件)。然而,不同向量寬度及暫存器尺寸亦可。 The execution unit instruction set includes a single instruction multiple data (SIMD) instruction. Various data components can be stored in the scratchpad in the package data type, and the execution unit will process various components according to the data size of the component. For example, when working on a 256-bit wide vector, the 256-bit vector is stored in the scratchpad, and the execution unit operates in a vector of 4 individual 64-bit packed data elements (quad-word (QW) size data elements). , 8 individual 32-bit package data components (double word (DW) size data components), 16 individual 16-bit package data components (word (W) size data components), or 32 individual 8-bit data components (bytes) (B) Size data component). However, different vector widths and scratchpad sizes are also possible.

一或更多個內部指令快取記憶體(例如1206)係包括於執行緒執行邏輯1200中,以快取執行單元之執行緒指令。在若干實施例中,係包括一或更多個資料快取記憶體(例如1212),而於執行緒執行期間快取執行緒資料。在若干實施例中,係包括取樣器1210,而提供3D作業之紋理取樣及媒體作業之媒體取樣。在若干實施例中,取樣器1210包括特定紋理或媒體取樣功能,而於提供取樣資料至執行單元之前,在取樣程序期間處理紋理或媒體資料。 One or more internal instruction caches (e.g., 1206) are included in thread execution logic 1200 to cache thread execution instructions. In some embodiments, one or more data caches (e.g., 1212) are included, and thread data is cached during thread execution. In several embodiments, a sampler 1210 is included to provide texture sampling of 3D jobs and media sampling of media jobs. In several embodiments, the sampler 1210 includes a particular texture or media sampling function, and the texture or media material is processed during the sampling process prior to providing the sampled data to the execution unit.

執行期間,圖形及媒體管線經由執行緒產生及調度邏輯發送執行緒啟動請求至執行緒執行邏輯1200。在若干實施例中,執行緒執行邏輯1200包括本機執行緒調度器1204,其仲裁來自圖形及媒體管線之執行 緒啟動請求,並於一或更多個執行單元1208A-1208N上例示請求之執行緒。例如,幾何管線(例如圖11之1136)針對執行緒執行邏輯1200,而調度頂點處理、密鋪、或幾何處理執行緒(圖12)。在若干實施例中,執行緒調度器1204亦可處理來自執行著色器程式之運行時間執行緒產生請求。 During execution, the graphics and media pipeline sends a thread initiation request to thread execution logic 1200 via thread generation and scheduling logic. In several embodiments, thread execution logic 1200 includes a native thread scheduler 1204 that arbitrates execution from graphics and media pipelines. The request is initiated and the requested thread is instantiated on one or more execution units 1208A-1208N. For example, a geometry pipeline (e.g., 1136 of Figure 11) schedules vertex processing, tiling, or geometry processing threads for thread execution logic 1200 (Fig. 12). In some embodiments, the thread scheduler 1204 can also process a runtime thread generation request from the execution shader program.

一旦幾何物件群經處理及格柵化為像素資料,便調用像素著色器1202而進一步運算輸出資訊,並致使結果寫入至輸出表面(例如顏色緩衝器、深度緩衝器、模板緩衝器等)。在若干實施例中,像素著色器1202計算跨越格柵化物件而內插之各式頂點屬性的值。在若干實施例中,像素著色器1202接著執行供應像素著色器程式之應用編程介面(API)。為執行像素著色器程式,像素著色器1202經由執行緒調度器1204而調度執行緒至執行單元(例如1208A)。在若干實施例中,像素著色器1202使用取樣器1210中之紋理取樣邏輯,存取儲存於記憶體中之紋理映射中之紋理資料。紋理資料及輸入幾何資料上之算術作業運算每一幾何資料塊之像素顏色資料,或拋棄來自進一步處理之一或更多個像素。 Once the geometry object group is processed and rasterized into pixel data, pixel shader 1202 is invoked to further manipulate the output information and cause the results to be written to the output surface (eg, color buffer, depth buffer, stencil buffer, etc.). In several embodiments, pixel shader 1202 calculates values for various vertex attributes that are interpolated across the tiling. In several embodiments, pixel shader 1202 then performs an application programming interface (API) that supplies the pixel shader program. To execute the pixel shader program, pixel shader 1202 dispatches threads to the execution unit (eg, 1208A) via thread scheduler 1204. In several embodiments, pixel shader 1202 uses texture sampling logic in sampler 1210 to access texture data stored in texture maps in memory. The arithmetic data on the texture data and the input geometry computes the pixel color data of each geometric data block, or discards one or more pixels from further processing.

在若干實施例中,資料埠1214提供執行緒執行邏輯1200之記憶體存取機構,而輸出處理之資料至記憶體,供圖形處理器輸出管線上之處理。在若干實施例中,資料埠1214包括或耦接至一或更多個快取記憶體(例如資料快取記憶體1212),而經由資料埠快取記憶 體存取之資料。 In some embodiments, the data port 1214 provides a memory access mechanism for the thread execution logic 1200, and outputs the processed data to the memory for processing on the graphics processor output pipeline. In some embodiments, the data file 1214 includes or is coupled to one or more cache memories (eg, data cache memory 1212), and the data is cached via the data cache. Physical access data.

圖13為方塊圖,依據若干實施例,描繪圖形處理器指令格式1300。在一或更多個實施例中,圖形處理器執行單元支援具有多格式指令之指令集。實線框描繪一般包括於執行單元指令中之組件,同時虛線包括可選或僅包括於指令子集中之組件。在若干實施例中,所描述及描繪之指令格式1300為巨集指令,其中相對於源自指令處理之指令解碼的微運算,巨集指令係供應至執行單元之指令。 FIG. 13 is a block diagram depicting a graphics processor instruction format 1300 in accordance with several embodiments. In one or more embodiments, the graphics processor execution unit supports an instruction set having multiple format instructions. The solid line drawing depicts components that are typically included in the execution unit instructions, while the dashed lines include components that are optional or only included in the subset of instructions. In several embodiments, the described and depicted instruction format 1300 is a macro instruction in which a macro instruction is supplied to an instruction of an execution unit relative to a micro-operation derived from instruction decoding.

在若干實施例中,圖形處理器執行單元本機支援128位元格式1310之指令。依據選擇之指令、指令選項、及運算元數量,64位元緊實指令格式1330可用於若干指令。本機128位元格式1310提供所有指令選項之存取,同時若干選項及作業限制於64位元格式1330。可用於64位元格式1330之本機指令隨實施例而異。在若干實施例中,部分使用索引欄1313中之一組索引值而緊實指令。執行單元硬體依據索引值而參考一組緊實表,並使用緊實表輸出而重建128位元格式1310之本機指令。 In several embodiments, the graphics processor execution unit natively supports instructions in the 128-bit format 1310. The 64-bit compact instruction format 1330 can be used for several instructions depending on the instruction selected, the instruction options, and the number of operands. The native 128-bit format 1310 provides access to all instruction options while several options and operations are limited to the 64-bit format 1330. Native instructions that may be used in 64-bit format 1330 vary from embodiment to embodiment. In several embodiments, the portion is indexed using a set of index values in index column 1313. The execution unit hardware references a set of compact tables based on the index values and reconstructs the native instructions of the 128-bit format 1310 using the compact table output.

對每一格式而言,指令運算碼1312定義執行單元將實施之作業。執行單元跨越每一運算元之多資料元件而並列執行每一指令。例如,回應於加法指令,執行單元跨越代表紋理元件或圖像元件之每一顏色通道,而實施同步加法作業。藉由預置,執行單元跨越運算元之所有資料通道而實施每一指令。在若干實施例中,指令控制欄 1314致能某些執行選項之控制,諸如通道選擇(例如預測)及資料通道順序(例如拌和)。對128位元指令1310而言,執行尺寸欄1316限制將並列執行之資料通道數量。在若干實施例中,執行尺寸欄1316無法用於64位元緊實指令格式1330。 For each format, the instruction opcode 1312 defines the job that the execution unit will perform. The execution unit executes each instruction side by side across multiple data elements per operand. For example, in response to the addition instruction, the execution unit performs a synchronous addition operation across each color channel representing the texture element or image element. By presetting, the execution unit implements each instruction across all data channels of the operand. In several embodiments, the instruction control bar 1314 enables control of certain execution options, such as channel selection (eg, prediction) and data channel order (eg, blending). For the 128-bit instruction 1310, the execution size column 1316 limits the number of data channels that will be executed side by side. In several embodiments, the execution size column 1316 cannot be used with the 64-bit compact instruction format 1330.

若干執行單元指令具有多達三運算元,包括二來源運算元SRC0 1320、SRC1 1322,及一目的地1318。在若干實施例中,執行單元支援雙目的地指令,其中一目的地隱含。資料調處指令可具有第三來源運算元(例如SRC2 1324),其中指令運算碼1312決定來源運算元數量。指令之最後來源運算元可為以指令傳遞之立即(例如硬編碼)值。 A number of execution unit instructions have up to three operands, including two source operands SRC0 1320, SRC1 1322, and a destination 1318. In several embodiments, the execution unit supports dual destination instructions, one of which is implicit. The data mediation instruction can have a third source operand (e.g., SRC2 1324), wherein the instruction opcode 1312 determines the number of source operands. The last source operand of the instruction may be an immediate (eg hard coded) value passed by the instruction.

在若干實施例中,128位元指令格式1310包括存取/位址模式資訊1326,指定例如使用直接暫存器定址模式或間接暫存器定址模式。當使用直接暫存器定址模式時,指令1310中之位元直接提供一或更多個運算元之暫存器位址。 In several embodiments, the 128-bit instruction format 1310 includes access/address mode information 1326 that specifies, for example, a direct register addressing mode or an indirect register addressing mode. When the direct register addressing mode is used, the bits in instruction 1310 directly provide one or more operand register addresses.

在若干實施例中,128位元指令格式1310包括存取/位址模式欄1326,其指定指令之位址模式及/或存取模式。在一實施例中,存取模式定義指令之資料存取對齊。若干實施例支援存取模式,包括16位元組對齊存取模式及1位元組對齊存取模式,其中存取模式之位元組對齊決定指令運算元之存取對齊。例如,當處於第一模式時,指令1310可使用來源及目的地運算元之位元組對 齊定址,當處於第二模式時,指令1310可使用所有來源及目的地運算元之16位元組對齊定址。 In several embodiments, the 128-bit instruction format 1310 includes an access/address mode field 1326 that specifies the address mode and/or access mode of the instruction. In one embodiment, the access mode defines the data access alignment of the instructions. Several embodiments support access modes, including a 16-bit aligned access mode and a 1-bit aligned access mode, where the byte alignment of the access mode determines the access alignment of the instruction operand. For example, when in the first mode, the instruction 1310 can use a pair of source and destination operands. Uniform addressing, when in the second mode, the instruction 1310 can use 16-bit alignment alignment of all source and destination operands.

在一實施例中,存取/位址模式欄1326之位址模式部決定指令係使用直接或間接定址。當使用直接暫存器定址模式時,指令1310中之位元直接提供一或更多個運算元之暫存器位址。當使用間接暫存器定址模式時,可依據指令中之位址暫存器值及位址立即欄,運算一或更多個運算元之暫存器位址。 In one embodiment, the address mode portion of the access/address mode field 1326 determines whether the instruction is to use direct or indirect addressing. When the direct register addressing mode is used, the bits in instruction 1310 directly provide one or more operand register addresses. When the indirect scratchpad addressing mode is used, the register address of one or more operands can be operated according to the address register value and the address immediate column in the instruction.

在若干實施例中,指令依據運算碼1312位元欄而群集,以簡化運算碼解碼1340。對8位元運算碼而言,位元10、11、及12允許執行單元決定運算碼類型。所示精確運算碼群集僅為範例。在若干實施例中,移動及邏輯運算碼群集1342包括資料移動及邏輯指令(例如移動(mov)、比較(cmp))。在若干實施例中,移動及邏輯群集1342共用五個最高效位元(MSB),其中移動(mov)指令係採0000xxxxb形式,且邏輯指令係採0001xxxxb形式。流程控制指令群集1344(例如呼叫、跳躍(jmp))包括0010xxxxb(例如0x20)形式之指令。雜項指令群集1346包括指令混合,包括0011xxxxb(例如0x30)形式之同步指令(例如等候、發送)。並列數學指令群集1348包括0100xxxxb(例如0x40)形式之組件形式算術指令(例如加法、乘法(mul))。並列數學群集1348跨越資料通道並列實施算術作業。向量數學群集1350包括0101xxxxb(例如0x50)形式之算術指令 (例如dp4)。向量數學群集於向量運算元上實施算術,諸如點乘積計算。 In several embodiments, the instructions are clustered according to the operation code 1312 bit field to simplify the opcode decoding 1340. For 8-bit opcodes, bits 10, 11, and 12 allow the execution unit to determine the opcode type. The exact opcode cluster shown is only an example. In several embodiments, the mobile and logical opcode cluster 1342 includes data movement and logic instructions (eg, move (mov), compare (cmp)). In several embodiments, the mobile and logical cluster 1342 shares the five most efficient bits (MSBs), with the move (mov) command in the form of 0000xxxxb and the logical command in the form of 0001xxxxb. The flow control instruction cluster 1344 (eg, call, jump (jmp)) includes instructions in the form of 0010xxxxb (eg, 0x20). Miscellaneous instruction cluster 1346 includes instruction mixes, including synchronization instructions (eg, wait, send) in the form of 0011xxxxb (eg, 0x30). The parallel math instruction cluster 1348 includes component-form arithmetic instructions (eg, addition, multiplication (mul)) in the form of 0100xxxxb (eg, 0x40). The parallel math cluster 1348 performs arithmetic operations side by side across the data channel. Vector math cluster 1350 includes arithmetic instructions in the form of 0101xxxxb (eg, 0x50) (eg dp4). Vector mathematics clusters perform arithmetic on vector arithmetic elements, such as point product calculations.

圖14為圖形處理器1400之另一實施例的方塊圖。圖14之該些元件具有與文中任何其他圖之元件相同的參考號碼(或名稱),可以類似於文中任何地方描述之任何方式操作或作動,但不侷限於此。 14 is a block diagram of another embodiment of a graphics processor 1400. The elements of Figure 14 have the same reference numbers (or names) as the elements of any other figures in the text, and may operate or behave in any manner similar to that described anywhere in the text, but are not limited thereto.

在若干實施例中,圖形處理器1400包括圖形管線1420、媒體管線1430、顯示引擎1440、執行緒執行邏輯1450、及呈現輸出管線1470。在若干實施例中,圖形處理器1400為包括一或更多個通用處理核心之多核心處理系統內之圖形處理器。圖形處理器係藉由暫存器寫入至一或更多個控制暫存器(未顯示),或經由環狀互連1402發布至圖形處理器1400之命令所控制。在若干實施例中,環狀互連1402將圖形處理器1400耦接至其他處理組件,諸如其他圖形處理器或通用處理器。來自環狀互連1402之命令係由命令串流器1403解譯,其供應指令至圖形管線1420或媒體管線1430之個別組件。 In several embodiments, graphics processor 1400 includes graphics pipeline 1420, media pipeline 1430, display engine 1440, thread execution logic 1450, and presentation output pipeline 1470. In several embodiments, graphics processor 1400 is a graphics processor within a multi-core processing system that includes one or more general purpose processing cores. The graphics processor is controlled by a scratchpad written to one or more control registers (not shown) or via a command issued by the ring interconnect 1402 to the graphics processor 1400. In several embodiments, the ring interconnect 1402 couples the graphics processor 1400 to other processing components, such as other graphics processors or general purpose processors. Commands from the ring interconnect 1402 are interpreted by the command streamer 1403, which supplies instructions to individual components of the graphics pipeline 1420 or media pipeline 1430.

在若干實施例中,命令串流器1403指示頂點收件器1405之作業,其讀取來自記憶體之頂點資料,並執行命令串流器1403提供之頂點處理命令。在若干實施例中,頂點收件器1405提供頂點資料至頂點著色器1407,其實施至每一頂點之座標空間轉換及照明作業。在若干實施例中,頂點收件器1405及頂點著色器1407經由執行緒調度器1431調度執行緒至執行單元1452A、 1452B,而執行頂點處理指令。 In several embodiments, command streamer 1403 indicates the job of vertex recipient 1405, which reads vertex data from memory and executes vertex processing commands provided by command streamer 1403. In several embodiments, vertex recipient 1405 provides vertex data to vertex shader 1407 that implements coordinate space conversion and lighting operations to each vertex. In several embodiments, vertex recipient 1405 and vertex shader 1407 schedule threads to execution unit 1452A via thread scheduler 1431, 1452B, while executing vertex processing instructions.

在若干實施例中,執行單元1452A、1452B為具有實施圖形及媒體作業之指令集之向量處理器的陣列。在若干實施例中,執行單元1452A、1452B具有附屬L1快取記憶體1451,其特定用於每一陣列或公用於陣列之間。快取記憶體可組配為資料快取記憶體、指令快取記憶體、或單一快取記憶體,其分區而包含不同分區之資料及指令。 In several embodiments, execution units 1452A, 1452B are arrays of vector processors having an instruction set that implements graphics and media jobs. In several embodiments, execution units 1452A, 1452B have an attached L1 cache memory 1451 that is specific to each array or common between arrays. The cache memory can be configured as a data cache memory, an instruction cache memory, or a single cache memory, and the partition includes data and instructions of different partitions.

在若干實施例中,圖形管線1420包括密鋪組件而實施3D物件之硬體加速密鋪。在若干實施例中,可程控外殼著色器1411組配密鋪作業。可程控域著色器1417提供密鋪輸出之後端評估。鑲嵌器1413以外殼著色器1411之方向操作,並包含專用邏輯以依據提供做為圖形管線1420之輸入之粗略幾何模型,而產生一組詳細幾何物件。在若干實施例中,若未使用密鋪,密鋪組件1411、1413、1417可略過。 In several embodiments, graphics pipeline 1420 includes a tile assembly to implement a hardware accelerated overlay of the 3D object. In several embodiments, the programmable shell shader 1411 is assembled with a tiling operation. The programmable field shader 1417 provides a measurment output backend evaluation. The tessellator 1413 operates in the direction of the hull shader 1411 and includes dedicated logic to produce a set of detailed geometric objects in accordance with a coarse geometric model provided as input to the graphics pipeline 1420. In several embodiments, the ply assemblies 1411, 1413, 1417 may be skipped if a close shop is not used.

在若干實施例中,完整幾何物件可藉由幾何著色器1419經由調度至執行單元1452A、1452B之一或更多個執行緒處理,或可直接前進至截波器1429。在若干實施例中,幾何著色器在整個幾何物件上操作,而非如圖形管線之先前級中之頂點或頂點之修補。若密鋪停用,幾何著色器1419便接收來自頂點著色器1407之輸入。在若干實施例中,若密鋪單元停用,幾何著色器1419可由幾何著色器程式程控,而實施幾何密鋪。 In several embodiments, the complete geometry may be processed by one or more threads of scheduling to execution units 1452A, 1452B by geometry shader 1419, or may proceed directly to chopper 1429. In several embodiments, the geometry shader operates on the entire geometry object rather than the vertices or vertices in the previous stages of the graphics pipeline. If the tile is disabled, the geometry shader 1419 receives the input from the vertex shader 1407. In several embodiments, if the tiling unit is deactivated, the geometry shader 1419 can be programmed by the geometry shader program to implement geometric tiling.

在柵格化之前,截波器1429處理頂點資料。截波器1429可為固定功能截波器或具有截波及幾何著色器功能之可程控截波器。在若干實施例中,呈現輸出管線1470中之光柵器/深度1473調度像素著色器,而將幾何物件轉換為其每一像素代表。在若干實施例中,像素著色器邏輯係包括於執行緒執行邏輯1450中。在若干實施例中,應用可略過光柵器1473,並經由流出單元1423存取未格柵化頂點資料。 The chopper 1429 processes the vertex data prior to rasterization. The chopper 1429 can be a fixed function chopper or a programmable chopper with chopping and geometry shader functions. In several embodiments, the rasterizer/depth 1473 in the output pipeline 1470 is scheduled to align pixel shaders, while the geometric objects are converted to their respective pixel representations. In several embodiments, pixel shader logic is included in thread execution logic 1450. In several embodiments, the application may bypass the rasterizer 1473 and access the unrasterized vertex material via the outflow unit 1423.

圖形處理器1400具有互連匯流排、互連架構、或若干其他互連機構,允許資料及信息於處理器之主要組件間通過。在若干實施例中,執行單元1452A、1452B及相關快取記憶體1451、紋理及媒體取樣器1454、及紋理/取樣器快取記憶體1458經由資料埠1456互連,而實施記憶體存取並與處理器之呈現輸出管線組件通訊。在若干實施例中,取樣器1454、快取記憶體1451、1458及執行單元1452A、1452B各具有不同記憶體存取路徑。 The graphics processor 1400 has an interconnect bus, an interconnect fabric, or several other interconnect mechanisms that allow data and information to pass between the main components of the processor. In some embodiments, execution units 1452A, 1452B and associated cache memory 1451, texture and media sampler 1454, and texture/sampler cache 1458 are interconnected via data port 1456 to implement memory access and Communicates with the presentation output pipeline component of the processor. In several embodiments, sampler 1454, cache memory 1451, 1458, and execution units 1452A, 1452B each have different memory access paths.

在若干實施例中,呈現輸出管線1470包含光柵器及深度測試組件1473,將基於頂點之物件轉換為相關基於像素代表。在若干實施例中,光柵器邏輯包括視窗/遮罩器單元,以實施固定功能三角形及線光柵化。在若干實施例中,相關呈現快取記憶體1478及深度快取記憶體1479可用。像素作業組件1477於資料上實施基於像素之作業,儘管在若干狀況下,與2D作業相關之像素作業 (例如混合位元方塊圖像轉移)係由2D引擎1441實施,或於顯示時由顯示控制器1443使用覆蓋顯示平面替代。在若干實施例中,共用L3快取記憶體1475可用於所有圖形組件,允許資料共用而未使用主系統記憶體。 In several embodiments, the rendering output pipeline 1470 includes a rasterizer and depth testing component 1473 that converts vertice-based objects into related pixel-based representations. In several embodiments, the rasterizer logic includes a window/masker unit to implement fixed function triangles and line rasterization. In several embodiments, the associated presentation cache 1478 and deep cache 1479 are available. Pixel job component 1477 implements pixel-based jobs on the data, although in some cases, pixel work associated with 2D jobs (eg, mixed bit block image transfer) is implemented by 2D engine 1441 or replaced by display controller 1443 using overlay display planes when displayed. In several embodiments, the shared L3 cache memory 1475 can be used for all graphics components, allowing data sharing without using main system memory.

在若干實施例中,圖形處理器媒體管線1430包括媒體引擎1437及視訊前端1434。在若干實施例中,視訊前端1434接收來自命令串流器1403之管線命令。在若干實施例中,媒體管線1430包括不同命令串流。在若干實施例中,視訊前端1434於發送命令至媒體引擎1437之前,處理媒體命令。在若干實施例中,媒體引擎1437包括執行緒產生功能,以產生執行緒,經由執行緒調度器1431調度執行緒執行邏輯1450。 In some embodiments, graphics processor media pipeline 1430 includes media engine 1437 and video front end 1434. In several embodiments, video front end 1434 receives a pipeline command from command streamer 1403. In several embodiments, media pipeline 1430 includes different command streams. In some embodiments, video front end 1434 processes media commands prior to sending a command to media engine 1437. In several embodiments, media engine 1437 includes a thread generation function to generate threads that schedule thread execution logic 1450 via thread scheduler 1431.

在若干實施例中,圖形處理器1400包括顯示引擎1440。在若干實施例中,顯示引擎1440為處理器1400外部,經由環狀互連1402或若干其他互連匯流排或架構而與圖形處理器耦接。在若干實施例中,顯示引擎1440包括2D引擎1441及顯示控制器1443。在若干實施例中,顯示引擎1440包含專用邏輯,可獨立於3D管線作業。在若干實施例中,顯示控制器1443與顯示裝置(未顯示)耦接,其可為系統整合顯示裝置,如膝上型電腦中,或經由顯示裝置連接器附著之外部顯示裝置。 In several embodiments, graphics processor 1400 includes display engine 1440. In several embodiments, display engine 1440 is external to processor 1400 and is coupled to the graphics processor via ring interconnect 1402 or a number of other interconnect busses or architectures. In several embodiments, display engine 1440 includes a 2D engine 1441 and a display controller 1443. In several embodiments, display engine 1440 includes dedicated logic that can operate independently of the 3D pipeline. In some embodiments, display controller 1443 is coupled to a display device (not shown), which can be a system integrated display device, such as in a laptop computer, or an external display device attached via a display device connector.

在若干實施例中,圖形管線1420及媒體管線1430可組配而依據多圖形及媒體編程介面實施作業,且不限於任一應用編程介面(API)。在若干實施例中,圖 形處理器之驅動軟體將特定圖形或媒體庫之API呼叫翻譯為可由圖形處理器處理之命令。在若干實施例中,支援係提供用於Khronos組織之開放圖形庫(OpenGL)及開放運算語言(OpenCL),微軟公司之直接3D庫,或支援可提供至OpenGL及D3D。支援亦可提供用於開放源電腦版(OpenCV)。若可從未來API之管線映射至圖形處理器之管線,亦可支援具相容3D管線之未來API。 In some embodiments, graphics pipeline 1420 and media pipeline 1430 can be configured to perform operations in accordance with multiple graphics and media programming interfaces, and are not limited to any application programming interface (API). In several embodiments, the figure The driver software of the processor translates API calls of a particular graphics or media library into commands that can be processed by the graphics processor. In several embodiments, the support system provides Open Graphics Library (OpenGL) and Open Computing Language (OpenCL) for Khronos organizations, Microsoft's direct 3D library, or support for OpenGL and D3D. Support is also available for Open Source Edition (OpenCV). Future APIs with compatible 3D pipelines can also be supported if they can be mapped from the pipeline of the future API to the pipeline of the graphics processor.

圖15A為方塊圖,依據若干實施例,描繪圖形處理器命令格式1500。圖15B為方塊圖,依據實施例,描繪圖形處理器命令序列1510。圖15A中實線框描繪一般包括於圖形命令中之組件,同時虛線包括可選的或僅包括於圖形命令之子集中之組件。圖15A之示例圖形處理器命令格式1500包括資料欄以識別命令之目標客戶1502、命令作業碼(運算碼)1504、及命令之相關資料1506。子運算碼1505及命令尺寸1508亦包括於若干命令中。 Figure 15A is a block diagram depicting a graphics processor command format 1500 in accordance with several embodiments. Figure 15B is a block diagram depicting a graphics processor command sequence 1510, in accordance with an embodiment. The solid lined box in Figure 15A depicts the components typically included in the graphics commands, while the dashed lines include components that are optional or only included in a subset of the graphics commands. The example graphics processor command format 1500 of FIG. 15A includes a data field to identify a target client 1502, a command job code (opcode) 1504, and a related material 1506 for the command. Sub-opcode 1505 and command size 1508 are also included in several commands.

在若干實施例中,客戶1502指明處理命令資料之圖形裝置的客戶單元。在若干實施例中,圖形處理器命令剖析器檢查每一命令之客戶欄,以決定命令之進一步處理,並將命令資料傳遞至適當客戶單元。在若干實施例中,圖形處理器客戶單元包括記憶體介面單元、呈現單元、2D單元、3D單元、及媒體單元。每一客戶單元具有處理命令之相應處理管線。一旦客戶單元接收命令,客戶單元便讀取運算碼1504,並讀取子運算碼1505(若存 在),以決定實施之作業。客戶單元使用資料欄1506中之資訊來實施命令。對若干命令而言,預期明確命令尺寸1508以指定命令之尺寸。在若干實施例中,命令剖析器依據命令運算碼而自動決定至少若干命令之尺寸。在若干實施例中,命令經由多個雙字而對齊。 In several embodiments, client 1502 indicates a client unit of a graphics device that processes command material. In several embodiments, the graphics processor command parser checks the client bar of each command to determine further processing of the command and passes the command material to the appropriate client unit. In some embodiments, the graphics processor client unit includes a memory interface unit, a presentation unit, a 2D unit, a 3D unit, and a media unit. Each client unit has a corresponding processing pipeline that processes the commands. Once the client unit receives the command, the client unit reads the opcode 1504 and reads the sub-opcode 1505 (if In order to determine the implementation of the operation. The client unit uses the information in the data column 1506 to implement the command. For several commands, the command size 1508 is expected to be specified to specify the size of the command. In several embodiments, the command parser automatically determines the size of at least a number of commands based on the command opcode. In several embodiments, the commands are aligned via a plurality of double words.

圖15B中流程圖顯示示例圖形處理器命令序列1510。在若干實施例中,資料處理系統之軟體或軔體之特徵在於圖形處理器之實施例使用所示命令序列之版本建立執行,及終止一組圖形作業。所顯示及描述之樣本命令序列僅為做為實施例之範例,不侷限於該些特定命令或此命令序列。再者,可發布命令做為命令序列中之一批命令,使得圖形處理器將至少部分同時處理命令序列。 The flowchart in Figure 15B shows an example graphics processor command sequence 1510. In some embodiments, the software or body of the data processing system is characterized in that the embodiment of the graphics processor establishes execution using the version of the command sequence shown, and terminates a set of graphics jobs. The sample command sequences shown and described are merely examples of embodiments and are not limited to the particular commands or sequences of such commands. Furthermore, the command can be issued as a batch of commands in the command sequence such that the graphics processor will process the command sequence at least partially simultaneously.

在若干實施例中,圖形處理器命令序列1510可基於管線刷新命令1512展開,以致使任何現用圖形管線完成管線之未決命令。在若干實施例中,3D管線1522及媒體管線1524未同時操作。實施管線刷新以致使現用圖形管線完成任何未決命令。回應於管線刷新,圖形處理器之命令剖析器將暫停命令處理,直至現用牽引引擎完成未決作業及相關讀取快取記憶體無效為止。可選地,標示為「已使用」之呈現快取記憶體中任何資料可刷新至記憶體。在若干實施例中,在圖形處理器置為低電力狀態之前,管線刷新命令1512可用於管線同步。 In several embodiments, the graphics processor command sequence 1510 can be expanded based on the pipeline refresh command 1512 to cause any active graphics pipeline to complete the pending command of the pipeline. In several embodiments, 3D pipeline 1522 and media pipeline 1524 are not operating simultaneously. A pipeline refresh is implemented to cause the active graphics pipeline to complete any pending commands. In response to the pipeline refresh, the graphics processor's command parser will suspend command processing until the active traction engine completes the pending job and the associated read cache memory is invalid. Optionally, any data in the presentation cache marked as "used" can be refreshed to the memory. In several embodiments, the pipeline refresh command 1512 can be used for pipeline synchronization before the graphics processor is placed in a low power state.

在若干實施例中,當命令序列要求圖形處理器明確地在管線之間切換時,使用管線選擇命令1513。 在若干實施例中,管線選擇命令1513僅需於發布管線命令之前執行情境,除非情境為發布管線之命令。在若干實施例中,在經由管線選擇命令1513之管線切換之前,立即需要管線刷新命令1512。 In several embodiments, the pipeline selection command 1513 is used when the command sequence requires the graphics processor to explicitly switch between pipelines. In several embodiments, the pipeline selection command 1513 only needs to execute the context before issuing the pipeline command, unless the context is a command to issue a pipeline. In several embodiments, a pipeline refresh command 1512 is required immediately prior to pipeline switching via pipeline select command 1513.

在若干實施例中,管線控制命令1514組配作業之圖形管線,並用以程控3D管線1522及媒體管線1524。在若干實施例中,管線控制命令1514組配現用管線之管線狀態。在一實施例中,管線控制命令1514用於管線同步,並於處理一批命令之前,清除來自現用管線內一或更多個快取記憶體之資料。 In several embodiments, the pipeline control command 1514 assembles a graphical pipeline of operations and is used to program the 3D pipeline 1522 and the media pipeline 1524. In several embodiments, the pipeline control command 1514 sets the pipeline status of the pipeline. In one embodiment, pipeline control command 1514 is used for pipeline synchronization and clears data from one or more cache memories in the active pipeline before processing a batch of commands.

在若干實施例中,返回緩衝器狀態命令1516用以組配個別管線之一組返回緩衝器而寫入資料。若干管線作業需要一或更多個返回緩衝器之配置、選擇或組態,其中作業於處理期間寫入中間資料。在若干實施例中,圖形處理器亦使用一或更多個返回緩衝器來儲存輸出資料及實施跨執行緒通訊。在若干實施例中,返回緩衝器狀態1516包括選擇尺寸及返回緩衝器數量,而用於管線作業組。 In several embodiments, the return buffer status command 1516 is used to assemble a group of individual pipeline return buffers to write data. Several pipeline operations require configuration, selection, or configuration of one or more return buffers, where the job writes intermediate data during processing. In some embodiments, the graphics processor also uses one or more return buffers to store output data and implement cross-thread communication. In several embodiments, the return buffer state 1516 includes the selection size and the number of return buffers for the pipeline job set.

命令序列中其餘命令基於作業之現用管線而異。依據管線決定1520,命令序列專用於3D管線1522,其始自3D管線狀態1530,或始自媒體管線狀態1540之媒體管線1524。 The remaining commands in the command sequence vary based on the active pipeline of the job. In accordance with pipeline decision 1520, the command sequence is dedicated to 3D pipeline 1522, starting from 3D pipeline state 1530, or from media pipeline 1524 of media pipeline state 1540.

3D管線狀態1530之命令包括3D狀態設定命令,用於頂點緩衝器狀態、頂點元件狀態、常量顏色狀 態、深度緩衝器狀態、及3D基元命令處理前組配之其他狀態變數。該些命令之值至少部分依據使用之特定3DAPI決定。在若干實施例中,若將不使用該些元件,3D管線狀態1530命令亦可選擇性停用或略過某些管線元件。 The 3D pipeline state 1530 command includes a 3D state setting command for vertex buffer state, vertex component state, constant color shape State, depth buffer state, and other state variables that are assembled before the 3D primitive command is processed. The values of these commands are determined, at least in part, by the particular 3DAPI used. In some embodiments, the 3D pipeline state 1530 command may also selectively disable or skip certain pipeline components if the components are not to be used.

在若干實施例中,3D基元1532命令用以提交將由3D管線處理之3D基元。經由3D基元1532命令傳遞至圖形處理器之命令及相關參數,被傳送至圖形管線中頂點提取功能。頂點提取功能使用3D基元1532命令資料以產生頂點資料結構。頂點資料結構係儲存於一或更多個返回暫存器中。在若干實施例中,3D基元1532命令用以經由頂點著色器在3D基元上實施頂點作業。為處理頂點著色器,3D管線1522調度著色器執行緒至圖形處理器執行單元。 In several embodiments, the 3D primitive 1532 commands the submission of 3D primitives to be processed by the 3D pipeline. The commands and associated parameters passed to the graphics processor via the 3D primitive 1532 command are passed to the vertex extraction function in the graphics pipeline. The vertex extraction function uses 3D primitive 1532 command data to generate a vertex data structure. The vertex data structure is stored in one or more return registers. In several embodiments, the 3D primitive 1532 commands to implement a vertex job on the 3D primitive via the vertex shader. To process the vertex shader, the 3D pipeline 1522 dispatches the shader thread to the graphics processor execution unit.

在若干實施例中,經由執行1534命令或事件而觸發3D管線1522。在若干實施例中,暫存器寫入觸發命令執行。在若干實施例中,經由命令序列中「前進」或「踢除」命令而觸發執行。在一實施例中,使用管線同步命令觸發命令執行,以刷新命令序列至圖形管線。3D管線將實施3D基元之幾何處理。一旦作業完成,最終幾何物件被格柵化,且像素引擎著色最終像素。亦可包括控制像素蔽影及像素後端作業之額外命令而用於該些作業。 In several embodiments, the 3D pipeline 1522 is triggered by executing a 1534 command or event. In several embodiments, the scratchpad write triggers command execution. In some embodiments, execution is triggered via a "forward" or "kick-out" command in the command sequence. In an embodiment, a pipeline synchronization command is used to trigger command execution to refresh the command sequence to the graphics pipeline. The 3D pipeline will implement the geometry processing of the 3D primitives. Once the job is complete, the final geometry is rasterized and the pixel engine colors the final pixels. Additional commands for controlling pixel masking and pixel back-end operations may also be included for such operations.

在若干實施例中,當實施媒體作業時,圖形處理器命令序列1510依循媒體管線1524路徑。通常,媒 體管線1524之特定使用及編程方式取決於將實施之媒體或運算作業。特定媒體解碼作業於媒體解碼期間可卸載至媒體管線。在若干實施例中,亦可略過媒體管線,並可完整或部分使用一或更多個通用處理核心提供之資源實施媒體解碼。在一實施例中,媒體管線亦包括通用圖形處理器單元(GPGPU)作業之元件,其中圖形處理器用以使用運算著色器程式實施SIMD向量作業,其未明確關於圖形基元之呈現。 In several embodiments, the graphics processor command sequence 1510 follows the path of the media pipeline 1524 when the media job is implemented. Usually, the media The particular use and programming of the bodyline 1524 depends on the media or computing operation to be performed. A particular media decoding job can be offloaded to the media pipeline during media decoding. In some embodiments, the media pipeline may also be skipped and media decoding may be performed in whole or in part using resources provided by one or more general processing cores. In one embodiment, the media pipeline also includes components of a general purpose graphics processor unit (GPGPU) job, wherein the graphics processor is configured to implement SIMD vector operations using an operational shader program that is not explicitly related to the rendering of graphics primitives.

在若干實施例中,媒體管線1524係以類似於3D管線1522之方式組配。一組媒體管線狀態命令1540係於媒體物件命令1542之前被調度或置於命令佇列中。在若干實施例中,媒體管線狀態命令1540包括資料以組配媒體管線元件,其將用以處理媒體物件。此包括資料而將視訊解碼及視訊編碼邏輯組配於媒體管線內,諸如編碼或解碼格式。在若干實施例中,媒體管線狀態命令1540亦支援使用「間接」狀態元件之一或更多個指標,其包含一批狀態設定。 In several embodiments, media line 1524 is assembled in a manner similar to 3D pipeline 1522. A set of media pipeline status commands 1540 are scheduled or placed in the command queue prior to media object command 1542. In several embodiments, media pipeline status command 1540 includes data to assemble media pipeline elements that will be used to process media objects. This includes data and the video decoding and video encoding logic is grouped into a media pipeline, such as an encoding or decoding format. In some embodiments, the media pipeline status command 1540 also supports the use of one or more metrics of the "indirect" status element, which includes a batch of status settings.

在若干實施例中,媒體物件命令1542供應指標至媒體物件,供媒體管線處理。媒體物件包括記憶體緩衝器,其包含將處理之視訊資料。在若干實施例中,於發布媒體物件命令1542之前,所有媒體管線狀態必須有效。一旦管線狀態組配及媒體物件命令1542佇列,便經由執行命令1544或等效執行事件(例如暫存器寫入),而觸發媒體管線1524。來自媒體管線1524之輸出接著可 由3D管線1522或媒體管線1524提供之作業後置處理。在若干實施例中,以類似於媒體作業之方式組配及執行GPGPU作業。 In several embodiments, the media item command 1542 provides an indicator to the media item for processing by the media pipeline. The media object includes a memory buffer containing the video material to be processed. In several embodiments, all media pipeline states must be valid before the media object command 1542 is issued. Once the pipeline state is assembled with the media object command 1542, the media pipeline 1524 is triggered via execution of command 1544 or an equivalent execution event (eg, a scratchpad write). The output from media pipeline 1524 can then Job post processing provided by 3D pipeline 1522 or media pipeline 1524. In several embodiments, GPGPU jobs are assembled and executed in a manner similar to media jobs.

圖16依據若干實施例,描繪資料處理系統1600之示例圖形軟體架構。在若干實施例中,軟體架構包括3D圖形應用1610、作業系統1620、及至少一處理器1630。在若干實施例中,處理器1630包括圖形處理器1632及一或更多個通用處理器核心1634。圖形應用1610及作業系統1620各於資料處理系統之系統記憶體1650中執行。 16 depicts an example graphics software architecture of data processing system 1600, in accordance with several embodiments. In some embodiments, the software architecture includes a 3D graphics application 1610, an operating system 1620, and at least one processor 1630. In several embodiments, processor 1630 includes a graphics processor 1632 and one or more general purpose processor cores 1634. Graphics application 1610 and operating system 1620 are each executed in system memory 1650 of the data processing system.

在若干實施例中,3D圖形應用1610包含一或更多個著色器程式,包括著色器指令1612。著色器語言指令可為高階著色器語言,諸如高階著色器語言(HLSL)或OpenGL著色器語言(GLSL)。應用亦包括可執行指令1614,其為適於通用處理器核心1634執行之機器語言。應用亦包括由頂點資料定義之圖形物件1616。 In several embodiments, the 3D graphics application 1610 includes one or more shader programs, including shader instructions 1612. The shader language instructions can be high order shader languages such as High Order Shader Language (HLSL) or OpenGL Shader Language (GLSL). The application also includes executable instructions 1614, which are machine languages suitable for execution by general purpose processor core 1634. The application also includes a graphical object 1616 defined by vertex data.

在若干實施例中,作業系統1620為來自微軟公司之微軟視窗(Microsoft® Windows®)作業系統,專屬UNIX型作業系統,或使用Linux內核變體之開放源UNIX型作業系統。當使用直接SD API時,作業系統1620使用前端著色器編譯器1624,將HLSL之任何著色器指令1612編譯為低階著色器語言。編譯可為即時(JIT)編譯或應用可實施著色器預編譯。在若干實施例 中,高階著色器於3D圖形應用1610編譯期間編譯為低階著色器。 In several embodiments, operating system 1620 is a Microsoft Windows® operating system from Microsoft Corporation, a proprietary UNIX-based operating system, or an open source UNIX operating system that uses a Linux kernel variant. When using the direct SD API, the operating system 1620 uses the front end shader compiler 1624 to compile any shader instructions 1612 of the HLSL into a lower order shader language. Compilation can be implemented for just-in-time (JIT) compilation or application-implementable shader precompilation. In several embodiments Medium-level shaders are compiled into low-order shaders during 3D graphics application 1610 compilation.

在若干實施例中,使用者模式圖形驅動器1626包含後端著色器編譯器1627,將著色器指令1612轉換為硬體特定代表。當使用OpenGL API時,GLSL高階語言之著色器指令1612傳遞至使用者模式圖形驅動器1626進行編譯。在若干實施例中,使用者模式圖形驅動器1626使用作業系統內核模式功能1628與內核模式圖形驅動器1629通訊。在若干實施例中,內核模式圖形驅動器1629與圖形處理器1632通訊而調度命令及指令。 In several embodiments, the user mode graphics driver 1626 includes a backend shader compiler 1627 that converts the shader instructions 1612 into hardware specific representations. When the OpenGL API is used, the GLSL high level language color shader instructions 1612 are passed to the user mode graphics driver 1626 for compilation. In several embodiments, the user mode graphics driver 1626 communicates with the kernel mode graphics driver 1629 using the operating system kernel mode function 1628. In several embodiments, kernel mode graphics driver 1629 communicates with graphics processor 1632 to schedule commands and instructions.

至少一實施例之一或更多個態樣可由儲存於機器可讀取媒體上之代表碼實施,其代表及/或定義積體電路內之邏輯,諸如處理器。例如,機器可讀取媒體可包括指令,其代表處理器內之各式邏輯。當機器讀取時,指令可致使機器製造邏輯而實施文中所描述之技術。該等代表已知為「IP核心」,為積體電路之邏輯的可再用單元,可儲存於實體機器可讀取媒體上,做為硬體模型,描述積體電路之結構。硬體模型可供應予各式客戶或製造商,其將硬體模型裝載於製造機器上,來製造積體電路。可製造積體電路,使得電路實施結合文中所描述任何實施例而描述之作業。 One or more aspects of at least one embodiment may be implemented by a representative code stored on a machine readable medium that represents and/or defines logic within an integrated circuit, such as a processor. For example, machine readable media can include instructions that represent various logic within the processor. When the machine reads, the instructions may cause the machine to make logic to implement the techniques described herein. These representatives are known as "IP cores", which are logical reusable units of integrated circuits that can be stored on a tangible medium of a physical machine as a hardware model to describe the structure of the integrated circuit. The hardware model can be supplied to various customers or manufacturers who load the hardware model on the manufacturing machine to manufacture the integrated circuit. The integrated circuit can be fabricated such that the circuit performs the operations described in connection with any of the embodiments described herein.

圖17為方塊圖,依據實施例,描繪IP核心發展系統1700,可用以製造積體電路而實施作業。IP核心發展系統1700可用以產生模組式可再用設計,其可併 入較大設計或用以組建整個積體電路(例如SOC積體電路)。設計設施1730可產生高階編程語言(例如C/C++)之IP核心設計的軟體模擬1710。軟體模擬1710可用以設計、測試、及驗證IP核心之行為。接著可從模擬模型1700製造或合成暫存器轉移級(RTL)設計。RTL設計1715為積體電路行為之抽象化,其為硬體暫存器間之數位信號流之模型,包括使用模型數位信號實施之相關邏輯。除了RTL設計1715外,亦可製造、設計、或合成邏輯級或電晶體級之低階設計。因而,最初設計及模擬之特定細節可改變。 17 is a block diagram depicting an IP core development system 1700 that can be used to fabricate an integrated circuit to perform an operation, in accordance with an embodiment. The IP Core Development System 1700 can be used to create a modular reusable design that can Into a larger design or to build the entire integrated circuit (such as SOC integrated circuit). The design facility 1730 can generate a software simulation 1710 of an IP core design of a high-level programming language (eg, C/C++). Software Simulation 1710 can be used to design, test, and verify the behavior of IP cores. A scratchpad transfer stage (RTL) design can then be fabricated or synthesized from the simulation model 1700. The RTL design 1715 is an abstraction of the behavior of the integrated circuit, which is a model of the digital signal flow between the hardware registers, including the associated logic implemented using the model digital signals. In addition to the RTL design 1715, low-order designs of logic or transistor levels can be fabricated, designed, or synthesized. Thus, the specific details of the initial design and simulation can vary.

藉由設計設施為硬體模型1720,可進一步合成RTL設計1715或等效物件,其可為硬體描述語言(HDL),或實體設計資料之若干其他代表。HDL可進一步模擬或測試以驗證IP核心設計。使用非揮發性記憶體1740(例如硬碟、快閃記憶體、或任何非揮發性儲存裝置媒體),可儲存IP核心設計用於傳遞至第三方製造設施1765。另一方面,透過有線連接1750或無線連接1760,可傳輸(例如經由網際網路)IP核心設計。製造設施1765接著可製造積體電路,其至少部分依據IP核心設計。依據文中描述之至少一實施例,可組配製造之積體電路而實施作業。 The RTL design 1715 or equivalent may be further synthesized by designing the hardware model 1720, which may be a hardware description language (HDL), or several other representations of physical design data. The HDL can be further simulated or tested to verify the IP core design. Using non-volatile memory 1740 (eg, hard disk, flash memory, or any non-volatile storage device media), the storable IP core is designed for delivery to third party manufacturing facility 1765. Alternatively, the IP core design can be transmitted (e.g., via the Internet) via a wired connection 1750 or a wireless connection 1760. Manufacturing facility 1765 can then fabricate integrated circuitry that is at least partially designed in accordance with the IP core. According to at least one embodiment described herein, the integrated circuit can be assembled to perform the work.

圖18為方塊圖,依據實施例,描繪示例系統晶片整合電路1800,其可使用一或更多個IP核心而予製造。示例積體電路包括一或更多個應用處理器1805(例 如CPU),至少一圖形處理器1810,此外可包括圖像處理器1815及/或視訊處理器1820,任一者可為來自相同或多個不同設計設施之模組式IP核心。積體電路包括週邊設備或匯流排邏輯,包括USB控制器1825、UART控制器1830、SPI/SDIO控制器1835、及I2S/I2C控制器1840。此外,積體電路可包括顯示裝置1845,耦接至一或更多個高解析度多媒體介面(HDMI)控制器1850及行動產業處理器介面(M1P1)顯示介面1855。儲存裝置可由快閃記憶體子系統1860提供,包括快閃記憶體及快閃記憶體控制器。記憶體介面可經由記憶體控制器1865提供,用於存取SDRAM或SRAM記憶體裝置。此外,若干積體電路包括嵌入安全引擎1870。 18 is a block diagram depicting an example system wafer integration circuit 1800 that may be fabricated using one or more IP cores, in accordance with an embodiment. An example integrated circuit includes one or more application processors 1805 (eg, For example, the CPU), at least one graphics processor 1810, may further include an image processor 1815 and/or a video processor 1820, either of which may be modular IP cores from the same or a plurality of different design facilities. The integrated circuit includes peripheral devices or bus logic, including a USB controller 1825, a UART controller 1830, an SPI/SDIO controller 1835, and an I2S/I2C controller 1840. In addition, the integrated circuit can include a display device 1845 coupled to one or more high resolution multimedia interface (HDMI) controllers 1850 and a mobile industry processor interface (M1P1) display interface 1855. The storage device can be provided by a flash memory subsystem 1860, including a flash memory and a flash memory controller. The memory interface can be provided via memory controller 1865 for accessing SDRAM or SRAM memory devices. In addition, a number of integrated circuits include an embedded security engine 1870.

此外,積體電路1800之處理器中可包括其他邏輯及電路,包括額外圖形處理器/核心、週邊設備介面控制器、或通用處理器核心。 In addition, other logic and circuitry may be included in the processor of integrated circuit 1800, including additional graphics processors/cores, peripheral device interface controllers, or general purpose processor cores.

下列範例關於進一步實施例。範例1包括一種設備,包含:圖形處理單元(GPU),從來源至緩衝器複製資料區塊之至少第一部,及從來源至目的地複製資料區塊之第二部;以及中央處理單元(CPU),從緩衝器至目的地中一或更多個相應位置複製資料區塊之第一部。範例2包括範例1之設備,其中,資料區塊之第一部包含資料區塊之開始處之第一資料,或資料區塊之末端之第二資料。範例3包括範例2之設備,其中,第一資料及第二資料各具有小於記憶體單元之尺寸。範例4包括範例3之設 備,其中,記憶體單元包含快取列或64位元組。範例5包括範例1之設備,其中,資料區塊之第二部包含一或更多個完整記憶體單元。範例6包括範例5之設備,其中,一或更多個完整記憶體單元之每一者包含64位元組。範例7包括範例1之設備,其中,來源包含耦接至GPU之視訊記憶體。範例8包括範例1之設備,其中,目的地包含耦接至CPU之系統記憶體。範例9包括範例1之設備,其中,GPU包含一或更多個圖形處理核心。範例10包括範例1之設備,其中,CPU包含一或更多個處理器核心。範例11包括範例1之設備,其中,一或更多個GPU、CPU、或記憶體係在單一積體電路晶粒上。 The following examples pertain to further embodiments. Example 1 includes an apparatus comprising: a graphics processing unit (GPU) that copies at least a first portion of a data block from a source to a buffer, and a second portion that copies a data block from a source to a destination; and a central processing unit ( CPU), copying the first portion of the data block from the buffer to one or more corresponding locations in the destination. Example 2 includes the device of example 1, wherein the first portion of the data block includes the first data at the beginning of the data block or the second data at the end of the data block. Example 3 includes the device of example 2, wherein the first data and the second data each have a size smaller than a memory unit. Example 4 includes the design of Example 3. In this case, the memory unit contains a cache column or a 64-bit tuple. Example 5 includes the device of example 1, wherein the second portion of the data block comprises one or more complete memory units. Example 6 includes the device of example 5, wherein each of the one or more full memory cells comprises a 64-bit tuple. Example 7 includes the device of example 1, wherein the source comprises video memory coupled to the GPU. Example 8 includes the device of example 1, wherein the destination comprises system memory coupled to the CPU. Example 9 includes the device of example 1, wherein the GPU includes one or more graphics processing cores. Example 10 includes the device of example 1, wherein the CPU includes one or more processor cores. Example 11 includes the device of example 1, wherein one or more GPUs, CPUs, or memory systems are on a single integrated circuit die.

範例12包括一種系統,包含:處理器,耦接至系統記憶體,系統記憶體儲存從視訊記憶體複製之資料區塊,處理器包含:圖形處理單元(GPU),用以複製:資料區塊之第一部,從視訊記憶體至第一緩衝器;資料區塊之第二部,從視訊記憶體至系統記憶體;以及資料區塊之第三部,從視訊記憶體至第二緩衝器;以及中央處理單元(CPU),用以複製:資料區塊之第一部,從第一緩衝器至系統記憶體中之相應位置;以及資料區塊之第二部,從第二緩衝器至系統記憶體中之相應位置。範例13包括範例12之系統,其中,資料區塊係連續依序由第一部、第二部、及第三部組成。範例14包括範例12之系統,其中,資料區塊之第一部及第三部各包含較記憶體單元少之位元組。範例15包括範例14之系統,其中,記憶體單元 包含快取列或64位元組。範例16包括範例12之系統,其中,資料區塊之第二部包含一或更多個完整記憶體單元。範例17包括範例16之系統,其中,一或更多個完整記憶體單元之每一者包含64位元組。範例18包括範例12之系統,其中,第一緩衝器及第二緩衝器組合為單一緩衝器。範例19包括範例12之系統,其中,具有一或更多個圖形處理核心之一或更多個GPU,具有一或更多個處理器核心之CPU,至少一部分系統記憶體,或至少一部分視訊記憶體係在單一積體電路晶粒上。 Example 12 includes a system comprising: a processor coupled to the system memory, the system memory storing the data block copied from the video memory, the processor comprising: a graphics processing unit (GPU) for copying: the data block The first part, from the video memory to the first buffer; the second part of the data block, from the video memory to the system memory; and the third part of the data block, from the video memory to the second buffer And a central processing unit (CPU) for copying: the first portion of the data block, from the first buffer to the corresponding location in the system memory; and the second portion of the data block, from the second buffer to The corresponding location in the system memory. Example 13 includes the system of example 12, wherein the data block is sequentially composed of the first portion, the second portion, and the third portion. Example 14 includes the system of example 12, wherein the first and third portions of the data block each contain fewer bytes than the memory unit. Example 15 includes the system of example 14, wherein the memory unit Contains cached columns or 64-bit tuples. Example 16 includes the system of example 12, wherein the second portion of the data block comprises one or more complete memory units. Example 17 includes the system of example 16, wherein each of the one or more full memory units comprises a 64-bit tuple. Example 18 includes the system of example 12, wherein the first buffer and the second buffer are combined into a single buffer. Example 19 includes the system of example 12, wherein one or more GPUs having one or more graphics processing cores, a CPU having one or more processor cores, at least a portion of system memory, or at least a portion of video memory The system is on a single integrated circuit die.

範例20包括一種電腦可讀取媒體,包含一或更多個指令,當指令在處理器上執行時組配處理器而實施一或更多個作業,用以:致使圖形處理單元(GPU),從來源至緩衝器複製資料區塊之至少第一部,及從來源至目的地複製資料區塊之第二部;以及致使中央處理單元(CPU),從第一緩衝器至目的地中一或更多個相應位置複製資料區塊之第一部。範例21包括範例20之電腦可讀取媒體,其中,資料區塊之第一部包含資料區塊之開始處之第一資料,或資料區塊之末端之第二資料。範例22包括範例21之電腦可讀取媒體,其中,第一資料及第二資料各具有小於記憶體單元之尺寸。範例23包括範例22之電腦可讀取媒體其中,記憶體單元包含快取列或64位元組。範例24包括範例20之電腦可讀取媒體,其中,資料區塊之第二部包含一或更多個完整記憶體單元。範例25包括範例20之電腦可讀取媒體,其中,一或更多個完整 記憶體單元之每一者包含64位元組。 Example 20 includes a computer readable medium containing one or more instructions that, when executed on a processor, assemble a processor to perform one or more jobs to: cause a graphics processing unit (GPU), Copying at least a first portion of the data block from the source to the buffer, and copying the second portion of the data block from the source to the destination; and causing the central processing unit (CPU) to pass from the first buffer to the destination More of the corresponding location copies the first part of the data block. Example 21 includes the computer readable medium of example 20, wherein the first portion of the data block includes the first data at the beginning of the data block or the second data at the end of the data block. The example 22 includes the computer readable medium of the example 21, wherein the first data and the second data each have a size smaller than a memory unit. Example 23 includes the computer readable medium of Example 22, wherein the memory unit comprises a cache line or a 64-bit tuple. Example 24 includes the computer readable medium of example 20, wherein the second portion of the data block comprises one or more complete memory units. Example 25 includes the computer readable medium of Example 20, wherein one or more complete Each of the memory cells contains a 64-bit tuple.

範例26包括一種方法,包含:致使圖形處理單元(GPU),從來源至緩衝器複製資料區塊之至少第一部,及從來源至目的地複製資料區塊之第二部;以及致使中央處理單元(CPU),從第一緩衝器至目的地中一或更多個相應位置複製資料區塊之第一部。範例27包括範例26之方法,其中,資料區塊之第一部包含資料區塊之開始處之第一資料,或資料區塊之末端之第二資料,或其中,資料區塊之第二部包含一或更多個完整記憶體單元,或其中,一或更多個完整記憶體單元之每一者包含64位元組。範例28包括範例27之方法,其中,第一資料及第二資料各具有小於記憶體單元之尺寸。範例29包括範例28之方法,其中,記憶體單元包含快取列或64位元組。 Example 26 includes a method comprising: causing a graphics processing unit (GPU) to copy at least a first portion of a data block from a source to a buffer, and copying a second portion of the data block from a source to a destination; and causing central processing A unit (CPU) that copies the first portion of the data block from the first buffer to one or more corresponding locations in the destination. Example 27 includes the method of example 26, wherein the first portion of the data block includes the first data at the beginning of the data block, or the second data at the end of the data block, or the second portion of the data block One or more complete memory cells are included, or wherein each of the one or more full memory cells comprises a 64-bit tuple. Example 28 includes the method of example 27, wherein the first data and the second data each have a size smaller than a memory unit. Example 29 includes the method of example 28, wherein the memory unit comprises a cache line or a 64-bit tuple.

範例30包括一種設備,其包含機制以實施如任何前述範例中提出之方法。範例31包含機器可讀取儲存裝置,其包括機器可讀取指令,當執行指令時,實施或實現如任何前述範例中提出之方法或設備。 Example 30 includes an apparatus that includes mechanisms to implement the method as set forth in any of the preceding examples. Example 31 includes a machine readable storage device that includes machine readable instructions that, when executed, implement or implement a method or apparatus as set forth in any of the preceding examples.

在各式實施例中,文中所討論之作業,例如參照圖1-18,可實施為硬體(例如邏輯電路)、軟體、軔體、或其組合,其可提供為電腦程式產品,例如包括實體(例如非暫態)機器可讀取或電腦可讀取媒體,具有儲存於其上之指令(或軟體程序)用以編程電腦,而實施文中所討論之程序。機器可讀取媒體可包括儲存裝置,諸如參照圖1-18所討論者。 In various embodiments, the operations discussed herein, for example, with reference to Figures 1-18, may be implemented as a hardware (e.g., logic circuit), a software, a body, or a combination thereof, which may be provided as a computer program product, including, for example, An entity (eg, non-transitory) machine readable or computer readable medium having instructions (or software programs) stored thereon for programming a computer to implement the procedures discussed herein. Machine readable media may include storage devices such as those discussed with reference to Figures 1-18.

此外,該電腦可讀取媒體可下載為電腦程式產品,其中,程式可藉由載波或其他傳播媒體中所提供之資料信號,經由通訊鏈路(例如匯流排、數據機、或網路連接),而從遠端電腦(例如伺服器)轉移至請求電腦(例如客戶)。 In addition, the computer readable medium can be downloaded as a computer program product, wherein the program can be transmitted via a communication link (such as a bus, a data machine, or a network connection) via a data signal provided on a carrier wave or other communication medium. And from a remote computer (such as a server) to a requesting computer (such as a customer).

說明書中提及「一實施例」或「實施例」表示結合實施例描述之特定部件、結構、及/或特性可包括於至少實施中。在說明書中各處出現之「在一實施例中」用語可或不可均指相同實施例。 Reference is made to the "an embodiment" or "an embodiment" or "an embodiment" or "an" The phrase "in one embodiment", which is used throughout the specification, may or may not mean the same embodiment.

而且,在描述及申請項中,可使用「耦接」及「連接」用詞,連同其衍生字。在若干實施例中,「連接」可用以表示二或更多個元件係彼此直接實體或電接觸。「耦接」可表示二或更多個元件係直接實體或電接觸。然而,「耦接」亦可表示二或更多個元件並非彼此直接接觸,但仍可彼此合作或互動。 Moreover, in the description and application, the words "coupled" and "connected" may be used together with their derivatives. In several embodiments, "connected" may be used to mean that two or more elements are in direct physical or electrical contact with each other. "Coupled" may mean that two or more elements are in direct physical or electrical contact. However, "coupled" may also mean that two or more elements are not in direct contact with each other, but can still cooperate or interact with each other.

因而,儘管已以結構部件及/或方法動作專用之語言描述實施例,應理解的是所主張之技術主題可不侷限於所描述之特定部件或動作。而是,特定部件及動作被揭露為實施所主張之技術主題的樣本型式。 Accordingly, while the embodiments have been described in the language of the embodiments of the embodiments of the invention, it is understood that the claimed subject matter is not limited to the specific components or acts described. Rather, the specific components and acts are disclosed as a sample format for implementing the claimed subject matter.

114‧‧‧記憶體 114‧‧‧ memory

150‧‧‧視訊記憶體 150‧‧‧Video Memory

202‧‧‧儲存裝置 202‧‧‧Storage device

Claims (24)

一種設備,包含:圖形處理單元(GPU),從來源至緩衝器複製資料區塊之至少第一部,及從該來源至目的地複製該資料區塊之第二部;以及中央處理單元(CPU),從該緩衝器至該目的地中一或更多個相應位置複製該資料區塊之該第一部。 A device comprising: a graphics processing unit (GPU), copying at least a first portion of a data block from a source to a buffer, and copying a second portion of the data block from the source to a destination; and a central processing unit (CPU) Copying the first portion of the data block from the buffer to one or more corresponding locations in the destination. 如申請專利範圍第1項之設備,其中,該資料區塊之該第一部包含該資料區塊之開始處之第一資料,或該資料區塊之末端之第二資料。 The device of claim 1, wherein the first portion of the data block includes the first data at the beginning of the data block or the second data at the end of the data block. 如申請專利範圍第2項之設備,其中,該第一資料及該第二資料各具有小於記憶體單元之尺寸。 The device of claim 2, wherein the first data and the second data each have a size smaller than a memory unit. 如申請專利範圍第3項之設備,其中,該記憶體單元包含快取列或64位元組。 The device of claim 3, wherein the memory unit comprises a cache column or a 64-bit tuple. 如申請專利範圍第1項之設備,其中,該資料區塊之該第二部包含一或更多個完整記憶體單元。 The device of claim 1, wherein the second portion of the data block comprises one or more complete memory units. 如申請專利範圍第5項之設備,其中,該一或更多個完整記憶體單元之每一者包含64位元組。 The device of claim 5, wherein each of the one or more complete memory units comprises a 64-bit tuple. 如申請專利範圍第1項之設備,其中,該來源包含耦接至該GPU之視訊記憶體。 The device of claim 1, wherein the source comprises video memory coupled to the GPU. 如申請專利範圍第1項之設備,其中,該目的地包含耦接至該CPU之系統記憶體。 The device of claim 1, wherein the destination comprises system memory coupled to the CPU. 如申請專利範圍第1項之設備,其中,該GPU包含一或更多個圖形處理核心。 The device of claim 1, wherein the GPU comprises one or more graphics processing cores. 如申請專利範圍第1項之設備,其中,該CPU包含一或更多個處理器核心。 The device of claim 1, wherein the CPU comprises one or more processor cores. 如申請專利範圍第1項之設備,其中,一或更多個該GPU、該CPU、或記憶體係在單一積體電路晶粒上。 The device of claim 1, wherein the one or more of the GPU, the CPU, or the memory system are on a single integrated circuit die. 一種系統,包含:處理器,耦接至系統記憶體,該系統記憶體儲存從視訊記憶體複製之資料區塊,該處理器包含:圖形處理單元(GPU),用以複製:該資料區塊之第一部,從該視訊記憶體至第一緩衝器;該資料區塊之第二部,從該視訊記憶體至該系統記憶體;以及該資料區塊之第三部,從該視訊記憶體至第二緩衝器;以及中央處理單元(CPU),用以複製:該資料區塊之該第一部,從該第一緩衝器至該系統記憶體中之相應位置;以及該資料區塊之該第二部,從該第二緩衝器至該系統記憶體中之相應位置。 A system includes a processor coupled to a system memory, the system memory storing a data block copied from the video memory, the processor comprising: a graphics processing unit (GPU) for copying: the data block The first part, from the video memory to the first buffer; the second part of the data block, from the video memory to the system memory; and the third part of the data block, from the video memory And a central processing unit (CPU) for copying: the first portion of the data block, from the first buffer to a corresponding location in the system memory; and the data block The second portion is from the second buffer to a corresponding location in the system memory. 如申請專利範圍第12項之系統,其中,該資料區塊係連續依序由該第一部、該第二部、及該第三部組成。 The system of claim 12, wherein the data block is sequentially composed of the first part, the second part, and the third part. 如申請專利範圍第12項之系統,其中,該資料 區塊之該第一部及該第三部各包含較記憶體單元少之位元組。 Such as the system of claim 12, wherein the information The first portion and the third portion of the block each contain fewer bytes than the memory unit. 一種電腦可讀取媒體,包含一或更多個指令,當指令在處理器上執行時組配該處理器而實施一或更多個作業,用以:致使圖形處理單元(GPU),從來源至緩衝器複製資料區塊之至少第一部,及從該來源至目的地複製該資料區塊之第二部;以及致使中央處理單元(CPU),從該第一緩衝器至該目的地中一或更多個相應位置複製該資料區塊之該第一部。 A computer readable medium containing one or more instructions that, when executed on a processor, assemble the processor to perform one or more jobs to: cause a graphics processing unit (GPU), from a source Reaching at least a first portion of the buffer data block and copying the second portion of the data block from the source to the destination; and causing a central processing unit (CPU) to pass from the first buffer to the destination The first portion of the data block is copied by one or more corresponding locations. 如申請專利範圍第15項之電腦可讀取媒體,其中,該資料區塊之該第一部包含該資料區塊之開始處之第一資料,或該資料區塊之末端之第二資料。 The computer readable medium as claimed in claim 15 wherein the first part of the data block includes the first data at the beginning of the data block or the second data at the end of the data block. 如申請專利範圍第16項之電腦可讀取媒體,其中,該第一資料及該第二資料各具有小於記憶體單元之尺寸。 The computer readable medium of claim 16, wherein the first data and the second data each have a size smaller than a memory unit. 如申請專利範圍第17項之電腦可讀取媒體,其中,該記憶體單元包含快取列或64位元組。 The computer readable medium of claim 17, wherein the memory unit comprises a cache column or a 64-bit tuple. 如申請專利範圍第15項之電腦可讀取媒體,其中,該資料區塊之該第二部包含一或更多個完整記憶體單元。 A computer readable medium as claimed in claim 15 wherein the second portion of the data block comprises one or more complete memory units. 如申請專利範圍第15項之電腦可讀取媒體,其中,該一或更多個完整記憶體單元之每一者包含64位元組。 A computer readable medium as claimed in claim 15 wherein each of the one or more full memory units comprises a 64-bit tuple. 一種方法,包含:致使圖形處理單元(GPU),從來源至緩衝器複製資料區塊之至少第一部,及從該來源至目的地複製該資料區塊之第二部;以及致使中央處理單元(CPU),從該第一緩衝器至該目的地中一或更多個相應位置複製該資料區塊之該第一部。 A method comprising: causing a graphics processing unit (GPU) to copy at least a first portion of a data block from a source to a buffer, and copying a second portion of the data block from the source to a destination; and causing a central processing unit (CPU) copying the first portion of the data block from the first buffer to the one or more corresponding locations in the destination. 如申請專利範圍第21項之方法,其中,該資料區塊之該第一部包含該資料區塊之開始處之第一資料,或該資料區塊之末端之第二資料,或其中,該資料區塊之該第二部包含一或更多個完整記憶體單元,或其中,該一或更多個完整記憶體單元之每一者包含64位元組。 The method of claim 21, wherein the first part of the data block includes the first data at the beginning of the data block, or the second data at the end of the data block, or The second portion of the data block includes one or more complete memory units, or wherein each of the one or more complete memory units comprises a 64-bit unit. 如申請專利範圍第22項之方法,其中,該第一資料及該第二資料各具有小於記憶體單元之尺寸。 The method of claim 22, wherein the first data and the second data each have a size smaller than a memory unit. 如申請專利範圍第23項之方法,其中,該記憶體單元包含快取列或64位元組。 The method of claim 23, wherein the memory unit comprises a cache column or a 64-bit tuple.
TW105126630A 2015-09-25 2016-08-19 Apparatus, system and method for performing gpu-cpu two-path memory copy TWI715613B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
PCT/CN2015/090701 WO2017049583A1 (en) 2015-09-25 2015-09-25 Gpu-cpu two-path memory copy
WOPCT/CN2015/090701 2015-09-25

Publications (2)

Publication Number Publication Date
TW201729111A true TW201729111A (en) 2017-08-16
TWI715613B TWI715613B (en) 2021-01-11

Family

ID=58385703

Family Applications (1)

Application Number Title Priority Date Filing Date
TW105126630A TWI715613B (en) 2015-09-25 2016-08-19 Apparatus, system and method for performing gpu-cpu two-path memory copy

Country Status (2)

Country Link
TW (1) TWI715613B (en)
WO (1) WO2017049583A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI734151B (en) * 2019-06-28 2021-07-21 鴻齡科技股份有限公司 Parameter synchronization method, device, and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134370B (en) * 2018-02-08 2023-09-12 龙芯中科技术股份有限公司 Graph drawing method and device, electronic equipment and storage medium
CN118520482B (en) * 2024-07-17 2024-09-27 湖北芯擎科技有限公司 Mailbox implementation method and device supporting information and function security

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7671864B2 (en) * 2000-01-14 2010-03-02 Roman Kendyl A Faster image processing
US7768517B2 (en) * 2006-02-21 2010-08-03 Nvidia Corporation Asymmetric multi-GPU processing
US8648867B2 (en) * 2006-09-25 2014-02-11 Neurala Llc Graphic processor based accelerator system and method
US8918552B2 (en) * 2008-10-24 2014-12-23 International Business Machines Corporation Managing misaligned DMA addresses
US9990287B2 (en) * 2011-12-21 2018-06-05 Intel Corporation Apparatus and method for memory-hierarchy aware producer-consumer instruction
WO2013097098A1 (en) * 2011-12-27 2013-07-04 华为技术有限公司 Data processing method, graphics processing unit (gpu) and first node device
US9164690B2 (en) * 2012-07-27 2015-10-20 Nvidia Corporation System, method, and computer program product for copying data between memory locations
KR101710001B1 (en) * 2012-08-10 2017-02-27 한국전자통신연구원 Apparatus and Method for JPEG2000 Encoding/Decoding based on GPU
US9245496B2 (en) * 2012-12-21 2016-01-26 Qualcomm Incorporated Multi-mode memory access techniques for performing graphics processing unit-based memory transfer operations

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI734151B (en) * 2019-06-28 2021-07-21 鴻齡科技股份有限公司 Parameter synchronization method, device, and storage medium

Also Published As

Publication number Publication date
TWI715613B (en) 2021-01-11
WO2017049583A1 (en) 2017-03-30

Similar Documents

Publication Publication Date Title
US10796401B2 (en) Efficient merging of atomic operations at computing devices
US20180307533A1 (en) Faciltating multi-level microcontroller scheduling for efficient computing microarchitecture
US10572288B2 (en) Apparatus and method for efficient communication between virtual machines
US20200371804A1 (en) Boosting local memory performance in processor graphics
US20190236758A1 (en) Apparatus and method for temporally stable conservative morphological anti-aliasing
US10037625B2 (en) Load-balanced tessellation distribution for parallel architectures
US11068401B2 (en) Method and apparatus to improve shared memory efficiency
US20170083450A1 (en) Supporting Data Conversion and Meta-Data in a Paging System
US11610564B2 (en) Consolidation of data compression using common sectored cache for graphics streams
US20180121202A1 (en) Simd channel utilization under divergent control flow
TWI715613B (en) Apparatus, system and method for performing gpu-cpu two-path memory copy
US20180075573A1 (en) Minimum/maximum and bitwise and/or based coarse stencil test
US11508338B2 (en) Register spill/fill using shared local memory space
US11150943B2 (en) Enabling a single context hardware system to operate as a multi-context system
US10430229B2 (en) Multiple-patch SIMD dispatch mode for domain shaders
US20180308215A1 (en) Dynamic allocation of cache based on instantaneous bandwidth consumption at computing devices
US10884932B2 (en) Independent and separate entity-based cache
US10902546B2 (en) Efficient skipping of data compression processes at computing devices
US10332278B2 (en) Multi-format range detect YCoCg compression
US10402345B2 (en) Deferred discard in tile-based rendering