KR101009557B1

KR101009557B1 - Hybrid multisample/supersample antialiasing

Info

Publication number: KR101009557B1
Application number: KR1020090060607A
Authority: KR
Inventors: 카스 더블유. 에버리트; 스티븐 이. 몰나르
Original assignee: 엔비디아 코포레이션
Priority date: 2008-07-03
Filing date: 2009-07-03
Publication date: 2011-01-18
Also published as: TWI425440B; TW201007610A; KR20100004890A; JP2010020764A; JP4744624B2

Abstract

프리미티브 셰이딩 중에 픽셀 샘플링 레이트를 동적으로 조정하기 위한 시스템 및 방법은 화질을 향상시키거나 셰이딩 성능을 증가시킬 수 있다. 픽셀 프레그먼트 당 셰이딩된 샘플들의 수를 선택함으로써 하이브리드 안티에일리어싱이 수행된다. 수퍼샘플 및 멀티샘플 안티에일리어싱의 조합을 이용하는데, 여기서 서브 픽셀 샘플들(멀티샘플들)의 클러스터가 프래그먼트 셰이더 파이프라인을 통한 각 패스(pass)에 대해 프로세싱된다. 각 클러스터에서의 멀티샘플들 및 셰이더 패스들의 수는 렌더링 상태에 기초하여 각 프리미티브에 대해 동적으로 결정될 수 있다.Systems and methods for dynamically adjusting the pixel sampling rate during primitive shading may improve image quality or increase shading performance. Hybrid antialiasing is performed by selecting the number of shaded samples per pixel fragment. A combination of supersample and multisample antialiasing is used, where a cluster of subpixel samples (multisamples) is processed for each pass through the fragment shader pipeline. The number of multisamples and shader passes in each cluster can be dynamically determined for each primitive based on the rendering state.

셰이딩, 프리미티브, 클러스터, 멀티샘플, 수퍼샘플 Shading, Primitives, Clusters, Multisamples, Supersamples

Description

Hybrid multisample / supersample antialiasing {HYBRID MULTISAMPLE / SUPERSAMPLE ANTIALIASING}

전반적으로, 본 발명의 실시예들은 그래픽 프로세싱을 위한 안티에일리어싱 기술들에 관한 것으로서, 특히, 픽셀 프래그먼트마다 셰이딩되는 샘플들의 수를 동적으로 조절하는 것에 관한 것이다.Overall, embodiments of the present invention relate to antialiasing techniques for graphics processing, and in particular, to dynamically adjusting the number of samples shaded per pixel fragment.

통상적으로, 그래픽 프로세서들은 멀티샘플링 또는 수퍼샘플링에 의해 안티에일리어싱을 수행하도록 구성된다. 멀티샘플링에서, 각각의 픽셀 프래그먼트는 1회 셰이딩되며, 결과적인 컬러 값이 모든 커버된 서브픽셀 샘플들을 위해 복제된다. 수퍼샘플링에서, 각각의 픽셀 프래그먼트는 N회 셰이딩되며, 각각의 커버된 서브픽셀 샘플에 대해 1회이다.Typically, graphics processors are configured to perform antialiasing by multisampling or supersampling. In multisampling, each pixel fragment is shaded once, and the resulting color value is duplicated for all covered subpixel samples. In supersampling, each pixel fragment is shaded N times, once for each covered subpixel sample.

멀티샘플링은 안티에일리어싱 프리미티브 에지들에 적합하며, 그 이유는, 여기서 중요한 것이 어느 샘플들이 인입 프리미티브에 의해 커버되는지 이기 때문이다. 전형적으로, 텍스처들은 셰이딩된 컬러 값들이, 픽셀당 1회의 셰이딩이면 적당한 충분히 낮은 공간 주파수를 갖도록 프리필터링된다. 그러나, 텍스처된 알파 투명성 및 고주파수 스페큘러 하이라이트와 같은 몇몇 효과들이, 앨리어싱 아티팩 트를 피하기 위해 픽셀 주파수보다 높은 주파수를 가질 수 있으며, 셰이딩이 픽셀 주파수보다 높은 주파수에서 행해지도록 요구할 수 있다. 전형적으로, 수퍼샘플링은 이들 유형의 앨리어싱을 피하기 위해 요구된다. 그러나, 픽셀에서의 모든 샘플에서의 셰이딩은 지나치게 높은 비용을 초래하는데, 그 이유는, 셰이딩이 전형적으로 렌더링에서의 가장 비싼 동작이기 때문이다. 또한, 일부 수퍼샘플링 구현들은 입력 프리미티브가 다수회 처리될 것을 요구하는데, 각각의 서브픽셀 샘플에 대해 1회이며, 그것은 추가적인 비효율성을 초래한다. 픽셀 당 하나보다는 많지만 모든 샘플보다는 적은 셰이딩 레이트는 상기 에일리어싱의 원인들을 완화시키에 충분할 수 있다. Multisampling is suitable for anti-aliasing primitive edges because it is important here which samples are covered by the incoming primitive. Typically, the textures are prefiltered such that the shaded color values have a sufficiently low spatial frequency that is adequate once shading per pixel. However, some effects, such as textured alpha transparency and high frequency specular highlights, may have a frequency higher than the pixel frequency to avoid aliasing artifacts, and may require shading to be done at frequencies higher than the pixel frequency. Typically, supersampling is required to avoid these types of aliasing. However, shading at every sample in a pixel results in an excessively high cost, since shading is typically the most expensive operation in rendering. In addition, some supersampling implementations require the input primitive to be processed multiple times, once for each subpixel sample, which results in additional inefficiency. More than one per pixel but less than all samples shading rate may be sufficient to mitigate the causes of the aliasing.

따라서, 본 기술 분야에서는 렌더링되는 현재의 지오메트리(geometry)에 대해 적합한 픽셀 셰이딩 레이트를 이용하기 위한 시스템 및 방법이 필요하다. 셰이딩 레이트는 화질을 향상시키도록 감소되거나, 또는 셰이딩 성능을 향상시키도록 감소될 수 있다.Accordingly, there is a need in the art for a system and method for utilizing pixel shading rates suitable for the current geometry being rendered. The shading rate can be reduced to improve image quality or can be reduced to improve shading performance.

프리미티브 셰이딩 동안 픽셀 샘플링 레이트를 동적으로 조절하기 위한 시스템 및 방법은 화질을 향상시키거나 또는 셰이딩 성능을 증가시킨다. 셰이딩 레이트는 픽셀당 1회(멀티샘플링)로부터 샘플당 1회(수퍼샘플링)으로 어디에서든 변할 수 있고, 또는 화질을 향상시키거나 또는 셰이딩 성능을 증가시키기 위해 그 사이 에서 어디에서든 변할 수 있다. 렌더 타겟(이미지 버퍼)에 대해 픽셀당 지정된 샘플 수가 주어지는 경우, 셰이더 패스들의 수가 동적으로 선택된다. 수퍼샘플 및 멀티샘플 안티에일리어싱의 결합이 이용되며, 서브픽셀 샘플들(멀티샘플들)의 클러스터가 프래그먼트 셰이더의 각각의 패스에 대해 프로세싱된다. 수퍼샘플 클러스터는 각각의 픽셀에 대해 결합되어 안티에일리어싱된 픽셀을 생성한다.Systems and methods for dynamically adjusting pixel sampling rates during primitive shading improve image quality or increase shading performance. The shading rate can vary anywhere from once per pixel (multisampling) to once per sample (supersampling), or anywhere in between to improve image quality or increase shading performance. Given a specified number of samples per pixel for the render target (image buffer), the number of shader passes is dynamically selected. A combination of supersample and multisample antialiasing is used, and a cluster of subpixel samples (multisamples) is processed for each pass of the fragment shader. Supersample clusters are combined for each pixel to produce anti-aliased pixels.

픽셀당 다수의 샘플을 생성하도록 구성되는 컴퓨팅 디바이스에서 하이브리드 안티에일리어싱을 이용한 셰이딩 프리미티브를 위한 본 발명의 방법의 다양한 실시예들은 그래픽 프리미티브를 수신하고, 그래픽 프리미티브를 가로지르는 각 픽셀을 안티에일리어싱하는데 이용되는 수퍼샘플 클러스터의 수를 결정하는 것을 포함한다. 그래픽 프리미티브는 컴퓨팅 디바이스 내의 프래그먼트 세이딩 유닛을 통한 다수의 패스들을 이용하여 셰이딩되며, 그래픽 프리미티브를 가로지르는 각각의 하이브리드 안티에일리어싱된 픽셀을 생성하는데 이용된 다수의 패스의 수는 수퍼샘플 클러스터의 수 이하이다.Various embodiments of the method of the present invention for shading primitives using hybrid antialiasing in a computing device configured to generate multiple samples per pixel are used to receive graphics primitives and antialias each pixel across the graphics primitives. Determining the number of supersample clusters. The graphics primitives are shaded using multiple passes through the fragment shading unit in the computing device, and the number of multiple passes used to generate each hybrid antialiased pixel across the graphics primitives is equal to or less than the number of supersample clusters. to be.

본 발명의 다양한 실시예들은 하이브리드 안티에일리어싱을 이용하여 그래픽 프리미트브를 셰이딩하도록 구성된 컴퓨팅 디바이스를 포함한다. 컴퓨팅 디바이스는 프래그먼트 셰이딩 유닛에 연결되는 래스터라이저를 포함한다. 래스터라이저는 그래픽 프리미티브를 수신하고, 그래픽 프리미티브를 가로지르는 각 필셀을 안티에일리어싱하는데 이용되는 수퍼샘플 클러스터의 수를 결정하도록 구성되는 하이브리드 안티에일리어스 제어 유닛을 포함한다. 프래그먼트 셰이딩 유닛은 다수의 패스를 이용하여 그래픽 프리미티브를 셰이딩하도록 구성되며, 그래픽 프리미티브를 가 로지르는 각각의 하이브리드 안티에일리어싱된 픽셀을 생성하는데 이용된 다수의 패스의 수는 수퍼샘플 클러스터의 수 이하이다.Various embodiments of the present invention include a computing device configured to shade graphics primitives using hybrid antialiasing. The computing device includes a rasterizer coupled to the fragment shading unit. The rasterizer includes a hybrid antialiasing control unit configured to receive the graphics primitives and to determine the number of supersample clusters used to antialias each pillar across the graphics primitives. The fragment shading unit is configured to shade the graphics primitives using multiple passes, wherein the number of multiple passes used to generate each hybrid antialiased pixel across the graphics primitives is less than or equal to the number of supersample clusters.

위에서 인용된 본 발명의 특징들이 상세히 이해될 수 있는 방법으로, 위에서 간단히 요약된 본 발명의 보다 특정한 설명은 실시예들을 참조할 수 있으며, 그 중 일부 실시예가 첨부 도면에 도시된다. 그러나, 첨부된 도면은 본 발명의 전형적인 실시예들만을 도시하는 것이며, 따라서, 본 발명이 다른 동일하게 유효한 실시예들을 허용할 수 있기 때문에, 본 발명의 영역을 제한하는 것으로 여겨지지 않는다. In a manner in which the features of the invention cited above may be understood in detail, a more specific description of the invention briefly summarized above may refer to embodiments, some of which are illustrated in the accompanying drawings. However, the accompanying drawings show only typical embodiments of the invention, and therefore, are not to be considered as limiting the scope of the invention, as the invention may permit other equally effective embodiments.

이하의 설명에서, 본 발명에 대한 보다 완전한 이해를 제공하기 위해, 다양한 특정 세부 사항들이 개시된다. 그러나, 당업자라면, 본 발명은 하나 이상의 이들 특정 세부 사항없이도 실시될 수 있음을 명백히 알 것이다. 다른 경우, 본 발명을 불명료하게 하지 않도록, 잘 알려진 특징들은 기술되지 않았다.In the following description, numerous specific details are set forth in order to provide a more complete understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without one or more of these specific details. In other instances, well known features have not been described in order not to obscure the present invention.

시스템 개요System overview

도 1은 본 발명의 하나 이상의 양태를 구현하도록 구성된 컴퓨터 시스템(100)을 도시하는 블록도이다. 컴퓨터 시스템(100)은 메모리 브리지(105)를 포함하는 버스 경로를 통해 통신하는 CPU(central processing unit)(102) 및 시스템 메모리(104)를 포함한다. 예를 들면, 노스브리지 칩(Northbridge chip)일 수 있는 메모리 브리지(105)는 버스 또는 다른 통신 경로(106)(예를 들면, HyperTransport 링크)를 통해 I/O(입/출력) 브리지(107)에 접속된다. 예를 들면, 사우스브리지 칩일 수 있는 I/O 브리지(107)는 하나 이상의 사용자 입력 디바이스들(108)(예를 들 면, 키보드, 마우스)로부터 사용자 입력을 수신하고, 그 입력을 경로(106) 및 메모리 브리지(105)를 통해 CPU(102)에 전달한다. 병렬 프로세싱 서브시스템(112)은 버스 또는 다른 통신 경로(113)(예를 들면, PCI Express, 가속 그래픽 포트, 또는 HyperTransport 링크)를 통해 메모리 브리지(105)에 연결되며, 일 실시예에서, 병렬 프로세싱 서브시스템(112)은 픽셀들을 디스플레이 디바이스(110)(예를 들면, 종래의 CRT 또는 LCD 기반 모니터)에 전달하는 그래픽 서브시스템이다. 시스템 메모리(104)에 저장되는 디바이스 드라이버(103)는, 어플리케이션 프로그램들과 같은, CPU(102)에 의해 실행되는 프로세스들과, 병렬 프로세싱 서브시스템(112)에 의한 실행을 위해 필요에 따라 프로그램 명령어들을 트랜스레이팅하는 병렬 프로세싱 서브시스템(112) 사이에서 인터페이스한다.1 is a block diagram illustrating a computer system 100 configured to implement one or more aspects of the present invention. Computer system 100 includes a central processing unit (CPU) 102 and system memory 104 that communicate over a bus path that includes a memory bridge 105. For example, memory bridge 105, which may be a Northbridge chip, is an I / O (input / output) bridge 107 over a bus or other communication path 106 (eg, a HyperTransport link). Is connected to. For example, I / O bridge 107, which may be a southbridge chip, receives user input from one or more user input devices 108 (eg, keyboard, mouse), and routes the input to path 106. And to the CPU 102 via the memory bridge 105. Parallel processing subsystem 112 is connected to memory bridge 105 via a bus or other communication path 113 (eg, PCI Express, accelerated graphics port, or HyperTransport link), and in one embodiment, parallel processing Subsystem 112 is a graphics subsystem for delivering pixels to display device 110 (eg, a conventional CRT or LCD based monitor). The device driver 103 stored in the system memory 104 stores program instructions as needed for execution by the parallel processing subsystem 112 and processes executed by the CPU 102, such as application programs. Interface between parallel processing subsystem 112 translating them.

시스템 디스크(114)가 또한 I/O 브리지(107)에 접속된다. 스위치(116)는 I/O 브리지(107)와, 네트워크 어댑터(118) 및 다양한 애드인 카드(120, 121)와 같은 다른 컴포넌트들 사이에 접속들을 제공한다. USB 또는 다른 포트 접속들, CD 드라이브, DVD 드라이브, 필름 기록 디바이스들 등을 포함하는 다른 컴포넌트들(명시적으로 도시되지 않음)이 I/O 브리지(107)에 접속될 수도 있다. 도 1에서의 다양한 컴포넌트들을 상호접속하는 통신 경로들은, PCI(Peripheral Component Interconnect), PCI Express(PCI-E), AGP(Accelerated Graphics Port), HyperTransport, 또는 임의의 다른 버스 또는 점대점 통신 프로토콜(들)과 같은 임의의 적절한 프로토콜들을 이용하여 구현될 수 있으며, 상이한 디바이스들 사이의 접속들은 본 기술 분야에 알려진 다른 프로토콜들을 이용할 수 있다.System disk 114 is also connected to I / O bridge 107. The switch 116 provides connections between the I / O bridge 107 and other components such as the network adapter 118 and the various add-in cards 120 and 121. Other components (not explicitly shown) may be connected to the I / O bridge 107, including USB or other port connections, CD drive, DVD drive, film recording devices, and the like. The communication paths interconnecting the various components in FIG. 1 may include Peripheral Component Interconnect (PCI), PCI Express (PCI-E), Accelerated Graphics Port (AGP), HyperTransport, or any other bus or point-to-point communication protocol (s). May be implemented using any suitable protocol, such as, and connections between different devices may use other protocols known in the art.

도 2에는, 병렬 프로세싱 서브시스템(112)의 실시예가 도시된다. 병렬 프로세싱 서브시스템(112)은 하나 이상의 병렬 프로세싱 유닛(PPU)들(202)을 포함하며, 이들 유닛들 각각은 로컬 병렬 프로세싱(PP) 메모리(204)에 연결된다. 일반적으로, 병렬 프로세싱 서브시스템은 U개의 PPU를 포함한다(여기서, U ≥ 1). (본 명세서에서, 필요한 경우에는, 많은 경우의 유사한 대상이, 그러한 대상을 식별하는 참조 번호 및 그러한 경우를 식별하는 괄호식 번호로 표기된다.) PPU(202) 및 PP 메모리(204)는, 예를 들면, 프로그램가능 프로세서들, ASIC(application specific integrated circuit)들 및 메모리 디바이스들과 같은 하나 이상의 집적 회로 디바이스를 이용하여 구현될 수 있다. 2, an embodiment of a parallel processing subsystem 112 is shown. Parallel processing subsystem 112 includes one or more parallel processing units (PPUs) 202, each of which is coupled to a local parallel processing (PP) memory 204. In general, the parallel processing subsystem includes U PPUs, where U ≧ 1. (In this specification, in many cases, similar objects are denoted by reference numbers identifying such objects and parentheses identifying such cases.) The PPU 202 and the PP memory 204 are examples, for example. For example, it may be implemented using one or more integrated circuit devices such as programmable processors, application specific integrated circuits (ASICs) and memory devices.

PPU(202(0))에 대해 상세히 도시된 바와 같이, 각각의 PPU(202)는 메모리 브리지(105)에(또는, 대안적인 일 실시예에서는, CPU(102)에 직접) 접속하는 통신 경로(113)를 통해 시스템(100)의 나머지와 통신하는 호스트 인터페이스(206)를 포함한다. 일 실시예에서, 통신 경로(113)는 본 기술 분야에 알려진 바와 같이 전용의 레인들이 각각의 PPU(202)에 할당되는 PCI-E 링크이다. 다른 통신 경로들이 또한 이용될 수 있다. 호스트 인터페이스(206)는 통신 경로(113) 상에서의 송신을 위한 패킷들(또는 다른 신호들)을 생성하고, 또한 통신 경로(113)로부터 인입하는 모든 패킷들(또는 다른 신호들)을 수신하며, 그들을 PPU(202)의 적절한 컴포넌트들로 보낸다. 예를 들어, 프로세싱 태스크와 관련된 커맨드들은 프론트 엔드 유닛(212)으로 보내질 수 있고, 메모리 동작들(예를 들면, PP 메모리(204)로/로부터의 기입 및 판독)과 관련된 커맨드들은 메모리 인터페이스(214)로 보내질 수 있다. 호스트 인 터페이스(206), 프론트 엔드 유닛(212) 및 메모리 인터페이스(214)은 일반적으로 종래의 설계일 수 있으며, 본 발명에 중요하지 않은 것이므로, 상세 설명은 생략한다.As shown in detail with respect to PPU 202 (0), each PPU 202 is connected to a memory bridge 105 (or, in one alternative embodiment, directly to the CPU 102). And a host interface 206 in communication with the rest of the system 100 via 113. In one embodiment, communication path 113 is a PCI-E link in which dedicated lanes are assigned to each PPU 202 as known in the art. Other communication paths may also be used. The host interface 206 generates packets (or other signals) for transmission on the communication path 113, also receives all packets (or other signals) coming from the communication path 113, Send them to the appropriate components of the PPU 202. For example, commands associated with the processing task may be sent to the front end unit 212, and commands associated with memory operations (eg, writing and reading to / from the PP memory 204) may be sent to the memory interface 214. Can be sent to). The host interface 206, the front end unit 212, and the memory interface 214 may generally be of conventional design and are not important to the present invention, and thus detailed descriptions thereof will be omitted.

각각의 PPU(202)는 고도의 병렬 프로세서를 바람직하게 구현한다. PPU(202(0))에 대해 상세히 도시된 바와 같이, PPU(202)는 C개의 코어들(208)(여기서, C ≥ 1)을 포함한다. 각각의 프로세싱 코어(208)는 다수의(예를 들면, 수십 개 또는 수백 개의) 스레드(thread)들을 동시에 실행할 수 있고, 각각의 스레드는 프로그램의 예이며, 멀티스레드 프로세싱 코어(208)의 일 실시예가 이하에 기술된다. 코어들(208)은 작업 분배 유닛(210)을 통해 실행될 프로세싱 태스크들을 수신하며, 작업 분배 유닛(210)은 프론트 엔드 유닛(212)으로부터 프로세싱 태스크들을 정의하는 커맨드들을 수신한다. 작업 분배 유닛(210)은 작업을 분배하기 위한 다양한 알고리즘을 구현할 수 있다. 예컨대, 일 실시예에서, 작업 분배 유닛(210)은, 코어가 새로운 프로세싱 태스크를 수용하기에 충분한 자원들을 갖는지 여부를 나타내는 "준비(ready)" 신호를 각각의 코어(208)로부터 수신한다. 새로운 프로세싱 태스크가 도달한 경우, 작업 분배 유닛(210)은 준비 신호를 어서팅하는 코어(208)에게 태스크를 할당하고, 만약, 준비 신호를 어서팅하는 코어(208)가 존재하지 않는다면, 작업 분배 유닛(210)은 준비 신호가 코어(208)에 의해 어서팅될 때까지 새로운 프로세싱 태스크를 유지한다. 당업자라면, 다른 알고리즘들이 이용될 수도 있으며, 작업 분배 유닛(210)이 인입하는 프로세싱 태스크들을 분배하는 특정한 방법이 본 발명에 중요한 것은 아님을 인식할 것이다.Each PPU 202 preferably implements a highly parallel processor. As shown in detail with respect to PPU 202 (0), PPU 202 includes C cores 208 (where C ≧ 1). Each processing core 208 can execute multiple (eg, tens or hundreds) threads simultaneously, each thread being an example of a program, and one implementation of the multithreaded processing core 208. Examples are described below. Cores 208 receive processing tasks to be executed via work distribution unit 210, and work distribution unit 210 receives commands defining processing tasks from front end unit 212. Work distribution unit 210 may implement various algorithms for distributing work. For example, in one embodiment, work distribution unit 210 receives a " ready " signal from each core 208 indicating whether the core has sufficient resources to accommodate a new processing task. When a new processing task arrives, the work distribution unit 210 assigns the task to the core 208 asserting the ready signal, and if there is no core 208 asserting the ready signal, the work distribution. Unit 210 maintains a new processing task until the ready signal is asserted by core 208. Those skilled in the art will recognize that other algorithms may be used and that the particular method of distributing the processing tasks that work distribution unit 210 enters is not critical to the present invention.

코어(208)는 메모리 인터페이스(214)와 통신하여 다양한 외부 메모리 디바이스들로부터 판독하거나, 또는 디바이스들로 기입한다. 일 실시예에서, 메모리 인터페이스(214)는 호스트 인터페이스(206)에 대한 접속 뿐만 아니라, 로컬 PP 메모리(204)와 통신하도록 적응된 인터페이스를 포함하여, 코어(208)가 시스템 메모리(104) 또는 PPU(202)에 대해 로컬이 아닌 다른 메모리와 통신할 수 있도록 한다. 메모리 인터페이스(214)는 일반적으로 종래의 설계일 수 있으며, 상세한 설명은 생략된다.Core 208 communicates with memory interface 214 to read from or write to various external memory devices. In one embodiment, the memory interface 214 includes an interface adapted to communicate with the local PP memory 204 as well as a connection to the host interface 206, such that the core 208 is a system memory 104 or PPU. Allow 202 to communicate with a memory other than local. Memory interface 214 may generally be of conventional design, and detailed description is omitted.

코어(208)는, 제한적인 것은 아니지만, 선형 및 비선형 데이터 변환, 비디오 및/또는 오디오 데이터의 필터링, 모델링 동작(예를 들면, 대상의 위치, 속도 및 다른 속성을 결정하기 위해 물리적 법칙들을 적용), 이미지 렌더링 동작(예를 들면, 버텍스 셰이더, 지오메트리 셰이더, 및/또는 픽셀 셰이더 프로그램들) 등을 포함하는 매우 다양한 어플리케이션들에 관련된 프로세싱 태스크들을 실행하도록 프로그램될 수 있다. PPU(202)는 시스템 메모리(104) 및/또는 로컬 PP 메모리(204)로부터의 데이터를 내부(온-칩) 메모리로 전송하여, 데이터를 프로세싱하고, 결과 데이터를 시스템 메모리(104) 및/또는 로컬 PP 메모리(204)로 다시 기입할 수 있으며, 여기서 그러한 데이터는 예를 들면, CPU(102) 또는 다른 병렬 프로세싱 서브시스템(112)을 포함하는 다른 시스템 컴포넌트들에 의해 액세스될 수 있다.Core 208 includes, but is not limited to, linear and nonlinear data transformations, filtering of video and / or audio data, modeling operations (e.g., applying physical laws to determine the position, velocity, and other properties of the object). Can be programmed to execute processing tasks related to a wide variety of applications, including image rendering operations (eg, vertex shader, geometry shader, and / or pixel shader programs). The PPU 202 transfers data from the system memory 104 and / or local PP memory 204 to internal (on-chip) memory to process the data, and output the resulting data to the system memory 104 and / or Write back to local PP memory 204, where such data may be accessed by other system components, including, for example, CPU 102 or other parallel processing subsystem 112.

도 1을 다시 참조하면, 일부 실시예들에서, 병렬 프로세싱 서브시스템(112)에서의 PPU(202)들 중 일부 또는 전부는 그래픽 프로세서들이며, 그러한 그래픽 프로세서들은, 픽셀 데이터를 저장 및 갱신하기 위해 (예를 들면, 종래의 프레임 버 퍼를 포함하는 그래픽 메모리로서 이용될 수 있는) 로컬 PP 메모리(204)와 상호작용하고, 픽셀 데이터를 디스플레이 디바이스(110)에 전달하는 등의 동작을 행하는 메모리 브리지(105) 및 버스(113)를 통해 CPU(102) 및/또는 시스템 메모리(104)에 의해 공급된 그래픽 데이터로부터 픽셀 데이터를 생성하는 것과 관련된 다양한 태스크들을 수행하도록 구성될 수 있는 렌더링 파이프라인을 갖는다. 일부 실시예들에서, 병렬 프로세싱 서브시스템(112)은 그래픽 프로세서로서 동작하는 하나 이상의 PPU(202) 및 범용 계산을 위해 이용되는 하나 이상의 다른 PPU(202)를 포함할 수 있다. PPU(202)들은 동일하거나 또는 상이할 수 있으며, 각각의 PPU(202)는 그 자신의 전용 PP 메모리 디바이스(들)(204)을 갖거나, 또는 어떠한 전용 PP 메모리 디바이스(들)도 갖지 않을 수 있다.Referring again to FIG. 1, in some embodiments, some or all of the PPUs 202 in the parallel processing subsystem 112 are graphics processors, and such graphics processors are configured to store and update pixel data ( For example, a memory bridge that interacts with a local PP memory 204 (which may be used as a graphics memory including a conventional frame buffer), delivers pixel data to the display device 110, and so forth. 105 and a rendering pipeline that can be configured to perform various tasks related to generating pixel data from graphics data supplied by the CPU 102 and / or system memory 104 via the bus 113. In some embodiments, parallel processing subsystem 112 may include one or more PPUs 202 operating as graphics processors and one or more other PPUs 202 used for general purpose computation. The PPUs 202 may be the same or different, and each PPU 202 may have its own dedicated PP memory device (s) 204 or no any dedicated PP memory device (s). have.

동작시에, CPU(102)는 시스템(100)의 마스터 프로세서이며, 다른 시스템 컴포넌트들의 동작들을 제어 및 조정한다. 특히, CPU(102)는 PPU(202)의 동작을 제어하는 커맨드들을 발행한다. 일부 실시예들에서, CPU(102)는 각각의 PPU(202)에 대한 커맨드들의 스트림을, 시스템 메모리(104), PP 메모리(204), 또는 CPU(102) 및 PPU(202) 둘다에 액세스가능한 다른 저장 위치에 위치될 수 있는 푸시버퍼(도 1에 명시적으로 도시되지 않음)에 기입한다. PPU(202)는 푸시버퍼로부터 커맨드 스트림을 판독하여, 커맨드들을 CPU(102)의 동작과 비동기적으로 실행한다. 따라서, PPU(202)는 시스템(100)의 프로세싱 처리량 및/또는 성능을 증가시키기 위해, CPU(102)로부터의 프로세싱을 오프로드하도록 구성될 수 있다.In operation, the CPU 102 is the master processor of the system 100 and controls and coordinates the operations of other system components. In particular, the CPU 102 issues commands to control the operation of the PPU 202. In some embodiments, CPU 102 may access the stream of commands for each PPU 202 to system memory 104, PP memory 204, or both CPU 102 and PPU 202. Write to a pushbuffer (not explicitly shown in FIG. 1) which may be located in another storage location. The PPU 202 reads the command stream from the pushbuffer and executes the commands asynchronously with the operation of the CPU 102. Thus, PPU 202 may be configured to offload processing from CPU 102 to increase processing throughput and / or performance of system 100.

본 명세서에 도시된 시스템은 예시적인 것이며, 수정 및 변형이 가능함을 이 해할 것이다. 브리지들의 수 및 배열을 포함하는 접속 토폴로지는 원하는 대로 변형될 수 있다. 예컨대, 일부 실시예들에서, 시스템 메모리(104)는 브리지를 통하지 않고 직접적으로 CPU(102)와 접속되며, 다른 디바이스들은 메모리 브리지(105) 및 CPU(102)를 통해 시스템 메모리(104)와 통신한다. 대안적인 다른 토폴로지에서, 병렬 프로세싱 서브시스템(112)은 메모리 브리지(105)에 접속되지 않고, I/O 브리지(107)에 접속되거나, 또는 CPU(102)에 직접 접속된다. 다른 실시예들에서, I/O 브리지(107) 및 메모리 브리지(105)는 단일 칩내에 통합될 수 있다. 본 명세서에서 도시된 특정 컴포넌트들은 선택적인 것이며, 예컨대, 임의의 수의 애드인 카드 또는 주변 디바이스가 지원될 수 있다. 일부 실시예들에서, 스위치(116)는 제거되며, 네트워크 어댑터(118) 및 애드인 카드(120, 121)는 I/O 브리지(107)에 직접 접속된다.It will be appreciated that the system shown herein is exemplary and that modifications and variations are possible. The connection topology, including the number and arrangement of bridges, can be modified as desired. For example, in some embodiments, system memory 104 is connected directly to CPU 102 without going through a bridge, and other devices communicate with system memory 104 through memory bridge 105 and CPU 102. do. In another alternative topology, parallel processing subsystem 112 is not connected to memory bridge 105, but to I / O bridge 107, or directly to CPU 102. In other embodiments, I / O bridge 107 and memory bridge 105 may be integrated into a single chip. Certain components shown herein are optional, for example, any number of add-in cards or peripheral devices may be supported. In some embodiments, switch 116 is removed and network adapter 118 and add-in cards 120 and 121 are directly connected to I / O bridge 107.

시스템(100)의 나머지에 대한 PPU(202)의 접속은 또한 변할 수 있다. 일부 실시예들에서, PP 시스템(112)은 시스템(100)의 확장 슬롯 내에 삽입될 수 있는 애드인 카드로서 구현된다. 다른 실시예들에서, PPU(202)는 메모리 브리지(105) 또는 I/O 브리지(107)와 같은 버스 브리지와 함께, 단일 칩 상에 통합될 수 있다. 또다른 실시예들에서, PPU(202)의 일부 또는 모든 요소들은 CPU(102)와 함께 단일 칩 상에 통합될 수 있다.The connection of the PPU 202 to the rest of the system 100 may also vary. In some embodiments, PP system 112 is implemented as an add-in card that can be inserted into an expansion slot of system 100. In other embodiments, PPU 202 may be integrated on a single chip, along with a bus bridge, such as memory bridge 105 or I / O bridge 107. In still other embodiments, some or all elements of PPU 202 may be integrated on a single chip with CPU 102.

PPU는 로컬 메모리를 포함하지 않고, 임의의 양의 로컬 PP 메모리를 구비할 수 있으며, 로컬 메모리 및 시스템 메모리를 임의의 조합으로 이용할 수 있다. 예컨대, PPU(202)는 UMA(unified memory architecture) 실시예에서의 그래픽 프로세 서일 수 있으며, 그러한 실시예에서는, 전용의 그래픽 (PP) 메모리가 거의 또는 전혀 제공되지 않으며, PPU(202)는 시스템 메모리를 독점적으로 또는 거의 독점적으로 이용할 것이다. UMA 실시예에서, PPU(202)는 브리지 칩 또는 프로세서 칩 내로 통합되거나, 또는 PPU를, 예를 들면, 브리지 칩을 통해 시스템 메모리에 접속하는 고속 링크(예를 들면, PCI-E)를 갖는 이산 칩으로서 제공될 수 있다.The PPU does not include local memory, and may have any amount of local PP memory, and local memory and system memory may be used in any combination. For example, PPU 202 may be a graphics processor in an unified memory architecture (UMA) embodiment, in which embodiment little or no dedicated graphics (PP) memory is provided, and PPU 202 may be a system. The memory will be used exclusively or almost exclusively. In a UMA embodiment, the PPU 202 is integrated into a bridge chip or processor chip, or discrete with a high speed link (e.g. PCI-E) that connects the PPU to system memory, for example, via the bridge chip. It can be provided as a chip.

전술한 바와 같이, 임의의 수의 PPU(202)가 병렬 프로세싱 서브시스템에 포함될 수 있다. 예컨대, 다수의 PPU(202)가 단일 애드인 카드 상에 제공되거나, 또는 다수의 애드인 카드가 통신 경로(113)에 접속되거나, 또는 하나 이상의 PPU(202)가 브리지 칩 내에 통합될 수 있다. 다수 PPU 시스템에서의 PPU들은 서로 동일하거나 또는 상이할 수 있으며, 예컨대, 상이한 PPU들은 상이한 수의 코어, 상이한 양의 로컬 PP 메모리 등을 가질 수 있다. 다수의 PPU(202)가 존재하는 경우, 그들은 단일 PPU(202)로 가능한 것보다 높은 처리량으로 데이터를 프로세싱하도록 병렬로 동작될 수 있다. 하나 이상의 PPU(202)를 포함하는 시스템들이, 데스크탑, 랩탑, 또는 핸드헬드 개인용 컴퓨터, 서버, 워크스테이션, 게인 콘솔, 내장형 시스템 등을 포함하는 다양한 구성 및 형태 요소들로 구현될 수 있다.As mentioned above, any number of PPUs 202 may be included in the parallel processing subsystem. For example, multiple PPUs 202 may be provided on a single add-in card, multiple add-in cards may be connected to the communication path 113, or one or more PPUs 202 may be integrated into the bridge chip. PPUs in multiple PPU systems may be the same or different from one another, for example, different PPUs may have different numbers of cores, different amounts of local PP memory, and the like. If multiple PPUs 202 are present, they may be operated in parallel to process data at higher throughput than is possible with a single PPU 202. Systems that include one or more PPUs 202 may be implemented in various configuration and form elements, including desktops, laptops, or handheld personal computers, servers, workstations, gain consoles, embedded systems, and the like.

코어 개요Core overview

도 3은 본 발명의 하나 이상의 양태에 따른, 도 2의 병렬 프로세싱 서브시스템(112)에 대한 코어(208)의 블록도이다. PPU(202)는 다수의 스레드를 병렬 실행하도록 구성된 코어(208)(또는 다수의 코어(208))를 포함하며, 여기서, "스레드" 라는 용어는 문맥의 예, 즉, 특정 입력 데이터 세트 상에서 실행되는 특정 프로그 램을 지칭한다. 일부 실시예들에서, SIMD(single-instruction, multiple-data) 명령어 발행 기술들을 이용하여, 다수의 독립적인 명령어 유닛들을 제공하지 않으면서, 다수의 스레드들의 병렬 실행을 지원한다. 3 is a block diagram of a core 208 for the parallel processing subsystem 112 of FIG. 2, in accordance with one or more aspects of the present invention. PPU 202 includes a core 208 (or multiple cores 208) configured to execute multiple threads in parallel, wherein the term "thread" executes on an example of context, that is, on a particular input data set. It refers to a specific program. In some embodiments, single-instruction (multi-data) instruction issuance techniques are used to support parallel execution of multiple threads without providing multiple independent instruction units.

일 실시예에서, 각각의 코어(208)는 단일 명령어 유닛(312)으로부터 SIMD 명령어들을 수신하도록 구성된 P개(예를 들면, 8개, 16개 등)의 병렬 프로세싱 엔진들(302)의 어레이를 포함한다. 각각의 프로세싱 엔진(302)은 바람직하게 기능 유닛들(예를 들면, 산술 로직 유닛 등)의 동일한 세트를 포함한다. 기능 유닛들은 파이프라이닝되어, 본 기술 분야에 알려진 바와 같이, 이전의 명령어가 종료되기 전에 새로운 명령어가 발행되도록 할 수 있다. 기능 유닛들의 임의의 조합이 제공될 수 있다. 일 실시예에서, 기능 유닛들은 정수 및 부동 소수점 산술(예를 들어, 덧셈 및 곱셈), 비교 연산, 불(Boolean) 연산(AND, OR, XOR), 비트 시프팅(bit-shifting), 및 다양한 대수 함수(예를 들어, 평면 보간법(planar interpolation), 삼각법(trigonometric), 지수, 및 로그 함수 등)의 계산을 포함하는 다양한 연산을 지원하며, 상이한 연산들을 수행하기 위해 동일한 기능 유닛 하드웨어가 이용될 수 있다.In one embodiment, each core 208 comprises an array of P (eg, eight, sixteen, etc.) parallel processing engines 302 configured to receive SIMD instructions from a single instruction unit 312. Include. Each processing engine 302 preferably includes the same set of functional units (eg, arithmetic logic unit, etc.). The functional units may be pipelined so that a new instruction is issued before the previous instruction is terminated, as known in the art. Any combination of functional units may be provided. In one embodiment, functional units are integer and floating point arithmetic (eg, addition and multiplication), comparison operations, Boolean operations (AND, OR, XOR), bit-shifting, and various It supports a variety of operations, including the calculation of algebraic functions (e.g., planar interpolation, trigonometric, exponential, and logarithmic functions, etc.), and the same functional unit hardware may be used to perform different operations. Can be.

각각의 프로세싱 엔진(302)은 그의 로컬 입력 데이터, 중간 결과들 등을 저장하기 위해 로컬 레지스터 파일(local register file; LRF)(304) 내 공간을 이용한다. 일 실시예에서, 로컬 레지스터 파일(304)은 물리적으로 또는 논리적으로 P 레인들로 분할되며, 각각의 레인은 몇 개의 엔트리를 가진다(여기서, 각각의 엔트리는 예를 들어 32 비트 워드를 저장할 수 있다). 하나의 레인이 각각의 프로세싱 엔진(302)에 할당되고, 상이한 레인들의 대응하는 엔트리들은 동일한 프로그램을 실행하는 상이한 스레드들을 위한 데이터로 채워져서 SIMD 실행을 용이하게 할 수 있다. 일부 실시예들에서, 각각의 프로세싱 엔진(302)은 그에 할당된 레인의 LRF 엔트리들에만 액세스할 수 있다. 유익하게는, 로컬 레지스터 파일(304)의 총 엔트리 수는 프로세싱 엔진(302)당 다수의 동시 스레드들을 지원할 만큼 충분히 크다.Each processing engine 302 uses space in a local register file (LRF) 304 to store its local input data, intermediate results, and the like. In one embodiment, local register file 304 is physically or logically divided into P lanes, each lane having several entries (where each entry may store a 32-bit word, for example). ). One lane is assigned to each processing engine 302, and corresponding entries in different lanes can be populated with data for different threads executing the same program to facilitate SIMD execution. In some embodiments, each processing engine 302 can only access the LRF entries of the lane assigned to it. Advantageously, the total number of entries in local register file 304 is large enough to support multiple concurrent threads per processing engine 302.

각각의 프로세싱 엔진(302)은 또한 코어(208) 내의 모든 프로세싱 엔진(302) 간에 공유되는 온-칩 공유 메모리(306)에 액세스한다. 공유 메모리(306)는 원하는 만큼 클 수 있으며, 일부 실시예들에서, 임의의 프로세싱 엔진(302)은 (예를 들어, 로컬 레지스터 파일(304)에 액세스하는 것과 비교되는) 동일하게 낮은 래이턴시를 갖고 공유 메모리(306) 내 임의의 위치로부터 판독하거나 거기에 기입할 수 있다. 일부 실시예들에서, 공유 메모리(306)는 공유 레지스터 파일로서 구현되며, 다른 실시예들에서, 공유 메모리(306)는 공유 캐시 메모리를 이용하여 구현될 수 있다.Each processing engine 302 also accesses on-chip shared memory 306 that is shared among all processing engines 302 in the core 208. The shared memory 306 can be as large as desired, and in some embodiments, any processing engine 302 can have the same low latency (eg, compared to accessing the local register file 304). Can read from or write to any location in shared memory 306. In some embodiments, shared memory 306 is implemented as a shared register file, and in other embodiments, shared memory 306 may be implemented using shared cache memory.

공유 메모리(306) 이외에, 일부 실시예들은 또한 부가의 온-칩 파라미터 메모리 및/또는 캐시(들)(308)를 제공하며, 이것은 예를 들어, 종래의 RAM 또는 캐시로서 구현될 수 있다. 파라미터 메모리/캐시(308)는 예를 들어, 다수의 스레드에 의해 필요로 될 수 있는 상태 파라미터들 및/또는 다른 데이터(예를 들어, 다양한 상수들)를 유지하는 데 이용될 수 있다. 프로세싱 엔진(302)은 또한 메모리 인터페이스(214)를 통해 오프-칩 "글로벌" 메모리에 액세스하며, 오프-칩 "글로벌" 메모리는 예를 들어, PP 메모리(204) 및/또는 시스템 메모리(104)를 포함할 수 있고, 시스템 메모리(104)는 호스트 인터페이스(206)를 통해 액세스 가능하다. PPU(202) 의 외부의 임의 메모리가 글로벌 메모리로서 이용될 수 있다는 것을 이해해야 한다.In addition to the shared memory 306, some embodiments also provide additional on-chip parameter memory and / or cache (s) 308, which may be implemented, for example, as conventional RAM or cache. Parameter memory / cache 308 may be used to maintain state parameters and / or other data (eg, various constants) that may be needed, for example, by multiple threads. The processing engine 302 also accesses off-chip “global” memory via the memory interface 214, which may be, for example, the PP memory 204 and / or the system memory 104. And the system memory 104 is accessible via the host interface 206. It should be understood that any memory external to the PPU 202 can be used as the global memory.

일 실시예에서, 각각의 프로세싱 엔진(302)은 멀티스레딩되고(multithreaded), 예를 들어, 로컬 레지스터 파일(304)에서 그의 할당된 레인의 상이한 부분에 각각의 스레드와 연관된 현재 상태 정보를 유지함으로써, 어떤 수 G개(예를 들어, 24개)의 스레드까지를 동시에 실행할 수 있다. 유익하게는, 프로세싱 엔진들(302)은 상이한 스레드들로부터의 명령어들이 효율성을 손실하지 않고 임의의 시퀀스로 발행될 수 있도록 한 스레드로부터 다른 스레드로 신속하게 전환되도록 설계된다. 각각의 스레드는 상이한 콘텍스트에 대응할 수 있기 때문에, 상이한 스레드들이 각각의 사이클 동안 발행됨에 따라 다수의 콘텍스트는 다수의 사이클에 걸쳐 프로세싱될 수 있다.In one embodiment, each processing engine 302 is multithreaded and, for example, by maintaining the current state information associated with each thread in different portions of its allocated lanes in the local register file 304. You can run up to any number of G threads (for example, 24). Advantageously, the processing engines 302 are designed to quickly switch from one thread to another so that instructions from different threads can be issued in any sequence without losing efficiency. Since each thread can correspond to a different context, multiple contexts can be processed over multiple cycles as different threads are issued during each cycle.

명령어 유닛(312)은, 임의의 주어진 프로세싱 사이클 동안, 명령어(INSTR)가 P개의 프로세싱 엔진들(302) 각각에 발행되도록 구성된다. 각각의 프로세싱 엔진(302)은 다수의 콘텍스트가 동시에 처리되고 있을 때 임의의 주어진 프로세싱 사이클 동안 상이한 명령어를 수신할 수 있다. 모든 P개의 프로세싱 엔진(302)이 단일 콘텍스트를 처리할 때, 코어(208)는 P-웨이 SIMD 마이크로 아키텍처를 구현한다. 각각의 프로세싱 엔진(302)이 또한 멀티스레딩되어 G개의 스레드까지 동시에 지원하기 때문에, 본 실시예에서 코어(208)는 동시에 실행하는 P*G개의 스레드까지 가질 수 있다. 예를 들어, P=16이고 G=24이면, 코어(208)는 단일 콘텍스트 동안 384개의 동시 스레드까지 또는 각각의 콘텍스트에 대해 N*24개의 동시 스레드까지 지원하고, 여기서 N은 콘텍스트에 할당된 프로세싱 엔진들(302)의 수이다.The instruction unit 312 is configured such that during any given processing cycle, an instruction INSTR is issued to each of the P processing engines 302. Each processing engine 302 may receive different instructions for any given processing cycle when multiple contexts are being processed at the same time. When all P processing engines 302 process a single context, core 208 implements a P-way SIMD microarchitecture. Since each processing engine 302 is also multithreaded to support up to G threads simultaneously, in this embodiment the core 208 may have up to P * G threads running simultaneously. For example, if P = 16 and G = 24, core 208 supports up to 384 concurrent threads during a single context or up to N * 24 concurrent threads for each context, where N is the processing assigned to the context. The number of engines 302.

유익하게는, 코어(208)의 동작은 작업 분배 유닛(200)을 통해 제어된다. 일부 실시예들에서, 작업 분배 유닛(200)은 처리될 데이터(예를 들어, 프리미티브 데이터, 버텍스 데이터 및/또는 픽셀 데이터)에 대한 포인터들뿐 아니라, 데이터가 어떻게 처리될 것인지(예를 들어, 어떤 프로그램이 실행될 것인지)를 정의하는 데이터 또는 명령어들을 포함하는 푸시버퍼들의 위치들도 수신한다. 작업 분배 유닛(210)은 공유 메모리(306)내로 처리될 데이터를 로딩하고 파라미터 메모리(308) 내로 파라미터들을 로딩할 수 있다. 작업 분배 유닛(210)은 또한 명령어 유닛(312) 내의 각각의 새로운 콘텍스트를 초기화하고, 그 다음에 명령어 유닛(312)에 콘텍스트의 실행을 시작하라고 신호를 보낸다(signal). 명령어 유닛(312)은 명령어 푸시버퍼들을 판독하고 명령어들을 실행하여 프로세싱된 데이터를 생성한다. 콘텍스트의 실행이 완료되면, 코어(208)는 유익하게는 작업 분배 유닛(210)에 통지한다. 작업 분배 유닛(210)은 그 다음에 예를 들어, 공유 메모리(306)로부터 출력 데이터를 검색하기 위해 및/또는 부가의 콘텍스트들의 실행을 위한 코어(208)를 준비하기 위해 다른 프로세스들을 개시할 수 있다.Advantageously, the operation of the core 208 is controlled via the job distribution unit 200. In some embodiments, work distribution unit 200 not only pointers to data to be processed (eg, primitive data, vertex data and / or pixel data), but also how the data is to be processed (eg, It also receives the locations of the pushbuffers that contain data or instructions that define which program will run). Work distribution unit 210 may load data to be processed into shared memory 306 and load parameters into parameter memory 308. The work distribution unit 210 also initializes each new context in the instruction unit 312 and then signals the instruction unit 312 to begin execution of the context. The instruction unit 312 reads the instruction pushbuffers and executes the instructions to generate processed data. When execution of the context is complete, the core 208 advantageously notifies the work distribution unit 210. Work distribution unit 210 may then initiate other processes, for example, to retrieve output data from shared memory 306 and / or to prepare core 208 for execution of additional contexts. have.

본원에 설명된 병렬 프로세싱 유닛 및 코어 아키텍처는 예시적인 것이며, 변형 및 수정이 가능하다는 것을 알 것이다. 임의의 수의 프로세싱 엔진이 포함될 수 있다. 일부 실시예들에서, 각각의 프로세싱 엔진(302)은 그 자신의 로컬 레지스터 파일을 가지며, 스레드당 로컬 레지스터 파일 엔트리들의 할당은 필요에 따라 고정될 수 있거나 구성가능할 수 있다. 특히, 로컬 레지스터 파일(304)의 엔트리 들은 각각의 콘텍스트를 프로세싱하기 위해 할당될 수 있다. 또한, 하나의 코어(208)만이 도시되어 있지만, PPU(202)는 임의 수의 코어들(208)을 포함할 수 있으며, 유익하게는 코어들(208)은 실행 거동(behavior)이 어느 코어(208)가 특정 프로세싱 태스크를 수신하는지에 의존하지 않도록 서로 동일한 설계로 되어 있다. 유익하게는, 각각의 코어(208)는 다른 코어들(208)에 독립적으로 동작하고, 그 자신의 프로세싱 엔진들, 공유 메모리 등을 갖는다.It will be appreciated that the parallel processing unit and core architecture described herein are illustrative and that variations and modifications are possible. Any number of processing engines can be included. In some embodiments, each processing engine 302 has its own local register file, and the allocation of local register file entries per thread may be fixed or configurable as needed. In particular, entries in local register file 304 may be allocated for processing each context. In addition, although only one core 208 is shown, the PPU 202 may include any number of cores 208, and advantageously the cores 208 may have any core ( 208 are of the same design so as not to depend on whether they receive a particular processing task. Advantageously, each core 208 operates independently of the other cores 208 and has its own processing engines, shared memory, and the like.

그래픽 파이프라인 아키텍처Graphic pipeline architecture

도 4는 본 발명의 하나 이상의 양태에 따른 그래픽 프로세싱 파이프라인(400)의 개념도이다. PPU(202)는 그래픽 프로세싱 파이프라인(400)을 형성하도록 구성될 수 있다. 예를 들어, 코어(208)는 버텍스(vertex) 프로세싱 유닛(444), 지오메트리(geometry) 프로세싱 유닛(448), 및 프래그먼트(fragment) 프로세싱 유닛(460) 중 하나 이상의 기능들을 수행하도록 구성될 수 있다. 데이터 어셈블러(442), 프리미티브 어셈블러(primitive assembler)(446), 래스터라이저(rasterizer)(455), 및 래스터 연산 유닛(465)의 기능들이 또한 코어(208)에 의해 수행될 수 있다. 대안적으로, 그래픽 프로세싱 파이프라인(400)은 버텍스 프로세싱 유닛(444), 지오메트리 프로세싱 유닛(448), 프래그먼트 프로세싱 유닛(460), 데이터 어셈블러(442), 프리미티브 어셈블러(446), 래스터라이저(455), 및 래스터 연산 유닛(465) 중 하나 이상을 위한 전용 프로세싱 유닛들을 이용하여 구현될 수 있다.4 is a conceptual diagram of a graphics processing pipeline 400 in accordance with one or more aspects of the present invention. PPU 202 may be configured to form graphics processing pipeline 400. For example, the core 208 may be configured to perform one or more functions of the vertex processing unit 444, the geometry processing unit 448, and the fragment processing unit 460. . The functions of the data assembler 442, the primitive assembler 446, the rasterizer 455, and the raster computation unit 465 may also be performed by the core 208. Alternatively, graphics processing pipeline 400 may include vertex processing unit 444, geometry processing unit 448, fragment processing unit 460, data assembler 442, primitive assembler 446, rasterizer 455. And dedicated processing units for one or more of the raster computation units 465.

데이터 어셈블러(442)는 고차의(high-order) 표면들, 프리미티브들 등을 위 한 버텍스 데이터를 수집하고 버텍스 데이터를 버텍스 프로세싱 유닛(444)에 출력하는 프로세싱 유닛이다. 버텍스 프로세싱 유닛(444)은 버텍스 셰이더 프로그램들을 실행함으로써 버텍스 셰이더 프로그램들에 의해 특정되는 바와 같이 버텍스 데이터를 변환하도록 구성되는 프로그램 가능 실행 유닛이다. 예를 들어, 버텍스 프로세싱 유닛(444)은 객체 기반 좌표 표현(객체 공간)으로부터의 버텍스 데이터를 세계 공간(world space) 또는 정규화된 디바이스 좌표들(normalized device coordinates; NDC) 공간과 같은 교대 기반 좌표 시스템(alternatively based coordinate system)으로 변환하도록 프로그램될 수 있다. 버텍스 프로세싱 유닛(444)은 버텍스 데이터를 프로세싱하는 데 이용하기 위한 PP 메모리(204) 또는 시스템 메모리(104)에 저장되는 데이터를 판독할 수 있다.The data assembler 442 is a processing unit that collects vertex data for high-order surfaces, primitives, and the like and outputs the vertex data to the vertex processing unit 444. Vertex processing unit 444 is a programmable execution unit configured to convert vertex data as specified by vertex shader programs by executing vertex shader programs. For example, vertex processing unit 444 converts vertex data from an object based coordinate representation (object space) into an alternate based coordinate system, such as world space or normalized device coordinates (NDC) space. can be programmed to convert to an alternately based coordinate system. Vertex processing unit 444 can read data stored in PP memory 204 or system memory 104 for use in processing vertex data.

프리미티브 어셈블러(446)는 버텍스 프로세싱 유닛(444)으로부터 프로세싱된 버텍스 데이터를 수신하고 지오메트리 프로세싱 유닛(448)에 의해 프로세싱하기 위한 그래픽 프리미티브들, 예를 들어, 점들, 선들, 삼각형들 등을 구성한다. 지오메트리 프로세싱 유닛(448)은 지오메트리 셰이더 프로그램들을 실행함으로써 지오메트리 셰이더 프로그램들에 의해 특정되는 바와 같이 프리미티브 어셈블러(446)로부터 수신된 그래픽 프리미티브들을 변환하도록 구성되는 프로그램 가능 실행 유닛이다. 예를 들어, 지오메트리 프로세싱 유닛(448)은 그래픽 프리미티브들을 하나 이상의 새로운 그래픽 프리미티브로 세분하고, 새로운 그래픽 프리미티브들을 래스터라이즈하는 데 이용되는 평면 등식 계수들(plane equation coefficients)과 같은 파라미터들을 계산하도록 프로그램될 수 있다. 본 발명의 일부 실시예들에서, 지 오메트리 프로세싱 유닛(448)은 또한 지오메트리 스트림에 요소들을 추가하거나 삭제할 수 있다. 지오메트리 프로세싱 유닛(448)은 새로운 그래픽 프리미티브들을 특정하는 파라미터들 및 버텍스들을 래스터라이저(455) 또는 메모리 인터페이스(214)에 출력한다. 지오메트리 프로세싱 유닛(448)은 지오메트리 데이터를 처리하는 데 이용하기 위한 PP 메모리(204) 또는 시스템 메모리(104)에 저장되는 데이터를 판독할 수 있다.The primitive assembler 446 constructs graphic primitives, such as points, lines, triangles, etc., for receiving the processed vertex data from the vertex processing unit 444 and processing by the geometry processing unit 448. Geometry processing unit 448 is a programmable execution unit configured to transform graphics primitives received from primitive assembler 446 as specified by geometry shader programs by executing geometry shader programs. For example, geometry processing unit 448 may be programmed to subdivide the graphic primitives into one or more new graphic primitives and calculate parameters such as plane equation coefficients used to rasterize the new graphic primitives. Can be. In some embodiments of the invention, geometry processing unit 448 may also add or remove elements from the geometry stream. Geometry processing unit 448 outputs parameters and vertices that specify new graphics primitives to rasterizer 455 or memory interface 214. Geometry processing unit 448 may read data stored in PP memory 204 or system memory 104 for use in processing geometry data.

래스터라이저(455) 스캔은 새로운 그래픽 프리미티브들을 변환하고 프래그먼트 프로세싱 유닛(260)에 프래그먼트들 및 커버리지 데이터를 출력한다. 안티에일리어싱(antialiasing)이 이미지 데이터를 생성하는 데 이용될 때, 래스터라이저(455)는 서브-픽셀 샘플 커버리지 데이터를 생성하도록 구성된다. 하이브리드 안티에일리어싱이 이용될 때, 래스터라이저(455)에 존재할 수 있는 하이브리드 안티에일리어스 제어 유닛(500)이, 도 5c 및 도 6과 함께 설명되는 바와 같이, 각각의 프리미티브를 처리하는 데 이용되는 프래그먼트 프로세싱 유닛(460)을 통한 패스들의 수를 결정하도록 구성된다.The rasterizer 455 scan converts new graphics primitives and outputs fragments and coverage data to the fragment processing unit 260. When antialiasing is used to generate image data, rasterizer 455 is configured to generate sub-pixel sample coverage data. When hybrid antialiasing is used, a hybrid antialiasing control unit 500 that may be present in the rasterizer 455 is used to process each primitive, as described in conjunction with FIGS. 5C and 6. Configured to determine the number of passes through the fragment processing unit 460.

프래그먼트 프로세싱 유닛(460)은 프래그먼트 셰이더 프로그램들을 실행함으로써 프래그먼트 셰이더 프로그램들에 의해 특정되는 바와 같이 래스터라이저(455)로부터 수신된 프래그먼트들을 변환하도록 구성되는 프로그램 가능 실행 유닛이다. 예를 들어, 프래그먼트 프로세싱 유닛(460)은 원근법 정정(perspective correction), 텍스처 맵핑(texture mapping), 셰이딩(shading), 블렌딩(blending) 등과 같은 동작들을 수행하여 래스터 연산 유닛(465)에 출력되는 셰이딩된 프래그 먼트들을 생성하도록 프로그램될 수 있다. 프래그먼트 프로세싱 유닛(460)은 프래그먼트 데이터를 프로세싱하는 데 이용하기 위한 PP 메모리(204) 또는 시스템 메모리(104)에 저장되는 데이터를 판독할 수 있다. 프래그먼트들은, 하이브리드 안티에일리어스 제어 유닛에 의해 선택되는 샘플링 레이트에 따라, 픽셀, 샘플, 또는 수퍼샘플 클러스터 입도로 셰이딩될 수 있다.The fragment processing unit 460 is a programmable execution unit configured to convert fragments received from the rasterizer 455 as specified by the fragment shader programs by executing fragment shader programs. For example, the fragment processing unit 460 performs shading that is output to the raster calculation unit 465 by performing operations such as perspective correction, texture mapping, shading, blending, and the like. It can be programmed to generate fragments. Fragment processing unit 460 may read data stored in PP memory 204 or system memory 104 for use in processing fragment data. The fragments may be shaded at pixel, sample, or supersample cluster granularity, depending on the sampling rate selected by the hybrid antialiasing control unit.

메모리 인터페이스(214)는 그래픽 메모리에 저장된 데이터를 위한 판독 요구들을 생성하고 텍스처 필터링 동작들, 예를 들어, 바이리니어(bilinear), 트라이리니어(trilinear), 이방성(anisotropic) 등을 수행한다. 본 발명의 일부 실시예들에서, 메모리 인터페이스(214)는 데이터를 압축 해제하도록 구성될 수 있다. 특히, 메모리 인터페이스(214)는 DXT 포맷으로 표현된 압축 데이터와 같은 고정 길이 블록 인코딩된 데이터를 압축 해제하도록 구성될 수 있다. 래스터 연산 유닛(465)은 스텐실, z 테스트 등과 같은 래스터 연산들을 수행하고 픽셀 데이터를 그래픽 메모리에 저장하기 위한 프로세싱된 그래픽 데이터로서 출력하는 프로세싱 유닛이다. 프로세싱된 그래픽 데이터는 디스플레이 디바이스(110)에 표시하기 위해 또는 CPU(102)나 병렬 프로세싱 서브시스템(112)에 의해 추가 처리하기 위해 그래픽 메모리, 예를 들어, PP 메모리(204) 및/또는 시스템 메모리(104)에 저장될 수 있다. 본 발명의 일부 실시예들에서, 래스터 연산 유닛(465)은 메모리에 기입되는 z 또는 컬러 데이터를 압축하고 메모리로부터 판독되는 z 또는 컬러 데이터를 압축 해제하도록 구성된다.Memory interface 214 generates read requests for data stored in graphics memory and performs texture filtering operations, such as bilinear, trilinear, anisotropic, and the like. In some embodiments of the invention, memory interface 214 may be configured to decompress data. In particular, memory interface 214 may be configured to decompress fixed length block encoded data, such as compressed data represented in the DXT format. Raster operation unit 465 is a processing unit that performs raster operations such as stencils, z tests, etc. and outputs the pixel data as processed graphic data for storage in graphics memory. The processed graphic data is for display on the display device 110 or for further processing by the CPU 102 or the parallel processing subsystem 112, for example graphics memory, eg, PP memory 204 and / or system memory. And stored at 104. In some embodiments of the present invention, raster computing unit 465 is configured to compress z or color data written to memory and decompress z or color data read from memory.

하이브리드hybrid 안티에일리어싱Anti-aliasing

이전에 설명한 바와 같이, PPU(202)는 화질을 향상시키기 위해 또는 셰이딩 성능을 향상시키기 위해 다양한 샘플링 레이트에서 셰이딩을 수행하도록 구성될 수 있다. 하이브리드 안티에일리어스 제어 유닛은 프리미티브 내의 각각의 픽셀을 셰이드하는 데 이용되는 셰이더 패스들의 수를 결정한다. 픽셀당 하나 이상의 멀티샘플들(서브-픽셀 샘플들)의 수퍼샘플 클러스터가, 수퍼샘플 클러스터 내 모든 멀티샘플들에 대해 복제되는 단일 셰이더 컬러 값을 생성하도록 각 패스에 대해 프래그먼트 프로세싱 유닛(460)으로서 구성되는 코어(208)에 의해 처리된다. 장면(scene)이 렌더링된 후에, 수퍼샘플 클러스터들을 위한 샘플들이 결합되어 안티에일리어싱된 이미지를 생성한다.As previously described, the PPU 202 may be configured to perform shading at various sampling rates to improve image quality or to improve shading performance. The hybrid antialiasing control unit determines the number of shader passes used to shade each pixel in the primitive. As a fragment processing unit 460 for each pass, a supersample cluster of one or more multisamples (sub-pixel samples) per pixel produces a single shader color value that is replicated for all multisamples in the supersample cluster. It is processed by the configured core 208. After the scene is rendered, the samples for the supersample clusters are combined to create an antialiased image.

각각의 프리미티브를 위한 서브-픽셀 샘플들 및 셰이더 패스들의 수가 증가되어 화질을 향상시킨다. 서브-픽셀 샘플들의 수는 애플리케이션이 착수될 때 결정되고 렌더 타겟(이미지 버퍼)의 각각의 픽셀에 대해 일관된다. 하이브리드 안티에일리어스 제어 유닛은 렌더링 상태, 예를 들어, 알파 테스트 인에이블/디스에이블, 텍스처 맵 컨텐츠, 사용자 제공된 품질/성능 제어들 등에 기초하여 셰이딩 패스들의 수를 동적으로 결정할 수 있다.The number of sub-pixel samples and shader passes for each primitive is increased to improve image quality. The number of sub-pixel samples is determined when the application is launched and is consistent for each pixel of the render target (image buffer). The hybrid antialiasing control unit can dynamically determine the number of shading passes based on the rendering state, eg, alpha test enable / disable, texture map content, user provided quality / performance controls, and the like.

도 5a는 본 발명의 하나 이상의 양태에 따른, 픽셀(501) 내의 수퍼샘플 클러스터들(503 및 511) 및 멀티샘플들(502, 504, 및 513)을 예시한다. 8개의 서브-픽셀 샘플 안티에일리어싱이 이용될 때, 8개의 서브-픽셀 샘플들을 생성하기 위해 각종 상이한 조합의 멀티샘플 및 수퍼샘플 클러스터가 이용될 수 있다. 도 5a에 도시된 예에서, 3개의 수퍼샘플 클러스터들(503) 및 수퍼샘플 클러스터(511)가 각각, 픽셀(501)을 갖는 총 8개의 서브-픽셀 샘플 위치들에 대해, 수퍼샘플 클러스터(511) 내 멀티샘플들(502 및 504)과 같이 2개의 멀티샘플들을 포함한다. 다른 8개의 서브-픽셀 샘플 구성들은 각각 하나의 멀티샘플을 갖는 8개의 많은 수퍼샘플 클러스터 또는 8개의 멀티샘플을 갖는 하나의 적은 수퍼샘플 클러스터를 포함한다. 셰이딩은 각각의 수퍼샘플 클러스터에 대해 한번 수행되고, 셰이딩된 값, 예를 들어, 컬러는 수퍼샘플 클러스터 내의 모든 멀티샘플들에 대해 저장된다.5A illustrates supersample clusters 503 and 511 and multisamples 502, 504, and 513 in pixel 501, in accordance with one or more aspects of the present invention. When eight sub-pixel sample antialiasing is used, various different combinations of multisample and supersample clusters may be used to generate eight sub-pixel samples. In the example shown in FIG. 5A, the three supersample clusters 503 and the supersample cluster 511 each have a supersample cluster 511, for a total of eight sub-pixel sample positions with the pixel 501. ) Include two multisamples, such as multisamples 502 and 504. The other eight sub-pixel sample configurations include eight many supersample clusters each with one multisample or one small supersample cluster with eight multisamples. Shading is performed once for each supersample cluster, and the shaded value, eg, color, is stored for all multisamples in the supersample cluster.

셰이더 속성들은 수퍼샘플 클러스터 내 특정 멀티샘플의 위치에서 샘플링될 수 있거나, 수퍼샘플 클러스터 내 또는 근처의 어떤 다른 위치에서 샘플링될 수 있다. 예를 들어, 도 5a에서 프래그먼트 속성들(컬러, 텍스처 좌표들 등)은 수퍼샘플 클러스터(511) 내 멀티샘플(502)과 같은 솔리드 멀티샘플 위치들에서 샘플링될 수 있다. 또한, 프래그먼트들이 수퍼샘플 클러스터를 부분적으로만 커버할 때, 속성들이 수퍼샘플 클러스터 내 커버된 멀티샘플들의 영역 내에 놓이도록 샘플링되는 위치를 조정하는 것이 유익하다. 이것은 일반적으로 센트로이드 샘플링(controid sampling)으로서 알려져 있으며, 이 용어는 본원에서 전체 픽셀 프래그먼트들보다는 수퍼샘플 클러스터들에 적용된다.Shader attributes may be sampled at the location of a particular multisample within the supersample cluster, or may be sampled at some other location within or near the supersample cluster. For example, fragment attributes (color, texture coordinates, etc.) in FIG. 5A may be sampled at solid multisample locations, such as multisample 502 in supersample cluster 511. Also, when fragments only partially cover a supersample cluster, it is beneficial to adjust the location where the attributes are sampled so that they lie within the area of covered multisamples in the supersample cluster. This is commonly known as centroid sampling, which term applies here to supersample clusters rather than whole pixel fragments.

도 5b는 본 발명의 하나 이상의 양태에 따른, 수퍼샘플 클러스터(511) 내 프래그먼트(509) 및 센트로이드 위치(517)를 예시한다. 본 발명의 일부 실시예들에서, 센트로이드 샘플링을 이용하여 속성들이 평가되는 위치를 프래그먼트에 의해 실제로 커버된 스크린 영역에 잘 대응하도록 수정한다. 본 발명의 일부 실시예들에서, 샘플 보간 유닛(510)은 특정 멀티샘플 위치 또는 근사화된 센트로이드 위치 에서 각각의 수퍼샘플 클러스터를 샘플링하도록 구성될 수 있다.5B illustrates fragment 509 and centroid location 517 in supersample cluster 511, in accordance with one or more aspects of the present invention. In some embodiments of the invention, centroid sampling is used to modify the position where the attributes are evaluated to better correspond to the screen area actually covered by the fragment. In some embodiments of the invention, the sample interpolation unit 510 may be configured to sample each supersample cluster at a particular multisample position or approximate centroid position.

센트로이드는 커버된 멀티샘플들의 지오메트릭 센트로이드일 수 있거나, 또는, 예를 들어, 완전히 커버된 수퍼샘플 클러스터의 센트로이드에 가장 가까운 수퍼샘플 클러스터 내 커버된 멀티샘플을 선택함으로써 근사화될 수 있다. 예를 들어, 센트로이드 위치(517)는 수퍼샘플 클러스터(511)를 위한 샘플링된 컬러를 표현하는 데 이용되는 수퍼샘플 클러스터(511)의 지오메트릭 중심에서 계산된 멀티샘플 위치이며, 그 이유는, 멀티샘플(502)의 위치가 프래그먼트(509)의 중심 근처가 아니라 가장자리 근처에 있기 때문이다. 셰이딩된 값은 멀티샘플(502)과 비교되는 프래그먼트 컬러를 더 정확하게 표현하기 위해 센트로이드 위치(517)에서 계산된다.The centroid may be a geometric centroid of the covered multisamples, or may be approximated, for example, by selecting the covered multisample in the supersample cluster closest to the centroid of the fully covered supersample cluster. For example, the centroid position 517 is a multisample position calculated at the geometric center of the supersample cluster 511 that is used to represent the sampled color for the supersample cluster 511, because This is because the location of the multisample 502 is near the edge and not near the center of the fragment 509. The shaded value is calculated at the centroid position 517 to more accurately represent the fragment color compared to the multisample 502.

도 5c는 본 발명의 하나 이상의 양태에 따른, 래스터라이저(455), 프래그먼트 프로세싱 유닛(460), 및 래스터 연산 유닛(465)을 포함하는 그래픽 프로세싱 파이프라인(400)의 일부분의 블록도이다. 래스터라이저(455), 프래그먼트 프로세싱 유닛(460), 및 래스터 연산 유닛(465) 내에 다른 프로세싱 유닛들이 포함될 수 있다. 도 5c에는 이들의 다른 프로세싱 유닛들이 도시되지 않았는데, 그 이유는 이것들은 일반적으로 종래의 설계로 되어 있을 수 있기 때문이고, 본 발명에 중요하지 않은 상세한 설명은 생략된다.5C is a block diagram of a portion of a graphics processing pipeline 400 that includes a rasterizer 455, a fragment processing unit 460, and a raster computation unit 465, in accordance with one or more aspects of the present invention. Other processing units may be included within the rasterizer 455, the fragment processing unit 460, and the raster calculation unit 465. Their other processing units are not shown in FIG. 5C because they can generally be of a conventional design, and detailed descriptions that are not important to the present invention are omitted.

래스터라이저(455)는 지오메트리 프로세싱 유닛(448)으로부터 프리미티브들을 수신하고 프리미티브가 가로지르는 각각의 픽셀을 위한 프래그먼트를 생성한다. 하이브리드 안티에일리어스 제어 유닛(500)(선택적으로 래스터라이저(455) 내에 있 음)은 렌더링 상태, 예를 들어, 알파 테스트 인에이블/디스에이블, 텍스처 맵 콘텐트, 사용자 제공된 품질/성능 제어들 등에 기초하여 각각의 프리미티브의 프래그먼트들을 처리하는 데 이용되는 셰이더 패스들의 수를 동적으로 결정하도록 구성될 수 있다.Rasterizer 455 receives primitives from geometry processing unit 448 and generates fragments for each pixel that the primitives traverse. The hybrid antialiasing control unit 500 (optionally in the rasterizer 455) can be used to render states such as alpha test enable / disable, texture map content, user provided quality / performance controls, etc. Based on the number of shader passes used to process the fragments of each primitive.

하이브리드 안티에일리어스 제어 유닛(500)은 더 높은 셰이딩 레이트로부터 이득이 있는 프리미티브들에 대해 더 많은 셰이딩 패스들을 수행하고 다른 프리미티브들에 대한 셰이딩 레이트를 감소시킴으로써 안티에일리어싱 효율을 향상시킨다. 하이브리드 안티에일리어스 제어 유닛(500)은 사용자, 애플리케이션, 또는 디바이스 드라이버(103)에 의해 다양한 품질 설정들로 동작하도록 구성될 수 있다. 이들은 가장 낮은 품질 설정인 "멀티샘플-항상(multisample-always)"으로부터 가장 높은 품질 설정인 "수퍼샘플-항상(supersample-always)"까지의 범위일 수 있다. 중간 품질 설정들은 셰이딩 패스들의 수를 결정함에 있어서 렌더 파이프라인 상태를 고려할 수 있다. 예를 들어, 알파 테스트 또는 셰이더 픽셀 킬(shader pixel kill)이 인에이블된 경우, 더 많은 셰이딩 패스들이 바람직할 수 있다. 반대로, 고성능이 특정될 때, 알파 테스트 및 셰이더 픽셀 킬이 디스에이블되고, 샘플링 레이트는 하이브리드 안티에일리어스 제어 유닛(500)에 의해 감소될 수 있다. 하이브리드 안티에일리어스 제어 유닛(500)은 또한 셰이딩 패스들의 수를 결정함에 있어서 픽셀 셰이더 또는 텍스처 샘플러 설정들의 특징을 고려할 수 있다. 이 기술분야의 통상의 기술자는 셰이딩 패스들의 수를 결정하기 위해 다양한 판단기준(criteria)이 하이브리드 안티에일리어스 제어 유닛(500)에 의해 이용될 수 있다 는 것을 인식할 것이다. 종래의 그래픽 시스템들에서, 샘플링 레이트는 사용자 제공 또는 고정 설정들에 기초하여 장면의 모든 프리미티브에 대해 결정된다. 또한, 종래의 시스템들을 위한 샘플링은 멀티샘플링 또는 수퍼샘플링으로 한정되며, 중간 대안들이 없다.Hybrid antialiasing control unit 500 improves antialiasing efficiency by performing more shading passes for primitives that benefit from higher shading rate and reducing the shading rate for other primitives. Hybrid antialiasing control unit 500 may be configured to operate with various quality settings by user, application, or device driver 103. These may range from the lowest quality setting "multisample-always" to the highest quality setting "supersample-always." Intermediate quality settings may take into account the render pipeline state in determining the number of shading passes. For example, if an alpha test or shader pixel kill is enabled, more shading passes may be desirable. Conversely, when high performance is specified, alpha test and shader pixel kills are disabled and the sampling rate can be reduced by the hybrid antialias control unit 500. Hybrid antialiasing control unit 500 may also take into account the characteristics of pixel shader or texture sampler settings in determining the number of shading passes. Those skilled in the art will appreciate that various criteria may be used by the hybrid antialias control unit 500 to determine the number of shading passes. In conventional graphics systems, the sampling rate is determined for all primitives of the scene based on user provided or fixed settings. In addition, sampling for conventional systems is limited to multisampling or supersampling and there are no intermediate alternatives.

일 실시예에서, 래스터라이저(455)는 하이브리드 안티에일리어스 반복기 유닛(515)에 의해 수신되는 픽셀 프래그먼트들의 2x2 쿼드들을 생성한다. 하이브리드 안티에일리어스 제어 유닛(500)이 패스들=1로 설정할 때(즉, 멀티샘플링일 때), 하이브리드 안티에일리어스 반복기 유닛(515)은 프래그먼트 프로세싱 유닛(460)으로 수정되지 않은 이들 쿼드들을 패스한다. 그러나, 하이브리드 안티에일리어스 제어 유닛(500)이 패스들을 N>1로 설정할 때, 하이브리드 안티에일리어스 반복기 유닛(515)은 셰이더 패스에 대응하는 패스 수를 포함하여 프래그먼트 프로세싱 유닛(460)에 여러번 각각의 쿼드를 출력한다. 하이브리드 안티에일리어스 반복기 유닛(515)은 현재의 패스에 대응하는 수퍼샘플 클러스터 내의 멀티샘플들만이 인에이블되도록 프래그먼트 프로세싱 유닛(460)에 전송된 커버리지를 마스킹할 수 있다. 다른 실시예들에서, 프래그먼트 프로세싱 유닛(460)은 하이브리드 안티에일리어스 반복기 유닛(515)에 의해 그것에 제공된 패스 수에 기초하여 커버리지를 마스킹할 수 있다. 다른 실시예들은 단일 픽셀, 4x4 프래그먼트 타일 등과 같은, 2x2 프래그먼트 쿼드 이외의 영역에 대해 반복할 수 있다는 것에 주목한다. 프리미티브들 이외의 픽셀들의 영역들(쿼드들)에 대해 반복하는 것은 이로울 수 있는데, 그 이유는, 텍스처 맵 데이터가 특정 쿼드를 위한 후속 셰이더 패스들에 대해 재사용될 가 능성이 있고, 반면, 클 수 있는 프리미티브들에 대해 반복하는 것은 텍스처 데이터가 메모리, 예를 들어, PP 메모리(204) 또는 시스템 메모리(104)로부터 재인출되도록 할 수 있기 때문이다.In one embodiment, rasterizer 455 generates 2x2 quads of pixel fragments received by hybrid antialias repeater unit 515. When the hybrid antialias control unit 500 sets passes = 1 (ie, when multisampling), the hybrid antialias repeater unit 515 is these quads that are not modified by the fragment processing unit 460. Pass them. However, when the hybrid antialiasing control unit 500 sets the passes to N> 1, the hybrid antialias repeater unit 515 includes the number of passes corresponding to the shader pass to fragment processing unit 460. Print each quad several times. The hybrid antialias repeater unit 515 may mask the coverage sent to the fragment processing unit 460 such that only multisamples in the supersample cluster corresponding to the current pass are enabled. In other embodiments, fragment processing unit 460 may mask coverage based on the number of passes provided to it by hybrid antialias repeater unit 515. Note that other embodiments may repeat for regions other than 2x2 fragment quads, such as single pixels, 4x4 fragment tiles, and the like. Iterating over regions (quads) of pixels other than primitives can be beneficial because the texture map data is likely to be reused for subsequent shader passes for a particular quad, while Iterating over the primitives that can be is because the texture data can be re-fetched from memory, eg, PP memory 204 or system memory 104.

중요하게는, 프래그먼트들을 생성하는데 필요한 지오메트리 계산들은 각각의 셰이더 패스에 대해 반복되지 않는다. 반대로, 샘플 마스크를 이용하여 멀티샘플 버퍼 내로 수퍼샘플링하는 종래의 시스템들은 통상적으로 각각의 셰이더 패스에 대해 지오메트리 계산들을 반복한다. 프래그먼트 프로세싱 유닛(460)에 샘플링되는 프리미티브 속성들은, 후속하는 반복된 쿼드들에 의해 참조될 것이고 그 다음에 폐기될 수 있기 때문에, 하이브리드 안티에일리어싱 패스들의 수에 상관없이 오직 한번 계산될 필요가 있다는 것을 주목한다.Importantly, the geometry calculations needed to generate the fragments are not repeated for each shader pass. Conversely, conventional systems that supersample into a multisample buffer using a sample mask typically repeat geometry calculations for each shader pass. The primitive attributes sampled to the fragment processing unit 460 need to be calculated only once, regardless of the number of hybrid antialiasing passes, since they will be referenced by subsequent repeated quads and then discarded. Pay attention.

프래그먼트 프로세싱 유닛(460)의 샘플 룩업 테이블은 하이브리드 안티에일리어싱 파라미터들 및 패스 수를 이용하여, 보간된 프래그먼트 파라미터들이 샘플링되는 위치를 결정한다. 샘플 룩업 테이블(505)은 각각의 수퍼샘플 클러스터에 대한 센트로이드 위치 또는 멀티샘플 위치를 선택할 수 있다. 멀티샘플 위치들은 각각의 수퍼샘플 클러스터에 대한 하나 이상의 보간된 파라미터들, 예를 들어, 컬러 채널들(적, 녹, 청, 알파), 텍스처 좌표들 등(즉, 픽셀 쿼드 내의 각각의 픽셀에 대한 1 세트의 보간된 파라미터들)을 계산하는 샘플 보간 유닛(510)에 출력된다. 셰이더(520)는, 프래그먼트 셰이더 프로그램 등을 실행하기 위해 이 기술분야의 통상의 기술자에게 알려진 기술들을 이용해서 픽셀 쿼드 내의 각각의 픽셀에 대한 보간된 파라미터들의 세트를 처리하여, 각각의 수퍼샘플 클러스터에 대해 셰이 드된 픽셀 값, 예를 들어, 컬러를 생성한다.The sample lookup table of the fragment processing unit 460 uses the hybrid antialiasing parameters and the number of passes to determine where the interpolated fragment parameters are sampled. The sample lookup table 505 may select the centroid position or multisample position for each supersample cluster. The multisample locations may include one or more interpolated parameters for each supersample cluster, eg, color channels (red, green, blue, alpha), texture coordinates, etc. (ie, for each pixel in the pixel quad. One set of interpolated parameters) is output to the sample interpolation unit 510. Shader 520 processes the set of interpolated parameters for each pixel in the pixel quad using techniques known to those of ordinary skill in the art to execute fragment shader programs, and the like, for each supersample cluster. Produces a shaded pixel value, for example a color.

셰이딩 동안 각각의 수퍼샘플 클러스터에 대한 서브-픽셀 샘플들은, 픽셀 킬 또는 알파 테스트 결과들에 기초하여 포스트-셰이더 커버리지를 생성하기 위해 래스터-생성된 커버리지가 수정되도록 셰이더 픽셀 킬 또는 알파 테스팅의 결과로서 제거(솎음(culled) 또는 킬(killed))될 수 있다. 수퍼샘플 클러스터들은 셰이더(520)를 통해 별개의 패스들로 처리되기 때문에, 수퍼샘플 클러스터들은 알파 테스팅 동안 개별적으로 제거될 수 있다. 반대로, 종래의 멀티샘플링을 이용하여 단일 셰이딩 패스에서 모든 서브-픽셀 샘플들을 프로세싱할 때, 모든 서브-픽셀 샘플들은 유지되거나 제거됨으로써, 더 낮은 품질 이미지를 생성하는 더 정밀하지 않은 알파 테스팅 입도를 초래한다.Sub-pixel samples for each supersample cluster during shading are the result of shader pixel kill or alpha testing such that raster-generated coverage is modified to generate post-shader coverage based on pixel kill or alpha test results. It may be removed (culled or killed). Because supersample clusters are processed in separate passes through shader 520, supersample clusters can be removed individually during alpha testing. Conversely, when processing all sub-pixel samples in a single shading pass using conventional multisampling, all the sub-pixel samples are retained or removed, resulting in a less precise alpha testing granularity that produces a lower quality image. do.

셰이더(520)는 컬러 버퍼(535) 및 커버리지 수집기(530)에 셰이딩된 픽셀 값들 및 서브-픽셀 커버리지(가능하게는 래스터라이저(455)에 의해 제공된 커버리지와 비교하여 수정된 것)를 각각 출력한다. 커버리지 수집기(530)는 각각의 셰이더 패스에 대한 포스트-셰이더 커버리지를 누적하여 각각의 픽셀에 대한 수집된 커버리지 정보를 생성한다. 컬러 버퍼(535)는 각각의 픽셀에 대한 셰이딩된 값들을 누적한다. 마지막 셰이더 패스에 대한 셰이딩된 값들이 수신될 때, 수집된 커버리지 정보는 래스터 연산 유닛(465)에 출력된다. 픽셀 쿼드에 대한 셰이딩된 값들은 수집된 커버리지 정보와 함께 출력될 수 있거나, 또는 나중에, 예를 들어, z 테스팅이 래스터 연산 유닛(465)에 의해 완료된 후에, 출력될 수 있다. 본 발명의 다른 실시예들에서, 커버리지 수집기(530) 및 컬러 버퍼(535)는 생략될 수 있다.Shader 520 outputs the shaded pixel values and sub-pixel coverage (possibly modified compared to the coverage provided by rasterizer 455) to color buffer 535 and coverage collector 530, respectively. . The coverage collector 530 accumulates post-shader coverage for each shader pass to generate the collected coverage information for each pixel. Color buffer 535 accumulates shaded values for each pixel. When the shaded values for the last shader pass are received, the collected coverage information is output to the raster computation unit 465. The shaded values for the pixel quad may be output with the collected coverage information or later, for example, after z testing is completed by the raster computation unit 465. In other embodiments of the invention, the coverage collector 530 and the color buffer 535 may be omitted.

커버리지 수집 및 컬러 버퍼 내로의 컬러 값들의 합체(coalescing)는, 다수의 샘플들이 단일 메모리 트랜잭션을 이용하여 기입 또는 판독될 수 있도록, 각각의 픽셀의 샘플들을 메모리에 함께 패킹하는 시스템들에서 유익하다. 다른 실시예들은 커버리지 수집기(530)를 생략할 수 있다. 커버리지 수집기(530)는 픽셀에 대한 샘플값들을 메모리 내에 끊임없이 저장하지 않는 시스템들에서는 덜 유리하다.Coverage collection and coalescing of color values into the color buffer is beneficial in systems that pack samples of each pixel together into memory so that multiple samples can be written or read using a single memory transaction. Other embodiments may omit the coverage collector 530. Coverage collector 530 is less advantageous in systems that do not constantly store sample values for a pixel in memory.

래스터 연산 유닛(465) 내의 선택적인 z/컬러 압축 유닛(55)은 수집된 커버리지 정보 및 z값들, 또는 프래그먼트들에 대한 z 또는 심도 값들의 다른 표현을 수신하고(이어서 z 테스팅), 픽셀들의 영역에 대해 압축된 z 값들을 생성한다. z/컬러 압축 유닛(550)은 또한 프래그먼트들에 대한 수집된 컬러값들을 수신하고 픽셀들의 영역에 대해 압축된 컬러 값들을 생성할 수 있다. 압축은 픽셀들의 큰 그룹에 적용될 때 향상될 수 있다. 따라서, 여러 픽셀 쿼드들이 함께 수집될 수 있고결과가 압축되기 전에 z 테스팅될 수 있다. 중요한 점은, 하이브리드 안티에일리어싱이 z 압축의 효율성을 방해하거나 감소시키지 않는다는 것이다. z 압축은 z 버퍼에 액세스하기 위한 메모리 대역폭 요건들을 감소시키는데 사용되는 것이 바람직하며, 일부 실시예들에서는, 메모리 풋프린트도 마찬가지이다.The optional z / color compression unit 55 in the raster calculation unit 465 receives the collected coverage information and z values, or other representation of z or depth values for the fragments (followed by z testing), and the area of the pixels. Produces compressed z values for. The z / color compression unit 550 can also receive the collected color values for the fragments and generate compressed color values for the area of pixels. Compression can be improved when applied to a large group of pixels. Thus, several pixel quads can be collected together and z tested before the result is compressed. Importantly, hybrid antialiasing does not interfere or reduce the efficiency of z compression. z compression is preferably used to reduce memory bandwidth requirements for accessing the z buffer, and in some embodiments, so does the memory footprint.

도 6은 본 발명의 하나 이상의 양태들에 따른, 하이브리드 안티에일리어싱을 수행하기 위한 방법 단계들의 흐름도이다. 단계(610)에서, 하이브리드 안티에일리어스 제어 유닛(500)이 프리미티브를 수신한다. 단계(615)에서, 하이브리드 안티에일리어스 제어 유닛(500)은 하이브리드 안티에일리어싱이 인에이블되는지를 판정하고, 인에이블되지 않으면 종래의 안티에일리어싱을 이용하여 프레그먼트를 프로 세싱한다. 단계(615)에서, 하이브리드 안티에일리어싱이 인에이블되면, 단계(635)에서 하이브리드 안티에일리어스 제어 유닛(500)이 프리미티브에 대한 하이브리드 안티에일리어스 파라미터들을 결정한다. 구체적으로, 하이브리드 안티에일리어스 제어 유닛(500)은 프리미티브가 가로지르는 각 픽셀을 셰이딩할 때 사용되는 수퍼샘플 클러스터들(셰이더 패스들)의 수를 결정한다. 6 is a flowchart of method steps for performing hybrid antialiasing, in accordance with one or more aspects of the present invention. In step 610, the hybrid antialiasing control unit 500 receives the primitives. In step 615, hybrid antialiasing control unit 500 determines whether hybrid antialiasing is enabled, and if not, processes the fragment using conventional antialiasing. In step 615, if hybrid antialiasing is enabled, in step 635 the hybrid antialiasing control unit 500 determines hybrid antialiasing parameters for the primitive. Specifically, the hybrid antialiasing control unit 500 determines the number of supersample clusters (shader passes) used when shading each pixel across the primitive.

단계(640)에서, 래스터라이저(455)는 프리미티브의 커버된 부분들에 대한 심플 레벨 커버리지를 생성한다. 이 커버리지의 입도는 세밀하지 못하거나(coarse) 세밀(fine)할 수 있지만, 적어도 픽셀 쿼드의 크기이다. 래스터라이저(455)는 프리미티브를 가로지르는 쿼드에 대한 커버리지 정보를 하이브리드 안티에일리어스 반복기 유닛(515)에 출력한다. 하이브리드 안티에일리어스 반복기 유닛(515)은 하이브리드 안티에일리어스 파라미터들에 기초하여 각 쿼드를 확대하여 다수의 패스들에서 쿼드를 셰이딩한다. 하이브리드 안티에일리어스 반복기 유닛(515)은 커버리지 정보에 따라, 수퍼샘플 클러스터 내 멀티샘플들 모두가 커버되지 않는 경우에 셰이더 패스들을 스킵하도록 구성될 수 있다. 단계(643)에서, 하이브리드 안티에일리어스 반복기 유닛(515)은 패스 수(제1, 제2 등)를 결정하고 픽셀 쿼드 및 패스 수를 프래그먼트 프로세싱 유닛(460)에 출력한다. 상술된 바와 같이, 패스들의 수가 1보다 큰 경우, 하이브리드 안티에일리어스 반복기 유닛(515)은 커버리지 정보를 마스킹할 수 있다. 샘플 룩업 테이블(505)은 패스 수 및 멀티샘플들의 수를 이용하여 인덱싱되어, 프래그먼트 파라미터들을 보간하기 위해 사용되는 수퍼샘플 클러스터 내의 위치의 표시를 포함하는, 멀티샘플 위치들에 대한 프로그램된 값을 판 독한다. 보간된 파라미터들은 샘플 보간 유닛(510)에 의해 수퍼샘플 클러스터에 대해 산출된다.At step 640, rasterizer 455 generates simple level coverage for the covered portions of the primitive. The granularity of this coverage may be coarse or fine, but at least the size of a pixel quad. Rasterizer 455 outputs coverage information for the quad across the primitive to hybrid antialias repeater unit 515. The hybrid antialias repeater unit 515 enlarges each quad based on the hybrid antialias parameters to shade the quad in multiple passes. The hybrid antialias repeater unit 515 may be configured to skip shader passes if all of the multisamples in the supersample cluster are not covered, according to the coverage information. In step 643, the hybrid antialias repeater unit 515 determines the number of passes (first, second, etc.) and outputs the pixel quad and the number of passes to the fragment processing unit 460. As described above, when the number of passes is greater than one, the hybrid antialias repeater unit 515 may mask coverage information. The sample lookup table 505 is indexed using the number of passes and the number of multisamples to determine the programmed value for the multisample locations, including an indication of the location in the supersample cluster used to interpolate the fragment parameters. It is poisonous. The interpolated parameters are calculated for the supersample cluster by the sample interpolation unit 510.

단계(645)에서, 프래그먼트 프로세싱 유닛(460)은 픽셀 쿼드를 셰이딩하여 각 수퍼샘플 클러스터에 대한 셰이딩된 값, 즉 픽셀 쿼드에서 각 픽셀에 대한 하나의 셰이딩된 값을 생성한다. 수퍼샘플 클러스터 내에서, 셰이딩된 값은 프리미티브에 의해 커버되는 각 멀티샘플에 대해 사용될 것이다. 프래그먼트 프로세싱 유닛(460)은 또한 픽셀 쿼드에 대한 포스트 셰이더 커버리지를 출력한다. 포스트 셰이더 커버리지는 전술된 바와 같이 셰이딩 동안 멀티샘플들이 제거될 수 있기 때문에 래스터라이즈된 픽셀 커버리지 정보와 상이할 수 있다.In step 645, fragment processing unit 460 shades the pixel quad to generate a shaded value for each supersample cluster, i.e. one shaded value for each pixel in the pixel quad. Within a supersample cluster, the shaded value will be used for each multisample covered by the primitive. Fragment processing unit 460 also outputs post shader coverage for the pixel quad. Post shader coverage may differ from rasterized pixel coverage information because multisamples may be removed during shading as described above.

단계(650)에서, 하이브리드 안티에일리어스 반복기 유닛(515)은 픽셀 쿼드를 프로세싱하기 위해 다른 셰이더 패스를 이용할 것인지를 판정하고, 그렇다면, 단계(643 및 645)가 다른 셰이더 패스(제2 및 제3 등)에 대해 반복된다. 단계(650)에서, 하이브리드 안티에일리어스 반복기 유닛(515)이 픽셀 쿼드를 프로세싱하기 위해 다른 셰이더 패스를 필요로하지 않는다고 결정하면, 단계(660)에서 커버리지 수집기(530)가 각 셰이더 패스에 대한 포스트 셰이더 커버리지를 결합하여 픽셀 쿼드에 대한 수집된 커버리지 정보를 생성한다. 단계(660)에서, 커버리지 수집기(530)는 또한 각 셰이더 패스에 대한 포스트 셰이더 컬러 값들을 결합하여 픽셀 쿼드에 대한 수집된 컬러 값들을 생성한다. 커버리지 수집기(530)는 멀티 쿼드 레벨에서 포스트 셰이더 컬러 및 커버리지 정보를 수집하도록 구성될 수도 있다. 단계(665)에서, 래스터 연산 유닛(465)은 어떤 셰이더 값들이 프레임 버퍼 에 기입될 것인지를 판정하는 래스터 연산을 수행한다. 래스터 연산들은 쿼드 또는 멀티 쿼드 레벨에서 수행될 수 있다. 래스터 연산 유닛(465) 내의 Z/컬러 압축 유닛(550)을 이용하여, z 및/또는 컬러 데이터가 z 버퍼 및/또는 컬러 버퍼 내에 저장되기 전에 픽셀 쿼드에 대한 z 및/또는 컬러 데이터를 압축할 수 있다.In step 650, hybrid antialias repeater unit 515 determines whether to use a different shader pass to process the pixel quad, and if so, steps 643 and 645 allow for different shader passes (second and second). 3, etc.). In step 650, if the hybrid antialias repeater unit 515 determines that no other shader pass is needed to process the pixel quad, then in step 660 the coverage collector 530 for each shader pass. Combine post shader coverage to generate the collected coverage information for the pixel quad. At step 660, coverage collector 530 also combines the post shader color values for each shader pass to generate the collected color values for the pixel quad. The coverage collector 530 may be configured to collect post shader color and coverage information at multiple quad levels. In step 665, the raster operation unit 465 performs a raster operation to determine which shader values will be written to the frame buffer. Raster operations can be performed at the quad or multi quad level. Z / color compression unit 550 in raster calculation unit 465 may be used to compress z and / or color data for a pixel quad before z and / or color data is stored in the z buffer and / or color buffer. Can be.

단계(670)에서, 래스터라이저(455)는 다른 픽셀 쿼드가 프리미티브와 가로지르는지를 판정하고, 그렇다면, 단계(640)에서 래스터라이저(455)가 프리미티브에 의해 커버되는 상이한 픽셀 쿼드를 프로세싱한다. 단계(670)에서, 래스터라이저(455)는 프리미티브가 가로지르는 픽셀 쿼드들 모두가 셰이딩되었는지를 판정한 후, 단계(675)에서 프리미티브 프로세싱이 완료한다. 파이프라인 시스템에서, 도 6에 도시된 단계들 중 하나 이상은 상이한 쿼드들에 대해 병렬로 수행될 수 있다.In step 670, rasterizer 455 determines if another pixel quad intersects the primitive, and if so, in step 640, rasterizer 455 processes the different pixel quads covered by the primitive. At step 670, rasterizer 455 determines whether all of the pixel quads that the primitive traverses have been shaded, and then at step 675, primitive processing is complete. In a pipeline system, one or more of the steps shown in FIG. 6 may be performed in parallel for different quads.

하이브리드 안티에일리어스 제어 유닛(500)은 렌더링 상태, 즉 알파 테스트 인에이블/디스에이블, 텍스쳐 맵 컨텐츠, 사용자 제공된 품질/성능 제어들 등에 기초하여 각 프리미티브에 대한 하이브리드 안티에일리어싱 파라미터들, 예를 들어 픽셀 당 수퍼샘플 클러스터들의 수를 동적으로 결정할 수 있다. 렌더링 상태에 기초하여 안티에일리어싱을 적응하면 효율성이 향상되는데, 그 이유는 고품질 안티에일리어싱으로부터 이득이 되는 프리미티브들은 더 많은 샘플들로 셰이딩되고 다른 프리미티브들은 더 적은 샘플들로 셰이딩되어 화질 및 성능을 최적화시키기 때문이다.The hybrid antialiasing control unit 500 may perform hybrid antialiasing parameters, e.g., for each primitive based on the rendering state, i.e., alpha test enable / disable, texture map content, user provided quality / performance controls, etc. It is possible to dynamically determine the number of supersample clusters per pixel. Adapting antialiasing based on the rendering state improves efficiency, because primitives that benefit from high quality antialiasing are shaded with more samples, while other primitives are shaded with fewer samples to optimize image quality and performance. Because.

본 발명은 특정 실시예들을 참조하여 상술되었다. 그러나, 당업자는 첨부된 청구항들에 개시된 발명의 폭넓은 정신 및 범위로부터 벗어나지 않고 그것에 다양한 수정 및 변경이 이루어질 수 있음을 이해할 것이다. 본 발명의 일 실시예는 컴퓨터 시스템과 함께 사용하기 위한 프로그램 제품으로서 구현될 수 있다. 프로그램 제품의 프로그램(들)은 실시예들의 기능(본원에 기재된 방법들 포함)을 정의하고 각종 컵퓨터 판독가능한 저장 매체 상에 포함될 수 있다. 예시적인 컴퓨터 판독가능한 저장 매체는 (i) 정보가 영구적으로 저장되는 기록 불가능한 저장 매체(CD-ROM 드라이브에 의해 판독가능한 CD-ROM 디스크, 플래시 메모리, ROM 칩들 또는 다른 유형의 고상 불휘발성 반도체 메모리와 같은 컴퓨터 내의 판독 전용 메모리 디바이스들) 및 (ii) 변경가능한 정보를 저장하는 기록가능한 저장 매체(예를 들어, 디스켓 드라이브 또는 하드 디스크 드라이브 내의 플로피 디스크들 또는 임의 유형의 고상 랜덤 액세스 반도체 메모리)를 포함하지만, 이에 제한되지 않는다. 상기 설명 및 도면은 따라서, 제한적인 것이 아니라 예시적인 것으로 간주되어야 한다.The present invention has been described above with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made thereto without departing from the broad spirit and scope of the invention as set forth in the appended claims. One embodiment of the invention may be implemented as a program product for use with a computer system. The program (s) of the program product define the functionality of the embodiments (including the methods described herein) and may be included on various cup computer readable storage media. Exemplary computer readable storage media include (i) non-writable storage media (CD-ROM disks, flash memories, ROM chips or other types of solid state nonvolatile semiconductor memory readable by a CD-ROM drive) in which information is stored permanently. Read-only memory devices in the same computer) and (ii) recordable storage media (e.g., floppy disks in a diskette drive or hard disk drive or any type of solid state random access semiconductor memory) for storing changeable information. However, it is not limited thereto. The description and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

도 1은 본 발명의 하나 이상의 양태를 구현하도록 구성된 컴퓨터 시스템을 도시하는 블록도.1 is a block diagram illustrating a computer system configured to implement one or more aspects of the present invention.

도 2는 본 발명의 하나 이상의 양태에 따른, 도 1의 컴퓨터 시스템에 대한 병렬 프로세싱 서브시스템의 블록도.2 is a block diagram of a parallel processing subsystem for the computer system of FIG. 1, in accordance with one or more aspects of the present invention.

도 3은 본 발명의 하나 이상의 양태에 따른, 도 2의 병렬 프로세싱 서브시스템에 대한 코어의 블록도.3 is a block diagram of a core for the parallel processing subsystem of FIG. 2, in accordance with one or more aspects of the present invention.

도 4는 본 발명의 하나 이상의 양태에 따른 그래픽 프로세싱 파이프라인의 개념도.4 is a conceptual diagram of a graphics processing pipeline, in accordance with one or more aspects of the present invention.

도 5a는 본 발명의 하나 이상의 양태에 따른 픽셀 내의 수퍼샘플 클러스터 및 멀티샘플 위치를 도시하는 도면.5A illustrates a supersample cluster and multisample locations within a pixel in accordance with one or more aspects of the present invention.

도 5b는 본 발명의 하나 이상의 양태에 따른 멀티샘플 클러스터 내의 프래그먼트 및 센트로이드 위치를 도시하는 도면.5B illustrates fragment and centroid positions within a multisample cluster in accordance with one or more aspects of the present invention.

도 5c는 본 발명의 하나 이상의 양태에 따른 그래픽 프로세싱 파이프라인의 일부의 블록도.5C is a block diagram of a portion of a graphics processing pipeline in accordance with one or more aspects of the present invention.

도 6은 본 발명의 하나 이상의 양태에 따른 하이브리드 안티에일리어싱을 수행하는 방법 단계들의 흐름도.6 is a flow diagram of method steps for performing hybrid antialiasing in accordance with one or more aspects of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

400: 그래픽 프로세싱 파이프라인400: graphics processing pipeline

442: 데이터 어셈블러442: Data Assembler

444: 버텍스 프로세싱 유닛444: vertex processing unit

446: 프리미티브 어셈블러446 primitive assembler

448: 지오메트리 프로세싱 유닛448: geometry processing unit

455: 래스터라이저455: rasterizer

460: 프래그먼트 프로세싱 유닛460: fragment processing unit

465: 래스터 연산 유닛465 raster calculation unit

Claims

A computing device configured to shade graphics primitives using hybrid antialiasing, the computing device comprising:

Rasterizer comprising a hybrid antialias control unit-The hybrid antialias control unit,

Receive the graphic primitives;

Determine a number of supersample clusters used for antialiasing each pixel across the graphic primitives;

Determine a number of multisamples used to process graphic primitives for each of the supersample clusters; And

A fragment shading unit coupled to the rasterizer, the fragment shading unit configured to shade the graphic primitives using a plurality of passes through the fragment shading unit.

Including,

A computing device in which the number of multiple passes used to generate each hybrid antialiased pixel across the graphics primitives is less than or equal to the number of supersample clusters.

The method of claim 1,

And said number of supersample clusters is determined based on a rendering state of said computing device.

The method of claim 2,

And the rendering state comprises one or more of high quality mode setting, high performance setting, alpha testing setting, and use of a texture map comprising high frequency content.

The method of claim 1,

And the fragment shading unit is further configured to generate post-shader coverage indicating which of the multisamples are covered by a graphic primitive for each of the supersample clusters.

The method of claim 4, wherein

A raster computation unit coupled to the fragment shading unit and configured to z test graphics primitives for each of the multisamples covered by a graphics primitive, in accordance with the post shader coverage, to generate z tested values. Computing device.

The method of claim 5,

And the raster computing unit is further configured to compress z tested values for a portion of a z buffer across each of the graphics primitives.

The method of claim 1,

The number of supersample clusters used for antialiasing each pixel across the first graphics primitive among the graphic primitives is the supersample used for antialiasing each pixel across the second graphics primitive among the graphics primitives. Computing device different from the number of clouters.

The method of claim 1,

The number of passes used to generate each hybrid antialiased pixel across the graphics primitive does not include a pass for any supersample cluster without one or more multisamples covered by the graphics primitive. .

The method of claim 1,

The fragment shading unit calculates the shaded value for only one of the multisamples in each cluster of the supersample clusters and duplicates the shaded value for other multisamples in the same supersample cluster. The computing device further configured to shade.

The method of claim 1,

The fragment shading unit may be configured to obtain a shaded value for a first supersample cluster of the supersample clusters.

Use a location of a first multisample in the first supersample cluster;

Using a centroid, which is a geometric centroid of the multisamples in the first supersample cluster covered by a graphics primitive; or

And calculate using an approximate centroid that is covered by a graphics primitive and is a multisample within the first supersample cluster closest to the geometric centroid of the first supersample cluster.