KR20230007377A

KR20230007377A - GPR optimization of GPU based on GPR release mechanism

Info

Publication number: KR20230007377A
Application number: KR1020227039661A
Authority: KR
Inventors: 앤드류 에반 그루버; 윈 두
Original assignee: 퀄컴 인코포레이티드
Priority date: 2020-05-18
Filing date: 2021-04-14
Publication date: 2023-01-12
Also published as: EP4154101A1; US11763419B2; US20230113415A1; WO2021236263A1; BR112022022687A2; US11475533B2; CN115516421A; US20210358076A1; TW202145140A

Abstract

본 개시는 GPR 릴리스 메커니즘에 기초하여 GPU 에서 GPR 최적화를 위한, 저장 매체에 인코딩된 컴퓨터 프로그램들을 포함하여, 시스템들, 디바이스들, 장치들 및 방법들을 제공한다. 더욱 상세하게는, GPU 는 실행가능한 셰이더에 대해 정의된 상수들에 기초하여 실행가능한 셰이더 내의 적어도 하나의 이용되지 않은 브랜치를 결정할 수 있다. 그 적어도 하나의 이용되지 않은 브랜치에 기초하여, GPU는 이전에 할당된 GPR 들로부터 할당 해제될 수 있는 GPR 의 수를 더 결정할 수 있다. GPU 는 드로우 콜 내의 후속 스레드에 대해, 할당 해제되도록 결정된 GPR 들의 수에 기초하여 실행가능한 셰이더의 실행 동안 이전에 할당된 GPR 들로부터의 GPR 들의 수를 할당 해제할 수 있다.This disclosure provides systems, devices, apparatuses and methods, including computer programs encoded in a storage medium, for GPR optimization in a GPU based on a GPR release mechanism. More specifically, the GPU can determine at least one unused branch within an executable shader based on constants defined for the executable shader. Based on that at least one unused branch, the GPU can further determine the number of GPRs that can be deallocated from previously allocated GPRs. The GPU can, for a subsequent thread within the draw call, deallocate the number of GPRs from previously allocated GPRs during execution of the executable shader based on the number of GPRs determined to be deallocated.

Description

GPR optimization of GPU based on GPR release mechanism

본 출원은 "GPR OPTIMIZATION IN A GPU BASED ON A GPR RELEASE MECHANISM" 를 발명의 명칭으로 하여 2020 년 5 월 18 일에 출원된 미국 특허출원 제 16/877,367 호의 이익을 주장하며, 그것은 그 전체가 여기에 참조에 의해 분명히 통합된다.This application claims the benefit of U.S. Patent Application No. 16/877,367 filed on May 18, 2020, entitled "GPR OPTIMIZATION IN A GPU BASED ON A GPR RELEASE MECHANISM", which is incorporated herein in its entirety Obviously incorporated by reference.

본 개시는 일반적으로 프로세싱 시스템에 관한 것으로, 보다 상세하게는 릴리스 메커니즘을 사용하는 그래픽 프로세싱에서 레지스터 최적화에 관한 것이다.This disclosure relates generally to processing systems, and more specifically to register optimization in graphics processing using a release mechanism.

컴퓨팅 디바이스들은 종종 컴퓨팅 디바이스들에 의한 디스플레이를 위한 그래픽 데이터를 렌더링하기 위해 (예를 들어, 그래픽 프로세싱 유닛 (GPU) 을 활용하여) 그래픽 프로세싱을 수행한다. 이러한 컴퓨팅 디바이스들은 예를 들어, 컴퓨터 워크스테이션, 모바일 폰, 이를 테면, 스마트 폰, 임베디드 시스템들, 퍼스널 컴퓨터, 태블릿 컴퓨터, 및 비디오 게임 콘솔들을 포함할 수도 있다. GPU들은 그래픽 프로세싱 커맨드를 실행하고 프레임을 출력하기 위해 함께 동작하는 하나 이상의 프로세싱 스테이지들을 포함하는 그래픽 프로세싱 파이프라인을 실행하도록 구성된다. 중앙 프로세싱 유닛 (CPU) 은 GPU 에 하나 이상의 그래픽 프로세싱 커맨드들을 이슈함으로써 GPU 의 동작을 제어할 수도 있다. 최근의 CPU들은 통상적으로 동시에 다수의 애플리케이션들을 실행가능하고, 애플리케이션들 각각은 실행 동안에 GPU 를 활용하는 것이 필요할 수도 있다. 디스플레이 상의 시각적 표현을 위한 컨텐츠를 제공하는 디바이스는 GPU 를 활용할 수도 있다.Computing devices often perform graphics processing (eg, utilizing a graphics processing unit (GPU)) to render graphics data for display by the computing devices. Such computing devices may include, for example, computer workstations, mobile phones such as smart phones, embedded systems, personal computers, tablet computers, and video game consoles. GPUs are configured to execute a graphics processing pipeline that includes one or more processing stages that work together to execute graphics processing commands and output frames. A central processing unit (CPU) may control the operation of a GPU by issuing one or more graphics processing commands to the GPU. Modern CPUs are typically capable of running multiple applications simultaneously, and each of the applications may need to utilize the GPU during execution. A device providing content for visual presentation on a display may utilize a GPU.

범용 레지스터(GPR)는 정보를 임시로 저장하기 위해 GPU에서 실행되는 프로그램에 할당될 수 있다. 그러나 프로그램에 더 많은 GPR이 할당될수록 GPU에 동시에 상주하는 스레드 수가 줄어들 수 있다. 따라서, 프로그램에 할당되는 GPR의 수를 줄일 필요가 있다.General purpose registers (GPRs) can be assigned to programs running on the GPU to temporarily store information. However, as more GPRs are allocated to a program, fewer threads can reside simultaneously on the GPU. Therefore, it is necessary to reduce the number of GPRs allocated to programs.

다음은 그러한 양태들의 기본적인 이해를 제공하기 위하여 하나 이상의 양태의 간략한 개요를 제시한다. 이 개요는 모든 고려된 양태들의 철저한 개관은 아니며, 모든 양태들의 핵심적인 또는 중요한 엘리먼트들을 식별하지도 않고, 임의의 또는 모든 양태들의 범위를 묘사하지도 않도록 의도된 것이다. 그 단지 목적은 나중에 제공되는 보다 자세한 설명의 서두로서 간략화된 형태로 일부 컨셉들의 하나 이상의 양상을 제공하는 것이다.The following presents a brief summary of one or more aspects to provide a basic understanding of those aspects. This summary is not an exhaustive overview of all contemplated aspects, it is intended not to identify key or critical elements of all aspects, nor to delineate the scope of any or all aspects. Its sole purpose is to present one or more aspects of some concepts in a simplified form as a prelude to the more detailed description that is presented later.

GPR은 컴파일된 셰이더 내에서 가능한 브랜치의 수를 기반으로 셰이더 컴파일에 후속하여 셰이더 프로그램에 다양하게 할당될 수 있다. 보다 구체적으로, 컴파일러는 컴파일 시간에, 비록 셰이더의 변수의 정확한 값이 그 컴파일 시간에 컴파일러에 의해 결정되지 않을 수 있더라도, 셰이더의 변수가 드로우 콜의 지속 기간에 걸쳐 일정할 수 있다고 결정할 수도 있다. 이와 같이 컴파일된 셰이더 내에서의 가능한 브랜치의 수는 셰이더의 런타임에 상수가 가정할 수 있는 상이한 값들을 기반으로 결정될 수도 있다. 결과적으로, 컴파일 시간에 셰이더에 과다한 GPR이 할당되어 해당 브랜치가 실제로 실행되는지 여부에 관계없이 셰이더가 셰이더의 가장 복잡한 브랜치를 실행하기에 충분한 GPR을 가지는 것을 보장한다.GPRs can be assigned variously to shader programs following shader compilation based on the number of possible branches within the compiled shader. More specifically, the compiler may determine at compile time that the shader's variables may be constant over the duration of the draw call, even though the exact values of the shader's variables may not be determined by the compiler at that compile time. The number of possible branches within such a compiled shader may be determined based on the different values the constant may assume at runtime of the shader. As a result, excessive GPRs are allocated to shaders at compile time to ensure that shaders have enough GPRs to execute the most complex branches of shaders regardless of whether those branches are actually executed.

따라서 프로그래밍 가능한 GPR 릴리스 메커니즘을 이용하여 컴파일 시간에 셰이더에 할당된 초과 GPR을 할당 해제할 수 있다. 릴리스 메커니즘은 상수들의 값이 셰이더에 의해 결정된 후 셰이더의 런타임에 실행될 수도 있다. 예를 들어 보다 구체적으로 셰이더는 실행에 더 많은 GPR이 필요한 복잡한 브랜치와 실행에 더 적은 GPR이 필요한 간단한 브랜치로 컴파일될 수 있다. 상수에 대해 정의되는 값을 기반으로, 셰이더는 복잡한 브랜치가 드로우 콜 중에 실행되지 않고 컴파일 시간에 셰이더에 할당된 GPR 들의 일부가 더 간단한 브랜치의 실행에 필요한 것보다 많다고 결정할 수 있다. 따라서, 릴리스 메커니즘은 더 많은 후속 스레드들이 GPU 에 동시에 상주할 수 있도록 후속 셰이더 스레드들로부터 초과/불필요한 GPR 들을 할당 해제하도록 구성될 수도 있다.Excess GPRs allocated to shaders can therefore be deallocated at compile time using a programmable GPR release mechanism. The release mechanism may be executed at runtime of the shader after the values of the constants are determined by the shader. More specifically, for example, shaders can be compiled into complex branches that require more GPRs to execute, and simple branches that require fewer GPRs to execute. Based on the values defined for the constants, the shader can decide that complex branches are not executed during draw calls and that some of the GPRs allocated to the shader at compile time are more than are required for execution of simpler branches. Thus, the release mechanism may be configured to deallocate excess/unnecessary GPRs from subsequent shader threads so that more subsequent threads can concurrently reside on the GPU.

본 개시의 일 양태에서, 방법, 컴퓨터 판독가능 매체, 및 장치가 제공된다. 장치는 메모리 및 메모리에 커플링된 적어도 하나의 프로세서를 포함할 수 있다. 적어도 하나의 프로세서는 실행 가능한 셰이더에 대해 정의된 상수를 기반으로 실행 가능한 셰이더 내에서 적어도 하나의 이용되지 않은 브랜치를 결정하고, 그 적어도 하나의 이용되지 않은 브랜치를 기반으로 할당된 GPR 들로부터 할당 해제될 수 있는 GPR 들의 수를 추가로 결정하도록 구성될 수 있다. 적어도 하나의 프로세서는 결정된 GPR의 수에 기초하여 드로우 콜 내에서 실행 가능한 셰이더의 실행 동안 할당된 GPR로부터 GPR의 수를 할당 해제할 수 있다.In one aspect of the present disclosure, methods, computer readable media, and apparatus are provided. An apparatus may include a memory and at least one processor coupled to the memory. The at least one processor determines at least one unused branch within the executable shader based on constants defined for the executable shader, and deallocates from allocated GPRs based on the at least one unused branch. It can be configured to further determine the number of GPRs that can be. The at least one processor may de-allocate the number of GPRs from the allocated GPRs during execution of the executable shader within the draw call based on the determined number of GPRs.

전술한 그리고 관련된 목적들의 달성을 위해, 하나 이상의 양상들은 이하에서 완전히 설명되고 특히 청구항들에서 지적되는 특징들을 포함한다. 다음의 설명 및 첨부된 도면들은 하나 이상의 양상들의 소정 예시적인 특징들을 상세히 개시한다. 그러나, 이들 특징들은 다양한 양상들의 원리들이 이용될 수도 있는 다양한 방식들 중 몇 개를 나타내고, 이러한 설명은 이러한 모든 양태들 및 이들의 등가물들을 포함하도록 의도된다.To the accomplishment of the foregoing and related ends, one or more aspects include the features fully described below and particularly pointed out in the claims. The following description and accompanying drawings disclose in detail certain illustrative features of one or more aspects. However, these features represent only a few of the various ways in which the principles of the various aspects may be employed, and this description is intended to encompass all such aspects and their equivalents.

도 1 은 본 개시의 하나 이상의 기법들에 따른 예시적인 콘텐츠 생성 시스템을 예시하는 블록 다이어그램이다.
도 2 는 본 개시의 하나 이상의 기법들에 따른 프로세싱 데이터에 대한 예시적인 컴포넌트들을 예시하는 블록 다이어그램이다.
도 3 은 본 개시의 하나 이상의 기법들에 따라 GPR 할당에 기초하여 셰이더를 샐행하기 위한 예시적인 명령들에 대응하는 블록도이다.
도 4 는 본 개시의 하나 이상의 기술에 따른 프로그램 가능한 GPR 릴리스 메커니즘에 기초하여 셰이더를 실행하기 위한 예시적인 명령에 대응하는 블록도이다.
도 5 는 본 개시의 하나 이상의 기법들에 따른 일 예의 방법의 플로우차트이다.
도 6 은 예시적인 장치에서의 상이한 수단들/컴포넌트들 사이의 데이터 플로우를 예시하는 개념적 데이터 플로우 다이어그램이다.1 is a block diagram illustrating an example content creation system in accordance with one or more techniques of this disclosure.
2 is a block diagram illustrating example components for processing data in accordance with one or more techniques of this disclosure.
3 is a block diagram corresponding to example instructions for executing a shader based on GPR allocation in accordance with one or more techniques of this disclosure.
4 is a block diagram corresponding to example instructions for executing shaders based on a programmable GPR release mechanism in accordance with one or more techniques of this disclosure.
5 is a flowchart of an example method in accordance with one or more techniques of this disclosure.
6 is a conceptual data flow diagram illustrating data flow between different means/components in an exemplary apparatus.

시스템들, 장치들, 컴퓨터 프로그램 제품들 및 방법들의 여러 양태들이 첨부된 도면을 참조하여 아래 보다 충분하게 설명될 것이다. 하지만, 본 개시는 많은 상이한 형태들에서 구체화될 수도 있고 본 개시 전체에 걸쳐 제시된 임의의 특정 구조 또는 기능에 한정되는 것으로 해석되서는 안된다. 오히려, 이들 양태들은 본 개시가 철저하고 완전할 것이며, 본 개시의 범위를 당업자에게 충분히 전달하도록 제공된다. 본 명세서에서의 교시들에 기초하여, 당업자는, 본 개시의 다른 양태와 독립적으로 구현되든 또는 다른 양태와 결합되든, 본 개시의 범위가 본 명세서에 개시된 시스템들, 장치들, 컴퓨터 프로그램 제품들, 및 방법들의 임의의 양태를 커버하도록 의도됨을 인식해야 한다. 예를 들어, 본 명세서에 기술된 임의의 수의 양태를 이용하여 장치가 구현될 수도 있거나 방법이 실시될 수도 있다. 또한, 본 개시의 범위는 여기에 제시된 본 개시의 다양한 양태들 외에 또는 추가하여 다른 구조, 기능, 또는 구조 및 기능을 이용하여 실시되는 그러한 장치 또는 방법을 커버하도록 의도된다. 본 명세서에 개시된 임의의 양태는 청구항의 하나 이상의 엘리먼트들에 의해 구현될 수도 있다.Several aspects of the systems, apparatuses, computer program products and methods will be more fully described below with reference to the accompanying drawings. This disclosure may, however, be embodied in many different forms and should not be construed as limited to any specific structure or function presented throughout this disclosure. Rather, these aspects are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Based on the teachings herein, those skilled in the art will appreciate that the systems, apparatuses, computer program products, and and any aspect of the methods. For example, an apparatus may be implemented or a method may be practiced using any number of the aspects set forth herein. Further, the scope of the present disclosure is intended to cover such apparatus or methods implemented using other structures, functions, or structures and functions other than or in addition to the various aspects of the present disclosure set forth herein. Any aspect disclosed herein may be embodied by one or more elements of a claim.

다양한 양태들이 본 명세서에서 설명되지만, 이들 양태들의 많은 변형들 및 치환들은 본 개시의 범위에 속한다. 본 개시의 양태들의 일부 잠재적인 이익들 및 이점들이 언급되지만, 본 개시의 범위는 특정 이익들, 사용들, 또는 목적들로 한정되도록 의도되지 않는다. 오히려, 본 개시의 양태들은 상이한 무선 기술들, 시스템 구성들, 네트워크들, 및 송신 프로토콜들에 폭넓게 적용가능한 것으로 의도되고, 이들 중 일부는 예로서 도면들에서 그리고 다음의 설명에서 예시된다. 상세한 설명 및 도면들은 본 개시를 제한하는 것이 아니라 예시할 뿐이고, 본 개시의 범위는 첨부된 청구항들 및 그 균등물들에 의해 정의된다.Although various aspects are described herein, many variations and permutations of these aspects fall within the scope of the present disclosure. Although some potential benefits and advantages of aspects of this disclosure are mentioned, the scope of this disclosure is not intended to be limited to particular benefits, uses, or objectives. Rather, aspects of the present disclosure are intended to be broadly applicable to different radio technologies, system configurations, networks, and transmission protocols, some of which are illustrated in the drawings by way of example and in the description that follows. The detailed description and drawings are illustrative rather than limiting of the disclosure, the scope of which is defined by the appended claims and equivalents thereof.

여러 양태들이 다양한 장치 및 방법들을 참조하여 제시된다. 이들 장치 및 방법들은 다양한 블록들, 컴포넌트들, 회로들, 프로세스들, 알고리즘들 등 ("엘리먼트들" 로서 총칭됨) 에 의해 다음의 상세한 설명에서 설명되고 첨부 도면들에 예시된다. 이들 엘리먼트들은 전자 하드웨어, 컴퓨터 소프트웨어, 또는 이들의 임의의 조합을 이용하여 구현될 수도 있다. 그러한 엘리먼트들이 하드웨어 또는 소프트웨어로 구현될지 여부는 전체 시스템에 부과된 특정 애플리케이션 및 설계 제약들에 의존한다.Several aspects are presented with reference to various apparatus and methods. These apparatus and methods are described in the detailed description that follows in terms of various blocks, components, circuits, processes, algorithms, etc. (collectively referred to as "elements") and are illustrated in the accompanying drawings. These elements may be implemented using electronic hardware, computer software, or any combination thereof. Whether such elements are implemented in hardware or software depends on the particular application and design constraints imposed on the overall system.

예로서, 엘리먼트, 또는 엘리먼트의 임의의 부분, 또는 엘리먼트들의 임의의 조합은 (프로세싱 유닛들로서 또한 지칭될 수도 있는) 하나 이상의 프로세서들을 포함하는 "프로세싱 시스템" 으로서 구현될 수도 있다. 프로세서들의 예들은 마이크로프로세서들, 마이크로제어기들, 그래픽 프로세싱 유닛들 (GPU들), 범용 GPU들 (GPGPU들), 중앙 프로세싱 유닛들 (CPU들), 애플리케이션 프로세서들, 디지털 신호 프로세서들 (DSP들), RISC (reduced instruction set computing) 프로세서들, 시스템 온 칩 (systems-on-chip; SoC), 기저대역 프로세서들, 주문형 집적 회로들 (ASIC들), 필드 프로그래밍가능 게이트 어레이들 (FPGA들), 프로그래밍가능 로직 디바이스들 (PLD들), 상태 머신들, 게이티드 로직, 이산 하드웨어 회로들, 및 본 개시 전반에 걸쳐 설명된 다양한 기능을 수행하도록 구성된 다른 적합한 하드웨어를 포함한다. 프로세싱 시스템에서의 하나 이상의 프로세서들은 소프트웨어를 실행할 수도 있다. 소프트웨어는, 소프트웨어, 펌웨어, 미들웨어, 마이크로코드, 하드웨어 기술 언어, 또는 다른 것으로서 지칭되든 간에, 명령들, 명령 세트들, 코드, 코드 세그먼트들, 프로그램 코드, 프로그램들, 서브프로그램들, 소프트웨어 컴포넌트들, 애플리케이션들, 소프트웨어 애플리케이션들, 소프트웨어 패키지들, 루틴들, 서브루틴들, 오브젝트들, 실행가능물들, 실행 스레드들, 절차들, 함수들 등을 의미하는 것으로 폭넓게 해석될 수 있다. As an example, an element, or any portion of an element, or any combination of elements may be implemented as a “processing system” that includes one or more processors (which may also be referred to as processing units). Examples of processors include microprocessors, microcontrollers, graphics processing units (GPUs), general purpose GPUs (GPGPUs), central processing units (CPUs), application processors, digital signal processors (DSPs) , reduced instruction set computing (RISC) processors, systems-on-chip (SoC), baseband processors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), programming enabling logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functions described throughout this disclosure. One or more processors in the processing system may execute software. Software, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise, includes instructions, instruction sets, code, code segments, program code, programs, subprograms, software components, Applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, and the like.

용어 애플리케이션은 소프트웨어를 지칭할 수도 있다. 본원에 설명된 바와 같이, 하나 이상의 기법들은 하나 이상의 기능들을 수행하도록 구성되는 애플리케이션 (예를 들어, 소프트웨어) 를 지칭할 수도 있다. 이러한 예들에서, 애플리케이션은 메모리 (예를 들어, 프로세서의 온-칩 메모리, 시스템 메모리, 또는 임의의 다른 메모리) 에 저장될 수도 있다. 본원에 설명된 하드웨어, 이를 테면 프로세서는 애플리케이션을 실행하도록 구성될 수도 있다. 예를 들어, 애플리케이션은, 하드웨어에 의해 실행될 경우, 하드웨어로 하여금 본 명세서에서 설명된 하나 이상의 기법들을 수행하게 하는 코드를 포함하는 것으로 설명될 수도 있다. 일 예로서, 하드웨어는 메모리로부터 코드에 액세스하고 메모리로부터 액세스된 코드를 실행하여 본 명세서에서 설명된 하나 이상의 기법들을 수행할 수도 있다. 일부 예들에서, 컴포넌트들은 본 개시에서 식별된다. 그러한 예들에서, 컴포넌트들은 하드웨어, 소프트웨어, 또는 이들의 조합일 수도 있다. 컴포넌트들은 별개의 컴포넌트들 또는 단일 컴포넌트의 서브-컴포넌트들일 수도 있다. The term application may also refer to software. As described herein, one or more techniques may refer to an application (eg, software) that is configured to perform one or more functions. In these examples, the application may be stored in memory (eg, on-chip memory of a processor, system memory, or any other memory). Hardware described herein, such as a processor, may be configured to run applications. For example, an application may be described as comprising code that, when executed by hardware, causes the hardware to perform one or more techniques described herein. As an example, hardware may access code from memory and execute the accessed code from memory to perform one or more techniques described herein. In some examples, components are identified in this disclosure. In such examples, components may be hardware, software, or combinations thereof. Components may be separate components or sub-components of a single component.

이에 따라, 본 명세서에서 설명된 하나 이상의 예들에서, 설명된 기능들은 하드웨어, 소프트웨어, 또는 이들의 임의의 조합에서 구현될 수도 있다. 소프트웨어에서 구현되면, 그 기능들은 컴퓨터 판독가능 매체 상에 하나 이상의 명령들 또는 코드로서 저장 또는 인코딩될 수도 있다. 컴퓨터 판독가능 매체들은 컴퓨터 저장 매체들을 포함한다. 저장 매체는 컴퓨터에 의해 액세스될 수도 있는 임의의 가용 매체일 수도 있다. 비한정적 예로서, 이러한 컴퓨터 판독가능 매체는 RAM (random-access memory), ROM (read-only memory), EEPROM (electrically erasable programmable ROM), 광학 디스크 스토리지, 자기 디스크 스토리지, 다른 자기 스토리지 디바이스들, 전술한 타입의 컴퓨터 판독가능 매체의 조합, 또는 컴퓨터에 의해 액세스될 수도 있는 명령 또는 데이터 구조 형태의 컴퓨터 실행가능 코드를 저장하는데 사용될 수도 있는 임의의 다른 매체를 포함할 수도 있다.Accordingly, in one or more examples described herein, the described functions may be implemented in hardware, software, or any combination thereof. If implemented in software, the functions may be stored on or encoded as one or more instructions or code on a computer readable medium. Computer readable media includes computer storage media. A storage medium may be any available medium that may be accessed by a computer. By way of non-limiting example, such computer readable media may include random-access memory (RAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), optical disk storage, magnetic disk storage, other magnetic storage devices, It may include a combination of one type of computer readable medium, or any other medium that may be used to store computer executable code in the form of instructions or data structures that may be accessed by a computer.

일반적으로, 본 개시는 그래픽 컨텐츠의 렌더링을 개선하고/하거나 프로세싱 유닛 (예를 들어, 본원에 설명된 하나 이상의 기법들을 수행하도록 구성되는 임의의 프로세싱 유닛, 이를 테면, GPU) 의 부하를 감소시킬 수도 있는 단일의 디바이스 또는 다수의 디바이스들에서의 그래픽 프로세싱에 대한 기법들을 설명한다. 예를 들어, 본 개시는 그래픽 프로세서를 활용하는 임의의 디바이스에서 그래픽 프로세싱에 적용가능한 기법을 설명한다. 이러한 기법의 다른 가능한 이점들이 본 개시 전반에 걸쳐 설명된다.In general, this disclosure may improve rendering of graphical content and/or reduce the load of a processing unit (eg, any processing unit configured to perform one or more techniques described herein, such as a GPU). Techniques for graphics processing on a single device or on multiple devices are described. For example, this disclosure describes techniques applicable to graphics processing in any device that utilizes a graphics processor. Other possible advantages of this technique are described throughout this disclosure.

본원에 사용된 바와 같이 용어 "컨텐츠"의 사례는 용어가 형용사, 명사, 또는 다른 스피치의 부분들로서 사용되는지 여부에 관계없이, "그래픽 컨텐츠", "이미지" 등을 지칭할 수도 있다. 일부 예들에서, 본원에 사용된 용어 "그래픽 컨텐츠" 는 그래픽 프로세싱 파이프라인의 하나 이상의 프로세스들에 의해 생성되는 컨텐츠를 의미할 수도 있다. 추가의 예들에서, 본원에 사용된 용어 "그래픽 컨텐츠" 는 그래픽 프로세싱을 수행하도록 구성되는 프로세싱 유닛에 의해 생성되는 컨텐츠를 의미할 수도 있다. 추가의 또 다른 예들에서, 본원에 사용된 용어 "그래픽 컨텐츠"는 그래픽 프로세싱 유닛에 의해 생성된 컨텐츠를 의미할 수도 있다.As used herein, instances of the term “content” may refer to “graphical content,” “images,” etc., regardless of whether the term is used as an adjective, noun, or other parts of speech. In some examples, the term “graphics content” as used herein may refer to content created by one or more processes of a graphics processing pipeline. In further examples, the term “graphics content” as used herein may refer to content created by a processing unit configured to perform graphics processing. In still further examples, the term “graphics content” as used herein may mean content created by a graphics processing unit.

따라서 프로그래밍 가능한 GPR 릴리스 메커니즘을 이용하여 컴파일 시간에 셰이더에 할당된 초과 GPR을 할당 해제할 수 있다. 릴리스 메커니즘은 상수들의 값이 셰이더에 의해 결정된 후 셰이더의 런타임에 실행될 수도 있다. 예를 들어 보다 구체적으로 셰이더는 실행에 더 많은 GPR이 필요한 복잡한 브랜치와 실행에 더 적은 GPR이 필요한 간단한 브랜치로 컴파일될 수 있다. 상수에 대해 정의되는 값을 기반으로, 셰이더는 복잡한 브랜치가 드로우 콜 중에 실행되지 않고 컴파일 시간에 셰이더에 할당된 GPR 들의 일부가 더 간단한 브랜치의 실행에 필요한 것보다 많다고 결정할 수 있다. 따라서, 릴리스 메커니즘은 초과의 GPR 들이 GPU 의 후속 스레드들에 할당될 수 있도록 셰이더로부터 초과/불필요한 GPR 들을 할당 해제하도록 구성될 수 있어, 더 많은 후속 스레드들이 GPU 에 동시에 상주하는 것을 허용한다.Excess GPRs allocated to shaders can therefore be deallocated at compile time using a programmable GPR release mechanism. The release mechanism may be executed at runtime of the shader after the values of the constants are determined by the shader. More specifically, for example, shaders can be compiled into complex branches that require more GPRs to execute, and simple branches that require fewer GPRs to execute. Based on the values defined for the constants, the shader can decide that complex branches are not executed during draw calls and that some of the GPRs allocated to the shader at compile time are more than are required for execution of simpler branches. Thus, the release mechanism can be configured to deallocate excess/unnecessary GPRs from the shader so that the excess GPRs can be allocated to subsequent threads of the GPU, allowing more subsequent threads to reside simultaneously on the GPU.

도 1 은 본 개시의 하나 이상의 기법들을 구현하도록 구성된 예시적인 콘텐츠 생성 시스템 (100) 을 예시하는 블록 다이어그램이다. 콘텐츠 생성 시스템(100)은 비디오 디바이스 (예를 들어, 미디어 플레이어), 셋톱 박스, 무선 통신 디바이스 (예를 들어, 스마트폰), 개인 휴대 정보 단말기(PDA), 데스크탑/랩탑 컴퓨터, 게임 콘솔, 화상 회의 장치, 태블릿 컴퓨팅 디바이스 등일 수 있지만 이에 제한되지 않는 디바이스(104)를 포함한다. 디바이스 (104) 는 본 명세서에서 설명된 다양한 기능들을 수행하기 위한 하나 이상의 컴포넌트들 또는 회로들을 포함할 수도 있다. 일부 예들에서, 디바이스 (104) 의 하나 이상의 컴포넌트들은 SOC 의 컴포넌트들일 수도 있다. 디바이스 (104) 는 본 개시의 하나 이상의 기법들을 수행하도록 구성되는 하나 이상의 컴포넌트들을 포함할 수도 있다. 도시된 예에서, 디바이스 (104) 는 프로세싱 유닛 (120) 및 시스템 메모리 (124) 를 포함할 수도 있다. 일부 양태들에서, 디바이스 (104) 는 복수의 선택적 컴포넌트들 (예를 들어, 통신 인터페이스 (126), 트랜시버 (132), 수신기 (128), 송신기 (130), 디스플레이 프로세서 (127), 및 하나 이상의 디스플레이들 (131)) 을 포함할 수도 있다. 디스플레이(들) (131) 은 하나 이상의 디스플레이들 (131) 을 지칭할 수도 있다. 예를 들어, 디스플레이 (131) 는 제 1 디스플레이 및 제 2 디스플레이를 포함할 수도 있는 단일의 디스플레이 또는 다수의 디스플레이들을 포함할 수도 있다. 제 1 디스플레이는 좌안 디스플레이일 수도 있고 제 2 디스플레이는 우안 디스플레이일 수도 있다. 일부 예들에서, 제 1 및 제 2 디스플레이는 그에 대한 제시를 위해 상이한 프레임들을 수신할 수도 있다. 다른 예들에서, 제 1 및 제 2 디스플레이는 그에 대한 제시를 위해 동일한 프레임들을 수신할 수도 있다. 추가 예들에서, 그래픽 프로세싱의 결과들은 디바이스 상에 디스플레이되지 않을 수도 있으며, 예를 들어, 제 1 및 제 2 디스플레이는 그에 대한 제시를 위해 임의의 프레임들을 수신하지 않을 수도 있다. 대신에, 프레임들 또는 그래픽 프로세싱 결과들은 다른 디바이스로 전송될 수도 있다. 일부 양태들에서, 이는 분할 렌더링으로 지칭될 수도 있다.1 is a block diagram illustrating an example content creation system 100 configured to implement one or more techniques of this disclosure. Content creation system 100 may include video devices (eg, media players), set top boxes, wireless communication devices (eg, smart phones), personal digital assistants (PDAs), desktop/laptop computers, game consoles, video device 104, which may be, but is not limited to, a conferencing device, a tablet computing device, and the like. Device 104 may include one or more components or circuits to perform various functions described herein. In some examples, one or more components of device 104 may be components of a SOC. Device 104 may include one or more components that are configured to perform one or more techniques of this disclosure. In the example shown, device 104 may include processing unit 120 and system memory 124 . In some aspects, device 104 may include a plurality of optional components (e.g., communication interface 126, transceiver 132, receiver 128, transmitter 130, display processor 127, and one or more displays 131). Display(s) 131 may refer to one or more displays 131 . For example, display 131 may include a single display or multiple displays, which may include a primary display and a secondary display. The first display may be a left eye display and the second display may be a right eye display. In some examples, the first and second displays may receive different frames for presentation thereon. In other examples, the first and second displays may receive the same frames for presentation thereon. In further examples, results of graphics processing may not be displayed on the device, eg, the first and second displays may not receive any frames for presentation thereto. Instead, frames or graphics processing results may be transmitted to another device. In some aspects, this may be referred to as split rendering.

프로세싱 유닛 (120) 은 내부 메모리 (121) 를 포함할 수도 있다. 프로세싱 유닛 (120) 은 그래픽 프로세싱 파이프라인 (107) 를 사용하여 그래픽 프로세싱을 수행하도록 구성될 수도 있다. 일부 예들에서, 디바이스 (104) 는 하나 이상의 디스플레이들 (131) 에 의해 디스플레이되기 전에 프로세싱 유닛 (120) 에 의해 생성된 하나 이상의 프레임들에 대해 하나 이상의 디스플레이 프로세싱 기법들을 수행하도록 디스플레이 프로세서, 이를 테면, 디스플레이 프로세서 (127) 를 포함할 수도 있다. 디스플레이 프로세서 (127) 는 디스플레이 프로세싱을 수행하도록 구성될 수도 있다. 예를 들어, 디스플레이 프로세서 (127) 는 프로세싱 유닛 (120) 에 의해 생성된 하나 이상의 프레임들에 대해 하나 이상의 디스플레이 프로세싱 기법들을 수행하도록 구성될 수도 있다. 하나 이상의 디스플레이들 (131) 은 디스플레이 프로세서 (127) 에 의해 프로세싱된 프레임들을 디스플레이하거나 또는 그렇지 않으면 제시하도록 구성될 수도 있다. 일부 예들에서, 하나 이상의 디스플레이들 (131) 은 액정 디스플레이 (LCD), 플라즈마 디스플레이, 유기 발광 다이오드 (OLED) 디스플레이, 프로젝션 디스플레이 디바이스, 증강 현실 디스플레이 디바이스, 가상 현실 디스플레이 디바이스, 헤드 탑재형 디스플레이 또는 임의의 다른 유형의 디스플레이 디바이스 중 하나 이상을 포함할 수도 있다.Processing unit 120 may include internal memory 121 . Processing unit 120 may be configured to perform graphics processing using graphics processing pipeline 107 . In some examples, device 104 may use a display processor, such as a display processor, to perform one or more display processing techniques on one or more frames generated by processing unit 120 before being displayed by one or more displays 131 . display processor 127. Display processor 127 may be configured to perform display processing. For example, display processor 127 may be configured to perform one or more display processing techniques on one or more frames generated by processing unit 120 . One or more displays 131 may be configured to display or otherwise present frames processed by display processor 127 . In some examples, one or more displays 131 may be a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, a projection display device, an augmented reality display device, a virtual reality display device, a head mounted display, or any It may also include one or more of other types of display devices.

시스템 메모리 (124) 와 같은, 프로세싱 유닛 (120) 의 외부에 있는 메모리는 프로세싱 유닛 (120) 에 액세스가능할 수 있다. 예를 들어, 프로세싱 유닛 (120) 은 외부 메모리, 이를 테면, 시스템 메모리 (124) 에 대하여 판독하고/하거나 기록하도록 구성될 수도 있다. 프로세싱 유닛(120)은 버스를 통해 시스템 메모리(124)에 통신가능하게 결합될 수도 있다. 일부 예들에서, 프로세싱 유닛 (120) 은 버스 또는 상이한 접속을 통해 내부 메모리 (121) 에 통신가능하게 결합될 수도 있다. 내부 메모리 (121) 또는 시스템 메모리 (124) 는 하나 이상의 휘발성 또는 비휘발성 메모리들 또는 저장 디바이스들을 포함할 수도 있다. 일부 예들에서, 내부 메모리 (121) 또는 시스템 메모리 (124) 는 RAM, SRAM (static random access memory), DRAM (dynamic random access memory), EPROM (erasable programmable ROM), EEPROM, 플래시 메모리, 자기 데이터 매체 또는 광학 저장 매체 또는 임의의 다른 유형의 메모리를 포함할 수도 있다. Memory external to processing unit 120 , such as system memory 124 , may be accessible to processing unit 120 . For example, processing unit 120 may be configured to read and/or write to external memory, such as system memory 124 . Processing unit 120 may be communicatively coupled to system memory 124 via a bus. In some examples, processing unit 120 may be communicatively coupled to internal memory 121 via a bus or other connection. Internal memory 121 or system memory 124 may include one or more volatile or non-volatile memories or storage devices. In some examples, internal memory 121 or system memory 124 is RAM, static random access memory (SRAM), dynamic random access memory (DRAM), erasable programmable ROM (EPROM), EEPROM, flash memory, a magnetic data medium, or optical storage media or any other type of memory.

내부 메모리 (121) 또는 시스템 메모리 (124) 는 일부 예들에 따른 비일시적 저장 매체일 수도 있다. 용어 "비일시적" 은, 저장 매체가 캐리어 파 (carrier wave) 또는 전파 신호에서 구현되지 않음을 표시할 수도 있다. 하지만, 용어 "비일시적" 은 내부 메모리 (121) 또는 시스템 메모리 (124) 가 이동가능하지 않다는 것 또는 그 콘텐츠가 정적이라는 것을 의미하도록 해석되지 않아야 한다. 일 예로서, 시스템 메모리 (124) 는 디바이스 (104) 로부터 제거되고 다른 디바이스로 이동될 수도 있다. 다른 예로서, 시스템 메모리 (124) 는 디바이스 (104) 로부터 제거가능하지 않을 수도 있다.Internal memory 121 or system memory 124 may be a non-transitory storage medium according to some examples. The term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or propagated signal. However, the term “non-transitory” should not be construed to mean that internal memory 121 or system memory 124 is not removable or that its contents are static. As an example, system memory 124 may be removed from device 104 and moved to another device. As another example, system memory 124 may not be removable from device 104 .

프로세싱 유닛 (120) 은 그래픽 프로세싱을 수행하도록 구성될 수도 있는 CPU, GPU, GPGPU, 또는 임의의 다른 프로세싱 유닛일 수도 있다. 일부 예들에서, 프로세싱 유닛 (120) 은 디바이스 (104) 의 마더보드에 통합될 수도 있다. 추가의 예들에서, 프로세싱 유닛 (120) 은 디바이스 (104) 의 마더보다의 포트에 설치된 그래픽 카드 상에 존재할 수도 있거나 또는 디바이스 (104) 와 상호동작하도록 구성된 주변 디바이스 내에서 달리 통합될 수도 있다. 프로세싱 유닛 (120) 은 하나 이상의 프로세서들, 이를 테면, 하나 이상의 마이크로프로세서들, GPU들, ASIC들, FPGA들, ALU들 (arithmetic logic units), DSP들, 별개의 로직, 소프트웨어, 하드웨어, 펌웨어, 다른 등가의 통합된 또는 별개의 로직 회로부 또는 임의의 이들의 조합을 포함할 수도 있다. 기법들이 부분적으로 소프트웨어에서 구현되면, 프로세싱 유닛 (120) 은 적합한 비-일시적인 컴퓨터 판독가능 저장 매체, 예를 들어 내부 메모리 (121) 에 소프트웨어에 대한 명령들을 저장할 수도 있으며, 본 개시의 기법들을 수행하기 위해 하나 이상의 프로세서들을 사용하여 하드웨어에서 명령들을 실행할 수도 있다. 하드웨어, 소프트웨어, 하드웨어와 소프트웨어의 조합 등을 포함한 임의의 전술한 바는 하나 이상의 프로세서들인 것으로 고려될 수도 있다. Processing unit 120 may be a CPU, GPU, GPGPU, or any other processing unit that may be configured to perform graphics processing. In some examples, processing unit 120 may be integrated into a motherboard of device 104 . In further examples, processing unit 120 may reside on a graphics card installed in a port other than the mother of device 104 or may otherwise be integrated within a peripheral device configured to interact with device 104 . Processing unit 120 may include one or more processors, such as one or more microprocessors, GPUs, ASICs, FPGAs, arithmetic logic units (ALUs), DSPs, discrete logic, software, hardware, firmware, It may also include other equivalent integrated or discrete logic circuitry or any combination thereof. If the techniques are implemented partly in software, processing unit 120 may store instructions for software in a suitable non-transitory computer-readable storage medium, such as internal memory 121, to perform the techniques of this disclosure. One or more processors may be used to execute instructions in hardware. Any of the foregoing, including hardware, software, combinations of hardware and software, or the like, may be considered one or more processors.

일부 양태들에서, 콘텐츠 생성 시스템 (100) 은 옵션의 통신 인터페이스 (126) 를 포함할 수 있다. 통신 인터페이스 (126) 는 수신기 (128) 및 송신기 (130) 를 포함할 수도 있다. 수신기 (128) 는 디바이스 (104) 에 대하 본 명세서에서 설명된 임의의 수신 기능을 수행하도록 구성될 수도 있다. 추가적으로, 수신기 (128) 는 정보, 예를 들어, 눈 또는 머리 위치 정보, 렌더링 커맨드들, 또는 로케이션 정보를 다른 디바이스로부터 수신하도록 구성될 수도 있다. 송신기 (130) 는 디바이스 (104) 에 대해 본 명세서에서 설명된 임의의 송신 기능을 수행하도록 구성될 수도 있다. 예를 들어, 송신기 (130) 는 콘텐츠에 대한 요청을 포함할 수도 있는 정보를 다른 디바이스에 송신하도록 구성될 수도 있다. 수신기 (128) 및 송신기 (130) 는 트랜시버 (132) 로 결합될 수도 있다. 그러한 예들에서, 트랜시버 (132) 는 디바이스 (104) 에 대해 본 명세서에서 설명된 임의의 수신 기능 및/또는 송신 기능을 수행하도록 구성될 수도 있다.In some aspects, content creation system 100 can include an optional communication interface 126 . Communications interface 126 may include a receiver 128 and a transmitter 130 . Receiver 128 may be configured to perform any receive function described herein for device 104 . Additionally, receiver 128 may be configured to receive information from another device, eg, eye or head position information, rendering commands, or location information. Transmitter 130 may be configured to perform any transmission function described herein for device 104 . For example, transmitter 130 may be configured to transmit information, which may include a request for content, to another device. Receiver 128 and transmitter 130 may be combined into a transceiver 132 . In such examples, transceiver 132 may be configured to perform any receive function and/or transmit function described herein for device 104 .

도 1 을 다시 참조하면, 특정 양태들에서, 그래픽 프로세싱 유닛(120)은 실행가능한 셰이더에 대해 정의된 상수들에 기초하여 실행가능한 셰이더 내의 적어도 하나의 이용되지 않은 브랜치를 결정하고; 그 적어도 하나의 이용되지 않은 브랜치에 기초하여 할당된 GPR 들로부터 할당 해제될 수 있는 GPR 들의 수를 결정하며; 및 드로우 콜 내의 후속적인 스레드에 대해, 결정된 GPR 들의 수에 기초하여 실행가능한 셰이더의 실행 동안 할당된 GPR 들로부터 GPR 들의 수를 할당 해제하도록 구성된 GPR 할당 해제 컴포넌트(198)를 포함할 수 있다. 할당 해제 컴포넌트(198)를 "컴포넌트"로 기술하고 참조하는 것은 설명의 편의를 위한 것이며 프로세싱 유닛(120)의 특정 하드웨어 컴포넌트에 반드시 대응하는 것은 아니다. 예를 들어, 할당 해제 컴포넌트 (198) 는 코드, 로직 등으로서 구성될 수도 있다. Referring again to FIG. 1 , in certain aspects, graphics processing unit 120 determines at least one unused branch within an executable shader based on constants defined for the executable shader; determine a number of GPRs that can be de-allocated from the allocated GPRs based on the at least one unused branch; and a GPR deallocation component 198 configured to, for a subsequent thread within the draw call, deallocate the number of GPRs from allocated GPRs during execution of the executable shader based on the determined number of GPRs. Describing and referencing deassignment component 198 as a “component” is for convenience of explanation and does not necessarily correspond to a specific hardware component of processing unit 120 . For example, deassignment component 198 may be configured as code, logic, and the like.

디바이스, 이를 테면, 디바이스 (104) 는 본원에 설명된 하나 이상의 기법들을 수행하도록 구성되는 임의의 디바이스, 장치, 또는 시스템을 지칭할 수도 있다. 예를 들어, 디바이스는 서버, 기지국, 사용자 장비, 클라이언트 디바이스, 스테이션, 액세스 포인트, 컴퓨터, 이를 테면, 퍼스널 컴퓨터, 데스크톱 컴퓨터, 랩톱 컴퓨터, 테블릿 컴퓨터, 컴퓨터 워크스테이션, 또는 메인프레임 컴퓨터, 최종 제품, 장치, 폰, 스마트폰, 서버, 비디오 게임 플랫폼 또는 콘솔, 핸드헬드 디바이스 이를 테면 포터블 비디오 게임 디바이스 또는 개인 휴대 정보 단말기 (PDA), 웨어러블 컴퓨팅 디바이스 이를 테면, 스마트 워치, 증강 현실 디바이스 또는 가상 현실 디바이스, 비-웨어러블 디바이스, 디스플레이 또는 디스플레이 디바이스, 텔레비전, 텔레비전 셋톱박스, 중간 네트워크 디바이스, 디지털 미디어 플레이어, 비디오 스트리밍 디바이스, 컨텐츠 스트리밍 디바이스, 비히클 내 컴퓨터, 임의의 모바일 디바이스, 그래픽 컨텐츠를 생성하도록 구성된 임의의 디바이스, 또는 본원에 설명된 하나 이상의 기법들을 수행하도록 구성되는 임의의 디바이스일 수도 있다. 본 명세서에서의 프로세스들은 특정 컴포넌트 (예를 들어, GPU) 에 의해 수행되는 것으로 설명될 수도 있지만, 다른 실시형태들에서, 개시된 실시형태들과 일치하는, 다른 컴포넌트들 (예를 들어, CPU) 을 사용하여 수행될 수도 있다.A device, such as device 104 , may refer to any device, apparatus, or system that is configured to perform one or more techniques described herein. For example, a device may be a server, base station, user equipment, client device, station, access point, computer such as a personal computer, desktop computer, laptop computer, tablet computer, computer workstation, or mainframe computer, end product , devices, phones, smartphones, servers, video game platforms or consoles, handheld devices such as portable video game devices or personal digital assistants (PDAs), wearable computing devices such as smart watches, augmented reality devices or virtual reality devices. , non-wearable device, display or display device, television, television set top box, intermediate network device, digital media player, video streaming device, content streaming device, computer in vehicle, any mobile device, any configured to generate graphical content device, or any device configured to perform one or more techniques described herein. Processes herein may be described as being performed by a particular component (eg, GPU), but in other embodiments, consistent with the disclosed embodiments, other components (eg, CPU) It can also be done using

도 2 는 데이터를 처리하기 위한 예시적인 디바이스(104)와 관련하여 식별될 수 있는 바와 같은 프로세싱 유닛(120) 및 시스템 메모리(124)와 같은 예시적인 컴포넌트들을 예시하는 블록도(200)이다. 양태들에서, 프로세싱 유닛 (120) 은 CPU (202) 및 GPU (212) 을 포함할 수도 있다. GPU (212) 및 CPU (202) 는 집적 회로(예를 들어, SOC)로서 형성될 수도 있고/있거나 GPU(212)는 CPU(202)와 함께 마더보드 상에 통합될 수도 있다. 대안적으로, CPU(202) 및 GPU(212)는 서로 통신가능하게 결합된 별개의 프로세싱 유닛으로서 구성될 수 있다. 예를 들어, GPU (212) 는 CPU (202) 를 포함하는 마더보드의 포트에 설치되는 그래픽 카드 상에 통합될 수도 있다.2 is a block diagram 200 illustrating example components such as processing unit 120 and system memory 124 as may be identified with respect to example device 104 for processing data. In aspects, processing unit 120 may include CPU 202 and GPU 212 . GPU 212 and CPU 202 may be formed as an integrated circuit (eg, SOC) and/or GPU 212 may be integrated on a motherboard along with CPU 202 . Alternatively, CPU 202 and GPU 212 may be configured as separate processing units communicatively coupled to each other. For example, GPU 212 may be integrated on a graphics card installed in a port of a motherboard that includes CPU 202 .

CPU(202)는 GPU(212)의 하나 이상의 동작에 기초하여 (예를 들어, 디바이스(104)의 디스플레이(들)(131) 상에) 그래픽 콘텐츠가 디스플레이되게 하는 소프트웨어 애플리케이션을 실행하도록 구성될 수 있다. 소프트웨어 애플리케이션은 그래픽 애플리케이션 프로그램 인터페이스(API)(204)에 명령을 발행할 수 있으며, 이는 소프트웨어 애플리케이션으로부터 수신된 명령을 GPU 구동기(210)에 의해 판독 가능한 형식으로 변환하는 런타임 프로그램일 수 있다. 그래픽 API(204)를 통해 소프트웨어 애플리케이션으로부터 명령을 수신한 후, GPU 구동기(210)는 그 명령에 기초하여 GPU(212)의 동작을 제어할 수 있다. 예를 들어, GPU 구동기(210)는 시스템 메모리(124)에 배치되는 하나 이상의 커맨드 스트림을 생성할 수 있으며, 여기서 GPU(212)는 (예를 들어, 하나 이상의 시스템 콜을 통해) 커맨드 스트림을 실행하도록 지시된다. GPU(212)에 포함된 커맨드 엔진(214)은 커맨드 스트림에 저장된 하나 이상의 커맨드들을 취출하도록 구성된다. 커맨드 엔진(214)은 GPU(212)에 의한 실행을 위해 커맨드 스트림으로부터 커맨드들을 제공할 수 있다. 커맨드 엔진(214)은 GPU(212)의 하드웨어, GPU(212) 상에서 실행되는 소프트웨어/펌웨어, 또는 이들의 조합일 수 있다. CPU 202 may be configured to execute a software application that causes graphical content to be displayed (eg, on display(s) 131 of device 104 ) based on one or more operations of GPU 212 . there is. A software application may issue commands to a graphics application program interface (API) 204 , which may be a runtime program that converts commands received from the software application into a format readable by the GPU driver 210 . After receiving commands from a software application via graphics API 204, GPU driver 210 may control the operation of GPU 212 based on the commands. For example, GPU driver 210 may generate one or more command streams that are placed in system memory 124, where GPU 212 executes the command streams (eg, via one or more system calls). instructed to do Command engine 214 included in GPU 212 is configured to retrieve one or more commands stored in a command stream. Command engine 214 may provide commands from a command stream for execution by GPU 212 . Command engine 214 may be hardware of GPU 212 , software/firmware running on GPU 212 , or a combination thereof.

GPU 구동기(210)는 그래픽 API(204)를 구현하도록 구성되지만, GPU 구동기(210)는 임의의 특정 API에 따라 구성되는 것으로 제한되지 않는다. 시스템 메모리(124)는 CPU(202)가 실행을 위해 취출할 수 있는 GPU 구동기(210)에 대한 코드를 저장할 수 있다. 예들에서, GPU 구동기(210)는 CPU(202)가 GPU 구동기(210)를 통해 GPU(212)에 그래픽 또는 비그래픽 처리 작업을 오프로드할 때와 같이 CPU(202)와 GPU(212) 사이의 통신을 허용하도록 구성될 수 있다. GPU driver 210 is configured to implement graphics API 204, although GPU driver 210 is not limited to being configured according to any particular API. System memory 124 may store code for GPU driver 210 that CPU 202 may retrieve for execution. In examples, GPU driver 210 may be used to provide information between CPU 202 and GPU 212, such as when CPU 202 offloads graphics or non-graphics processing tasks to GPU 212 via GPU driver 210. It can be configured to allow communication.

시스템 메모리(124)는 프리앰블 셰이더(224) 또는 메인 셰이더(226) 중 하나 이상에 대한 소스 코드를 더 저장할 수 있다. 그러한 구성들에서, CPU (202) 상에서 실행되는 셰이더 컴파일러 (208) 는 런타임 동안 GPU (212) 의 셰이더 코어 (216) 에 의해 실행 가능한 오브젝트 코드 또는 중간 코드를 생성하기 위해 셰이더들 (224-226) 의 소스 코드를 (예를 들어, 셰이더들 (224-226) 이 셰이더 코어 (216) 상에서 실행되어야 할 때에) 컴파일할 수도 있다. 일부 예에서, 셰이더 컴파일러 (208) 는 셰이더들 (224-226) 을 사전 컴파일하고 셰이더 프로그램들의 오브젝트 코드 또는 중간 코드를 시스템 메모리 (124) 에 저장할 수도 있다.The system memory 124 may further store source code for one or more of the preamble shader 224 or the main shader 226 . In such configurations, shader compiler 208 running on CPU 202 uses shaders 224-226 to generate object code or intermediate code executable by shader core 216 of GPU 212 during runtime. may compile the source code of (e.g., when shaders 224-226 are to be executed on shader core 216). In some examples, shader compiler 208 may precompile shaders 224 - 226 and store object code or intermediate code of shader programs in system memory 124 .

CPU(202) 상에서 실행되는 셰이더 컴파일러(208)(또는 다른 예에서 GPU 구동기(210))는 프리앰블 셰이더(224) 및 메인 셰이더(226)를 포함하는 다수의 컴포넌트를 갖는 셰이더 프로그램을 구축할 수 있다. 메인 셰이더 (226) 는 프리앰블 셰이더 (224) 를 포함하지 않는 셰이더 프로그램의 일부 또는 전부에 대응할 수도 있다. 셰이더 컴파일러(208)는 CPU(202)에서 실행되는 프로그램으로부터 셰이더(들)(224-226)를 컴파일하기 위한 명령을 수신할 수 있다. 셰이더 컴파일러 (208) 는 또한 (메인 셰이더 (226) 보다는) 프리앰블 셰이더 (224) 내에 공통 동작들을 포함하기 위해 셰이더 프로그램에서 상수 로드 명령들 및 공통 동작들을 식별할 수도 있다. 셰이더 컴파일러(208)는 예를 들어, 공통 명령들에 포함될 (현재 결정되지 않은) 상수들(206)에 기초하여 이러한 공통 명령들을 식별할 수 있다. 상수들(206)은 전체 드로우 콜에 걸쳐 일정하도록 그래픽 API(204) 내에서 정의될 수 있다. 셰이더 컴파일러 (208) 는 프리앰블 셰이더 (224) 의 시작을 나타내는 프리앰블 셰이더 시작 및 프리앰블 셰이더 (224) 의 종료를 나타내는 프리앰블 셰이더 종료와 같은 명령들을 이용할 수도 있다. 메인 셰이더(226)에 대해서도 유사한 명령들이 사용될 수 있다. Shader compiler 208 (or GPU driver 210 in another example) running on CPU 202 may build a shader program having multiple components, including preamble shader 224 and main shader 226. . Main shader 226 may correspond to part or all of a shader program that does not include preamble shader 224 . Shader compiler 208 may receive instructions to compile shader(s) 224 - 226 from a program running on CPU 202 . Shader compiler 208 may also identify constant load instructions and common operations in a shader program to include common operations within preamble shader 224 (rather than main shader 226). The shader compiler 208 can identify these common instructions, for example, based on (currently undetermined) constants 206 to be included in the common instructions. Constants 206 may be defined within the graphics API 204 to be constant across the entire draw call. Shader compiler 208 may use instructions such as start preamble shader to indicate the beginning of preamble shader 224 and end preamble shader to indicate the end of preamble shader 224 . Similar instructions can be used for the main shader 226.

GPU(212)에 포함된 셰이더 코어(216)는 GPR(218) 및 상수 메모리(220)를 포함할 수 있다. GPR (218) 은 단일 GPR, GPR 파일 및/또는 GPR 뱅크에 대응할 수도 있다. GPR 들 (218) 내의 각 GPR 은 단일 스레드에 대해 액세스 가능한 데이터를 저장할 수도 있다. GPU (212) 상에서 실행되는 소프트웨어 및/또는 펌웨어는 GPU (212) 의 셰이더 코어 (216) 상에서 실행될 수도 있는 셰이더 프로그램 (224-226) 일 수도 있다. 셰이더 코어 (216) 는 동일한 셰이더 프로그램의 동일한 명령들의 많은 인스턴스를 병렬로 실행하도록 구성될 수도 있다. 예를 들어, 셰이더 코어(216)는 주어진 형상을 정의하는 각각의 픽셀에 대해 메인 셰이더(226)를 실행할 수 있다. The shader core 216 included in the GPU 212 may include a GPR 218 and a constant memory 220 . GPR 218 may correspond to a single GPR, GPR file, and/or GPR bank. Each GPR within GPRs 218 may store data accessible to a single thread. Software and/or firmware running on GPU 212 may be shader programs 224 - 226 , which may run on shader core 216 of GPU 212 . Shader core 216 may be configured to execute many instances of the same instructions of the same shader program in parallel. For example, shader core 216 can execute main shader 226 for each pixel defining a given shape.

셰이더 코어(216)는 CPU(202) 상에서 실행되는 애플리케이션으로부터 데이터를 송신 및 수신할 수도 있다. 예들에서, 셰이더들 (224-226)의 실행에 사용되는 상수들 (206) 은 상수 메모리(220)(예를 들어, 판독/기입 상수 RAM) 또는 GPR(218)에 저장될 수 있다. 셰이더 코어(216)는 상수(206)를 상수 메모리(220)에 로드할 수 있다. 추가 예들에서, 프리앰블 셰이더(224)의 실행은 상수 메모리(220)(예를 들어, 상수 RAM), GPU 메모리(222), 또는 시스템 메모리 (124) 와 같은 온칩 메모리에 상수 값 또는 상수 값들의 세트가 저장되게 할 수 있다. 상수 메모리 (220) 는 GPR (218) 에 유지된 값과 같은 특정한 스레드를 위해 예약된 바로 특정 부분보다는 셰이더 코어 (216) 의 모든 양태들에 의해 액세스 가능한 메모리를 포함할 수도 있다.Shader core 216 may send and receive data from applications running on CPU 202 . In examples, constants 206 used in the execution of shaders 224-226 may be stored in constant memory 220 (eg, read/write constant RAM) or GPR 218. Shader core 216 may load constant 206 into constant memory 220 . In further examples, execution of the preamble shader 224 may generate a constant value or set of constant values in on-chip memory, such as constant memory 220 (e.g., constant RAM), GPU memory 222, or system memory 124. can be stored. Constant memory 220 may include memory accessible by all aspects of shader core 216 rather than just a specific portion reserved for a specific thread, such as the value held in GPR 218.

도 3은 GPR 할당에 기초하여 셰이더(302)를 실행하기 위한 예시적인 명령들(350)에 대응하는 블록도(300)이다. GPR 은 셰이더 컴파일 시 셰이더에 가변적으로 할당될 수 있다. 그러나 셰이더 (302) 에 할당되는 GPR (218) 의 수가 증가함에 따라, GPU 에 동시에 상주할 수 있는 대응하는 스레드들의 수가 감소한다. 할당된 GPR(218)의 수의 증가로 인한 그러한 효과는 레이턴시 은닉을 제한할 뿐만 아니라 GPU의 전반적인 성능을 저하시킬 수 있다. 셰이더(302)에 할당된 GPR(218)의 수를 늘리는 것과 GPU에 동시에 상주할 수 있는 스레드의 수를 늘리는 것 사이의 트레이드오프의 균형을 맞추기 위해, 셰이더(302)는 셰이더(302)에 의한 사용되지 않는 할당된 GPR 리소스가 없도록 셰이더 실행에 필요한 GPR(218)의 최소 수에만 기초하여 실행될 수도 있다.3 is a block diagram 300 corresponding to example instructions 350 for executing shader 302 based on GPR assignments. GPRs can be dynamically assigned to shaders at shader compilation. However, as the number of GPRs 218 allocated to a shader 302 increases, the corresponding number of threads that can simultaneously reside on the GPU decreases. Such an effect of increasing the number of assigned GPRs 218 can limit latency concealment as well as degrade the overall performance of the GPU. To balance the tradeoff between increasing the number of GPRs 218 allocated to shader 302 and increasing the number of threads that can simultaneously reside on the GPU, shader 302 is It may also be implemented based only on the minimum number of GPRs 218 required for shader execution so that there are no unused allocated GPR resources.

셰이더(302)를 실행하는 데 필요한 GPR(218)의 최소 수는 단일 드로우 콜 또는 커널에 대한 셰이더(302)의 런타임 동안 변경되지 않는 상수/균일한 값에 기초할 수 있다. 상수(206)의 정확한 값이 셰이더(302)가 컴파일될 때 컴파일러에 의해 알려지지 않을 수 있는 경우, 과다한 GPR(218) 이 셰이더(302)의 더 복잡한 경로/브랜치(예를 들어, 복잡한 브랜치(304))를 실행하기 위한 GPR(218)의 충분한 가용성을 보장하기 위해 셰이더(302)에 할당될 수도 있다. 셰이더(302)가 컴파일될 때 상수(206)의 값이 알려져 있는 경우, 컴파일러는 실행을 위해 더 많은 GPR(218)을 필요로 하는 특정 브랜치를 셰이더(302)에서 제거하여 컴파일에 후속하여 셰이더(302)에 할당될 필요가 있는 GPR(218)의 수를 줄임으로써 셰이더 성능을 증가시킬 수 있다. 대안적으로, GPU 구동기가 셰이더(302)가 GPU에 제출(예를 들어, 큐잉)될 때 상수(206)의 값을 결정할 수 있다면, 컴파일러는 각각 상이한 GPR 할당을 갖고 GPU 구동기가 제출 시 사용되어야 하는 셰이더(302)의 버전을 선택하는 것을 허용하는 셰이더(302)의 다수의 버전들을 생성할 수 있다. The minimum number of GPRs 218 required to execute shader 302 may be based on a single draw call or constant/uniform value that does not change during runtime of shader 302 for a kernel. If the exact value of constant 206 may not be known by the compiler when shader 302 is compiled, an excess of GPRs 218 may lead to more complex paths/branches of shader 302 (e.g., complex branch 304 )) may be assigned to the shader 302 to ensure sufficient availability of the GPR 218 to execute. If the value of the constant 206 is known when the shader 302 is compiled, the compiler removes from the shader 302 certain branches that require more GPRs 218 to execute, so that subsequent to compilation the shader ( 302) can increase shader performance by reducing the number of GPRs 218 that need to be allocated. Alternatively, if the GPU driver can determine the value of constant 206 when the shader 302 is submitted (e.g., queued) to the GPU, then the compiler should each have a different GPR assignment and the GPU driver should be used on submission. You can create multiple versions of shader 302 allowing you to choose which version of shader 302 you want.

일반적으로 상수(206)의 값은 컴파일 시간에 컴파일러에 의해 또는 제출 시간에 GPU 구동기에 의해 결정될 수 없다. 컴파일된 셰이더(302)는 런타임에 상수(206)의 값을 식별하도록 구성될 수 있지만, GPR(218)의 수는 아마도 셰이더(302)의 특정 브랜치를 실행하는 데 필요한 GPR(218)의 수를 초과하여, 런타임이 발생할 때까지 셰이더(302)에 이미 할당될 수도 있다. 따라서, 비록 컴파일러가 컴파일 시간에 변수가 상수임을 식별하도록 구성될 수 있더라도, 상수(206)의 정확한 값은 셰이더 컴파일 동안 알려지지 않은 상태로 유지되어 GPR 할당을 줄이는 데 상수 값을 사용할 수 없을 수도 있다.In general, the value of constant 206 cannot be determined by the compiler at compile time or by the GPU driver at submit time. The compiled shader 302 can be configured to identify the value of the constant 206 at runtime, but the number of GPRs 218 probably determines the number of GPRs 218 needed to execute a particular branch of the shader 302. In excess, it may already be assigned to shader 302 by the time runtime occurs. Thus, although the compiler may be configured to identify that the variable is a constant at compile time, the exact value of the constant 206 may remain unknown during shader compilation and the constant value may not be available to reduce GPR allocation.

셰이더는 상수들(206)의 일부 조합을 기반으로 하는 상이한 흐름 제어 경로/브랜치를 가질 수 있다. 상수(206)는 전체 드로우 콜에 걸쳐(예를 들어, 대응하는 형상의 전체 수명 동안) 동일하게 유지되도록 그래픽 API 내에서 정의될 수 있다. 즉, 주어진 값의 상수(206)는 드로우 콜에 걸쳐 한 픽셀에서 다음 픽셀로 픽셀 단위로 변경되지 않는다. 상수 (206) 는 대응하는 형상을 실행하는 모든 픽셀에 대한 셰이더 수명 내내 변경되지 않은 상태로 유지된다. 균일 버퍼로도 지칭될 수 있는 상수 버퍼는 그래픽 API에 의해 관리될 수 있고 (예를 들어, 텍스처 버퍼 또는 프레임 버퍼와 유사한) 메모리에 상주할 수 있으며, 여기서 상수 버퍼는 드로우 콜을 통해 상수/균일한 값을 제공하기 위해 셰이더(302)에 의해 액세스될 수 있다.A shader can have different flow control paths/branches based on some combination of constants (206). The constant 206 may be defined within the graphics API to remain the same throughout the entire draw call (eg, throughout the lifetime of the corresponding shape). That is, a constant 206 of a given value does not change pixel by pixel from one pixel to the next across a draw call. Constant 206 remains unchanged throughout the shader lifetime for every pixel executing the corresponding shape. Constant buffers, also referred to as uniform buffers, can be managed by graphics APIs and can reside in memory (similar to texture buffers or frame buffers, for example), where constant buffers are constant/uniform via draw calls. It can be accessed by shader 302 to provide a value.

실행 가능한 셰이더 프로그램은 셰이더 프로그램의 프리앰블 부분과 셰이더 프로그램의 메인 부분(또는 간단히 "프리앰블 셰이더"(224) 및 "메인 셰이더"(226))을 포함할 수 있다. 프리앰블 셰이더(224)는 드로우 콜 또는 커널당 한 번만 실행되는 셰이더(302)의 일부일 수 있다. 프리앰블 셰이더(224)는 임의의 스레드가 메인 셰이더(226)를 실행하는 것을 허용하기 전에 실행될 수 있다. 프리앰블 셰이더(224)는 또한 상수 값을 GPU의 로컬 메모리에 미리 로드할 수 있으며, 여기서 상수 값은 메인 셰이더(226) 내에서 실행되는 다수의 스레드들에 의해 사용될 수 있다. 따라서, 상수 값은 드로우 콜 내의 각 스레드(예: 픽셀)에 대해 메인 셰이더에 의해 페치되기보다는 드로우 콜당 한 번 프리앰블 셰이더에 의해 페치될 수 있다. The executable shader program may include a preamble portion of the shader program and a main portion of the shader program (or simply “preamble shader” 224 and “main shader” 226). The preamble shader 224 may be part of the shader 302 that is executed only once per draw call or per kernel. The preamble shader 224 may be executed before allowing any thread to execute the main shader 226. The preamble shader 224 may also preload constant values into the local memory of the GPU, where the constant values may be used by multiple threads executing within the main shader 226 . Thus, the constant value can be fetched by the preamble shader once per draw call rather than by the main shader for each thread (eg pixel) in the draw call.

일 예에서, 프리앰블 셰이더(224)는 로컬 상수 버퍼로부터 로컬 상수(206)를 페치할 수 있다. 로컬 상수(206)가 제1 값(예를 들어, 상수 값 X)을 가질 때, 메인 셰이더(226)는 제1 수의 GPR(218)(예를 들어, 20 개의 GPR)을 사용하여 복잡한 브랜치(304)를 실행할 수 있다. 로컬 상수가 제2 값(예를 들어, 상수 값 Y)을 가질 때, 메인 셰이더(226)는 제2 수의 GPR(218)(예를 들어, 4 개의 GPR)을 사용하여 간단한 브랜치(306)를 실행할 수 있다. 그러나 드로우 콜에 대한 로컬 상수가 0이고 복잡한 브랜치(304)의 실행이 필요하지 않은 경우, 셰이더(302)는 여전히 예를 들어 4 개의 GPR 들의 할당에 기초하기 보다는 20 개의 GPR 들의 할당에 기초하여 실행될 수도 있다. 그 결과, 셰이더(302)에 할당된 GPR(218) 중 일부는 불필요/과다할 수 있다.In one example, preamble shader 224 can fetch local constant 206 from a local constant buffer. When local constant 206 has a first value (eg constant value X), main shader 226 uses a first number of GPRs 218 (eg 20 GPRs) to create a complex branch. (304) can be executed. When the local constant has a second value (e.g. constant value Y), the main shader 226 uses the second number of GPRs 218 (e.g. 4 GPRs) to make a simple branch 306 can run However, if the local constant for the draw call is 0 and execution of the complex branch 304 is not required, the shader 302 will still be executed based on the allocation of 20 GPRs rather than, for example, the allocation of 4 GPRs. may be As a result, some of the GPRs 218 allocated to shaders 302 may be unnecessary/redundant.

도 4 는 프로그램 가능한 GPR 릴리스 메커니즘 (402) 에 기초하여 셰이더(302)를 실행하기 위한 예시적인 명령들(450)에 대응하는 블록도(400)이다. GPR 릴리스 메커니즘(402)은 로컬 상수 값 및/또는 할당된 GPR(218)의 수의 결정에 기초하여 GPR "풋프린트"가 수정되는 것을 허용하도록 런타임에 실행될 수 있다. 예를 들어, 로컬 상수(206)가 0일 때, 셰이더(302)는 (예를 들어, 스레드당 4개의 GPR로) 드로우 콜에 대한 간단한 브랜치(306)를 실행할 수 있고 드로우 콜 내에서 후속 스레드/픽셀 실행과 연관된 임의의 초과 GPR(218)을 릴리스할 수 있다(예를 들어, 16개의 GPR이 릴리스될 수 있다). 컴파일러가 컴파일 시간에 상수(206)의 값을 결정할 방법이 없을 수 있기 때문에 GPR 할당이 복잡한 브랜치(304)에 기초하게 할 수도 있는 컴파일러와 달리, 셰이더(302)는 GPR 릴리스 메커니즘(402)을 통해 상수(206)의 이러한 값을 식별한 후에 셰이더(302)의 실행에 실제로 필요한 GPR(218)의 수를 결정하여 임의의 불필요한 GPR(218)이 셰이더(302)로부터 릴리스/할당해제될 수 있다. 4 is a block diagram 400 corresponding to example instructions 450 for executing a shader 302 based on a programmable GPR release mechanism 402 . The GPR release mechanism 402 may be executed at run time to allow the GPR “footprint” to be modified based on local constant values and/or determining the number of assigned GPRs 218 . For example, when local constant 206 is 0, shader 302 can execute a simple branch 306 to the draw call (e.g., with 4 GPRs per thread) and within the draw call a subsequent thread /may release any excess GPRs 218 associated with a pixel run (eg, 16 GPRs may be released). Unlike the compiler, which may cause GPR assignments to be based on complex branches (304) because the compiler may have no way of determining the value of the constant (206) at compile time, the shader (302) uses the GPR release mechanism (402) to After identifying these values of the constants 206, determine the number of GPRs 218 actually needed for execution of the shader 302 so that any unnecessary GPRs 218 can be released/deallocated from the shader 302.

따라서, 셰이더(302)는 더 많은 수의 GPR(218)을 필요로 하는 복잡한 브랜치(304)를 기반으로 실행될 수 있거나 셰이더(302)는 GPR 릴리스 메커니즘(402)에 의한 상수(206)의 결정된 값에 기초하여 더 적은 수의 GPR(218)을 필요로 하는 간단한 브랜치(306)를 기반으로 실행될 수 있다. 그러나, 상수(206)의 값이 간단한 브랜치(306)의 실행에 대응하는 값으로 결정되어 복잡한 브랜치(304)가 실행될 필요가 없는 경우, 간단한 브랜치(306)를 실행하는 데 필요한 것을 초과하여 셰이더(302)에 할당되는 GPR(218)은 후속 실행 인스턴스를 위해 셰이더(302)로부터 할당 해제될 수 있다. 따라서, GPR 릴리스 메커니즘(402)은 셰이더(302) 외부에 있는 GPU의 다른 스레드에 의한 사용을 위한 GPR(218)의 가용성을 증가시켜 더 많은 스레드가 동시에 상주하는 것을 가능하게 함으로써 GPU에 GPR 자원의 보다 효율적인 할당을 제공하도록 구성될 수 있다. Thus, shader 302 may be executed based on a complex branch 304 that requires a larger number of GPRs 218 or shader 302 may be executed based on a determined value of constant 206 by GPR release mechanism 402. It can be implemented based on a simple branch 306 requiring fewer GPRs 218 based on . However, if the value of the constant 206 is determined to be a value corresponding to the execution of the simple branch 306, so that the complex branch 304 does not need to be executed, the shader ( GPR 218 allocated to 302 may be deallocated from shader 302 for subsequent execution instances. Thus, the GPR release mechanism 402 increases the availability of the GPR 218 for use by other threads on the GPU outside of the shaders 302, allowing more threads to reside simultaneously, thereby freeing the GPU of GPR resources. It can be configured to provide more efficient allocation.

GPR 릴리스 메커니즘(402)을 통한 GPR(218)의 할당 해제는 (드로우 콜당 한 번만 실행되는) 프리앰블 셰이더(224) 또는 메인 셰이더(226)의 첫 번째 실행 인스턴스에 적용되지 않을 수도 있지만, 아직 발행되지 않았거나 아직 GPR 할당을 받지 못한 드로우 콜 내의 후속 스레드에만 적용될 수도 있다. GPR 릴리스 메커니즘(402)은 후속 스레드에 대한 GPR 할당을 수정하도록 구성되기 때문에, 후속 스레드로부터 GPR(218)을 릴리스하기 위한 접근법은 이미 GPR(218)이 할당된 현재 스레드로부터 GPR을 릴리스하기 위한 접근법과 비교하여 단순화될 수 있다. GPR(218)이 후속 스레드에 대해 할당 해제되어야 하는 정확한 시간은 초과 GPR 할당으로 실행되는 이전에 발행된 스레드가 여전히 올바르게 실행될 것이기 때문에 일부 경우에 덜 중요할 수 있다. 이와 같이, GPR 릴리스 메커니즘(402)은 (예를 들어, 드로우 콜의 모든 스레드에 대해 GPR이 할당 해제되게 하기 위해) 프리앰블 셰이더(224) 및/또는 (예를 들어, 프리앰블 셰이더(224)를 사용하지 않는 구성에서와 같이 드로우 콜의 스레드들의 서브세트에 대해 GPR 이 할당 해제되게 하기 위해) 메인 셰이더(226)의 어느 하나 또는 양자 모두에 통합될 수도 있다. Deallocation of GPR 218 via GPR release mechanism 402 may not apply to the first running instance of preamble shader 224 or main shader 226 (which is executed only once per draw call), but has not yet been issued. may apply only to subsequent threads in the draw call that have not yet received a GPR allocation. Because the GPR release mechanism 402 is configured to modify GPR assignments for subsequent threads, the approach for releasing GPRs 218 from subsequent threads is the approach for releasing GPRs from the current thread that already have GPRs 218 assigned to them. can be simplified by comparison with The exact time at which GPR 218 must be deallocated for subsequent threads may be less important in some cases since previously issued threads that run with excess GPR allocation will still run correctly. As such, the GPR release mechanism 402 uses the preamble shader 224 (e.g., to cause the GPR to be deallocated for all threads of the draw call) and/or the preamble shader 224 (e.g., may be integrated into either or both of the main shader 226 to allow the GPR to be deallocated for a subset of the threads of the draw call, such as in configurations that do not.

상수(206)에 액세스하는 커맨드 프로세서 또는 다른 프로그래밍 가능한 하드웨어 유닛은 GPR 릴리스 메커니즘(402)과 유사한 동작을 수행하도록 구성될 수 있다. 예를 들어, 커맨드 프로세서는 셰이더(302)가 커맨드 프로세서에 의해 론치되기 전에 셰이더(302)를 실행하는 데 필요한 GPR(218)의 수를 상수(206)에 기초하여 결정하도록 구성될 수 있다. 그 다음, 커맨드 프로세서는 결정된 수의 GPR(218)에 기초하여 GPR(218)이 셰이더(302)에 할당되게 할 수 있다. A command processor or other programmable hardware unit that accesses constant 206 may be configured to perform similar operations to GPR release mechanism 402 . For example, the command processor can be configured to determine based on the constant 206 the number of GPRs 218 needed to execute the shader 302 before the shader 302 is launched by the command processor. The command processor may then cause the GPRs 218 to be assigned to the shader 302 based on the determined number of GPRs 218 .

프리앰블 셰이더(224)와 메인 셰이더(226)는 동시에 컴파일될 수 있다. 프리앰블 셰이더(224)는 메모리로부터 상수(206)를 페치하고 상수(206)를 드로우 콜 동안 더 효율적으로 액세스할 수 있는 로컬 상수 버퍼에 저장하는 데 사용될 수 있다. 상수(206)는 드로우 콜 전체에 걸쳐 변경되지 않은 상태로 유지되기 때문에, 프리앰블 셰이더(224)는 메인 셰이더(226)에 의해 각 픽셀에 대해 페치되기보다는 드로우 콜의 시작 부분에서 상수(206)가 한 번 페치되는 방법을 제공한다. 프리앰블 셰이더(224)가 상수(206)에 보다 효율적으로 액세스하기 위한 목적으로 로컬 상수 스토리지를 관리하는 이점을 제공할 수 있지만, 프리앰블 셰이더(224)는 GPR 릴리스 메커니즘(402)의 구현을 위한 요건이 아니다. 예를 들어, 컴파일러는 메모리로부터 상수(206)를 페치하기 위해 메인 셰이더(226)를 대신 컴파일할 수 있지만, 드로우 콜당 한 번이 아니라 픽셀당 기반으로 할 수 있다. 드로우 콜 내에서 셰이더 실행의 후속 인스턴스에 대해 초과 GPR(218)이 할당 해제될 수 있도록 픽셀/스레드를 실행하는 데 필요한 GPR(218)의 수에 관한 개별 픽셀들/스레드들 각각에 대해 메인 셰이더(226)의 GPR 릴리스 메커니즘(402)에 의해 결정이 내려질 수 있다. 픽셀당 결정은 상수(206)를 기반으로 하기 때문에, 후속 픽셀/스레드에 대한 결정된 GPR(218) 수는 드로우 콜 전체에 걸쳐 변경되지 않은 상태로 유지된다. 프리앰블 셰이더(224)는 메인 셰이더(226)의 실행이 시작되자마자 GPR(218)이 할당 해제되도록 허용할 수 있지만, GPR 릴리스 메커니즘(402)은 드로우 콜의 첫 번째 부분이 초과 GPR의 할당 해제를 제공하기 위한 덜 효율적인 GPR 할당/드로우 콜의 두 번째 부분을 실행하기 위한 더 효율적인 GPR 할당으로 실행된 후 메인 셰이더 (402) 에 의해 유사하게 실행될 수도 있다.The preamble shader 224 and the main shader 226 may be compiled simultaneously. The preamble shader 224 can be used to fetch the constant 206 from memory and store the constant 206 in a local constant buffer where it can be accessed more efficiently during a draw call. Because the constant 206 remains unchanged throughout the draw call, the preamble shader 224 is not fetched for each pixel by the main shader 226, rather than the constant 206 at the beginning of the draw call. Provides a one-time fetch method. Although the preamble shader 224 may provide the benefit of managing local constant storage for the purpose of more efficient access to the constants 206, the preamble shader 224 is not a requirement for the implementation of the GPR release mechanism 402. not. For example, the compiler could instead compile main shader 226 to fetch constant 206 from memory, but on a per-pixel basis rather than once per draw call. The main shader ( The decision may be made by the GPR release mechanism 402 at 226). Since the per-pixel decision is based on the constant 206, the determined number of GPRs 218 for subsequent pixels/threads remains unchanged throughout the draw call. The preamble shader 224 may allow the GPR 218 to be deallocated as soon as the execution of the main shader 226 begins, but the GPR release mechanism 402 does not allow the first part of the draw call to deallocate the excess GPR. may be similarly executed by the main shader 402 after being executed with a less efficient GPR allocation to provide/more efficient GPR allocation to execute the second part of the draw call.

컴파일러가 셰이더(302)를 생성할 때, 셰이더(302)의 함수에 입력될 수 있는 상이한 가능한 상수(206)에 기초하여 셰이더(302)를 통한 고유 경로들/브랜치들의 수가 식별될 수 있다. 즉, 컴파일러는 셰이더(302)의 함수 및 상이한 가능한 상수들(206)에 기초하여 셰이더(302)에 할당되어야 하는 GPR(218)의 수를 결정할 수 있다. 컴파일 시간에 컴파일러에 의해 결정된 GPR(218)의 수는 일반적으로 상이한 가능한 상수들(206)에 기초하여 셰이더(302)를 통해 가장 복잡한 경로/브랜치(예: 복잡한 브랜치(304)) 조차 만족시키기에 충분히 유연하다. 프리앰블 셰이더(224)는 불필요한 GPR(218)의 할당 해제가 메인 셰이더(226)의 초기 실행 시간에 발생하기 때문에 셰이더 실행에 필요한 GPR(218)의 수를 결정하기 위한 GPR 릴리스 메커니즘(402)을 통합하는 자연스러운 위치일 수 있다. 그러나, GPR 릴리스 메커니즘(402)은 GPR 할당을 감소시키고 이에 의해 더 많은 스레드/픽셀이 동시에 실행되도록 허용하고, 레이턴시 은닉을 개선하고, 및/또는 셰이더(302) 및 시스템 메모리 양자 모두의 효율을 증가시키기 위해 프리앰블 셰이더 (224) 내의 통합에 제한되지 않는다.When the compiler generates shader 302, the number of unique paths/branches through shader 302 can be identified based on different possible constants 206 that can be input to a function of shader 302. That is, the compiler can determine the number of GPRs 218 that should be assigned to the shader 302 based on the function of the shader 302 and the different possible constants 206 . The number of GPRs 218 determined by the compiler at compile time is generally sufficient to satisfy even the most complex path/branch (e.g., complex branch 304) through the shader 302 based on different possible constants 206. flexible enough The preamble shader 224 incorporates a GPR release mechanism 402 to determine the number of GPRs 218 required for shader execution since the deallocation of unnecessary GPRs 218 occurs at the initial execution time of the main shader 226. It may be a natural position to do. However, the GPR release mechanism 402 reduces GPR allocations thereby allowing more threads/pixels to run concurrently, improving latency concealment, and/or increasing the efficiency of both shaders 302 and system memory. are not limited to integration within the preamble shader 224 to

도 5 는 본 개시의 하나 이상의 기법들에 따른 GPR 할당 해제의 예시적인 방법 (500) 의 플로우차트이다. 방법 (500) 은 도 1 내지 도 4 의 예들과 연계하여 사용된 바와 같이 GPU, CPU, 커맨드 프로세서, 그래픽 프로세싱을 위한 장치, 무선 통신 디바이스 등에 의해 수행될 수도 있다. 5 is a flowchart of an example method 500 of GPR deallocation in accordance with one or more techniques of this disclosure. Method 500 may be performed by a GPU, CPU, command processor, apparatus for graphics processing, wireless communication device, or the like as used in connection with the examples of FIGS. 1-4 .

502에서, 프리앰블 셰이더는 (예를 들어, GPR 할당 해제의 방법을 수행하기 위해) 정의된 상수를 페치할 수 있다. 예를 들어, 도 2 를 참조하면, 셰이더 코어(216) 상에서 실행되는 프리앰블 셰이더(224)는 그래픽 API(204)로부터 상수(206)를 페치할 수 있다. GPR 할당 해제의 방법은 실행 가능한 셰이더 내에서 그리고 GPU에 의해 실행될 때 실행 가능한 셰이더의 주요 부분 이전에 프리앰블 셰이더에 의해 수행될 수 있다. 추가적으로 또는 대안적으로, GPR 할당 해제의 방법은 GPU에 의해 실행될 때 실행 가능한 셰이더의 주요 부분 내에서 수행될 수 있고 드로우 콜 내에서 실행 가능한 셰이더의 주요 부분의 후속 호출에 적용될 수 있다. 예를 들어, 도 2 를 참조하면, 프리앰블 셰이더(224) 및/또는 메인 셰이더(226)는 GPU(212)의 셰이더 코어(216) 상에서 실행되어 GPR 할당 해제의 방법을 수행할 수 있다. 추가 구성에서, GPR 할당 해제의 방법은 GPU 내의 커맨드 프로세서에 의해 수행될 수 있다. 예를 들어, 도 2 를 참조하면, 커맨드 엔진(214)은 GPR 할당 해제의 방법을 수행하기 위해 커맨드 스트림으로부터의 커맨드들을 GPU(212) 상에서 실행할 수 있다.At 502, the preamble shader may fetch defined constants (eg, to perform the method of GPR deallocation). For example, referring to FIG. 2 , preamble shader 224 running on shader core 216 can fetch constant 206 from graphics API 204 . The method of GPR deallocation may be performed by the preamble shader within the executable shader and before the main part of the executable shader when executed by the GPU. Additionally or alternatively, the method of GPR deallocation may be performed within a main part of an executable shader when executed by a GPU and may be applied to subsequent calls of the main part of an executable shader within a draw call. For example, referring to FIG. 2 , the preamble shader 224 and/or the main shader 226 may be executed on the shader core 216 of the GPU 212 to perform a method of de-allocating the GPR. In a further configuration, the method of de-allocating the GPR may be performed by a command processor within the GPU. For example, referring to FIG. 2 , command engine 214 can execute commands from the command stream on GPU 212 to perform the method of GPR deallocation.

504에서, 정의된 상수는 실행 가능한 셰이더 내에서 액세스 가능한 로컬 상수 버퍼에 저장될 수 있다. 예를 들어, 도 2 를 참조하면, 상수(206)는 셰이더 코어(216)에 액세스 가능한 상수 메모리(220)에 저장될 수 있다. 추가의 양태들에서, 상수(206) 는 GPU 메모리 (222) 에 저장될 수도 있다. At 504, the defined constants may be stored in a local constant buffer accessible within the executable shader. For example, referring to FIG. 2 , constant 206 may be stored in constant memory 220 accessible to shader core 216 . In additional aspects, constant 206 may be stored in GPU memory 222 .

506에서, 실행 가능한 셰이더 내의 브랜치의 수가 결정될 수 있다. 예를 들어, 도 2 를 참조하면, 브랜치의 수는 프리앰블 셰이더(224) 및/또는 메인 셰이더(226)의 실행에 사용될 수 있는 상수들(206)의 상이한 값들에 기초하여 GPU(212)에 의해 결정될 수 있다. 상수는 실행 가능한 셰이더를 컴파일할 때 정의되지 않을 수 있다. 따라서 실행 가능한 셰이더에 의해 사용될 수 있는 상수(206)는 컴파일 시간에 셰이더 컴파일러(208)에 의해 결정되지 않은 값을 가정할 수 있지만 런타임에 실행 가능한 셰이더에 의해 결정될 수 있다.At 506, the number of branches within the executable shader can be determined. For example, referring to FIG. 2 , the number of branches is determined by GPU 212 based on different values of constants 206 that may be used in the execution of preamble shader 224 and/or main shader 226. can be determined Constants may be undefined when compiling an executable shader. Thus, constants 206 that may be used by an executable shader may assume values not determined by the shader compiler 208 at compile time, but may be determined by the executable shader at runtime.

508에서, 결정된 브랜치의 수에 기초하여 GPR이 할당될 수 있다. 예를 들어, 도 2 를 참조하면, GPR(218)은 프리앰블 셰이더(224) 및/또는 메인 셰이더(226)의 브랜치를 실행하기 위해 셰이더 코어(216)에 할당될 수 있다. GPR의 할당 해제가 수행되어야 하는 경우, 할당된 GPR에서 GPR 들의 수가 할당 해제될 수 있다.At 508, a GPR may be assigned based on the number of branches determined. For example, referring to FIG. 2 , GPR 218 can be assigned to shader core 216 to execute branches of preamble shader 224 and/or main shader 226 . When the de-allocation of a GPR is to be performed, the number of GPRs in the allocated GPR may be de-allocated.

510에서, 실행가능한 셰이더에 대해 정의된 상수들에 기초하여 실행가능한 셰이더 내의 적어도 하나의 이용되지 않은 브랜치가 결정될 수 있다. 예를 들어, 도 2 를 참조하면, 프리앰블 셰이더(224) 또는 메인 셰이더(226)에 의해 페치된 상수(206)는 메인 셰이더(226)에서 사용되지 않는 적어도 하나의 브랜치를 결정하는데 사용될 수 있다. At 510, at least one unused branch within the executable shader may be determined based on the constants defined for the executable shader. For example, referring to FIG. 2 , the constant 206 fetched by the preamble shader 224 or the main shader 226 can be used to determine at least one unused branch in the main shader 226 .

512에서, 적어도 하나의 이용되지 않은 브랜치에 기초하여 할당된 GPR 들로부터 할당 해제될 수 있는 GPR 들의 수가 결정될 수 있다. 예를 들어, 도 4 를 참조하면, 명령들(400)은 예를 들어 이용되지 않은 브랜치를 나타내는 0의 로컬 상수에 기초하여 16개의 GPR이 20개의 할당된 GPR로부터 할당 해제될 수 있음을 나타낸다. 상기 적어도 하나의 이용되지 않은 브랜치에 기초하여 할당 해제될 수 있는 GPR의 수를 결정하는 것은 적어도 하나의 이용되지 않은 브랜치가 없는 경우 실행 가능한 셰이더의 실행에 필요한 총 GPR 수를 결정하는 것; 및 결정된 총 GPR 수를 초과하여 할당된 GPR 개수에 기초하여 할당 해제될 수 있는 GPR 수를 결정하는 것을 더 포함할 수 있다. 예를 들어, 도 4 를 참조하면, 명령(400)은 복잡한 브랜치가 없을 때 실행 가능한 셰이더의 실행을 위해 4개의 GPR이 필요할 수 있음을 나타낸다. 이와 같이, 명령(400)은 총 20개의 할당된 GPR 중 16개의 GPR이 4개의 요구된 GPR을 초과하고 할당 해제될 수 있다고 결정할 수 있다.At 512, the number of GPRs that can be de-assigned from the allocated GPRs based on the at least one unused branch can be determined. For example, referring to FIG. 4 , instructions 400 indicate that 16 GPRs may be deallocated from the 20 allocated GPRs, eg, based on a local constant of 0 indicating an unused branch. Determining the number of GPRs that can be deallocated based on the at least one unused branch may include determining a total number of GPRs required for execution of executable shaders if there is no at least one unused branch; and determining the number of GPRs that can be deallocated based on the number of allocated GPRs exceeding the determined total number of GPRs. For example, referring to FIG. 4, instruction 400 indicates that four GPRs may be required for execution of an executable shader when there are no complex branches. As such, instructions 400 may determine that 16 GPRs out of a total of 20 allocated GPRs exceed the 4 requested GPRs and may be de-allocated.

514에서, 드로우 콜 내의 후속적인 스레드에 대해, GPR 들의 결정된 수에 기초하여 실행가능한 셰이더의 실행 동안, 할당된 GPR 들로부터 GPR 들의 수가 할당 해제될 수 있다. 예를 들어, 도 4 를 참조하면, 명령(400)은 드로우 콜 내의 후속 스레드에 대해, 실행 가능한 셰이더의 실행 동안 20개의 할당된 GPR로부터 16개의 GPR이 할당 해제될 수 있음을 나타낸다. 양태들에서, GPR의 수는 실행 가능한 셰이더의 실행 전에 할당 해제될 수 있다. 예를 들어, 도 2 를 참조하면, GPR(218)은 메인 셰이더(226)의 실행 이전에 프리앰블 셰이더(224)에 의해 할당 해제될 수 있다.At 514, for a subsequent thread within the draw call, during execution of the executable shader based on the determined number of GPRs, the number of GPRs may be de-allocated from the allocated GPRs. For example, referring to FIG. 4 , instruction 400 indicates that 16 GPRs may be deallocated from the 20 allocated GPRs during execution of the executable shader, for subsequent threads in the draw call. In aspects, the number of GPRs may be deallocated prior to execution of an executable shader. For example, referring to FIG. 2 , GPR 218 may be deallocated by preamble shader 224 prior to execution of main shader 226 .

도 6 은 예시의 장치 (602) 에서 상이한 수단들/컴포넌트들 사이의 데이터 플로우를 도시하는 개념적 데이터 플로우 다이어그램 (600) 이다. 장치 (602) 는 GPU, 커맨드 프로세서, 실행가능한 셰이더를 갖는 장치, 무선 통신 디바이스, 또는 다른 유사한 장치일 수도 있다. 6 is a conceptual data flow diagram 600 illustrating data flow between different means/components in the example apparatus 602. Apparatus 602 may be a GPU, a command processor, an apparatus having executable shaders, a wireless communication device, or other similar apparatus.

장치(602)는 실행 가능한 셰이더의 컴파일 후에, 상수가 실행 가능한 셰이더의 브랜치에 대해 정의되지 않은 것으로 결정할 수 있는 결정 컴포넌트(604)를 포함한다. 정의되지 않은 상수에 기초하여, 장치(602)에 포함된 페처 컴포넌트(606)는 CPU(650)로부터 정의된 상수를 페치하기 위해 시스템 콜을 실행한다. 예를 들어, 502와 관련하여 설명된 바와 같이, 페처 컴포넌트(606)는 프리앰블 셰이더를 통해 정의된 상수를 페치할 수 있다. 장치(602)는 CPU(650)로부터 상수를 수신하고 로컬 상수 버퍼에 상수를 저장하는 저장 컴포넌트(608)를 포함한다. 예를 들어, 504와 관련하여 설명된 바와 같이, 저장 컴포넌트(608)는 정의된 상수를 실행 가능한 셰이더 내에서 액세스 가능한 로컬 상수 버퍼에 저장할 수 있다. Apparatus 602 includes a decision component 604 that can determine, after compilation of the executable shader, that a constant is undefined for the branch of the executable shader. Based on the undefined constant, the fetcher component 606 included in device 602 executes a system call to fetch the defined constant from CPU 650 . For example, as described with respect to 502 , fetcher component 606 can fetch constants defined via a preamble shader. Apparatus 602 includes a storage component 608 that receives constants from CPU 650 and stores the constants in a local constant buffer. For example, as described with respect to 504, the storage component 608 can store the defined constants in a local constant buffer accessible within the executable shader.

결정 컴포넌트(604)는 로컬 상수 버퍼로부터 취출된 상수에 기초하여 실행 가능한 셰이더 내의 브랜치의 총 수를 결정할 수 있다. 예를 들어, 506와 관련하여 설명된 바와 같이, 결정 컴포넌트(604)는 실행가능한 셰이더 내의 브랜치들의 수를 결정할 수 있다. 장치(602)는 GPR을 실행 가능한 셰이더에 할당할 수 있는 할당 컴포넌트(610)를 더 포함한다. 예를 들어, 508과 관련하여 설명된 바와 같이, 할당 컴포넌트(610)는 결정된 브랜치의 수에 기초하여 GPR을 할당할 수 있다. The determining component 604 can determine the total number of branches within the executable shader based on the constants retrieved from the local constant buffer. For example, as described with respect to 506, determining component 604 can determine the number of branches within an executable shader. Apparatus 602 further includes an assignment component 610 that can assign GPRs to executable shaders. For example, as described with respect to 508, allocation component 610 can allocate a GPR based on the determined number of branches.

할당된 GPR을 수신한 후, 결정 컴포넌트(604)는 실행 가능한 셰이더의 이용되지 않은 브랜치 및 결정된 이용되지 않은 브랜치에 기초하여 할당 해제될 할당된 GPR의 수를 결정할 수 있다. 예를 들어, 510과 관련하여 설명된 바와 같이, 결정 컴포넌트(604)는 실행 가능한 셰이더에 대해 정의된 상수에 기초하여 실행 가능한 셰이더 내의 적어도 하나의 이용되지 않은 브랜치를 결정할 수 있다. 512와 관련하여 설명된 바와 같이, 결정 컴포넌트(604)는 적어도 하나의 이용되지 않은 브랜치에 기초하여 할당된 GPR들로부터 할당해제될 수 있는 GPR들의 수를 더 결정할 수 있다.After receiving the allocated GPRs, determining component 604 can determine the number of allocated GPRs to be deallocated based on the unused branches of the executable shader and the determined unused branches. For example, as described with respect to 510 , determining component 604 can determine at least one unused branch within an executable shader based on constants defined for the executable shader. As described with respect to 512 , determining component 604 can further determine a number of GPRs that can be deassigned from the allocated GPRs based on the at least one unused branch.

장치(602)는 드로우 콜 내의 후속 스레드에 대해 실행 가능한 셰이더로부터 GPR을 할당 해제하는 할당 해제 컴포넌트(612)를 포함한다. 예를 들어, 514와 관련하여 설명된 바와 같이, 할당 해제 컴포넌트(612)는, 드로우 콜 내의 후속 스레드에 대해, 결정된 GPR의 수에 기초하여 실행 가능한 셰이더의 실행 동안 할당된 GPR로부터의 GPR의 수를 할당 해제할 수 있다.Apparatus 602 includes a deallocation component 612 that deallocates the GPR from an executable shader for a subsequent thread in the draw call. For example, as described with respect to 514, the deallocation component 612 may, for subsequent threads in the draw call, determine the number of GPRs from the allocated GPRs during execution of the executable shader based on the determined number of GPRs. can be deallocated.

장치 (602) 는 도 5 의 전술된 플로우차트에서의 알고리즘의 블록들의 각각을 수행하는 부가적인 컴포넌트들을 포함할 수도 있다. 이로써, 도 5 의 전술된 플로우차트에서의 각각의 블록은 컴포넌트에 의해 수행될 수도 있으며, 장치 (602) 는 이들 컴포넌트들 중 하나 이상을 포함할 수도 있다. 컴포넌트들은 언급된 프로세스들/알고리즘을 수행하도록 특별히 구성된 하나 이상의 하드웨어 컴포넌트이거나, 진술된 프로세스들/알고리즘을 수행하도록 구성된 프로세서 (예를 들어, 프로세서에 의해 실행된 로직 및/또는 코드) 에 의해 구현되거나, 프로세서에 의한 구현을 위해 컴퓨터 판독가능 매체 내에 저장되거나, 또는 이들의 일부 조합일 수도 있다.Apparatus 602 may include additional components that perform each of the blocks of the algorithm in the aforementioned flowchart of FIG. 5 . As such, each block in the aforementioned flowchart of FIG. 5 may be performed by a component, and apparatus 602 may include one or more of these components. Components are one or more hardware components specifically configured to perform the stated processes/algorithms, or implemented by a processor (eg, logic and/or code executed by the processor) configured to perform the stated processes/algorithms; , stored in a computer readable medium for implementation by a processor, or some combination thereof.

개시된 프로세스들/플로우차트들에 있어서의 블록들의 특정 순서 또는 계위는 예시적인 접근법들의 예시임이 이해된다. 설계 선호들에 기초하여, 프로세스들/플로우차트들에서 블록들의 특정 순서 또는 계위는 재배열될 수도 있다는 것이 이해된다. 또한, 일부 블록들은 조합될 수도 있거나 생략될 수도 있다. 첨부 방법 청구항들은 다양한 블록들의 엘리먼트들을 샘플 순서로 제시하고, 제시된 특정 순서 또는 계위로 한정되도록 의도되지 않는다.It is understood that the specific order or hierarchy of blocks in the processes/flowcharts disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes/flowcharts may be rearranged. Also, some blocks may be combined or omitted. The accompanying method claims present elements of the various blocks in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

이전의 설명은 임의의 당업자로 하여금 본 명세서에서 설명된 다양한 양태들을 실시할 수 있게 하도록 제공된다. 이들 양태들에 대한 다양한 수정들은 당업자에게 쉽게 자명할 것이며, 본 명세서에 정의된 일반적인 원리들은 다른 양태들에 적용될 수도 있다. 따라서, 청구항들은 여기에 보여진 다양한 양태들에 한정되는 것으로 의도된 것이 아니라, 청구항의 언어에 부합하는 전체 범위가 부여되어야 하고, 단수형 엘리먼트에 대한 언급은, 특별히 그렇게 진술되지 않았으면 "하나 및 오직 하나만" 을 의미하도록 의도된 것이 아니라 오히려 "하나 이상" 을 의미하도록 의도된다. 단어 "예시적인" 은 "예, 사례, 또는 예시로서 작용하는 것" 을 의미하도록 본 명세서에서 사용된다.　 "예시적인" 으로서 본 명세서에서 설명된 임의의 양태는 반드시 다른 양태들에 비해 유리하거나 또는 바람직한 것으로서 해석될 필요는 없다. The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the various aspects shown herein but are to be accorded the full scope consistent with the language of the claims and references to singular elements should be substituted for "one and only one" unless specifically stated so. ", but rather "one or more". The word "exemplary" is used herein to mean "serving as an example, instance, or illustration." Any aspect described herein as “exemplary” is not necessarily to be construed as advantageous or preferred over other aspects.

달리 구체적으로 언급되지 않는 한, 용어 "일부"는 하나 이상을 지칭하고, 용어 "또는"은 컨텍스트가 달리 지시하지 않는 경우 "및/또는"으로서 인터럽트될 수 있다. "A, B, 또는 C 중 적어도 하나", "A, B, 또는 C 중 하나 이상", "A, B, 및 C 중 적어도 하나", "A, B, 및 C 중 하나 이상", 및 "A, B, C, 또는 이들의 임의의 조합" 과 같은 조합들은 A, B, 및/또는 C 의 임의의 조합을 포함하고, A 의 배수들, B 의 배수들, 또는 C 의 배수들을 포함할 수도 있다. 구체적으로, "A, B, 또는 C 중 적어도 하나", "A, B, 또는 C 중 하나 이상", "A, B, 및 C 중 적어도 하나", "A, B, 및 C 중 하나 이상", 및 "A, B, C, 또는 이들의 임의의 조합" 과 같은 조합들은 오직 A, 오직 B, 오직 C, A 및 B, A 및 C, B 및 C, 또는 A 와 B 와 C 일 수도 있고, 여기서 임의의 그러한 조합들은 A, B, 또는 C 의 하나 이상의 멤버 또는 멤버들을 포함할 수도 있다. 당업자들에게 알려지거나 또는 나중에 알려지게 될 본 개시 전반에 걸쳐 설명된 다양한 양태들의 엘리먼트들에 대한 모든 구조적 및 기능적 등가물들은 본 명세서에 참조로 명백히 통합되고 청구항들에 의해 포괄되도록 의도된다. 또한, 본 명세서에 개시된 어느 것도 그러한 개시가 명시적으로 청구항들에 인용되는지 여부에 관계없이 공중에 전용되는 것으로 의도되지 않는다. "모듈", "메커니즘", "엘리먼트", "디바이스" 등의 단어는 "수단" 이라는 단어의 대체물이 아닐 수도 있다. 이로써, 어떠한 청구항 엘리먼트도, 그 엘리먼트가 어구 "하는 수단" 을 사용하여 명백하게 기재되지 않는다면 수단 플러스 기능으로서 해석되지 않아야 한다. Unless specifically stated otherwise, the term “some” refers to one or more, and the term “or” may be interrupted as “and/or” when the context does not dictate otherwise. "At least one of A, B, or C", "At least one of A, B, or C", "At least one of A, B, and C", "At least one of A, B, and C", and " Combinations such as "A, B, C, or any combination thereof" include any combination of A, B, and/or C, and may include multiples of A, multiples of B, or multiples of C. may be Specifically, “at least one of A, B, or C”, “one or more of A, B, or C”, “at least one of A, B, and C”, “one or more of A, B, and C” , and combinations such as "A, B, C, or any combination thereof" may be only A, only B, only C, A and B, A and C, B and C, or A and B and C, , wherein any such combinations may include one or more members or members of A, B, or C. All structural and functional equivalents to elements of the various aspects described throughout this disclosure that are known, or will become known to those skilled in the art, are expressly incorporated herein by reference and are intended to be covered by the claims. Furthermore, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. The words "module", "mechanism", "element", "device" and the like may not be substitutes for the word "means". As such, no claim element should be construed as a means plus function unless that element is expressly recited using the phrase “means for”.

하나 이상의 예에서, 본원에 설명된 기능들은 하드웨어, 소프트웨어, 펌웨어, 또는 이들의 임의의 조합에서 구현될 수도 있다. 예를 들어, 용어 "프로세싱 유닛" 이 본 개시에 걸쳐 사용되었지만, 이러한 프로세싱 유닛들은 하드웨어, 소프트웨어, 펌웨어, 또는 이들의 임의의 조합으로 구현될 수도 있다. 임의의 기능, 프로세싱 유닛, 본 명세서에 설명된 기술, 또는 다른 모듈이 소프트웨어에서 구현되는 경우, 기능, 프로세싱 유닛, 본 명세서에 설명된 기술, 또는 다른 모듈은 컴퓨터 판독가능 매체 상에 하나 이상의 명령 또는 코드로서 저장되거나 전송될 수도 있다. In one or more examples, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, although the term “processing unit” is used throughout this disclosure, such processing units may be implemented in hardware, software, firmware, or any combination thereof. If any function, processing unit, technology, or other module is implemented in software, the function, processing unit, technology, or other module may include one or more instructions or instructions on a computer-readable medium. It may be stored or transmitted as a code.

컴퓨터 판독가능 매체들은, 일 장소로부터 다른 장소로의 컴퓨터 프로그램의 전송을 용이하게 하는 임의의 매체를 포함하는 통신 매체들 또는 컴퓨터 데이터 저장 매체들을 포함할 수도 있다.　　이러한 방식으로 컴퓨터 판독 가능 매체는 일반적으로 다음에 해당할 수 있다: (1) 비일시적인 유형의 컴퓨터 판독가능 저장 매체들 또는 (2) 신호 또는 캐리어 파와 같은 통신 매체. 데이터 저장 매체들은 본 개시에서 설명된 기법들의 구현을 위한 명령들, 코드, 및/또는 데이터 구조들을 취출하기 위해 하나 이상의 컴퓨터들 또는 하나 이상의 프로세서에 의해 액세스될 수 있는 임의의 가용 매체들일 수도 있다.　 한정이 아닌 예로서, 그러한 컴퓨터 판독가능 매체들은 RAM, ROM, EEPROM, CD-ROM (compact disc-read only memory) 또는 다른 광학 디스크 저장부, 자기 디스크 저장부 또는 다른 자기 저장 디바이스들을 포함할 수 있다.　 본 명세서에 사용된 바와 같이, 디스크 (disk) 및 디스크 (disc) 는 컴팩트 디스크 (CD), 레이저 디스크, 광학 디스크, 디지털 다기능 디스크 (DVD), 플로피 디스크 및 블루레이 디스크를 포함하며, 여기서, 디스크(disk)들은 통상적으로 데이터를 자기적으로 재생하지만 디스크(disc)들은 통상적으로 레이저들을 이용하여 데이터를 광학적으로 재생한다. 또한, 상기의 조합들은 컴퓨터 판독 가능 매체들의 범위 내에 포함되어야 한다. 컴퓨터 프로그램 제품은 컴퓨터 판독가능 매체를 포함할 수 있다.Computer readable media may include computer data storage media or communication media including any medium that facilitates transfer of a computer program from one place to another. In this manner, computer readable media may generally be: (1) tangible computer readable storage media that are non-transitory or (2) communication media such as signals or carrier waves. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and/or data structures for implementation of the techniques described in this disclosure. By way of example and not limitation, such computer readable media may include RAM, ROM, EEPROM, compact disc-read only memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices. . As used herein, disk and disc include compact discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy discs, and Blu-ray discs, where the disc Disks typically reproduce data magnetically, while discs typically reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer readable media. A computer program product may include a computer readable medium.

본 개시의 기법들은 무선 핸드셋, 집적 회로 (IC) 또는 IC들의 세트, 예컨대 칩 세트를 포함하는, 매우 다양한 디바이스들 또는 장치들에서 구현될 수도 있다. 다양한 컴포넌트들, 모듈들, 또는 유닛들은 개시된 기법들을 수행하도록 구성된 디바이스들의 기능적 양태들을 강조하기 위해 본 개시에 설명되지만, 상이한 하드웨어 유닛들에 의한 실현을 반드시 필요로 하는 것은 아니다. 오히려, 상기 설명된 바와 같이, 다양한 유닛들은 적합한 소프트웨어 및/또는 펌웨어와 함께 상기 설명된 바와 같은 하나 이상의 프로세서들을 포함하여 임의의 하드웨어 유닛에서 결합되거나 또는 상호운용식 하드웨어 유닛들의 집합에 의해 제공될 수도 있다. 따라서, 본 명세서에 사용된 용어 "프로세서" 는 전술한 구조 중 임의의 것 또는 본 명세서에 설명된 기법들의 구현에 적합한 임의의 다른 구조를 지칭할 수도 있다. 또한, 기법들은 하나 이상의 회로 또는 로직 엘리먼트에서 완전히 구현될 수도 있다.The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or set of ICs, such as a chip set. Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in any hardware unit including one or more processors as described above along with suitable software and/or firmware or provided by a collection of interoperable hardware units. there is. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Also, the techniques may be implemented entirely in one or more circuits or logic elements.

다양한 예들이 설명되었다. 이들 및 다른 예들은 다음의 청구항들의 범위 내에 있다.Various examples have been described. These and other examples are within the scope of the following claims.

Claims

As a method of deallocating a general register (GPR),
determining at least one unused branch within an executable shader based on constants defined for the executable shader;
determining the number of GPRs that can be de-allocated from the allocated GPRs based on the at least one unused branch; and
for a subsequent thread within a draw call, de-allocating the number of GPRs from the allocated GPRs during execution of the executable shader based on the determined number of GPRs.

According to claim 1,
Determining the number of GPRs that can be deallocated based on the at least one unused branch comprises:
determining a total number of GPRs required for execution of the executable shader if there is no at least one unused branch; and
and determining the number of GPRs that can be deallocated based on the number of GPRs allocated in excess of the determined total number of GPRs.

According to claim 1,
wherein the number of GPRs is deallocated prior to execution of the executable shader.

According to claim 1,
wherein the method is performed by a preamble shader within the executable shader when executed by a graphics processing unit (GPU) and before a main portion of the executable shader.

According to claim 4,
fetching the defined constants by the preamble shader; And
The method of GPR deallocation further comprising storing the defined constants in a local constant buffer accessible within the executable shader.

According to claim 1,
The method of GPR deallocation, wherein the method is performed within the main part of the executable shader when executed by a graphics processing unit (GPU) and applied to subsequent invocations of the main part of the executable shader within the draw call. .

According to claim 1,
wherein the method is performed by a command processor in a graphics processing unit (GPU).

According to claim 1,
determining the number of branches within the executable shader; and
allocating GPRs based on the determined number of branches, further comprising allocating the GPRs, wherein the number of GPRs is de-allocated from the allocated GPRs.

According to claim 1,
The method of GPR deallocation, wherein the constants are not defined at compile time of the executable shader.

A device for general purpose register (GPR) deallocation,
Memory; and
at least one processor coupled to the memory;
The at least one processor,
determine at least one unused branch within the executable shader based on constants defined for the executable shader;
determine a number of GPRs that can be de-allocated from allocated GPRs based on the at least one unused branch; and
To, for a subsequent thread within a draw call, deallocate the number of GPRs from the allocated GPRs during execution of the executable shader based on the number of GPRs determined.
A device for de-allocating GPRs, which is configured.

According to claim 10,
The at least one processor configured to determine the number of GPRs that can be deallocated based on the at least one unused branch also:
determine a total number of GPRs required for execution of the executable shader if there is no at least one unused branch; and
and determine the number of GPRs that can be deallocated based on the number of GPRs allocated in excess of the determined total number of GPRs.

According to claim 10,
wherein the number of GPRs is deallocated prior to execution of the executable shader.

According to claim 10,
wherein the action of the at least one processor is performed by a preamble shader within the executable shader when executed by a graphics processing unit (GPU) and before a main portion of the executable shader.

According to claim 13,
The at least one processor also,
fetch, by the preamble shader, the defined constants; And
Apparatus for GPR deallocation, configured to store the defined constants in a local constant buffer accessible within the executable shader.

According to claim 10,
the actions of the at least one processor are performed within the main part of the executable shader when executed by a graphics processing unit (GPU) and apply to subsequent invocations of the main part of the executable shader within the draw call; A device for de-allocating GPRs.

According to claim 10,
The action of the at least one processor is performed by a command processor in a graphics processing unit (GPU).

According to claim 10,
The at least one processor also,
determine a number of branches within the executable shader; and
Allocating GPRs based on the determined number of branches, wherein the number of GPRs is de-allocated from the allocated GPRs.

According to claim 10,
A device for GPR deallocation, wherein the constants are not defined at compile time of the executable shader.

According to claim 10,
The apparatus for deassigning a GPR, wherein the apparatus is a wireless communication device.

A computer-readable storage medium storing computer-executable code,
The code, when executed by at least one processor, causes the at least one processor to:
determine at least one unused branch within an executable shader based on constants defined for the executable shader;
determine a number of GPRs that can be de-allocated from allocated GPRs based on the at least one unused branch; and
de-allocate the number of GPRs from the allocated GPRs during execution of the executable shader within a draw call based on the determined number of GPRs.