KR20230160893A

KR20230160893A - Select and run wavefront

Info

Publication number: KR20230160893A
Application number: KR1020237036512A
Authority: KR
Inventors: 맥심 비. 카자코브
Original assignee: 어드밴스드 마이크로 디바이시즈, 인코포레이티드
Priority date: 2021-03-31
Filing date: 2022-03-02
Publication date: 2023-11-24
Also published as: US20230266975A1; US20220318021A1; WO2022211962A1; CN117099081A; US11656877B2; JP2024511764A; EP4315061A1

Abstract

파면들을 실행하기 위한 기술들이 제공된다. 기술들은 실행을 위해 명령어들을 발행하기 위한 제1 시간에, 처리 레인 내에서 명령어들의 제1 세트를 함께 실행하기에 충분한 처리 자원들이 존재한다고 식별하는 것을 포함한, 제1 식별을 수행하는 것; 제1 식별에 응답하여, 명령어들의 제1 세트를 함께 실행하는 것; 실행을 위해 명령어들을 발행하기 위한 제2 시간에, 처리 레인 내에서 함께 실행하기 위해 충분한 처리 자원들이 존재하는 명령어들이 이용가능하지 않다고 식별하는 것을 포함한, 제2 식별을 수행하는 것; 및 제2 식별에 응답하여, 명령어를 임의의 다른 명령어와는 독립적으로 실행하는 것을 포함한다.Techniques for implementing wavefronts are provided. The techniques include performing a first identification, including identifying that sufficient processing resources exist to jointly execute the first set of instructions within a processing lane at a first time to issue instructions for execution; In response to the first identification, executing a first set of instructions together; At a second time for issuing instructions for execution, performing a second identification, including identifying that instructions for which there are sufficient processing resources for execution together within the processing lane are not available; and in response to the second identification, executing the instruction independently of any other instructions.

Description

Select and run wavefront

본 출원은 2021년 3월 31일자로 출원된, 발명의 명칭이 "파면 선택 및 실행(WAVEFRONT SELECTION AND EXECUTION)"인, 계류 중인 미국 정규 특허 출원 제17/219,775호의 이익을 주장하며, 이 특허 출원의 전체 내용은 이에 의해 본 명세서에 참고로 포함된다.This application claims the benefit of pending U.S. regular patent application Ser. No. 17/219,775, entitled “WAVEFRONT SELECTION AND EXECUTION,” filed March 31, 2021. The entire contents of are hereby incorporated by reference.

그래픽 처리 유닛들은 셰이더 프로그램들을 고도의 병렬 방식으로 실행하는 병렬 처리 요소들을 포함한다. 셰이더 프로그램들의 실행 효율에 대한 개선들이 끊임없이 이루어지고 있다.Graphics processing units contain parallel processing elements that execute shader programs in a highly parallel manner. Improvements to the execution efficiency of shader programs are constantly being made.

첨부 도면들과 관련하여 예로서 주어지는 다음의 설명으로부터 더 상세한 이해가 이루어질 수 있다.
도 1은 본 개시의 하나 이상의 특징이 구현될 수 있는 예시적인 디바이스의 블록도이다.
도 2는 추가적인 세부 사항을 예시하는, 도 1의 디바이스의 블록도이다.
도 3은 예에 따른, 그래픽 처리 파이프라인을 예시하는 블록도이다.
도 4는 예에 따른, SIMD 유닛의 세부 사항들을 예시한다.
도 5는 예에 따른, 명령어들을 실행하기 위한 방법의 흐름도이다.A more detailed understanding may be obtained from the following description given by way of example in conjunction with the accompanying drawings.
1 is a block diagram of an example device in which one or more features of the present disclosure may be implemented.
Figure 2 is a block diagram of the device of Figure 1, illustrating additional details.
3 is a block diagram illustrating a graphics processing pipeline, according to an example.
4 illustrates details of a SIMD unit, according to an example.
5 is a flow diagram of a method for executing instructions, according to an example.

도 1은 본 개시의 하나 이상의 특징이 구현될 수 있는 예시적인 디바이스(100)의 블록도이다. 디바이스(100)는 예를 들어, 컴퓨터, 게이밍 디바이스, 핸드헬드 디바이스, 셋톱 박스, 텔레비전, 모바일 폰, 또는 태블릿 컴퓨터를 포함할 수 있다. 디바이스(100)는 프로세서(102), 메모리(104), 저장소(106), 하나 이상의 입력 디바이스(108), 및 하나 이상의 출력 디바이스(110)를 포함한다. 디바이스(100)는 또한 선택적으로 입력 드라이버(112) 및 출력 드라이버(114)를 포함할 수 있다. 디바이스(100)는 도 1에 도시되지 않은 추가적인 컴포넌트를 포함할 수 있다는 것이 이해된다.1 is a block diagram of an example device 100 in which one or more features of the present disclosure may be implemented. Device 100 may include, for example, a computer, gaming device, handheld device, set-top box, television, mobile phone, or tablet computer. Device 100 includes a processor 102, memory 104, storage 106, one or more input devices 108, and one or more output devices 110. Device 100 may also optionally include input driver 112 and output driver 114. It is understood that device 100 may include additional components not shown in FIG. 1 .

다양한 대안들에서, 프로세서(102)는 중앙 처리 유닛(CPU), 그래픽 처리 유닛(GPU), 동일 다이 상에 위치된 CPU 및 GPU, 또는 하나 이상의 프로세서 코어를 포함하며, 여기서 각각의 프로세서 코어는 CPU 또는 GPU일 수 있다. 다양한 대안들에서, 메모리(104)는 프로세서(102)와 동일한 다이 상에 위치되거나, 프로세서(102)와는 별도로 위치된다. 메모리(104)는 휘발성 또는 비휘발성 메모리, 예를 들어 RAM(RAM), 동적 RAM, 또는 캐시를 포함한다.In various alternatives, processor 102 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and a GPU located on the same die, or one or more processor cores, where each processor core is a CPU. Or it could be a GPU. In various alternatives, memory 104 is located on the same die as processor 102, or is located separately from processor 102. Memory 104 includes volatile or non-volatile memory, such as RAM (RAM), dynamic RAM, or cache.

저장소(106)는 고정 또는 이동식 저장소, 예를 들어 하드 디스크 드라이브, 솔리드 스테이트 드라이브, 광 디스크, 또는 플래시 드라이브를 포함한다. 입력 디바이스들(108)은, 제한 없이, 키보드, 키패드, 터치 스크린, 터치 패드, 검출기, 마이크로폰, 가속도계, 자이로스코프, 바이오메트릭 스캐너, 또는 네트워크 접속(예를 들어, 무선 IEEE 802 신호들의 송신 및/또는 수신을 위한 무선 근거리 통신망 카드)을 포함한다. 출력 디바이스들(110)은, 제한 없이, 디스플레이, 스피커, 프린터, 햅틱 피드백 디바이스, 하나 이상의 발광체, 안테나, 또는 네트워크 접속(예를 들어, 무선 IEEE 802 신호들의 송신 및/또는 수신을 위한 무선 근거리 통신망 카드)을 포함한다.Storage 106 includes fixed or removable storage, such as a hard disk drive, solid state drive, optical disk, or flash drive. Input devices 108 may include, without limitation, a keyboard, keypad, touch screen, touch pad, detector, microphone, accelerometer, gyroscope, biometric scanner, or network connection (e.g., transmission of wireless IEEE 802 signals and/or or a wireless local area network card for reception). Output devices 110 may include, without limitation, a display, speaker, printer, haptic feedback device, one or more emitters, antennas, or a network connection (e.g., a wireless local area network for transmitting and/or receiving wireless IEEE 802 signals). card).

입력 드라이버(112)는 프로세서(102) 및 입력 디바이스들(108)과 통신하고, 프로세서(102)가 입력 디바이스들(108)로부터 입력을 수신하도록 허용한다. 출력 드라이버(114)는 프로세서(102) 및 출력 디바이스들(110)과 통신하고, 프로세서(102)가 출력 디바이스들(110)에 출력을 전송하도록 허용한다. 입력 드라이버(112) 및 출력 드라이버(114)는 선택적인 컴포넌트들이며, 입력 드라이버(112) 및 출력 드라이버(114)가 존재하지 않더라도 디바이스(100)는 동일한 방식으로 동작할 것임에 유의한다. 출력 드라이버(116)는 디스플레이 디바이스(118)에 결합된 가속 처리 디바이스("APD")(116)를 포함한다. APD는 프로세서(102)로부터 컴퓨트 커맨드들 및 그래픽 렌더링 커맨드들을 수신하고, 이러한 컴퓨트 및 그래픽 렌더링 커맨드들을 처리하고, 픽셀 출력을 디스플레이를 위해 디스플레이 디바이스(118)에 제공한다. 아래에서 더 상세히 설명되는 바와 같이, APD(116)는 "SIMD"(single-instruction-multiple-data) 패러다임에 따라 계산을 수행하기 위해 하나 이상의 병렬 처리 유닛을 포함한다. 따라서, 다양한 기능이 본 명세서에서 APD(116)에 의해 또는 그와 관련하여 수행되는 것으로 설명되지만, 다양한 대안에서, APD(116)에 의해 수행되는 것으로 설명된 기능은 추가적으로 또는 대안적으로 호스트 프로세서(예를 들어, 프로세서(102))에 의해 구동되지 않고 디스플레이 디바이스(118)에 그래픽 출력을 제공하는 유사한 능력을 갖는 다른 컴퓨팅 디바이스들에 의해 수행된다. 예를 들어, SIMD 패러다임에 따라 처리 태스크들을 수행하는 임의의 처리 시스템이 본 명세서에서 설명된 기능을 수행할 수 있는 것이 고려된다. 대안적으로, SIMD 패러다임에 따라 처리 태스크들을 수행하지 않는 컴퓨팅 시스템이 본 명세서에서 설명된 기능을 수행하는 것이 고려된다.Input driver 112 communicates with processor 102 and input devices 108 and allows processor 102 to receive input from input devices 108. Output driver 114 communicates with processor 102 and output devices 110 and allows processor 102 to send output to output devices 110 . Note that input driver 112 and output driver 114 are optional components, and device 100 will operate in the same manner even if input driver 112 and output driver 114 are not present. Output driver 116 includes an accelerated processing device (“APD”) 116 coupled to display device 118 . The APD receives compute commands and graphics rendering commands from processor 102, processes these compute and graphics rendering commands, and provides pixel output to display device 118 for display. As described in more detail below, APD 116 includes one or more parallel processing units to perform calculations according to the single-instruction-multiple-data (“SIMD”) paradigm. Accordingly, although various functions are described herein as being performed by or in connection with APD 116, in various alternatives, functions described as being performed by APD 116 may additionally or alternatively be performed by a host processor ( For example, it is not driven by processor 102 but is performed by other computing devices with similar capabilities to provide graphical output to display device 118. For example, it is contemplated that any processing system that performs processing tasks according to the SIMD paradigm can perform the functionality described herein. Alternatively, it is contemplated that a computing system that does not perform processing tasks according to the SIMD paradigm would perform the functionality described herein.

도 2는 APD(116)에서의 처리 태스크들의 실행과 관련된 추가적인 세부 사항들을 예시하는, 디바이스(100)의 블록도이다. 프로세서(102)는, 시스템 메모리(104)에, 프로세서(102)에 의한 실행을 위한 하나 이상의 제어 로직 모듈을 유지한다. 제어 로직 모듈은 운영 체제(120), 드라이버(122), 및 애플리케이션들(126)을 포함한다. 이러한 제어 로직 모듈들은 프로세서(102) 및 APD(116)의 동작의 다양한 특징들을 제어한다. 예를 들어, 운영 체제(120)는 하드웨어와 직접 통신하고, 프로세서(102) 상에서 실행되는 다른 소프트웨어에 대한 인터페이스를 하드웨어에 제공한다. 드라이버(122)는, 예를 들어, APD(116)의 다양한 기능에 액세스하기 위해 프로세서(102) 상에서 실행되는 소프트웨어(예를 들어, 애플리케이션들(126))에 "API"(application programming interface)를 제공함으로써 APD(116)의 동작을 제어한다. 드라이버(122)는 또한 APD(116)의 처리 컴포넌트들(예컨대, 아래에서 더 상세히 논의되는 SIMD 유닛들(138))에 의한 실행을 위한 프로그램들을 컴파일하는 JIT(just-in-time) 컴파일러를 포함한다.2 is a block diagram of device 100, illustrating additional details related to execution of processing tasks in APD 116. Processor 102 maintains, in system memory 104, one or more control logic modules for execution by processor 102. The control logic module includes an operating system 120, drivers 122, and applications 126. These control logic modules control various aspects of the operation of processor 102 and APD 116. For example, operating system 120 communicates directly with the hardware and provides the hardware with an interface to other software running on processor 102. Driver 122 provides, for example, an “application programming interface” (“API”) to software running on processor 102 (e.g., applications 126) to access various functions of APD 116. The operation of the APD 116 is controlled by providing Driver 122 also includes a just-in-time (JIT) compiler that compiles programs for execution by processing components of APD 116 (e.g., SIMD units 138, discussed in more detail below). do.

APD(116)는 병렬 처리에 적합할 수 있는 그래픽 연산 및 비-그래픽 연산과 같은 선택된 기능에 대한 커맨드 및 프로그램을 실행한다. APD(116)는 픽셀 연산, 지오메트리 계산과 같은 그래픽 파이프라인 연산을 실행하는 데, 그리고 프로세서(102)로부터 수신된 커맨드에 기초하여 이미지를 디스플레이 디바이스(118)에 렌더링하는 데 사용될 수 있다. APD(116)는 또한 프로세서(102)로부터 수신된 커맨드에 기초하여, 비디오, 물리 시뮬레이션, 계산 유체 역학, 또는 다른 태스크와 같은, 그래픽 연산과 직접 관련되지 않은 컴퓨트 처리 동작을 실행한다.APD 116 executes commands and programs for selected functions, such as graphical operations and non-graphical operations that may be suitable for parallel processing. APD 116 may be used to perform graphics pipeline operations, such as pixel operations, geometry calculations, and render images to display device 118 based on commands received from processor 102. APD 116 also executes compute processing operations not directly related to graphics operations, such as video, physics simulations, computational fluid dynamics, or other tasks, based on commands received from processor 102.

APD(116)는 SIMD 패러다임에 따라 병렬 방식으로 프로세서(102)의 요청으로 동작을 수행하는 하나 이상의 SIMD 유닛(138)을 포함하는 컴퓨트 유닛(132)을 포함한다. SIMD 패러다임은 다수의 처리 요소가 단일 프로그램 제어 흐름 유닛 및 프로그램 카운터를 공유하고 이에 따라 동일한 프로그램을 실행하지만 상이한 데이터로 그 프로그램을 실행할 수 있는 패러다임이다. 일례에서, 각각의 SIMD 유닛(138)은 16개의 레인을 포함하며, 여기서 각각의 레인은 SIMD 유닛(138) 내의 다른 레인과 동일한 시간에 동일한 명령어를 실행하지만 상이한 데이터로 그 명령어를 실행할 수 있다. 레인들은 모든 레인들이 주어진 명령어를 실행할 필요는 없는 경우 예측으로 스위치 오프될 수 있다. 예측은 또한 프로그램을 분기 제어 흐름으로 실행하는 데 사용될 수 있다. 더 구체적으로, 제어 흐름이 개별 레인에 의해 수행되는 계산들에 기초하는 조건부 분기들 또는 다른 명령어들을 갖는 프로그램들에 대해, 현재 실행되고 있지 않은 제어 흐름 경로들에 대응하는 레인들의 예측, 및 상이한 제어 흐름 경로들의 직렬 실행은 임의적인 제어 흐름을 가능하게 한다.APD 116 includes a compute unit 132 that includes one or more SIMD units 138 that perform operations at the request of processor 102 in a parallel manner according to the SIMD paradigm. The SIMD paradigm is a paradigm in which multiple processing elements share a single program control flow unit and program counters and thus execute the same program, but may execute that program with different data. In one example, each SIMD unit 138 includes 16 lanes, where each lane executes the same instructions at the same time as the other lanes within the SIMD unit 138 but may execute those instructions with different data. Lanes can be switched off speculatively when not all lanes need to execute a given instruction. Predictions can also be used to execute programs with branching control flow. More specifically, for programs with conditional branches or other instructions whose control flow is based on computations performed by individual lanes, prediction of lanes corresponding to control flow paths that are not currently executing, and different control Serial execution of flow paths enables arbitrary control flow.

컴퓨트 유닛(132)에서의 실행의 기본 단위는 작업-항목이다. 각각의 작업-항목은 특정 레인에서 병렬로 실행될 프로그램의 단일 인스턴스화를 나타낸다. 작업-항목들은 단일 SIMD 처리 유닛(138) 상에서 "파면"으로서 동시에 실행될 수 있다. 하나 이상의 파면이 동일 프로그램을 실행하도록 지정된 작업-항목들의 집합을 포함하는 "작업 그룹"에 포함된다. 작업 그룹은 작업 그룹을 구성하는 파면들 각각을 실행함으로써 실행될 수 있다. 대안들에서, 파면들은 단일 SIMD 유닛(138) 상에서 순차적으로, 또는 상이한 SIMD 유닛들(138) 상에서 부분적으로 또는 완전히 병렬로 실행된다.The basic unit of execution in compute unit 132 is the work-item. Each work-item represents a single instantiation of a program to be executed in parallel on a specific lane. Work-items may be executed simultaneously as “wavefronts” on a single SIMD processing unit 138. One or more wavefronts are included in a “work group” that contains a set of work-items designated to execute the same program. A task group can be executed by executing each of the wavefronts that make up the task group. In alternatives, the wavefronts are executed sequentially on a single SIMD unit 138, or partially or completely in parallel on different SIMD units 138.

커맨드 프로세서(136)는 상이한 컴퓨트 유닛들(132) 및 SIMD 유닛들(138) 상에서 다양한 작업 그룹들을 스케줄링하는 것과 관련된 동작들을 수행한다. 일반적으로, 커맨드 프로세서(136)는 프로세서(102)와 같은 엔티티로부터 커맨드들을 수신하며, 여기서 커맨드들은 그래픽 렌더링, 범용 셰이더의 실행 등과 같은 태스크들을 수행하도록 APD(116)에 지시한다.Command processor 136 performs operations related to scheduling various task groups on different compute units 132 and SIMD units 138. Typically, command processor 136 receives commands from an entity such as processor 102, where the commands direct APD 116 to perform tasks such as rendering graphics, executing general-purpose shaders, etc.

컴퓨트 유닛(132)에 의해 제공되는 병렬성은 픽셀 값 계산, 정점 변환, 및 다른 그래픽 연산과 같은 그래픽 관련 연산에 적합하다. 따라서 몇몇 경우에, 프로세서(102)로부터 그래픽 처리 커맨드를 수신하는 그래픽 파이프라인(134)이 계산 태스크들을 병렬 실행을 위해 컴퓨트 유닛들(132)에 제공한다.The parallelism provided by compute unit 132 is suitable for graphics-related operations such as pixel value calculations, vertex transformations, and other graphics operations. Accordingly, in some cases, graphics pipeline 134, which receives graphics processing commands from processor 102, provides computational tasks to compute units 132 for parallel execution.

컴퓨트 유닛(132)은 또한 그래픽과 관련되지 않은, 또는 그래픽 파이프라인(134)의 "정상" 동작의 일부로서 수행되지 않는 계산 태스크(예를 들어, 그래픽 파이프라인(134)의 동작을 위해 수행되는 처리를 보완하기 위해 수행되는 커스텀 동작)를 수행하는 데 사용된다. 프로세서(102) 상에서 실행되는 애플리케이션(126) 또는 다른 소프트웨어는 그러한 계산 태스크들을 정의하는 프로그램들을 실행을 위해 APD(116)로 송신한다.Compute unit 132 may also perform computational tasks that are non-graphics related or not performed as part of the “normal” operation of graphics pipeline 134 (e.g., performed for the operation of graphics pipeline 134). It is used to perform custom operations performed to complement the processing being performed. Application 126 or other software running on processor 102 transmits programs defining such computational tasks to APD 116 for execution.

도 3은 도 2에 예시된 그래픽 처리 파이프라인(134)의 추가적인 세부 사항들을 도시하는 블록도이다. 그래픽 처리 파이프라인(134)은, 각각 특정 기능을 수행하는 스테이지들을 포함한다. 스테이지들은 그래픽 처리 파이프라인(134)의 기능의 세분을 나타낸다. 각각의 스테이지는 프로그래밍 가능한 처리 유닛(202)에서 실행되는 셰이더 프로그램으로서 부분적으로 또는 전체적으로 구현되거나, 프로그래밍 가능한 처리 유닛(202) 외부의 고정 기능, 프로그래밍 가능하지 않은 하드웨어로서 부분적으로 또는 전체적으로 구현된다.FIG. 3 is a block diagram illustrating additional details of the graphics processing pipeline 134 illustrated in FIG. 2 . The graphics processing pipeline 134 includes stages each performing a specific function. Stages represent a subdivision of functionality of graphics processing pipeline 134. Each stage is implemented, partially or entirely, as a shader program executing in programmable processing unit 202, or as fixed-function, non-programmable hardware external to programmable processing unit 202.

입력 어셈블러 스테이지(302)는 유저-필드 버퍼(user-filled buffer)(예를 들어, 애플리케이션(126)과 같은, 프로세서(102)에 의해 실행되는 소프트웨어의 요청으로 채워지는 버퍼)로부터 프리미티브 데이터를 판독하고, 파이프라인의 나머지에 의한 사용을 위해 데이터를 프리미티브로 어셈블링한다. 입력 어셈블러 스테이지(302)는 유저-필드 버퍼에 포함된 프리미티브 데이터에 기초하여 상이한 유형들의 프리미티브들을 생성할 수 있다. 입력 어셈블러 스테이지(302)는 파이프라인의 나머지에 의한 사용을 위해 어셈블링된 프리미티브를 포맷한다.Input assembler stage 302 reads primitive data from a user-filled buffer (e.g., a buffer filled with requests from software executing by processor 102, such as application 126). and assembles the data into primitives for use by the rest of the pipeline. The input assembler stage 302 may generate different types of primitives based on primitive data contained in the user-field buffer. The input assembler stage 302 formats the assembled primitives for use by the rest of the pipeline.

정점 셰이더 스테이지(304)는 입력 어셈블러 스테이지(302)에 의해 어셈블링된 프리미티브의 정점을 처리한다. 정점 셰이더 스테이지(304)는 변환, 스키닝(skinning), 모핑(morphing) 및 정점별 라이팅(per-vertex lighting)과 같은 다양한 정점별 연산을 수행한다. 변환 연산은 정점의 좌표를 변환하기 위한 다양한 연산을 포함한다. 이러한 연산은 모델링 변환, 뷰잉 변환, 투영 변환, 원근 분할, 및 뷰포트 변환 중 하나 이상을 포함한다. 여기서, 그러한 변환은 변환이 수행되는 정점의 좌표 또는 "위치"를 수정하는 것으로 간주된다. 정점 셰이더 스테이지(304)의 다른 연산은 좌표 이외의 속성을 수정한다.Vertex shader stage 304 processes vertices of primitives assembled by input assembler stage 302. The vertex shader stage 304 performs various per-vertex operations such as transformation, skinning, morphing, and per-vertex lighting. Transformation operations include various operations to transform the coordinates of vertices. These operations include one or more of modeling transformations, viewing transformations, projection transformations, perspective splitting, and viewport transformations. Here, such transformations are considered to modify the coordinates or “position” of the vertices on which the transformation is performed. Other operations in the vertex shader stage 304 modify properties other than coordinates.

정점 셰이더 스테이지(304)는 하나 이상의 컴퓨트 유닛(132) 상에서 실행될 정점 셰이더 프로그램으로서 부분적으로 또는 전체적으로 구현된다. 정점 셰이더 프로그램은 프로세서(102)에 의해 제공되고, 컴퓨터 프로그래머에 의해 미리 작성된 프로그램에 기초한다. 드라이버(122)는 그러한 컴퓨터 프로그램을 컴퓨트 유닛(132) 내에서의 실행에 적합한 포맷을 갖는 정점 셰이더 프로그램을 생성하도록 컴파일한다.Vertex shader stage 304 is implemented, in part or entirely, as a vertex shader program to run on one or more compute units 132. The vertex shader program is provided by processor 102 and is based on a program pre-written by a computer programmer. Driver 122 compiles such computer programs to produce a vertex shader program having a format suitable for execution within compute unit 132.

헐 셰이더 스테이지(306), 테셀레이터 스테이지(308), 및 도메인 셰이더 스테이지(310)는, 프리미티브를 세분화함으로써 단순한 프리미티브를 보다 복잡한 프리미티브로 변환하는 테셀레이션을 구현하기 위해 함께 작동한다. 헐 셰이더 스테이지(306)는 입력 프리미티브에 기초하여 테셀레이션을 위한 패치를 생성한다. 테셀레이터 스테이지(308)는 패치를 위한 한 세트의 샘플들을 생성한다. 도메인 셰이더 스테이지(310)는 패치를 위한 샘플들에 대응하는 정점들에 대한 정점 위치들을 계산한다. 헐 셰이더 스테이지(306) 및 도메인 셰이더 스테이지(310)는 프로그래밍 가능한 처리 유닛(202) 상에서 실행될 셰이더 프로그램으로서 구현될 수 있다.The hull shader stage 306, tessellator stage 308, and domain shader stage 310 work together to implement tessellation, which converts simple primitives into more complex primitives by subdividing the primitives. The hull shader stage 306 generates patches for tessellation based on input primitives. The tessellator stage 308 generates a set of samples for the patch. The domain shader stage 310 calculates vertex positions for vertices corresponding to samples for patching. Hull shader stage 306 and domain shader stage 310 may be implemented as shader programs to be executed on programmable processing unit 202.

지오메트리 셰이더 스테이지(312)는 프리미티브 단위로 정점 연산을 수행한다. 포인트 스프린트 확장, 동적 입자 시스템 연산, 퍼-핀(fur-fin) 생성, 섀도우 볼륨 생성, 단일 패스 렌더-투-큐브맵, 프리미티브별 머티리얼 스와핑, 및 프리미티브별 머티리얼 셋업과 같은 연산을 포함한, 다양한 상이한 유형의 연산들이 지오메트리 셰이더 스테이지(312)에 의해 수행될 수 있다. 몇몇 경우에, 프로그래밍 가능한 처리 유닛(202) 상에서 실행되는 셰이더 프로그램이 지오메트리 셰이더 스테이지(312)에 대한 연산을 수행한다.The geometry shader stage 312 performs vertex operations on a primitive basis. A variety of different operations, including operations such as point sprint scaling, dynamic particle system operations, fur-fin generation, shadow volume generation, single pass render-to-cubemap, per-primitive material swapping, and per-primitive material setup. Types of operations may be performed by geometry shader stage 312. In some cases, shader programs executing on programmable processing unit 202 perform operations on geometry shader stage 312.

래스터라이저 스테이지(314)는 단순한 프리미티브 및 생성된 업스트림을 수신하고 래스터화한다. 래스터화는 어느 스크린 픽셀(또는 서브-픽셀 샘플)이 특정 프리미티브에 의해 커버되는지를 결정하는 것을 포함한다. 래스터화는 고정 기능 하드웨어에 의해 수행된다.Rasterizer stage 314 receives and rasterizes simple primitives and generated upstream. Rasterization involves determining which screen pixels (or sub-pixel samples) are covered by a particular primitive. Rasterization is performed by fixed-function hardware.

픽셀 셰이더 스테이지(316)는 업스트림에서 생성된 프리미티브 및 래스터화의 결과에 기초하여 스크린 픽셀에 대한 출력 값을 계산한다. 픽셀 셰이더 스테이지(316)는 텍스처 메모리로부터의 텍스처를 적용할 수 있다. 픽셀 셰이더 스테이지(316)에 대한 연산은 프로그래밍 가능한 처리 유닛(202) 상에서 실행되는 셰이더 프로그램에 의해 수행된다.Pixel shader stage 316 calculates output values for screen pixels based on primitives generated upstream and the results of rasterization. Pixel shader stage 316 can apply textures from texture memory. Operations on pixel shader stage 316 are performed by shader programs executing on programmable processing unit 202.

출력 합병 스테이지(318)는 픽셀 셰이더 스테이지(316)로부터의 출력을 수신하고 이러한 출력들을 합병하여, z-테스트 및 알파 블렌딩과 같은 연산을 수행하여 스크린 픽셀에 대한 최종 컬러를 결정한다.The output merge stage 318 receives the output from the pixel shader stage 316 and merges these outputs to perform operations such as z-test and alpha blending to determine the final color for the screen pixel.

그래픽 처리 파이프라인(134)이 APD(116) 내에 포함되는 것으로 예시되어 있지만, 그래픽 처리 파이프라인(134)을 포함하지 않는(그러나 범용 컴퓨트 셰이더 프로그램들과 같은 셰이더 프로그램들을 실행하는 컴퓨트 유닛들(132)을 포함하는) APD(116)의 구현들이 본 개시에 의해 고려된다는 것이 이해되어야 한다.Although graphics processing pipeline 134 is illustrated as being included within APD 116, compute units that do not include graphics processing pipeline 134 (but execute shader programs, such as general-purpose compute shader programs) It should be understood that implementations of APD 116 (including 132) are contemplated by this disclosure.

본 명세서의 다른 곳에서 설명되는 바와 같이, 파면들은 복수의 작업-항목을 포함한다. SIMD 유닛들(138)은 SIMD 유닛(138)의 레인들에 걸쳐 락스텝(lockstep)으로 함께 파면의 작업-항목들을 실행함으로써 파면들을 실행한다. SIMD 유닛들(138)은 다수의 실행 인스턴스(작업-항목)에 대해 동일한 명령어 흐름 회로부(예를 들어, 명령어 페치 및 명령어 포인터 조정 회로부)를 이용함으로써 효율적인 방식으로 고도의 병렬 실행을 용이하게 한다. 각각의 레인은 하나의 작업-항목에 대한 데이터에 대해 하나의 명령어 인스턴스를 실행하기 위한 기능 하드웨어(예를 들어, 덧셈, 뺄셈, 곱셈, 나눗셈, 행렬 곱셈, 초월 함수, 및 다른 함수와 같은, APD(116)의 명령어 세트 아키텍처의 명령어들을 수행하기 위한 산술 로직 유닛 및 다른 회로부)를 구비한다.As explained elsewhere herein, wavefronts include multiple work-items. SIMD units 138 execute wavefronts by executing the wavefront's work-items together in lockstep across lanes of SIMD unit 138. SIMD units 138 facilitate highly parallel execution in an efficient manner by utilizing the same instruction flow circuitry (e.g., instruction fetch and instruction pointer manipulation circuitry) for multiple execution instances (task-items). Each lane contains functional hardware (e.g., addition, subtraction, multiplication, division, matrix multiplication, transcendental functions, and other functions, such as APD) to execute one instruction instance on the data for one work-item. an arithmetic logic unit and other circuitry for executing instructions of the instruction set architecture of 116).

레인당 하나의 작업-항목에 대해 이러한 명령어 유형들 모두를 실행하기 위한 하드웨어는, 소정 조건들이 충족되는 경우, 동시에 레인당 다수의 작업-항목에 대해 명령어들을 함께 실행하기에 충분하다. 예를 들어, SIMD 유닛들(138)은 각각의 작업-항목에 대해 행렬 곱셈 연산을 수행할 수 있기 때문에, 그리고 행렬 곱셈은 비교적 많은 수의 곱셈-덧셈 연산을 포함하기 때문에, SIMD 유닛들(138)은 곱셈, 덧셈 또는 융합-곱셈-덧셈 명령어들과 같은 다수의 "단순한" 명령어들을 그러한 기능 하드웨어가 그렇게 많지 않은 명령어들보다 더 높은 레이트로 실행하기 위한 하드웨어를 갖는다. 몇몇 명령어들에 대한 증가된 실행 레이트가 몇몇 경우들에서 달성될 수 있지만, SIMD 유닛들(138)의 하드웨어는 무조건적으로 "정상" 실행 레이트만을 지원하기에 충분하다. 다시 말해서, 레지스터 파일 대역폭 관련 제한들과 같은 다른 이유들로 인해 실행 레이트에 대한 제한들이 존재한다. 그럼에도 불구하고, 본 개시의 SIMD 유닛(138)은 다수의 상황에서 더 높은 레이트로 명령어들을 실행하려고 시도한다.The hardware to execute all of these instruction types on one work-item per lane is sufficient to execute the instructions together on multiple work-items per lane simultaneously, if certain conditions are met. For example, because SIMD units 138 can perform a matrix multiplication operation for each work-item, and because matrix multiplication involves a relatively large number of multiply-add operations, SIMD units 138 ) has hardware for executing many “simple” instructions, such as multiply, add, or fused-multiply-add instructions, at a higher rate than instructions that do not have as much such functional hardware. Although increased execution rates for some instructions may be achieved in some cases, the hardware of SIMD units 138 is sufficient to unconditionally support only the “normal” execution rate. In other words, there are limitations to execution rate due to other reasons such as register file bandwidth related limitations. Nonetheless, SIMD unit 138 of the present disclosure attempts to execute instructions at a higher rate in many situations.

위에서 설명된 높아진 실행 레이트는 때때로 본 명세서에서 "증가된 실행 레이트"로 지칭된다. 실행 레이트를 증가시키는 기술을 수행함이 없는 명령어들의 실행은 때때로 본 명세서에서 "정상 실행 레이트"로 수행되는 것으로 일컬어진다. 하나의 구현에서, 증가된 실행 레이트와 정상 실행 레이트 사이의 구별되는 특징은, 증가된 실행 레이트에서는, 각각의 레인이 다수의 명령어들을 실행하는 반면, 정상 실행 레이트에서는, 각각의 레인이 하나의 명령어를 실행한다는 것이다. 증가된 실행 레이트로 각각의 레인에서 실행되는 다수의 명령어가 상이한 파면들로부터 오는 것이 가능하다. 대안적으로, 다수의 명령어가 동일 파면으로부터 오는 것이 가능하다. 예를 들어, 프로그램 순서에서의 다음 명령어가 프로그램 순서에서의 그 명령어 뒤의, 그리고 동일 파면으로부터의 명령어와 함께 실행되는 것이 가능하다.The increased execution rate described above is sometimes referred to herein as “increased execution rate.” Execution of instructions without performing a technique to increase execution rate is sometimes referred to herein as being performed at a "normal execution rate." In one implementation, the distinguishing characteristic between increased and normal execution rates is that, at the increased execution rate, each lane executes multiple instructions, whereas at the normal execution rate, each lane executes one instruction. is to execute. The increased execution rate makes it possible for multiple instructions executing on each lane to come from different wavefronts. Alternatively, it is possible for multiple instructions to come from the same wavefront. For example, it is possible for the next instruction in the program sequence to be executed along with the instruction after that instruction in the program sequence and from the same wavefront.

도 4는 예에 따른, SIMD 유닛(138)의 세부 사항들을 예시한다. 몇몇 예들에서, SIMD 유닛(138)의 예시된 블록들(예를 들어, 중재기(404), 기능 유닛들(406), 벡터 레지스터 파일(412), 또는 스칼라 레지스터 파일(414)) 각각은 본 명세서에 설명된 동작을 수행하도록 구성된 하드-와이어드 회로부로 구현가능하다. 다른 실시예들에서, SIMD 유닛(138)의 예시된 블록들 중 일부 또는 전부는 프로세서 상에서 실행되는 소프트웨어, 또는 소프트웨어와 하드웨어 회로부의 조합으로서 구현된다.4 illustrates details of SIMD unit 138, according to an example. In some examples, each of the illustrated blocks of SIMD unit 138 (e.g., arbiter 404, functional units 406, vector register file 412, or scalar register file 414) is It can be implemented with hard-wired circuitry configured to perform the operations described in the specification. In other embodiments, some or all of the illustrated blocks of SIMD unit 138 are implemented as software running on a processor, or a combination of software and hardware circuitry.

도 4의 SIMD 유닛(138)은 기능 유닛들(406), 중재기(404), 벡터 레지스터 파일(412), 및 스칼라 레지스터 파일(414)을 포함한다. 중재기(404)는 실행을 위해 하나 이상의 미결 작업 그룹(402)으로부터 하나 이상의 미결 명령어(403)를 선택하도록 구성된다. 각각의 미결 파면(402)은 실행을 위해 SIMD 유닛(138)에 할당되는 파면이다. 다양한 예들에서, 이러한 할당은 컴퓨트 유닛(132) 내의 스케줄러에 의해 또는 상세히 설명되지 않은 다른 스케줄러에 의해 수행된다. 몇몇 구현들에서, SIMD 유닛(138)은 레이턴시 은닉 목적으로 "과잉 구독(oversubscription)"된다. 용어 "과잉 구독"은 SIMD 유닛(138)에 할당되는 여러 작업 그룹이 있고, SIMD 유닛(138)이 이러한 할당된 작업 그룹들 각각을 동시에 실행하기에 충분한 하드웨어(예컨대, 기능 유닛들(406))를 갖고 있지 않다는 것을 의미한다. 따라서, SIMD 유닛(138)은 상이한 작업 그룹들로부터의 명령어들의 스케줄링을 교대로 한다. 레이턴시 은닉은 메모리 액세스가 완료되기를 기다리거나 긴 명령어들이 완료되기를 기다리는 것과 같은, 다양한 이유로 인해 작업 그룹들이 때때로 정지하기 때문에 발생한다.SIMD unit 138 in FIG. 4 includes functional units 406, arbiter 404, vector register file 412, and scalar register file 414. Arbiter 404 is configured to select one or more pending instructions 403 from one or more pending task groups 402 for execution. Each outstanding wavefront 402 is a wavefront assigned to a SIMD unit 138 for execution. In various examples, this allocation is performed by a scheduler within compute unit 132 or by another scheduler not described in detail. In some implementations, SIMD unit 138 is “oversubscribed” for latency concealment purposes. The term “oversubscription” means that there are multiple work groups assigned to SIMD unit 138, and that SIMD unit 138 has sufficient hardware (e.g., functional units 406) to execute each of these assigned work groups simultaneously. It means that you do not have a . Accordingly, SIMD unit 138 alternates scheduling instructions from different task groups. Latency concealment occurs because groups of tasks sometimes stall for a variety of reasons, such as waiting for a memory access to complete or waiting for long instructions to complete.

각각의 미결 작업 그룹(402)은 기능 유닛들(406)에 의해 실행될 명령어들을 나타내는 하나 이상의 미결 명령어(403)를 갖는다. 몇몇 예들에서, 미결 명령어들(403)은 프로그램 순서에서 다음 명령어를 포함한다. 몇몇 예들에서, 미결 명령어들(403)은 하나 이상의 작업 그룹(402)에 대한 하나 초과의 명령어를 포함한다. 예에서, 미결 명령어들(403)은 프로그램 순서에서 다음 명령어 및 프로그램 순서에서 그 명령어 뒤의 다음 명령어를 포함한다.Each pending task group 402 has one or more outstanding instructions 403 representing instructions to be executed by functional units 406 . In some examples, outstanding instructions 403 include the next instruction in program order. In some examples, outstanding instructions 403 include more than one instruction for one or more task groups 402. In the example, outstanding instructions 403 include the next instruction in program order and the next instruction after that instruction in program order.

중재기(404)는 미결 명령어들(403)을 검사하고, 명령어들을 정상 실행 레이트로 발행할지 또는 증가된 실행 레이트로 발행할지를 결정한다. 본 명세서의 다른 곳에서 언급되는 바와 같이, 기능 유닛들(406)은 일반적으로 적어도 정상 실행 레이트로 명령어들을 실행하기에 충분한 자원들을 갖는다. 그러나, 기능 유닛들(406)은 모든 상황에서 증가된 실행 레이트로 명령어들을 실행하기에 충분한 자원들을 갖지는 않는다. 다음의 적어도 2개의 자원이 때때로 실행을 정상 실행 레이트로 제한한다: 레지스터 파일 대역폭 및 기능 유닛 실행 자원들. 레지스터 파일 대역폭과 관련하여, 미결 명령어들(403)이 레지스터 파일들에 대해 존재하는 것보다 더 많은 대역폭을 요구하는 성질을 갖는 것이 가능하다. 그러한 상황에서, SIMD 유닛(138)은 증가된 실행 레이트로 동작할 수 없고, 정상 실행 레이트로 동작한다. 기능 유닛 실행 자원들과 관련하여, 기능 유닛들(406)에 존재하는 실행 자원들이 너무 적은 것으로 인해 함께 실행될 수 있는 미결 명령어들(403)이 없는 것이 가능하다. 그러한 상황에서, SIMD 유닛(138)은 정상 실행 레이트로 실행한다. 적어도 2개의 미결 명령어(403)를 함께 실행하기에 충분한 자원이 있는 경우, SIMD 유닛(138)은 증가된 실행 레이트로 실행한다. 용어 "함께 실행한다"는 기능 유닛들(406)이, 함께 실행되지 않으면, 상이한 클록 사이클(들)에서 수행될 다수의 명령어에 대한 연산들을 동일한 클록 사이클(들)에서 수행한다는 것을 의미한다. 예에서, 명령어가 덧셈 명령어인 경우, 2개의 명령어를 함께 실행하는 것은 단일 클록 사이클에서 각각의 명령어에 대한 덧셈 연산을 수행하는 것을 의미한다. 이러한 예에서, 2개의 명령어를 함께 실행하는 것은 2개의 명령어를 함께 실행하지 않는 것과는 대조적인데, 여기서 2개의 상이한 명령어에 대한 덧셈 연산들은 상이한 클록 사이클들에서 수행된다. 명령어들은 파이프라인화되며, 이는 명령어 실행이 다수의 하위 연산들을 포함하고, 다수의 하위 연산들 각각이 하나 이상의 클록 사이클에서 수행된다는 것을 의미한다는 점에 유의해야 한다. 명령어들을 함께 실행하기 위해, SIMD 유닛(138)의 레인은 동일한 사이클(들)에서 상이한 명령어들에 대한 다수의 하위 연산들을 실행할 수 있다.Arbiter 404 examines outstanding instructions 403 and determines whether to issue the instructions at the normal execution rate or at an increased execution rate. As noted elsewhere herein, functional units 406 generally have sufficient resources to execute instructions at least at a normal execution rate. However, functional units 406 do not have sufficient resources to execute instructions at an increased execution rate in all situations. At least two resources sometimes limit execution to the normal execution rate: register file bandwidth and functional unit execution resources. With regard to register file bandwidth, it is possible that outstanding instructions 403 may be of a nature requiring more bandwidth than exists for the register files. In such a situation, SIMD unit 138 cannot operate at the increased execution rate and operates at the normal execution rate. With regard to functional unit execution resources, it is possible that there are too few execution resources present in the functional units 406 so that there are no outstanding instructions 403 that can be executed together. In such circumstances, SIMD unit 138 executes at its normal execution rate. If there are sufficient resources to execute at least two outstanding instructions 403 together, SIMD unit 138 executes at an increased execution rate. The term “execute together” means that functional units 406 perform operations on the same clock cycle(s) for multiple instructions that would otherwise be performed on different clock cycle(s). In the example, if the instructions are addition instructions, executing two instructions together means performing the addition operation for each instruction in a single clock cycle. In this example, executing two instructions together is contrasted with not executing two instructions together, where the addition operations for two different instructions are performed in different clock cycles. It should be noted that instructions are pipelined, which means that execution of an instruction includes multiple sub-operations, each of which is performed in one or more clock cycles. To execute instructions together, a lane of SIMD unit 138 may execute multiple sub-operations for different instructions in the same cycle(s).

본 명세서의 다른 곳에서 설명되는 바와 같이, SIMD 유닛들(138)이 더 높은 실행 레이트로 실행할 수 있는 한 가지 이유는, 몇몇 명령어들이 복잡하고 다수의 "단순한" 기능 유닛들을 요구하기 때문에 다수의 명령어들을 함께 실행하기에 충분한 기능 유닛들(406)이 존재하기 때문이다. 따라서 몇몇 예들에서, 명령어들을 함께 실행하는 것은 동일한 클록 사이클(들)에서 다수의 명령어들을 실행하는 것을 의미하며, 여기서 적어도 하나의 그러한 명령어는 다른 사이클들에서 복잡한 명령어에 의해 사용되는 "단순한" 기능 유닛을 사용하는 "단순한" 명령어이다. 예에서, 2개의 덧셈 명령어가 함께 실행되고, 적어도 하나의 그러한 덧셈 명령어는 상이한 시간에 실행되는 행렬 곱셈 명령어에 대해 사용되는 덧셈기를 사용한다.As explained elsewhere herein, one reason SIMD units 138 can execute at higher execution rates is that many instructions are complex and require many “simple” functional units. This is because there are enough functional units 406 to execute them together. Thus, in some instances, executing instructions together means executing multiple instructions in the same clock cycle(s), where at least one such instruction is a "simple" functional unit used by a complex instruction in other cycles. It is a "simple" command that uses . In the example, two add instructions are executed together, and at least one such add instruction uses an adder that is used for a matrix multiply instruction that executes at a different time.

전술된 바와 같이, 기능 유닛들의 수는 때때로 다수의 명령어가 함께 실행될 수 있는지 여부를 제한한다. 보다 구체적으로, 레인들은 명령어 세트 아키텍처에서 모든 명령어의 다수의 인스턴스를 실행하기 위한 중복 하드웨어를 포함하지 않기 때문에, SIMD 유닛(138)은 소정 유형의 명령어 조합들을 함께 실행할 수 없다. 구체적으로, SIMD 유닛(138)은 그 명령어를 실행하는 데 요구되는 기능 유닛들의 2개의 인스턴스가 기능 유닛들(406)에 존재하지 않는 명령어의 2개의 인스턴스를 함께 실행할 수 없다. 그러나, 그렇게 하기에 충분한 기능 유닛들이 존재하는 경우 매우 단순한 유형의 하나의 명령어 및 더 복잡한 유형의 다른 명령어가 함께 실행되는 것이 때때로 가능하다. 예에서, 덧셈, 곱셈, 및 융합-덧셈-곱셈 명령어들이 함께 실행될 수 있다. 단순한 명령어들의 다른 예는 단순한 수학 명령어들(예를 들어, 뺄셈), 비트 조작 명령어들(예를 들어, 비트별 AND, OR, XOR, 시프트 등), 또는 다른 명령어들을 포함한다. 몇몇 "공동-실행가능한 복잡한 명령어들"은 다수의 그러한 명령어들이 함께 실행되기에 충분한 하드웨어가 존재하지 않는 명령어들이다. 따라서, 중재기(404)는 함께 실행하기 위해 다수의 그러한 명령어들을 선택하지 않는다. 그러나, SIMD 유닛(138)은 단순한 명령어와 함께 공동-실행가능한 복잡한 명령어들을 실행하기에 충분한 하드웨어를 갖는다. 따라서, 중재기는 단순한 명령어와 함께 실행하기 위해 공동-실행가능한 복잡한 명령어를 선택하도록 허용된다. 마지막으로, 몇몇 "비-공동-실행가능한 복잡한 명령어들"은 단순한 명령어들과 함께 실행가능하기 위해 너무 많은 사이클을 소비하거나 너무 많은 기능 유닛들(406)을 소비한다. 따라서, SIMD 유닛(138)은 임의의 다른 명령어들과 함께 비-공동-실행가능한 복잡한 명령어들을 결코 실행하지 않는다.As mentioned above, the number of functional units sometimes limits whether multiple instructions can be executed together. More specifically, because the lanes do not contain redundant hardware to execute multiple instances of every instruction in an instruction set architecture, SIMD unit 138 cannot execute combinations of certain types of instructions together. Specifically, SIMD unit 138 cannot jointly execute two instances of an instruction for which two instances of the functional units required to execute the instruction are not present in functional units 406 . However, it is sometimes possible for one instruction of a very simple type and another instruction of a more complex type to be executed together if sufficient functional units exist to do so. In an example, add, multiply, and fuse-add-multiply instructions may be executed together. Other examples of simple instructions include simple math instructions (e.g., subtraction), bit manipulation instructions (e.g., bitwise AND, OR, XOR, shift, etc.), or other instructions. Some “co-executable complex instructions” are instructions for which sufficient hardware does not exist for multiple such instructions to be executed together. Accordingly, arbiter 404 does not select multiple such instructions for execution together. However, SIMD unit 138 has sufficient hardware to execute complex instructions that are co-executable together with simple instructions. Accordingly, the arbiter is allowed to select co-executable complex instructions for execution together with simple instructions. Finally, some “non-co-executable complex instructions” consume too many cycles or too many functional units 406 to be executable together with simple instructions. Accordingly, SIMD unit 138 never executes complex instructions that are non-co-executable with any other instructions.

도 4에서, 기능 유닛들(406)은 단순한 기능 유닛들(408) 및 복잡한 명령어들(410)을 위한 추가적인 로직을 포함한다. 단순한 기능 유닛들(408)은 단순한 명령어들을 실행하기 위한 기능 유닛들이다. 다양한 예들에서, 단순한 기능 유닛들(408)은 덧셈기들, 곱셈기들, 비트별 조작 명령어들 등을 포함한다. 복잡한 명령어들(410)을 위한 추가적인 로직은 공동-실행가능한 복잡한 명령어들을 위한 그리고 비-공동-실행가능한 복잡한 명령어들을 위한 하드웨어를 포함한다. 단순한 기능 유닛들(408)은 단순한 명령어들과 복잡한 명령어들(410)을 위한 추가적인 로직의 적어도 일부 사이에서 공유된다. 예를 들어, 덧셈기들 및 곱셈기들이 덧셈 및 곱셈 명령어들뿐만 아니라, 곱셈 및 덧셈을 사용하는 더 복잡한 명령어들을 위해 사용된다.In Figure 4, functional units 406 include simple functional units 408 and additional logic for complex instructions 410. Simple functional units 408 are functional units for executing simple instructions. In various examples, simple functional units 408 include adders, multipliers, bit-wise manipulation instructions, etc. Additional logic for complex instructions 410 includes hardware for co-executable complex instructions and for non-co-executable complex instructions. Simple functional units 408 are shared between simple instructions and at least some of the additional logic for complex instructions 410 . For example, adders and multipliers are used for addition and multiplication instructions, as well as more complex instructions that use multiplication and addition.

위에서 언급된 바와 같이, 중재기(404)는 미결 명령어들(403)이 레지스터 파일 대역폭 내에 맞는 레지스터 액세스 요건을 갖는 상황에서 실행을 위해 그러한 미결 명령어들(403)을 선택할 수 있다. 몇몇 예들에서, 레지스터 파일 대역폭은 레지스터 파일 포트들에서 이용가능한 대역폭을 지칭한다. 레지스터 파일 포트는 기능 유닛들(406)이 레지스터의 내용에 액세스할 수 있게 하는 인터페이스이며, 제한된 수의 레지스터 파일 포트가 존재한다. 예에서, 벡터 레지스터 파일(412)은 4개의 포트를 가지며, 이들 각각은 64 비트와 같은 클록 사이클당 소정 수의 비트를 제공할 수 있다. 중재기(404)가 벡터 레지스터 파일(412)의 대역폭 내에 맞는 다수의 미결 명령어들(403)을 찾을 수 없는 경우, 중재기(404)는 SIMD 유닛(138)이 정상 실행 레이트로 실행하여야 한다고 결정한다. 함께 실행할 모든 명령어들의 모든 피연산자에 액세스하기에 충분한 대역폭이 존재하는 경우 다수의 명령어들은 대역폭 "내에 맞는다".As mentioned above, arbiter 404 may select outstanding instructions 403 for execution in situations where the outstanding instructions 403 have register access requirements that fit within the register file bandwidth. In some examples, register file bandwidth refers to the bandwidth available at register file ports. A register file port is an interface that allows functional units 406 to access the contents of a register, and there is a limited number of register file ports. In the example, vector register file 412 has four ports, each of which can provide a certain number of bits per clock cycle, such as 64 bits. If the arbiter 404 cannot find a number of outstanding instructions 403 that fit within the bandwidth of the vector register file 412, the arbiter 404 determines that the SIMD unit 138 should execute at its normal execution rate. do. Multiple instructions "fit within" a bandwidth if there is sufficient bandwidth to access all operands of all instructions to be executed together.

레지스터 파일 대역폭은, 몇몇 예들에서, 각각이 자신의 독립적인 대역폭을 갖는, 상이한 유형들의 레지스터 파일들이 존재한다는 사실에 의해 더욱 확장된다. 예에서, SIMD 유닛(138)은 벡터 레지스터 파일(412) 및 스칼라 레지스터 파일(414)을 포함한다. 벡터 레지스터 파일(412)은 벡터 레지스터 파일 포트들을 갖고, 스칼라 레지스터 파일(414)은 스칼라 레지스터 파일 포트들을 갖는다. 벡터 레지스터 파일 포트들은 스칼라 레지스터 파일 포트들과는 독립적으로 액세스된다. 따라서, 이용가능한 대역폭은 동일한 레지스터 파일에 액세스하는 명령어들의 세트들에 비해 상이한 레지스터 파일들에 액세스하는 명령어들의 세트들에 대해 증가한다. 예에서, 제1 명령어가 3개의 벡터 레지스터 파일 레지스터에 액세스하고, 제2 명령어가 2개의 벡터 레지스터 파일 레지스터 및 하나의 스칼라 레지스터 파일 레지스터에 액세스한다. 제3 명령어가 3개의 벡터 레지스터 파일 레지스터에 액세스한다. 제1 및 제2 명령어들은 제1 및 제3 또는 제2 및 제3 명령어들보다 레지스터 파일 대역폭 충돌을 겪을 가능성이 적다.Register file bandwidth is further expanded by the fact that, in some examples, there are different types of register files, each with its own independent bandwidth. In the example, SIMD unit 138 includes vector register file 412 and scalar register file 414. Vector register file 412 has vector register file ports, and scalar register file 414 has scalar register file ports. Vector register file ports are accessed independently from scalar register file ports. Accordingly, available bandwidth increases for sets of instructions that access different register files compared to sets of instructions that access the same register file. In the example, the first instruction accesses three vector register file registers, and the second instruction accesses two vector register file registers and one scalar register file register. A third instruction accesses three vector register file registers. First and second instructions are less likely to experience register file bandwidth conflicts than first and third or second and third instructions.

몇몇 예들에서, 명령어들은 피연산자들이 어느 "슬롯"에 있는지에 기초하여 레지스터 파일들 내의 피연산자들에 액세스한다. 보다 구체적으로, 명령어들은 특정 순서로 레지스터들을 참조한다. 이 순서에서의 레지스터의 장소는 슬롯을 정의한다. 몇몇 구현들에서, 명령어에 대한 레지스터들은 상이한 클록 사이클들에서 액세스되며, 레지스터들이 액세스되는 특정 사이클은 레지스터의 슬롯에 의존한다. 몇몇 구현들에서, 함께 실행되는 명령어들은 동일한 클록 사이클(들)에서 동일한 슬롯의 피연산자들에 액세스한다. 예에서, 함께 실행되는 2개의 명령어가 동일한 클록 사이클에서 제1 슬롯의 레지스터들에 액세스한다. 따라서, 2개의 명령어의 각각의 슬롯의 레지스터들에 액세스하기에 충분한 대역폭이 있는 경우, 그러한 2개의 명령어는 함께 실행될 수 있고, 충분한 대역폭이 없는 경우, 2개의 명령어는 함께 실행될 수 없다. 예에서, 2개의 명령어들은 그들의 제1 슬롯 내의 2개의 벡터 레지스터에 액세스한다. 벡터 레지스터 파일(412)이 제1 슬롯에 대한 클록 사이클에서 이러한 2개의 액세스를 충족시키기에 충분한 대역폭을 갖는 경우, 중재기(404)는 이러한 2개의 명령어를 함께 실행하도록 스케줄링할 수 있고, 만약 그렇지 않은 경우, 중재기(404)는 이러한 2개의 명령어를 함께 실행하도록 스케줄링할 수 없다.In some examples, instructions access operands within register files based on which “slot” the operands are in. More specifically, instructions reference registers in a specific order. The place of a register in this order defines a slot. In some implementations, registers for an instruction are accessed in different clock cycles, and the particular cycle in which registers are accessed depends on the register's slot. In some implementations, instructions executed together access operands in the same slot in the same clock cycle(s). In an example, two instructions executed together access registers in the first slot in the same clock cycle. Therefore, if there is sufficient bandwidth to access the registers of the two instructions' respective slots, those two instructions can be executed together, and if there is not enough bandwidth, the two instructions cannot be executed together. In the example, two instructions access two vector registers in their first slot. If vector register file 412 has sufficient bandwidth to satisfy these two accesses in the clock cycle for the first slot, arbiter 404 may schedule these two instructions to execute together; Otherwise, arbiter 404 cannot schedule these two instructions to be executed together.

중재기(404)가 2개의 명령어에 대한 레지스터 액세스가 이용가능한 레지스터 액세스 대역폭에 맞는지를 결정할 때 고려하는 다수의 태양이 있다. 하나의 태양은 명령어들에 의해 요구되는 대역폭의 양과 그러한 대역폭이 이용가능한지 여부 사이의 비교를 포함한다. 일례에서, 소정 수의 레지스터 파일 포트가 레지스터 액세스에 대해 이용가능하다. 각각의 포트는 클록 사이클당 소정 수의 비트에 대한 액세스를 제공할 수 있으며, 레지스터들에 대한 액세스는 소정 수의 그러한 비트를 요구한다. 예에서, 제1 명령어가 3개의 32 비트 벡터 레지스터에 액세스하고, 제2 명령어가 하나의 64 비트 벡터 레지스터 및 2개의 32 비트 레지스터에 액세스한다. 이 예에서, 벡터 레지스터 파일(412)에 대한 3개의 포트가 있으며, 각각은 64 비트의 대역폭을 제공한다. 이 경우에, 요구되는 비트들의 양(5 x 32 + 64 = 224)은 이용가능한 비트들의 수(192)보다 크고, 따라서 그러한 명령어들은 함께 실행될 수 없다. 다른 예에서, 제1 명령어가 2개의 32 비트 레지스터에 액세스하고 제2 명령어가 3개의 32 비트 레지스터에 액세스한다. 이 예에서, 다른 충돌이 존재하지 않는다면, 명령어들은 함께 실행될 수 있다.There are a number of aspects that arbiter 404 considers when determining whether the register accesses for two instructions fit within the available register access bandwidth. One aspect involves a comparison between the amount of bandwidth required by instructions and whether such bandwidth is available. In one example, a certain number of register file ports are available for register access. Each port can provide access to a certain number of bits per clock cycle, and access to registers requires a certain number of such bits. In the example, a first instruction accesses three 32-bit vector registers, and a second instruction accesses one 64-bit vector register and two 32-bit registers. In this example, there are three ports to the vector register file 412, each providing 64 bits of bandwidth. In this case, the amount of bits required (5 x 32 + 64 = 224) is greater than the number of bits available (192), so those instructions cannot be executed together. In another example, a first instruction accesses two 32-bit registers and a second instruction accesses three 32-bit registers. In this example, the instructions can be executed together if no other conflicts exist.

몇몇 구현들에서, 포트들은 단일 또는 이중 데이터 레이트를 제공할 수 있다. 이러한 구현들에서, 포트의 각각의 "절반"은 레지스터 파일의 상이한 뱅크에 액세스할 수 있다. 벡터 레지스터 파일의 뱅크는 상이한 뱅크에 할당된 레지스터들과 상호 배타적인 레지스터들의 세트를 포함하는 레지스터 파일의 일부이다. 포트는 포트에 대한 데이터가 2개의 상이한 뱅크로부터 소싱되는 경우에 레지스터 파일의 레지스터들에 대한 액세스의 향상된 레이트를 제공할 수 있다. 예에서, 홀수 레지스터들이 하나의 뱅크에 있고 짝수 레지스터들이 다른 뱅크에 있다. 따라서, 명령어들에 대한 레지스터 액세스들을 위해 대역폭이 이용가능한지 여부를 고려할 때, 중재기(404)는 액세스되는 레지스터들이 상이한 뱅크들에서 발견되는지 여부를 고려하며, 이는 이용가능한 대역폭을 증가시키거나 제한할 수 있다.In some implementations, ports may provide single or dual data rates. In these implementations, each “half” of the port can access a different bank of the register file. A bank of a vector register file is a portion of a register file that contains a set of mutually exclusive registers with registers assigned to different banks. A port can provide an improved rate of access to registers in a register file when data for the port is sourced from two different banks. In the example, the odd registers are in one bank and the even registers are in the other bank. Accordingly, when considering whether bandwidth is available for register accesses for instructions, arbiter 404 considers whether the registers being accessed are found in different banks, which may increase or limit the available bandwidth. You can.

다른 태양은 액세스되는 레지스터들의 레지스터 파일 유형들을 포함한다. 구체적으로, 위에서 설명된 바와 같이, 상이한 유형들의 레지스터 파일들은 그들 자신의 독립적인 대역폭을 갖는다. 예에서, 벡터 레지스터 파일(412)은, 각각 64 비트에 액세스할 수 있는, 3개의 포트를 갖고, 스칼라 레지스터 파일(414)이 또한, 각각 64 비트에 액세스할 수 있는, 3개의 포트를 갖는다. 따라서, 2개의 명령어가 벡터 레지스터와 스칼라 레지스터의 혼합을 갖는 경우, 중재기(404)는 벡터 레지스터 파일(412) 및 스칼라 레지스터 파일(414) 둘 모두에 걸쳐 충분한 대역폭이 있는지를 고려한다.Another aspect includes register file types of the registers being accessed. Specifically, as described above, different types of register files have their own independent bandwidth. In the example, vector register file 412 has three ports, each capable of accessing 64 bits, and scalar register file 414 also has three ports, each capable of accessing 64 bits. Accordingly, if two instructions have a mix of vector and scalar registers, arbiter 404 considers whether there is sufficient bandwidth across both vector register file 412 and scalar register file 414.

다른 태양은, 레지스터 파일(들)을 통해 직접 액세스되기 위해 필요한 대역폭의 양을 감소시키기 위해, 레지스터 파일 이외의 엔티티들로부터 레지스터 값들이 액세스될 수 있는지 여부를 포함한다. 몇몇 그러한 엔티티들은 SIMD 유닛(138) 실행 파이프라인 내의 전달 회로부, 피연산자 캐시, 또는 다른 피연산자 또는 명령어로서 사용되는 레지스터 값을 포함한다. 전달 회로부는 실행 파이프라인에서 발생하는 데이터 위험을 보상하는 회로부이다. 데이터 위험은 파이프라인 내의 하나의 명령어가 파이프라인 내의 다른 명령어에 의해 판독되는 레지스터에 기입하는 상황이다. 그러한 다른 명령어가 레지스터 파일로부터 레지스터의 값을 페치할 경우, 이것은 제1 명령어에 의해 생성된 값이 파일에 기입되기 전에 발생할 수 있으며, 다른 명령어가 스테일 데이터를 판독할 수 있다. 전달 회로부는 기입될 값들을 명령어에 제공함으로써 이것이 발생하는 것을 방지한다. 이러한 전달 회로부는 레지스터 파일의 대역폭을 차지하지 않는다. 따라서, 값 전달이 이러한 방식으로 발생하는 경우, 레지스터 파일의 대역폭은 효과적으로 증가된다. (예를 들어, 명령어가 파이프라인에 의해 실행되는 데 걸리는 사이클들의 수 내에서) 프로그램 순서에서 함께 가깝게 실행되는 명령어들은 종종 전달된 데이터를 사용한다. 피연산자 캐시들은 레지스터 값들을 캐싱한다. 명령어가 그러한 피연산자 캐시들로부터 값들을 획득할 수 있는 경우, 이것은 유효 레지스터 파일 대역폭을 증가시킨다. 레지스터 값들은 복제될 수 있으며, 이는 레지스터 값이 하나 이상의 명령어에 걸쳐 2회 이상 사용될 수 있음을 의미한다. 이 경우에, 하나의 액세스만이 레지스터 파일 대역폭을 소비한다.Another aspect includes whether register values can be accessed from entities other than the register file(s), to reduce the amount of bandwidth required to be accessed directly through the register file(s). Some such entities include transfer circuitry within the SIMD unit 138 execution pipeline, an operand cache, or other operands or register values used as instructions. The transmission circuit is a circuit that compensates for data risks occurring in the execution pipeline. A data hazard is a situation where one instruction in the pipeline writes to a register that is read by another instruction in the pipeline. If such another instruction fetches the value of a register from a register file, this may occur before the value generated by the first instruction is written to the file, and the other instruction may read the stale data. The transfer circuitry prevents this from happening by providing the command with the values to be written. This transfer circuitry does not take up the bandwidth of the register file. Therefore, when value transfers occur in this manner, the bandwidth of the register file is effectively increased. Instructions that are executed close together in program order (e.g., within the number of cycles it takes for the instruction to be executed by the pipeline) often use passed data. Operand caches cache register values. If an instruction can obtain values from those operand caches, this increases the effective register file bandwidth. Register values can be duplicated, meaning that a register value can be used more than once across one or more instructions. In this case, only one access consumes register file bandwidth.

요약하면, 중재기(404)는 2개의 명령어를 실행하기에 충분한 기능 유닛들(406)이 있는 상황에서, 그리고 충분한 레지스터 파일 대역폭이 있는 상황에서 병렬로 실행하기 위해 그러한 2개의(또는 그 초과의) 명령어를 선택할 수 있다. 2개의 명령어가 명령어 유형들을 가져서 각각의 레인이 둘 모두의 그러한 유형들을 함께 실행하기 위한 기능 유닛들(406)을 포함하는 경우에 충분한 기능 유닛들(406)이 있다. 2개의 명령어에 의해 액세스되는 피연산자들이 이용가능한 레지스터 파일 대역폭 내에 맞는 경우에 충분한 레지스터 파일 대역폭이 있다.In summary, the arbiter 404 can configure two (or more) instructions for execution in parallel in situations where there are sufficient functional units 406 to execute two instructions, and where there is sufficient register file bandwidth. ) You can select a command. There are enough functional units 406 if two instructions have instruction types such that each lane contains functional units 406 to execute both those types together. There is sufficient register file bandwidth if the operands accessed by two instructions fit within the available register file bandwidth.

주기적으로(예를 들어, 클록 사이클마다, 또는 중재기(404)가 실행을 위해 명령어들을 스케줄링할 준비가 될 때마다), 중재기(404)는 어느 명령어(들)를 실행할지 그리고 다수의 명령어들이 함께 실행될 수 있는지에 관한 결정을 행한다. 중재기(404)는 미결 명령어들(403)의 하나 이상의 조합을 고려하고, 그러한 하나 이상의 조합이 함께 실행될 수 있기 위해 본 명세서에 제시된 기준들을 충족시키는지를 결정한다. 그러한 조합들은 "적격 조합들"로 지칭된다. 중재기(404)는 그러한 조합이 존재하는 상황에서 함께 실행할 하나의 그러한 조합을 선택하고, 명령어들의 그러한 조합이 함께 실행되게 한다. 중재기(404)는 그러한 조합이 존재하지 않는 경우 명령어들의 어떠한 조합도 함께 실행되게 하지 않으며, 그 경우에, 하나의 명령어가 실행되게 한다. 다양한 예들에서, 중재기(404)는, 우선순위 할당 동작에 의해 결정된 바와 같은, 최고 우선순위를 갖는 적격 조합을 선택한다. 우선순위는 라운드 로빈 실행을 용이하게 하는 우선순위, 순방향 진행을 행하는 것을 돕는 우선순위, 또는 임의의 다른 기술적으로 실현가능한 방식으로 결정되는 우선순위와 같은, 임의의 기술적으로 실현가능한 방식으로 결정된다. 몇몇 예들에서, 최고 우선순위를 갖는 명령어가 임의의 다른 명령어와 함께 공동-실행가능하지 않지만 우선순위를 갖는 경우, 중재기(404)는 단독 실행을 위해 그 명령어를 선택한다.Periodically (e.g., per clock cycle, or whenever arbiter 404 is ready to schedule instructions for execution), arbiter 404 determines which instruction(s) to execute and a number of instructions. A decision is made as to whether they can be implemented together. Arbiter 404 considers one or more combinations of pending instructions 403 and determines whether such one or more combinations meet the criteria set forth herein to be able to be executed together. Such combinations are referred to as “Eligible Combinations.” The arbiter 404 selects one such combination to be executed together in situations where such combinations exist, and causes such combinations of instructions to be executed together. Arbiter 404 does not cause any combination of instructions to be executed together if no such combination exists, in which case it causes one instruction to be executed. In various examples, arbiter 404 selects the eligible combination with the highest priority, as determined by a priority assignment operation. The priority is determined in any technically feasible manner, such as a priority that facilitates round robin execution, a priority that helps effect forward progress, or a priority determined in any other technically feasible manner. In some examples, if an instruction with the highest priority is not co-executable with any other instruction but does have priority, arbiter 404 selects that instruction for standalone execution.

도 5는 예에 따른, 명령어들을 실행하기 위한 방법(500)의 흐름도이다. 도 1 내지 도 4의 시스템과 관련하여 설명되었지만, 당업자는 임의의 기술적으로 실현가능한 순서로 방법(500)의 단계들을 수행하도록 구성된 임의의 시스템이 본 개시의 범위에 속한다는 것을 인식할 것이다.5 is a flow diagram of a method 500 for executing instructions, according to an example. Although described with respect to the system of FIGS. 1-4 , those skilled in the art will recognize that any system configured to perform the steps of method 500 in any technically feasible order is within the scope of the present disclosure.

단계 502에서, 중재기(404)는 실행을 위해 명령어들을 발행하기 위한 제1 시간에서 동작하고 있다. 중재기(404)는, 새로운 명령어들에 대해 자원들이 이용가능한 한, 다양한 시간에, 예컨대 사이클마다 실행을 위해 명령어들을 선택한다는 것이 이해되어야 한다. 중재기(404)는 명령어들의 제1 세트를 함께 실행하기에 충분한 처리 자원들이 존재함을 식별한다. 몇몇 예들에서, 명령어들을 함께 실행하는 것은 명령어들 각각에 대한 적어도 하나의 연산이 동일한 사이클(들)에서 SIMD 유닛(138)의 동일한 레인으로 수행되는 것을 의미한다. 몇몇 예들에서, 명령어들을 함께 실행하는 것은 명령어들 중 적어도 하나의 명령어의 적어도 하나의 연산이 복잡한 명령어들에 대해 사용되는 하드웨어(예를 들어, 행렬 곱셈 또는 다른 더 복잡한 명령어들에 대해 사용될 덧셈기들 중 하나)로 수행되는 것을 의미한다. 몇몇 예들에서, 명령어들을 함께 실행하는 것은 명령어들의 실행 레이트가 명령어들의 "정상" 실행 레이트를 넘어 증가되는 것을 의미한다. 몇몇 예들에서, 정상 실행 레이트는 각각의 명령어에 대한 레지스터 대역폭 및/또는 기능 유닛들에 대한 충돌이 있을지라도 명령어들이 실행될 수 있는 레이트이다. 몇몇 예들에서, 정상 실행 레이트는 (데이터가 메모리로부터 페치되기를 기다리는 것과 같은) 정지할 다른 이유가 없다고 가정하여 SIMD 유닛(138)의 하나의 레인이 실행을 보장하기에 충분한 하드웨어를 갖는 레이트이다. 몇몇 예들에서, 정상 실행 레이트는 단일 레이트 실행 레이트이고, 증가된 실행 레이트는 이중 실행 레이트이다. 몇몇 예들에서, 함께 실행될 명령어들은 상이한 파면들로부터의 것들이고, 다른 예들에서, 함께 실행될 명령어들은 동일한 파면으로부터의 것들이다. 함께 실행될 명령어들이 상이한 파면들로부터의 것들인 예들에서, 이러한 명령어들을 함께 실행하는 것은 명령어들이 함께 실행될 수 없는 경우 단지 하나의 파면으로부터의 연산들이 진행될 것과 동일한 시간에 2개의 파면에 대한 연산들이 진행될 수 있게 한다. 예에서, 명령어들을 함께 실행될 수 있는 경우, 2개의 파면 각각에 대한 하나의 명령어는 사이클마다(또는 소정 수의 사이클마다) 완료될 수 있는 반면, 그러한 명령어들을 함께 실행할 수 없다면, 단지 하나의 파면으로부터의 명령어가 사이클마다(또는 소정 수의 사이클마다) 완료될 수 있다.At step 502, arbiter 404 is operating at the first time to issue instructions for execution. It should be understood that the arbiter 404 selects instructions for execution at various times, such as per cycle, as long as resources are available for new instructions. Arbiter 404 identifies that sufficient processing resources exist to execute the first set of instructions together. In some examples, executing instructions together means that at least one operation for each of the instructions is performed on the same lane of SIMD unit 138 in the same cycle(s). In some examples, executing instructions together means that at least one operation of at least one of the instructions may be performed on the hardware used for complex instructions (e.g., one of the adders used for matrix multiplication or other more complex instructions). It means that it is carried out as one). In some examples, executing instructions together means that the execution rate of the instructions is increased beyond their “normal” execution rate. In some examples, the normal execution rate is the rate at which instructions can be executed even if there are conflicts over functional units and/or register bandwidth for each instruction. In some examples, the normal execution rate is the rate at which one lane of SIMD unit 138 has sufficient hardware to ensure execution, assuming there are no other reasons to stall (such as waiting for data to be fetched from memory). In some examples, the normal execution rate is a single rate execution rate and the increased execution rate is a dual execution rate. In some examples, instructions to be executed together are from different wavefronts, and in other examples, instructions to be executed together are from the same wavefront. In examples where the instructions to be executed together are from different wavefronts, executing these instructions together may allow operations on two wavefronts to proceed at the same time that operations from just one wavefront would proceed if the instructions could not be executed together. let it be In the example, if the instructions can be executed together, then one instruction for each of the two wavefronts can be completed per cycle (or every certain number of cycles), whereas if those instructions cannot be executed together, then only one instruction from one wavefront can be completed. The instructions may be completed every cycle (or every certain number of cycles).

단계 502는 명령어들을 함께 실행하기에 충분한 처리 자원들이 있음을 식별하는 것을 포함한다. 몇몇 예들에서, 그러한 처리 자원들은 레지스터 파일 대역폭 및 기능 유닛들을 포함한다. 몇몇 예들에서, 각각의 레인은 명령어 유형들의 소정 조합들을 함께 실행하기에 충분한 기능 유닛들(406)을 갖지만, 명령어 유형들의 다른 조합들에 대해서는 그렇지 않다. 예에서, 행렬 곱셈과 같은 복잡한 명령어들은 덧셈기들 및 곱셈기들과 같은 다수의 소정 유형들의 기능 유닛들을 요구한다. 게다가, 각각의 레인은 각각의 명령어 유형을 실행할 수 있기 때문에, 각각의 레인은 이러한 상이한 기능 유닛들의 적어도 하나의 사본을 포함한다. 따라서, 각각의 레인은 단순한 덧셈 또는 곱셈 연산들과 같은 더 단순한 명령어들을 수행하기 위한 기능 유닛들의 다수의 사본을 포함한다. 따라서, 일례에서, 명령어들이 그러한 명령어들을 실행하기 위한 기능 유닛들의 적어도 하나의 사본이 존재하는 유형의 것인 경우에 2개의 명령어에 대해 충분한 처리 자원들이 존재한다.Step 502 includes identifying that there are sufficient processing resources to execute the instructions together. In some examples, such processing resources include register file bandwidth and functional units. In some examples, each lane has enough functional units 406 to execute together certain combinations of instruction types, but not for other combinations of instruction types. In an example, complex instructions such as matrix multiplication require multiple types of functional units such as adders and multipliers. Moreover, since each lane can execute a respective instruction type, each lane contains at least one copy of these different functional units. Accordingly, each lane contains multiple copies of functional units for performing simpler instructions such as simple addition or multiplication operations. Thus, in one example, there are sufficient processing resources for two instructions if the instructions are of a type for which there is at least one copy of the functional units for executing those instructions.

몇몇 예들에서, 레지스터 파일 대역폭은 벡터 레지스터 파일(412) 및 스칼라 레지스터 파일(414)과 같은 하나 이상의 레지스터 파일에 대한 대역폭을 포함한다. 몇몇 예들에서, 명령어들은 지정된 피연산자들에 기초하여 레지스터 파일 대역폭을 소비한다. 보다 구체적으로, 명령어들은 레지스터 이름에 의해 피연산자들을 지정할 수 있다. 또한, 이러한 피연산자들은 암시적으로 또는 명시적으로 전송될 필요가 있는 데이터의 양을 포함한다. 일례에서, 피연산자들은 32 비트 피연산자 또는 64 비트 피연산자이다. 중재기(404)는 다수의 명령어들에 대해 지정된 데이터의 양이 지정된 레지스터들에 대한 이용가능 대역폭보다 적은 경우 레지스터 파일 대역폭이 충분하다고 결정한다. 다양한 예들에서, 벡터 레지스터 파일(412)과 스칼라 레지스터 파일(414)은 독립적인 대역폭을 가지며, 이는 벡터 레지스터 파일(412)에 대한 액세스가 스칼라 레지스터 파일(414)의 대역폭을 소비하지 않고, 스칼라 레지스터 파일(414)에 대한 액세스가 벡터 레지스터 파일(412)에 대한 대역폭을 소비하지 않음을 의미한다. 또한, SIMD 유닛(138)이 캐시 또는 전달 회로부와 같은 상이한 위치로부터 피연산자들에 액세스할 수 있는 것과 같은 다른 이유로 소정의 레지스터 액세스가 대역폭을 소비하지 않는 것이 가능하다. (문자 그대로의 값들과 같은) 레지스터들을 참조하지 않는 피연산자들은 레지스터 파일 대역폭을 소비하지 않는다. 몇몇 예들에서, 레지스터 파일 액세스에 대한 제한들이 있다. 예를 들어, 몇몇 예들에서, 전체 대역폭에 액세스하기 위해, 액세스들은 레지스터 파일들의 2개(또는 그 초과)의 상이한 뱅크에 대한 것이어야 한다. 몇몇 예들에서, 모든 뱅크가 사용되는 경우 최대 대역폭이 액세스 가능하고, 레지스터 액세스들이 특정 뱅크로 편향되는 경우 대역폭은 제한된다. 예에서, 벡터 레지스터 파일(412)은 2개의 뱅크를 포함한다. 게다가, 2개의 명령어에 대한 레지스터 액세스들은 전체 대역폭을 이용하고, 뱅크들 간에 균등하게 분산된다. 그러한 상황에서, 명령어들을 함께 실행하기에 충분한 대역폭이 있다. 다른 예에서, 2개의 명령어에 대한 레지스터 액세스들은 전체 대역폭을 이용하지만, 하나의 뱅크로 심하게 편향된다. 그러한 예에서, 명령어들을 함께 실행하기에 충분한 대역폭이 없다.In some examples, register file bandwidth includes bandwidth for one or more register files, such as vector register file 412 and scalar register file 414. In some examples, instructions consume register file bandwidth based on the specified operands. More specifically, instructions can specify operands by register names. Additionally, these operands implicitly or explicitly contain the amount of data that needs to be transferred. In one example, the operands are 32-bit operands or 64-bit operands. Arbiter 404 determines that the register file bandwidth is sufficient if the amount of data specified for a number of instructions is less than the available bandwidth for the specified registers. In various examples, vector register file 412 and scalar register file 414 have independent bandwidths, such that accesses to vector register file 412 do not consume bandwidth of scalar register file 414, and accesses to scalar register file 414 do not consume bandwidth of scalar register file 414. This means that access to file 414 does not consume bandwidth to vector register file 412. Additionally, it is possible that some register accesses do not consume bandwidth for other reasons, such as SIMD unit 138 may access operands from different locations, such as cache or forwarding circuitry. Operands that do not reference registers (such as literal values) do not consume register file bandwidth. In some examples, there are restrictions on register file access. For example, in some examples, to access the full bandwidth, accesses must be to two (or more) different banks of register files. In some examples, maximum bandwidth is accessible when all banks are used, and bandwidth is limited when register accesses are biased to specific banks. In the example, vector register file 412 includes two banks. Additionally, register accesses for two instructions utilize the full bandwidth and are evenly distributed between banks. In such a situation, there is sufficient bandwidth to execute the instructions together. In another example, register accesses for two instructions utilize the full bandwidth, but are heavily biased towards one bank. In such an example, there is not enough bandwidth to execute the instructions together.

단계 502는 명령어들을 함께 실행하기에 충분한 기능 유닛들이 있다는 것뿐만 아니라, 명령어들을 함께 실행하기에 충분한 레지스터 대역폭이 있다고 결정하는 것을 포함한다. 단계 504에서, 이러한 결정에 응답하여, 중재기(404)는 함께 실행하기 위한 명령어의 제1 세트를 발행한다. 몇몇 구현들에서, 동작 동안, 중재기(404)는 때때로 함께 실행하기에 적합한 2개 이상의 명령어 세트 중에서 선택한다는 것이 이해되어야 한다. 몇몇 예들에서, 이러한 선택은 우선순위 메커니즘에 기초하여 발생한다.Step 502 includes determining that there are sufficient functional units to execute the instructions together, as well as that there is sufficient register bandwidth to execute the instructions together. At step 504, in response to this determination, arbiter 404 issues a first set of instructions for execution together. It should be understood that in some implementations, during operation, arbiter 404 sometimes selects between two or more sets of instructions that are suitable for execution together. In some examples, this selection occurs based on a priority mechanism.

단계 506에서, 중재기(404)는 함께 실행하기 위한 기준들을 충족시키는 명령어들의 세트가 없다고 결정한다. 일례에서, 이러한 결정은 (예를 들어, 너무 많은 기능 유닛(406)을 사용하는 것으로 인해) 임의의 다른 명령어와 함께 실행될 수 없는 명령어가 최고 우선순위를 갖기 때문에(따라서 실행되어야 하기 때문에) 발생한다. 다른 예들에서, 이러한 결정은 중재기(404)가 단계 502와 관련하여 설명된 기준들을 충족시키는 어떠한 명령어 세트도 찾을 수 없기 때문에 발생한다. 단계 508에서, 단계 506의 결정에 응답하여, 중재기(404)는 다수의 명령어들을 함께 발행하는 대신에 하나의 명령어를 발행한다.At step 506, arbiter 404 determines that there is no set of instructions that meet the criteria for execution together. In one example, this decision occurs because an instruction that cannot be executed with any other instruction (e.g., due to using too many functional units 406) has the highest priority (and therefore must be executed). . In other examples, this decision occurs because arbiter 404 cannot find any instruction set that meets the criteria described with respect to step 502. At step 508, in response to the decision of step 506, arbiter 404 issues one command instead of issuing multiple commands together.

중재기(404)가 실행을 위해 명령어들을 발행하기 위해 방법(500)의 단계들을 연속적으로 수행한다는 것이 이해되어야 한다. 게다가, 방법(500)은 중재기(404)가 함께 실행하기 위한 명령어들을 선택하는 하나의 반복, 및 중재기(404)가 어떠한 다른 명령어와도 함께 실행하기 위한 것이 아닌 명령어를 선택하는 하나의 반복을 포함하지만, 이러한 특정 패턴은 단지 예시적인 것이며, 중재기(404)는 런타임 상황들이 보증하는 바에 따라 함께 실행하기 위한 또는 함께 실행하기 위한 것이 아닌 명령어들을 자유롭게 선택한다는 것이 이해되어야 한다.It should be understood that arbiter 404 sequentially performs the steps of method 500 to issue instructions for execution. Additionally, method 500 includes one iteration in which arbiter 404 selects instructions for execution together, and one iteration in which arbiter 404 selects an instruction that is not intended for execution with any other instruction. However, it should be understood that this particular pattern is exemplary only, and that arbiter 404 is free to select instructions that may or may not be executed together as runtime circumstances warrant.

본 명세서의 개시 내용에 기초하여 많은 변형이 가능하다는 것이 이해되어야 한다. 특징부들 및 요소들이 특정 조합으로 전술되지만, 각각의 특징부 또는 요소는 다른 특징부들 및 요소들 없이 단독으로, 또는 다른 특징부들 및 요소들과 함께 또는 이들 없이 다양한 조합으로 사용될 수 있다.It should be understood that many modifications are possible based on the disclosure herein. Although features and elements are described above in specific combinations, each feature or element may be used alone, without other features and elements, or in various combinations with or without other features and elements.

도면에 예시된 그리고/또는 본 명세서에서 설명된 다양한 기능 유닛들(프로세서(102), 입력 드라이버(112), 입력 디바이스들(108), 출력 드라이버(114), 출력 디바이스들(110), 가속 처리 디바이스(116), 커맨드 프로세서(136), 그래픽 처리 파이프라인(134), 컴퓨트 유닛들(132), SIMD 유닛들(138), 시스템(400), 또는 레지스터 할당기(402)를 포함하지만, 이로 제한되지 않음)은 범용 컴퓨터, 프로세서, 또는 프로세서 코어로서, 또는 프로그램, 소프트웨어, 또는 펌웨어로서 구현되고, 비일시적 컴퓨터 판독가능 매체에 또는 다른 매체에 저장되고, 범용 컴퓨터, 프로세서, 또는 프로세서 코어에 의해 실행가능할 수 있다. 제공된 방법은 범용 컴퓨터, 프로세서, 또는 프로세서 코어로 구현될 수 있다. 적합한 프로세서는, 예로서, 범용 프로세서, 특수 목적 프로세서, 종래의 프로세서, DSP(digital signal processor), 복수의 마이크로프로세서, DSP 코어와 연관된 하나 이상의 마이크로프로세서, 컨트롤러, 마이크로컨트롤러, ASIC(Application Specific Integrated Circuit), FPGA(Field Programmable Gate Array) 회로, 임의의 다른 유형의 IC(integrated circuit), 및/또는 상태 기계를 포함한다. 그러한 프로세서들은 처리된 HDL(hardware description language) 명령어 및 네트리스트(netlist)를 포함하는 다른 중개 데이터의 결과를 이용하여 제조 프로세스를 구성함으로써 제조될 수 있다(그러한 명령어는 컴퓨터 판독가능 매체에 저장될 수 있음). 그러한 처리의 결과는 이어서 본 개시의 특징부를 구현하는 프로세서를 제조하기 위해 반도체 제조 프로세스에 사용되는 마스크워크(maskwork)일 수 있다.Various functional units illustrated in the figures and/or described herein (processor 102, input driver 112, input devices 108, output driver 114, output devices 110, accelerated processing) including device 116, command processor 136, graphics processing pipeline 134, compute units 132, SIMD units 138, system 400, or register allocator 402, (without limitation) is implemented as a general-purpose computer, processor, or processor core, or as a program, software, or firmware, stored on a non-transitory computer-readable medium or other medium, and stored on a general-purpose computer, processor, or processor core. It may be feasible by The provided methods may be implemented in a general purpose computer, processor, or processor core. Suitable processors include, for example, a general-purpose processor, a special-purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors associated with a DSP core, a controller, a microcontroller, or an Application Specific Integrated Circuit (ASIC). ), Field Programmable Gate Array (FPGA) circuits, any other type of integrated circuit (IC), and/or state machines. Such processors may be manufactured by constructing a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediate data, including a netlist (such instructions may be stored on a computer-readable medium). has exist). The result of such processing may be a maskwork that is then used in a semiconductor manufacturing process to fabricate a processor that implements the features of the present disclosure.

본 명세서에서 제공된 방법 또는 흐름도는 범용 컴퓨터 또는 프로세서에 의한 실행을 위해 비일시적 컴퓨터 판독가능 저장 매체에 통합된 컴퓨터 프로그램, 소프트웨어, 또는 펌웨어로 구현될 수 있다. 비일시적 컴퓨터 판독가능 저장 매체의 예는 ROM(read only memory), RAM(random access memory), 레지스터, 캐시 메모리, 반도체 메모리 디바이스, 자기 매체, 예컨대 내부 하드 디스크 및 이동식 디스크, 광자기 매체, 및 광 매체, 예컨대 CD-ROM 디스크, 및 DVD(digital versatile disk)를 포함한다.The methods or flow diagrams provided herein may be implemented as a computer program, software, or firmware incorporated into a non-transitory computer-readable storage medium for execution by a general-purpose computer or processor. Examples of non-transitory computer-readable storage media include read only memory (ROM), random access memory (RAM), registers, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media. Includes media such as CD-ROM disks, and digital versatile disks (DVDs).

Claims

As a method,
At a first time to issue instructions for execution, performing a first identification, including identifying that sufficient processing resources exist within a processing lane to jointly execute the first set of instructions;
In response to the first identification, executing the first set of instructions together;
At a second time for issuing instructions for execution, performing a second identification, including identifying that instructions for which there are sufficient processing resources for execution together within a processing lane are not available;
In response to the second identification, executing an instruction independently of any other instructions.

2. The method of claim 1, wherein identifying that sufficient processing resources exist comprises identifying that sufficient functional units exist in the processing lane to execute at least one operation of each instruction of the first set of instructions in a same clock cycle. A method comprising the steps of:

2. The method of claim 1, wherein identifying that sufficient processing resources exist comprises determining that each instruction of the first set of instructions comprises a simple instruction for which multiple copies of functional units exist within the processing lane. How to.

2. The method of claim 1, wherein identifying that sufficient processing resources exist comprises: wherein one of the first set of instructions comprises a complex instruction for which one copy of the functional units exists in the processing lane and the first set of instructions A method comprising determining that another instruction in the set includes a simple instruction for which there are multiple copies of functional units within the processing lane.

The method of claim 1, wherein identifying that sufficient processing resources exist includes determining that sufficient bandwidth exists for operands of each instruction in the first set of instructions.

6. The method of claim 5, wherein sufficient bandwidth exists if the bandwidth consumed by the operands of the instructions in the first set of instructions is less than the bandwidth available in register files.

7. The method of claim 6, wherein operands available from sources other than a register file do not consume bandwidth of the register files.

The method of claim 1, wherein the instructions in the first set of instructions are from the same wavefront.

The method of claim 1, wherein the instructions in the first set of instructions are from different wavefronts.

As a compute unit,
a memory configured to store instructions; and
A processing unit comprising:
At a first time to issue instructions obtained from the memory for execution, perform a first identification, including identifying that sufficient processing resources exist within a processing lane to jointly execute the first set of instructions;
In response to the first identification, jointly executing the first set of instructions;
At a second time to issue instructions obtained from the memory for execution, perform a second identification, including identifying that instructions for which there are sufficient processing resources for execution together within the processing lane are not available; ,
In response to the second identification, the compute unit is configured to execute an instruction independently of any other instructions.

11. The method of claim 10, wherein identifying that sufficient processing resources exist comprises identifying that sufficient functional units are present in the processing lane to execute at least one operation of each instruction of the first set of instructions in a same clock cycle. Compute unit, including:

11. The method of claim 10, wherein identifying that sufficient processing resources exist comprises determining that each instruction of the first set of instructions includes a simple instruction for which multiple copies of functional units exist within the processing lane. Compute unit.

11. The method of claim 10, wherein identifying that sufficient processing resources exist means that one of the first set of instructions comprises a complex instruction for which one copy of the functional units exists within the processing lane and and determining that another instruction in the compute unit includes a simple instruction for which there are multiple copies of functional units within the processing lane.

11. The compute unit of claim 10, wherein identifying that sufficient processing resources exist includes determining that sufficient bandwidth exists for operands of each instruction of the first set of instructions.

15. The compute unit of claim 14, wherein sufficient bandwidth exists if the bandwidth consumed by the operands of the instructions in the first set of instructions is less than the bandwidth available in register files.

16. The compute unit of claim 15, wherein operands available from sources other than a register file do not consume bandwidth of the register files.

11. The compute unit of claim 10, wherein instructions in the first set of instructions are from the same wavefront.

11. The compute unit of claim 10, wherein instructions in the first set of instructions are from different wavefronts.

As an accelerated processing device,
It includes a plurality of compute units, each compute unit having:
a memory configured to store instructions, and
A processing unit comprising:
At a first time to issue instructions obtained from the memory for execution, perform a first identification, including identifying that sufficient processing resources exist within a processing lane to jointly execute the first set of instructions;
In response to the first identification, jointly executing the first set of instructions;
At a second time to issue instructions obtained from the memory for execution, perform a second identification, including identifying that instructions for which there are sufficient processing resources for execution together within the processing lane are not available; ,
In response to the second identification, the accelerated processing device is configured to execute an instruction independently of any other instructions.

20. The method of claim 19, wherein identifying that sufficient processing resources exist comprises identifying that sufficient functional units are present in the processing lane to execute at least one operation of each instruction of the first set of instructions in a same clock cycle. An accelerated processing device comprising: