KR20090107973A

KR20090107973A - Execution of retargetted graphics processor accelerated code by a general purpose processor

Info

Publication number: KR20090107973A
Application number: KR1020090031039A
Authority: KR
Inventors: 비노드 그로버; 바스티안 요아네스 마데우스 아츠; 마이클 머피; 제이얀트 비. 콜헤; 존 브라이언 포맨; 더글라스 세일러; 보리스 베이린
Original assignee: 엔비디아 코포레이션
Priority date: 2008-04-09
Filing date: 2009-04-09
Publication date: 2009-10-14
Also published as: KR101118321B1

Abstract

PURPOSE: Execution of retargeted graphics processor accelerated code by a general purpose processor is provided to configure a general purpose processor in order to execute a translated application program. CONSTITUTION: A CPU(102) operates as the control processor of a computer system(100), managing and coordinating the operation of other system components. In particular, the CPU issues commands that control the operation of parallel processors(134) within a multithreaded processing subsystem(112). CPU writes a stream of commands for parallel processors to a command buffer, which may reside in a system memory(104), a subsystem memory(138), or another storage location accessible to both CPU and parallel processors. Parallel processors read the command stream from the command buffer and execute commands asynchronously with respect to the operation of CPU. System memory includes an execution image of an operating system, a device driver(103), and CUDA code(101) that is configured for execution by multithreaded processing subsystem. CUDA code incorporates programming instructions intended to execute on multithreaded processing subsystem. In the context of the present description, code refers to any computer code, instructions, and/or functions that may be executed using a processor.

Description

Execution of Retargeted Graphics Processor Accelerated Code by General-Purpose Processor {EXECUTION OF RETARGETTED GRAPHICS PROCESSOR ACCELERATED Code

관련 출원의 교차 참조Cross Reference of Related Application

본 출원은 2008년 4월 9일자로 "System For Executing GPU-Accelerated Code on Multi-Core Architectures"라는 제목으로 출원된 미국 가출원 일련번호 61/043,708호(Attorney Docket No. NVDA/SC-08-0007-US0)의 우선권을 주장한다. 이 관련 출원은 본 명세서에 참조로서 포함된다.This application is filed on April 9, 2008, entitled "System For Executing GPU-Accelerated Code on Multi-Core Architectures," US Provisional Serial No. 61 / 043,708 (Attorney Docket No. NVDA / SC-08-0007- Claiming priority of US0). This related application is incorporated herein by reference.

기술분야Field of technology

본 발명의 실시예들은 일반적으로 컴파일러 프로그램들에 관련되고, 더욱 구체적으로는 멀티코어 그래픽 프로세서(multi-core graphics processing unit)에 의한 실행을 위하여 기록되고 공유 메모리를 갖는 범용 프로세서에 의한 실행을 위해 리타게팅된 어플리케이션 프로그램에 관련된 것이다.Embodiments of the present invention generally relate to compiler programs, and more specifically, for execution by a general-purpose processor having a shared memory and recorded for execution by a multi-core graphics processing unit. It relates to the targeted application program.

현대의 그래픽 프로세싱 시스템들은 통상적으로 어플리케이션들을 멀티스레드 방식으로 실행하도록 구성되는 멀티코어 그래픽 프로세싱 유닛(GPU)을 포함한 다. 그래픽 프로세싱 시스템들은 또한 실행 스레드들 사이에 공유되는 부분과 각 스레드 전용 부분을 갖는 메모리를 포함한다.Modern graphics processing systems typically include a multicore graphics processing unit (GPU) that is configured to execute applications in a multithreaded manner. Graphics processing systems also include a memory having a portion shared between execution threads and a dedicated portion of each thread.

NVIDIA사의 CUDA™(Compute Unified Device Architecture) 기술은 영상 및 음성 인코딩, 석유 및 가스 탐사를 위한 모델링 및 의학적 촬상(medical imaging)과 같은 복잡한 계산 문제들을 풀기 위한 소프트웨어 어플리케이션들을 프로그래머들 및 개발자들이 기록하는 것을 가능하게 하는 C 언어 환경을 제공한다. 어플리케이션들은 멀티코어 GPU에 의한 병렬 실행을 위해 구성되고 통상적으로 멀티코어 GPU의 특정한 특징들에 의존한다. 동일한 특정 특징들이 범용 CPU에서 이용가능하지 않기 때문에, CUDA를 이용하여 기록된 소프트웨어 어플리케이션은 범용 CPU 상에서 작동하기에 포트가능(portable)하지 않을 수 있다.NVIDIA's Compute Unified Device Architecture (CUDA ™) technology allows programmers and developers to record software applications to solve complex computational problems such as video and audio encoding, modeling for oil and gas exploration, and medical imaging. Provide a C language environment that makes it possible. Applications are configured for parallel execution by a multicore GPU and typically rely on specific features of the multicore GPU. Since the same specific features are not available on a general purpose CPU, a software application written using CUDA may not be portable to run on a general purpose CPU.

전술된 바와 같이, 프로그래머가 어플리케이션 프로그램을 수정할 필요 없이, 멀티코어 GPU 상의 실행을 위한 병렬 프로그래밍 모델을 사용하여 기록된 어플리케이션 프로그램들을 범용 CPU들 상에서 작동할 수 있게 하기 위한 기술이 본 기술분야에 필요하다.As mentioned above, there is a need in the art for techniques to enable programmers to run written application programs on general purpose CPUs using a parallel programming model for execution on a multicore GPU, without the need for programmers to modify the application program. .

본 발명의 일 실시예는 트랜스레이트된 어플리케이션 프로그램을 실행하도록 범용 프로세서를 구성하는 방법을 개시한다. 방법은 멀티코어 그래픽 프로세싱 유닛 상의 실행을 위한 벙렬 프로그래밍 모델을 이용하여 기록된 어플리케이션 프로 그램으로부터 변환된 트랜스레이트된 어플리케이션 프로그램을 수신하는 단계 및 트랜스레이트된 어플리케이션 프로그램을 컴파일하여 범용 프로세서에 의한 실행을 위해 컴파일된 코드를 생성하는 단계를 포함한다. 컴파일된 코드를 실행하는 데 이용가능한 범용 프로세서의 실행 코어들의 수가 결정되고 범용 프로세서는 그 실행 코어들의 수를 인에이블하도록(enable the number of execution cores) 구성된다. 컴파일된 코드는 그 실행 코어들의 수를 포함하는(including the number of execution cores) 범용 프로세서에 의한 실행을 위해 런칭된다.One embodiment of the present invention discloses a method of configuring a general purpose processor to execute a translated application program. The method comprises receiving a transformed application program from a recorded application program using a parallel programming model for execution on a multicore graphics processing unit and compiling the translated application program for execution by a general purpose processor. Generating the compiled code. The number of execution cores of the general purpose processor available to execute the compiled code is determined and the general purpose processor is configured to enable the number of execution cores. Compiled code is launched for execution by a general purpose processor, including the number of execution cores.

개시된 방법의 하나의 이점은 멀티코어 GPU들 상의 실행을 위한 병렬 프로그래밍 모델을 사용하여 기록된 어플리케이션 프로그램들은 범용 CPU들로 수정없이 포트가능하다는 것이다. 멀티코어 GPU의 특정 특징들에 의존하는 어플리케이션의 부분들은 트랜스레이터에 의해 범용 CPU에 의한 실행을 위해 변환된다. 어플리케이션 프로그램은 동기화와 독립 명령어들의 영역들로 파티셔닝된다. 명령어들은 수렴하는 것(convergent)과 발산하는 것(divergent)으로 분류되고 영역들 사이에서 공유되는 발산형 메모리 참조들은 복사된다(replicated). 스레드 루프들은 범용 CPU에 의한 실행 동안 다양한 스레드 사이의 정확한 메모리 공유를 보장하기 위해 삽입된다.One advantage of the disclosed method is that application programs written using a parallel programming model for execution on multicore GPUs are portable without modification to general purpose CPUs. Portions of the application that depend on the specific features of the multicore GPU are translated by the translator for execution by the general purpose CPU. The application program is partitioned into areas of synchronization and independent instructions. Instructions are classified as convergent and divergent, and divergent memory references shared between regions are replicated. Thread loops are inserted to ensure accurate memory sharing between the various threads during execution by the general purpose CPU.

그리하여 본 발명의 특징들을 위에서 개시한 방식이 자세히 이해될 수 있도록, 일부가 첨부 도면에 예시된 실시예들을 참조하여, 위에서 간단히 요약된 본 발명이 더욱 구체적으로 기술될 수 있다. 그러나 본 발명은 다른 동등하게 효과적인 실시예들을 허용할 수 있으므로 첨부 도면은 본 발명의 통상적인 실시예들만을 예시할 뿐이고, 본 발명의 범위를 제한하는 것으로 간주되지 않아야 한다.Thus, the present invention, briefly summarized above, may be described in more detail with reference to embodiments, some of which are illustrated in the accompanying drawings, in order that the features disclosed herein may be understood in detail. However, the present invention may tolerate other equally effective embodiments so the accompanying drawings only illustrate typical embodiments of the invention and should not be taken as limiting the scope of the invention.

이하의 기술에서, 본 발명의 더욱 완전한 이해를 제공하기 위해 다수의 특정한 상세들이 개시된다. 그러나, 이러한 특정한 상세 중 하나 이상 없이 본 발명이 실시될 수 있다는 것이 본 기술분야의 통상의 기술자에게 명백할 것이다. 다른 예들에서, 공지의 특징들은 본 발명을 불명료하게 하는 것을 피하기 위해 기술되지 않았다.In the following description, numerous specific details are set forth in order to provide a more complete understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the present invention.

도 1은 CUDA를 이용하여 기술된 코드를 실행하도록 구성된 컴퓨터 시스템(100)을 예시하는 블록도이다. 컴퓨터 시스템(100)은 메모리 브리지(105)를 포함하는 버스 경로를 통해 통신하는 시스템 메모리(104) 및 CPU(102)를 포함한다. 예를 들어 노스브리지 칩일 수 있는 메모리 브리지(105)는, 버스 또는 다른 통신 경로(106)(예를 들어, HyperTransport link)를 통하여 I/O 브리지(107)에 접속된다. 예를 들어, 사우스브리지 칩일 수 있는 I/O 브리지(107)는 하나 이상의 사용자 입력 장치(108)(예를 들어, 키보드, 마우스)로부터 사용자 입력을 수신하고 경로(106) 및 메모리 브리지(105)를 통하여 입력을 CPU(102)로 전달한다. 멀티스레드형 프로세싱 서브시스템(112)이 버스 또는 다른 통신 경로(113)(예를 들어, PCI Express, 가속 그래픽 포트(Accelerated Graphics Port), 또는 HyperTransport link)를 통하여 메모리 브리지(105)에 연결된다. 일 실시예에서, 멀티스레드형 프로세싱 서브시스템(112)은 표시 장치(110)(예를 들어, 종래의 CRT 또는 LCD 기반 모니터)에 화소들을 전달하는 그래픽 서브시스템이다. 시스템 디스크(114)는 I/O 브리지(107)에도 접속된다. 스위치(116)는 I/O 브리지(107)와 다른 구성 요소들, 예를 들어 네트워크 어댑터(118) 및 다양한 애드인 카드들(120 및 121) 사이에서 접속들을 제공한다. USB 또는 다른 포트 접속들, CD 드라이브들, DVD 드라이브들, 필름 기록 장치들 등을 포함하는 다른 구성 요소들(명시적으로 도시 안됨)도 I/O 브리지(107)에 접속될 수 있다. 도 1의 다양한 구성 요소들과 상호접속하는 통신 경로들은 PCI(Peripheral Component Interconnect), PCI Express(PCI-E), AGP(Accelerated Graphics Port), HyperTransport 또는 임의의 다른 버스 또는 포인트-투-포인트 통신 프로토콜(들)과 같은 임의의 적당한 프로토콜들을 사용하여 구현될 수 있고, 상이한 장치 사이의 접속들은 본 기술분야에서 공지된 상이한 프로토콜들을 사용할 수 있다.1 is a block diagram illustrating a computer system 100 configured to execute code described using CUDA. Computer system 100 includes a system memory 104 and a CPU 102 in communication over a bus path that includes a memory bridge 105. Memory bridge 105, which may be, for example, a northbridge chip, is connected to I / O bridge 107 via a bus or other communication path 106 (eg, a HyperTransport link). For example, the I / O bridge 107, which may be a southbridge chip, receives user input from one or more user input devices 108 (eg, keyboard, mouse) and transmits the path 106 and memory bridge 105. The input is passed through to the CPU 102. Multithreaded processing subsystem 112 is coupled to memory bridge 105 via a bus or other communication path 113 (eg, PCI Express, Accelerated Graphics Port, or HyperTransport link). In one embodiment, the multithreaded processing subsystem 112 is a graphics subsystem for delivering pixels to the display device 110 (eg, a conventional CRT or LCD based monitor). The system disk 114 is also connected to the I / O bridge 107. Switch 116 provides connections between I / O bridge 107 and other components, such as network adapter 118 and various add-in cards 120 and 121. Other components (not explicitly shown), including USB or other port connections, CD drives, DVD drives, film recording devices, etc., may also be connected to the I / O bridge 107. The communication paths that interconnect with the various components of FIG. 1 are Peripheral Component Interconnect (PCI), PCI Express (PCI-E), Accelerated Graphics Port (AGP), HyperTransport, or any other bus or point-to-point communication protocol. It may be implemented using any suitable protocols such as (s), and connections between different devices may use different protocols known in the art.

CPU(102)는 다른 시스템 구성 요소들의 동작을 관리하고 조정하는, 컴퓨터 시스템(100)의 제어 프로세서로서 동작한다. 특히, CPU(102)는 멀티스레드형 프로세싱 서브시스템(112) 내의 병렬 프로세서들(134)의 동작을 제어하는 명령들을 발행한다. 일부 실시예들에서, CPU(102)는 시스템 메모리(104), 서브시스템 메모리(138), 또는 CPU(102)와 병렬 프로세서들(134) 둘 다에 액세스가능한 다른 저장위치에 있는(reside) 커맨드 버퍼(도시 안됨)에 병렬 프로세서들(134)에 대한 명령들의 스트림을 기록한다. 병렬 프로세서들(134)은 명령 버퍼로부터 명령 스트림을 판독하고 CPU(102)의 동작에 대하여 비동기적으로(asynchronously) 명령들을 실행한다.CPU 102 acts as a control processor of computer system 100, managing and coordinating the operation of other system components. In particular, CPU 102 issues instructions to control the operation of parallel processors 134 in multithreaded processing subsystem 112. In some embodiments, CPU 102 may reside in system memory 104, subsystem memory 138, or in another storage location accessible to both CPU 102 and parallel processors 134. Write a stream of instructions for parallel processors 134 to a buffer (not shown). Parallel processors 134 read the instruction stream from the instruction buffer and execute instructions asynchronously to the operation of the CPU 102.

시스템 메모리(104)는 멀티스레드형 프로세싱 서브시스템(112)에 의한 실행 을 위해 구성되는 CUDA 코드(101), 장치 드라이버(103), 운영 체제의 실행 이미지(execution image)를 포함한다. CUDA 코드(101)는 멀티스레드형 프로세싱 서브시스템(112) 상에서 실행되도록 의도된 프로그래밍 명령어들을 포함한다. 본 설명의 문맥에서, 코드는 임의의 컴퓨터 코드, 명령어들 및/또는 프로세서를 이용하여 실행될 수 있는 기능들을 지칭한다. 예를 들면, 다양한 실시예에서, 코드는 C 코드, C++ 코드 등을 포함할 수 있다. 일 실시예에서, 코드는 컴퓨터 언어의 언어 확장을 포함할 수 있다(예를 들어, C, C++ 등의 확장).System memory 104 includes CUDA code 101, device driver 103, and an execution image of an operating system configured for execution by multithreaded processing subsystem 112. CUDA code 101 includes programming instructions intended to be executed on the multithreaded processing subsystem 112. In the context of the present description, code refers to any computer code, instructions, and / or functions that may be executed using a processor. For example, in various embodiments, the code may include C code, C ++ code, and the like. In one embodiment, the code may include language extensions of computer languages (eg, extensions of C, C ++, etc.).

운영 체제는 컴퓨터 시스템(100)의 동작을 관리하고 조정하기 위해 구체적인(detailed) 명령어들을 제공한다. 장치 드라이버(103)는 멀티스레드형 프로세싱 서브시스템(112), 특히 병렬 프로세서들(134)의 동작들을 관리하고 조정하기 위한 구체적인 명령어들을 제공한다. 더구나, 장치 드라이버(103)는 병렬 프로세서들(134)에 특정하게(specifically) 최적화된 기계 코드(machine code)를 생성하기 위한 편집 기능을 제공할 수 있다. 장치 드라이버(103)는 NVIDIA사에 의해 제공되는 CUDA™ 프레임워크와 함께 제공될 수 있다.The operating system provides detailed instructions to manage and coordinate the operation of the computer system 100. The device driver 103 provides specific instructions for managing and coordinating the operations of the multithreaded processing subsystem 112, particularly the parallel processors 134. Moreover, device driver 103 may provide editing functionality for generating machine code specifically optimized for parallel processors 134. The device driver 103 may be provided with the CUDA ™ framework provided by NVIDIA Corporation.

일 실시예에서, 멀티스레드형 프로세싱 서브시스템(112)은 예를 들면, 프로그램가능한 프로세서들, ASIC(application specific integrated circuit)들과 같은 하나 이상의 집적 회로 장치를 사용하여 구현될 수 있는 하나 이상의 병렬 프로세서(134)를 통합한다. 병렬 프로세서들(134)은 예를 들면, 영상 출력 회로 및 GPU(graphics processing unit)를 포함하는 그래픽 및 영상 프로세싱을 위해 최적화된 회로를 포함할 수 있다. 다른 실시예에서, 멀티스레드형 프로세싱 서브시스 템(112)은 메모리 브리지(105), CPU(102), 및 I/O 브리지(107)와 같은 하나 이상의 다른 시스템 요소들과 통합되어 SoC(system on chip)를 형성할 수 있다. 하나 이상의 병렬 프로세서(134)가 표시 장치(110)에 데이터를 출력할 수 있거나 또는 각 병렬 프로세서(134)가 하나 이상의 표시 장치(110)에 데이터를 출력할 수 있다.In one embodiment, the multithreaded processing subsystem 112 is one or more parallel processors that may be implemented using one or more integrated circuit devices such as, for example, programmable processors, application specific integrated circuits (ASICs). Integrate 134. Parallel processors 134 may include circuitry optimized for graphics and image processing, including, for example, image output circuitry and graphics processing unit (GPU). In another embodiment, multithreaded processing subsystem 112 is integrated with one or more other system elements such as memory bridge 105, CPU 102, and I / O bridge 107 to provide a system on SoC. chip) can be formed. One or more parallel processors 134 may output data to the display device 110, or each parallel processor 134 may output data to the one or more display devices 110.

병렬 프로세서들(134)은 유리하게도 각 스레드가 코드(101)와 같은 프로그램의 인스턴스(instance)인 다수의 스레드를 각각이 동시에 실행할 수 있는 하나 이상의 프로세싱 코어를 포함하는 고도의 병렬 프로세서(highly parallel processor)를 구현한다. 병렬 프로세서들(134)은 선형 및 비선형 데이터 변환들, 영상 및/또는 음향 데이터의 필터링, (예를 들어, 오브젝트들의 위치, 속도 및 다른 속성들을 결정하기 위하여 물리 법칙들을 적용하는) 모델링 연산들, 이미지 렌더링 연산들(예를 들면 모자이크 셰이더(tessellation shader), 꼭지점 셰이더(vertex shader), 기하학적 셰이더(geometry shader) 및/또는 픽셀 셰이더(pixel shader) 프로그램들) 등을 포함하지만 거기에 제한되지 않는, 광범위하고 다양한 어플리케이션들에 관련된 프로세싱 태스크(task)들을 실행하도록 프로그램될 수 있다. 병렬 프로세서들(134)은 데이터를 시스템 메모리(104) 및/또는 로컬 서브시스템 메모리(138)로부터 로컬(온 칩) 메모리로 전송하고, 데이터를 처리하고 결과 데이터를 시스템 메모리(104) 및/또는 서브시스템 메모리(138)에 다시 기록할 수 있고, 그러한 데이터는 CPU(102) 또는 다른 멀티스레드형 프로세싱 서브시스템(112)을 포함하는 다른 시스템 구성 요소들에 의해 액세스될 수 있다. Parallel processors 134 advantageously comprise a highly parallel processor that includes one or more processing cores, each of which can simultaneously execute multiple threads, each thread being an instance of a program such as code 101. ). Parallel processors 134 can be used for linear and nonlinear data transformations, filtering of image and / or acoustic data, modeling operations (e.g., applying physics to determine the position, velocity, and other properties of objects), Image rendering operations (e.g., but not limited to, mosaic shader, vertex shader, geometry shader and / or pixel shader programs), It can be programmed to execute processing tasks related to a wide variety of applications. Parallel processors 134 transfer data from system memory 104 and / or local subsystem memory 138 to local (on-chip) memory, process the data and return the resulting data to system memory 104 and / or Write back to subsystem memory 138, and such data may be accessed by other system components including CPU 102 or other multithreaded processing subsystem 112.

병렬 프로세서(134)는 서브시스템 메모리(138)를 포함하지 않는 것까지 포함 하여 임의의 양의 서브시스템 메모리(138)를 구비할 수 있고, 서브시스템 메모리(138)와 시스템 메모리(104)를 임의의 조합으로 사용할 수 있다. 예를 들면, 병렬 프로세서(134)는 UMA(unified memory architecture) 실시예의 그래픽 프로세서일 수 있다. 그러한 실시예들에서, 전용 서브시스템 메모리(138) 약간만이 제공되거나 전혀 제공되지 않을 수 있고, 병렬 프로세서(134)는 시스템 메모리(104)를 배타적으로 또는 거의 배타적으로 사용할 것이다. UMA 실시예들에서, 병렬 프로세서(134)는 브리지 칩 또는 다른 통신 수단을 통하여 병렬 프로세서(134)를 시스템 메모리(104)에 접속하는 고속 링크(예를 들며, PCI-E)를 갖는 별개의 칩으로서 제공되거나 브리지 칩 또는 프로세서 칩에 통합될 수 있다.The parallel processor 134 can have any amount of subsystem memory 138, including those that do not include the subsystem memory 138, and can optionally include the subsystem memory 138 and the system memory 104. It can be used in combination. For example, parallel processor 134 may be a graphics processor of an unified memory architecture (UMA) embodiment. In such embodiments, only a small or no dedicated subsystem memory 138 may be provided, and the parallel processor 134 will use the system memory 104 exclusively or almost exclusively. In UMA embodiments, parallel processor 134 is a separate chip with a high speed link (e.g., PCI-E) that connects parallel processor 134 to system memory 104 via a bridge chip or other communication means. It may be provided as or integrated into a bridge chip or processor chip.

전술한 바와 같이, 임의의 수의 병렬 프로세서(134)가 멀티스레드형 프로세싱 서브시스템(112)에 포함될 수 있다. 예를 들면, 다수의 병렬 프로세서(134)가 단일 애드인 카드 상에 제공될 수 있거나, 또는 다수의 애드인 카드가 통신 경로(113)에 접속될 수 있거나, 또는 하나 이상의 병렬 프로세서(134)가 브리지 칩으로 통합될 수 있다. 다수의 병렬 프로세서(134)가 존재하는 경우에는, 그러한 병렬 프로세서들(134)이 병렬로 동작되어 단일 병렬 프로세서(134)를 이용하여 가능한 것보다 높은 처리량으로 데이터를 처리할 수 있다. 하나 이상의 병렬 프로세서(134)를 통합하는 시스템들은 다양한 구성들 및 폼 팩터들로 구현될 수 있으며, 데스크톱, 랩톱 또는 핸드헬드 개인용 컴퓨터들, 서버들, 워크스테이션들, 게임 콘솔들, 내장 시스템들 등을 포함한다.As noted above, any number of parallel processors 134 may be included in the multithreaded processing subsystem 112. For example, multiple parallel processors 134 may be provided on a single add-in card, or multiple add-in cards may be connected to the communication path 113, or one or more parallel processors 134 may be provided. It can be integrated into a bridge chip. If there are multiple parallel processors 134, such parallel processors 134 may be operated in parallel to process data with higher throughput than is possible with a single parallel processor 134. Systems incorporating one or more parallel processors 134 may be implemented in a variety of configurations and form factors, including desktop, laptop or handheld personal computers, servers, workstations, game consoles, embedded systems, and the like. It includes.

병렬 프로세서들(134)의 일부 실시예들에서, SIMD(single-instruction, multiple-data)명령어 발행 기술들은 다수의 독립 명령어 유닛을 제공하지 않고 다수의 스레드의 병렬 실행을 지원하기 위해 사용된다. 다른 실시예들에서, SIMT(single-instruction, multiple-thread) 기술들이 다수의 일반적으로 동기화된 스레드(generally synchronized thread)의 병렬 실행을 지원하기 위해 사용된다. 모든 프로세싱 엔진들이 통상적으로 동일한 명령어들을 실행하는 SIMD 실행 레짐(regime)과는 달리, SIMT 실행은 상이한 스레드가 주어진 스레드 프로그램을 통하여 발산하는 실행 경로(divergent execution path)들을 더욱 용이하게 따르도록 한다. 본 기술분야의 통상의 기술자들은 SIMD 프로세싱 레짐이 SIMT 프로세싱 레짐의 기능적인 서브세트를 나타낸다는 것을 이해할 것이다. 병렬 프로세서들(134) 내의 기능 유닛(functional unit)들은 정수 및 부동 소수점 연산(예를 들면 덧셈 및 곱셈), 비교 연산들, 불 연산들(AND, OR, XOR), 비트 시프팅(bit-shifting) 및 다양한 대수 함수들(예를 들면, 평면 보간(planar interpolation), 삼각함수, 지수함수 및 로그함수 등)의 계산을 포함하는 다양한 연산들을 지원한다.In some embodiments of parallel processors 134, single-instruction (multi-data) instruction issuing techniques are used to support parallel execution of multiple threads without providing multiple independent instruction units. In other embodiments, single-instruction, multiple-thread (SIMT) techniques are used to support parallel execution of a plurality of generally synchronized threads. Unlike the SIMD execution regime, in which all processing engines typically execute the same instructions, SIMT execution makes it easier to follow the divergent execution paths that different threads diverge through a given thread program. Those skilled in the art will understand that the SIMD processing regime represents a functional subset of the SIMT processing regime. Functional units in parallel processors 134 are integer and floating point operations (eg, addition and multiplication), comparison operations, Boolean operations (AND, OR, XOR), bit-shifting. And various algebraic functions (e.g., planar interpolation, trigonometric, exponential and logarithmic functions, etc.).

병렬 프로세서들(134)의 프로세싱 코어(도시 안됨) 내의 특정한 프로세싱 유닛(도시 안됨)으로 송신된 이러한 일련의 명령어는 본 명세서에서 이전에 정의된 것과 같이 스레드를 구성하고, 하나의 프로세싱 코어 내의 프로세싱 유닛들에 걸쳐 동시에 실행하는 소정 수의 스레드들의 컬렉션(collection)은 본 명세서에서 "스레드 그룹"으로 지칭된다. 본 명세서에서 사용될 때, "스레드 그룹"은 상이한 입력 데이터에 대하여 동일한 프로그램을 실행하고, 그룹의 각 스레드는 프로세싱 코어의 상이한 프로세싱 유닛에 할당되는 스레드들의 그룹을 지칭한다. 스레드 그룹은 프로세싱 유닛의 수보다 적은 스레드를 포함할 수 있고, 그 경우 일부 프로세싱 유닛들은 스레드 그룹이 프로세싱되는 사이클들 동안 유휴상태일 것이다. 스레드 그룹은 프로세싱 유닛의 수보다 많은 스레드를 포함할 수 있고, 그 경우 프로세싱은 다수의 클럭 사이클에 걸쳐 일어날 것이다.This series of instructions sent to a specific processing unit (not shown) within the processing core (not shown) of the parallel processors 134 constitutes a thread as previously defined herein, and the processing unit within one processing core. A collection of any number of threads executing concurrently across the domains is referred to herein as a "thread group." As used herein, "thread group" refers to a group of threads that execute the same program for different input data, and each thread of the group is assigned to a different processing unit of the processing core. The thread group may include fewer threads than the number of processing units, in which case some processing units will be idle during the cycles in which the thread group is processed. The thread group may include more threads than the number of processing units, in which case processing will occur over multiple clock cycles.

각 프로세싱 코어가 G개의 스레드 그룹까지 동시에 지원할 수 있기 때문에, G×M개의 스레드 그룹까지 임의의 주어진 시각에 프로세싱 코어에서 실행될 수 있으며, 여기서 M은 병렬 프로세서(134)의 프로세싱 코어의 수이다. 게다가, 복수의 관련된 스레드 그룹들이 프로세싱 코어 내에서 동시에 (실행의 상이한 상태(phase)들에서) 활성일 수 있다. 이러한 스레드 그룹들의 컬렉션은 본 명세서에서 "협력적인 스레드 어레이(cooperative thread array; CTA)"라고 지칭된다. CTA의 크기는 일반적으로 CTA에 대해 이용가능한, 메모리 또는 레지스터들과 같은 하드웨어 리소스들의 양 및 프로그래머에 의해 결정된다. CUDA 프로그래밍 모델은 GPU 가속기들의 시스템 아키텍처를 반영한다. 배타적 로컬 어드레스 공간은 각 스레드에 대해 이용가능하고 공유된 CTA-당(per-CTA) 어드레스 공간은 CTA 내의 스레드들 사이에서 데이터를 보내는 데 사용된다. 프로세싱 코어들은 칩 밖의 "전역(global)" 메모리에 액세스하고, 전역 메모리는 서브시스템 메모리(138) 및/또는 시스템 메모리(104)를 포함할 수 있다.Since each processing core can support up to G thread groups simultaneously, up to G × M thread groups can run on the processing core at any given time, where M is the number of processing cores of the parallel processor 134. In addition, a plurality of related thread groups can be active simultaneously (in different phases of execution) within the processing core. This collection of thread groups is referred to herein as a "cooperative thread array (CTA)." The size of a CTA is generally determined by the programmer and the amount of hardware resources, such as memory or registers, available to the CTA. The CUDA programming model reflects the system architecture of GPU accelerators. An exclusive local address space is available for each thread and a shared per-CTA address space is used to transfer data between threads within the CTA. Processing cores access out-of-chip “global” memory, which may include subsystem memory 138 and / or system memory 104.

CUDA 어플리케이션 프로그램의 호스트 부분은 종래 방법들 및 도구들을 사용하여 컴파일되는 한편, 커널 함수(kernel function)들은 CTA 프로세싱을 특정한다. 가장 높은 레벨에서, CUDA 메모리 모델은 호스트 메모리 공간과 장치 메모리 공간 들을 분리하여, 호스트 코드와 커널 코드가 그들 각각의 메모리 공간들에 직접 액세스하는 것만 할 수 있게 한다. API(application programming interface) 함수들은 호스트 메모리 공간과 장치 메모리 공간들 사이에서 데이터를 복사할 수 있게 한다. CUDA 프로그래밍 모델의 공유 메모리 CPU 실행에서, CPU 제어 스레드는 잠재적인 데이터 레이스들 없이 병렬 CTA들을 이용하여 병렬로 실행할 수 있다. 호스트 메모리 공간은 C 프로그래밍 언어에 의해 정의되고 장치 메모리 공간들은 전역, 상수, 로컬, 공유된 및 텍스처로서 특정된다. 모든 스레드는 전역, 상수 및 텍스처 메모리 공간들에 액세스할 수 있다. 앞에서 설명된 바와 같이, 로컬 공간으로의 액세스는 단일 스레드로 제한되고 공유 공간으로의 액세스는 CTA의 스레드들로 제한된다. 이러한 메모리 모델은 낮은 대기시간 액세스들에 대하여 작은 메모리 공간들을 이용하는 것을 장려하고, 통상적으로 더 긴 대기시간을 갖는 더 큰 메모리 공간들의 현명한(wise) 사용을 장려한다.The host portion of the CUDA application program is compiled using conventional methods and tools, while kernel functions specify CTA processing. At the highest level, the CUDA memory model separates host memory space from device memory spaces, allowing only host code and kernel code to directly access their respective memory spaces. Application programming interface (API) functions allow data to be copied between host memory space and device memory space. In shared memory CPU execution of the CUDA programming model, the CPU control thread can execute in parallel using parallel CTAs without potential data races. Host memory space is defined by the C programming language and device memory spaces are specified as global, constant, local, shared and texture. Every thread can access global, constant, and texture memory spaces. As described earlier, access to local space is limited to a single thread and access to shared space is limited to threads of the CTA. This memory model encourages the use of small memory spaces for low latency accesses, and encourages wise use of larger memory spaces, which typically have longer latency.

코드(101)와 같은 CUDA 프로그램은 통상적으로 1, 2 차원 또는 3 차원(예를 들면, x, y 및 z)에서 CTA들의 동기 또는 비동기 실행(synchronous or asynchronous execution)들의 세트로서 조직된다. 3-투플 인덱스(3-tuple index)는 스레드 블록 내의 스레드들을 고유하게 식별한다. 스레드 블록들 자체는 암시적으로 정의된 2-투플 변수에 의해 구별된다. 이들 인덱스들의 범위들은 런타임에서 정의되고 런타임 환경은 인덱스들이 임의의 하드웨어 제한들에 따르는지를 검사한다. 각 CTA는 다른 CTA들과 함께 병렬 프로세서(134)에 의해 병렬로 실행될 수 있다. 각 병렬 프로세서(134)가 하나 이상의 CTA를 실행하면서 다수의 CTA가 병렬 로 작동할 수 있다. 런타임 환경은 요청될 때 CUDA 코드(101)의 실행을 동기적으로 또는 비동기적으로 관리하는 것을 담당한다. CTA 내의 스레드들은 공유 메모리 및 synchthreads()라고 불리는 장벽 동기화 프리미티브(barrier synchronization primitive)를 이용하여 서로 통신하고 동기화된다. CUDA는 스레드 블록 내의 스레드들이 동시에 계속되는 것(live)을 보장하고, 스레드 블록 내에 스레드들을 위한 구성체들을 제공하여 빠른 장벽 동기화들 및 로컬 데이터 공유를 수행한다. (1 이상의 차원에 의해 정의된) CTA 내의 별개의 스레드 블록들은 그들의 생성, 실행 또는 은퇴(retirement)에 있어서 따라야 할 순서가 없다. 게다가, 병렬 CTA들은 I/O를 포함하는 시스템 호(call)들에 액세스하도록 허용되지 않는다. CUDA 프로그래밍 모델은 병렬 CTA 사이의 전역 동기화를 강요하기만 하고, CTA 내의 블록들 사이의 제한된 통신을 위하여 고유의(intrinsic) 원자 연산(atomic operation)들을 제공한다.CUDA programs such as code 101 are typically organized as a set of synchronous or asynchronous executions of CTAs in one, two or three dimensions (eg, x, y and z). The 3-tuple index uniquely identifies the threads in the thread block. The thread blocks themselves are distinguished by implicitly defined two-tuple variables. The ranges of these indexes are defined at run time and the runtime environment checks whether the indexes conform to any hardware restrictions. Each CTA may be executed in parallel by parallel processor 134 along with other CTAs. Multiple CTAs can operate in parallel while each parallel processor 134 executes one or more CTAs. The runtime environment is responsible for managing the execution of the CUDA code 101 synchronously or asynchronously when requested. Threads in the CTA communicate and synchronize with each other using shared memory and a barrier synchronization primitive called synchthreads (). CUDA ensures that threads in a thread block continue to live at the same time, providing constructs for threads in the thread block to perform fast barrier synchronizations and local data sharing. Separate thread blocks within a CTA (defined by one or more dimensions) are out of order to follow in their creation, execution, or retirement. In addition, parallel CTAs are not allowed to access system calls containing I / O. The CUDA programming model only enforces global synchronization between parallel CTAs and provides intrinsic atomic operations for limited communication between blocks within the CTA.

커널이라 지칭되는 각 스레드의 바디는 CUDA를 이용하여 특정되며, 메모리 모델 주석(annotation)들 및 장벽 동기화 프리미티브를 이용하여 표준 C로 표현될 수 있다. CUDA 프로그램의 시맨틱스(semantics)는 각 커널이 장벽 동기화 프리미티브에 의해 암시되는 메모리 순서를 침해하지 않는(respect) 순서로 CTA의 모든 스레드에 의해 실행되는 것이다. 특히, 장벽 동기화 프리미티브 전에 발생하는 CTA 내의 모든 공유된 메모리 참조들은 장벽 동기화 프리미티브 후에 일어나는 임의의 공유된 메모리 참조 전에 완료되어야 한다.The body of each thread, called the kernel, is specified using CUDA and can be expressed in standard C using memory model annotations and barrier synchronization primitives. The semantics of a CUDA program is that each kernel is executed by every thread in the CTA in an order that does not violate the memory order implied by the barrier synchronization primitive. In particular, all shared memory references in the CTA that occur before the barrier synchronization primitive must be completed before any shared memory references that occur after the barrier synchronization primitive.

커널 코드의 장벽 동기화 프리미티브의 각 인스턴스는 개념적으로 분리된 논 리적 장벽을 나타내고 정적인 것으로 취급되어야 한다. CUDA 스레드들이 구조체의 상이한 분기들을 취할 수 있을 때 이프-엘스(if-else) 구조체의 양쪽 경로에서 장벽 동기화 프리미티브를 인보크하는 것은 반칙이다. 스레드 블록 내의 모든 스레드가 동기화 프리미티브들 중 하나에 도달할 것이지만, 그들은 각각이 모든 스레드가 도달하거나 어떤 스레드도 도달하지 않아야 하는 개별적인 장벽들을 나타낸다. 그리하여, 그러한 커널은 정확하게 실행되지 않을 것이다. 더욱 일반적으로, 동기화 프리미티브가 스레드 블록 내의 상이한 스레드에 대하여 상이하게 행동하는 임의의 제어 흐름 구조 내에 포함된다면 CUDA 코드는 정확히 실행되도록 보장되지 않는다.Each instance of a barrier synchronization primitive in kernel code represents a conceptually separate logical barrier and should be treated as static. Invoking a barrier synchronization primitive in both paths of an if-else structure is a foul when CUDA threads can take different branches of the structure. Although every thread in a thread block will reach one of the synchronization primitives, they each represent individual barriers that every thread must reach or no thread reaches. Thus, such a kernel will not run correctly. More generally, CUDA code is not guaranteed to execute correctly if the synchronization primitive is included in any control flow structure that behaves differently for different threads in the thread block.

도 2는 본 발명의 일 실시예에 따른 컴퓨터 시스템(200)을 예시하는 블록도이다. 컴퓨터 시스템(100)은 메모리 브리지(205)를 포함하는 버스 경로를 통하여 통신하는 시스템 메모리(204) 및 CPU(202)를 포함한다. 예를 들어, 노스브리지 칩일 수 있는 메모리 브리지(205)는 버스 또는 다른 통신 경로(106)(예를 들면, HyperTransport link)를 통하여 I/O(입력/출력) 브리지(107)에 접속된다. CPU(202)는 표시 장치(210)(예를 들면, 종래의 CRT 또는 LCD 기반 모니터) 상의 표시를 위한 출력을 생성한다.2 is a block diagram illustrating a computer system 200 according to one embodiment of the invention. Computer system 100 includes a system memory 204 and a CPU 202 that communicate over a bus path that includes a memory bridge 205. For example, the memory bridge 205, which may be a northbridge chip, is connected to the I / O (input / output) bridge 107 via a bus or other communication path 106 (eg, a HyperTransport link). The CPU 202 generates an output for display on the display device 210 (eg, a conventional CRT or LCD based monitor).

멀티스레드형 프로세싱 서브시스템(112)은 컴퓨터 시스템(200)에 포함되지 않고 CUDA 코드(101)는 CPU(202)와 같은 범용 프로세서에 의한 실행에 대해 적응되지 않는다. CUDA 코드(101)는 멀티스레드형 프로세싱 서브시스템(112)에 의한 실행에 대해 적응되고, 트랜스레이터(220)를 사용하여 트랜스레이트되어 장벽 동기화 프리미티브를 포함하지 않는 트랜스레이트된 코드(201)를 생성한다. CPU(202)가 코드(101)에 의해 표현되는 프로그램을 작동시키기 위해서, 코드(101)는 먼저 코드(201)로 트랜스레이트되어야 한다. 트랜스레이트된 코드는 그 후 컴파일러(225)에 의해 CPU(202)에 의한 실행을 위해 컴파일될 수 있다. 컴파일러(225)는 CPU(202)에 특정한 최적화들을 수행할 수 있다. 코드를 트랜스레이트하는 것은 제1 컴퓨터 언어로 기록된 코드를 제2 컴퓨터 언어로 변환하는 것을 지칭한다. 코드를 컴파일하는 것은 컴퓨터 언어(예를 들면, 소스 코드)로 기록된 코드를 다른 컴퓨터 언어(예를 들면, 오브젝트 코드)로 변환하는 것을 지칭한다. 트랜스레이터(220)는 도 3a와 관련하여 기술되고 컴파일러(225)는 도 4와 관련하여 기술된다. 컴파일러(225)는 코드(101), 코드(201) 및 CPU(202) 사이를 인터페이스하도록 구성되는 장치 드라이버(203) 내에 포함될 수 있다. 런타임 환경(227)은 컴파일된 코드를 위한 함수들, 예를 들면 입력 및 출력, 메모리 관리 등을 구현하도록 구성된다. 런타임 환경(227)은 또한 CPU(202)에 의한 실행을 위해 컴파일된 코드를 런칭한다. 트랜스레이터(220)는 최적화 변환(optimizing transformation)들을 수행하여 CUDA 스레드 그룹의 결이 고운(fine-grained) 스레드들에 걸쳐 동작들을 단일 CPU 스레드로 직렬화하고, 한편 런타임 환경(227)은 스레드 그룹들을 CPU(202)에 의한 병렬 프로세싱을 위한 작업 유닛들로서 스케줄링한다.The multithreaded processing subsystem 112 is not included in the computer system 200 and the CUDA code 101 is not adapted for execution by a general purpose processor such as the CPU 202. CUDA code 101 is adapted for execution by multithreaded processing subsystem 112 and is translated using translator 220 to produce translated code 201 that does not include barrier synchronization primitives. . In order for the CPU 202 to run the program represented by the code 101, the code 101 must first be translated into the code 201. The translated code can then be compiled for execution by the CPU 202 by the compiler 225. Compiler 225 may perform optimizations specific to CPU 202. Translating code refers to converting code written in a first computer language to a second computer language. Compiling the code refers to converting code written in a computer language (eg, source code) into another computer language (eg, object code). Translator 220 is described in conjunction with FIG. 3A and compiler 225 is described in relation to FIG. 4. Compiler 225 may be included in device driver 203 that is configured to interface between code 101, code 201, and CPU 202. Runtime environment 227 is configured to implement functions for compiled code, such as input and output, memory management, and the like. Runtime environment 227 also launches compiled code for execution by CPU 202. Translator 220 performs optimization transformations to serialize operations into a single CPU thread across the fine-grained threads of a CUDA thread group, while runtime environment 227 processes the thread groups to CPU. Schedule as work units for parallel processing by 202.

범용 CPU들에 의한 실행을 위해 GPU들 상에서 작동하도록 설계된 CUDA 어플리케이션들의 포트가능성(portability)을 방해하는 주된 장애물은 병렬성의 입도(granularity of parallelism)이다. 종래의 CPU들은 단일 CUDA CTA가 필요로하 는 하드웨어 스레드 중 수백개를 지원하지 않는다. 그리하여, 범용 CPU 상에서 CUDA 프로그래밍 모델을 구현하는 시스템의 주된 목표는 이용가능한 CPU 코어들에 태스크 레벨의 병렬성을 분산하는 것이다. 동시에, 시스템은 태스크 내의 마이크로 스레드(microthread)들을 단일 CPU 스레드로 통합하여 과도한 스케줄링 오버헤드 및 잦은 코어간 동기화(intercore synchronization)를 방지해야 한다.A major obstacle to the portability of CUDA applications designed to run on GPUs for execution by general purpose CPUs is the granularity of parallelism. Conventional CPUs do not support hundreds of hardware threads that a single CUDA CTA requires. Thus, the main goal of a system implementing a CUDA programming model on a general purpose CPU is to distribute task level parallelism across the available CPU cores. At the same time, the system must consolidate microthreads within a task into a single CPU thread to prevent excessive scheduling overhead and frequent intercore synchronization.

도 3a는 본 발명의 일 실시예에 따른, 예를 들면 멀티스레드형 프로세싱 서브시스템(112)과 같은 멀티코어 그래픽 프로세싱 유닛에 의한 실행을 위해 기록된 코드(101)를 예를 들면 CPU(202)와 같은 범용 프로세서에 의한 실행을 위해 코드(201)로 트랜스레이트하기 위한 방법 단계들의 흐름도이다. 코드(101)에서 사용된 장벽 동기화 프리미티브 시맨틱을 보존하기 위해 트랜스레이터(220)는 도 3a에 도시된 하나 이상의 단계들을 수행하도록 구성된다. 트랜스레이터(220)는 장벽 동기화 프리미티브들 주위에서 코드(101)를 파티셔닝함으로써 병렬 스레드들을 "언롤(unroll)" 하고, 공유 상태의 사용을 감소시키고, 메모리 액세스를 위한 참조들의 지역성(locality)을 개선시키고 스레드 루프들을 삽입하여 CUDA-특정 코드를 범용 프로세서에 의한 실행을 위해 변환한다. 멀티스레드형 프로세싱 서브시스템(112)에 의한 실행을 위해 타게팅한 CUDA 코드(101)를 변경하지 않고 코드(201)를 실행하기 위해 CPU(202)를 사용하여 양호한 실행 성능을 달성하는 것이 가능하다. 컴파일러(225)는 CPU(202)에 의해 제공되는 벡터 명령어들의 능력을 활용할 수 있고 실행을 위해 코드(201)를 컴파일할 때 최적화들을 수행할 수 있다.3A illustrates code 101 for example written for execution by a multicore graphics processing unit, such as, for example, multithreaded processing subsystem 112, according to one embodiment of the invention. Is a flowchart of method steps for translating into code 201 for execution by a general purpose processor such as < RTI ID = 0.0 > Translator 220 is configured to perform one or more steps shown in FIG. 3A to preserve the barrier synchronization primitive semantics used in code 101. Translator 220 "unrolls" parallel threads by partitioning code 101 around barrier synchronization primitives, reduces the use of shared state, improves the locality of references for memory access, and Insert thread loops to convert CUDA-specific code for execution by a general purpose processor. It is possible to achieve good execution performance using the CPU 202 to execute the code 201 without changing the CUDA code 101 targeted for execution by the multithreaded processing subsystem 112. Compiler 225 may utilize the capabilities of the vector instructions provided by CPU 202 and may perform optimizations when compiling code 201 for execution.

단계(300)에서 트랜스레이터(220)는 멀티스레드형 프로세싱 서브시스템(112) 또는 예를 들면, CUDA 코드(101)와 같은 하나 이상의 병렬 프로세서(134)를 포함하는 프로세서와 같은 멀티코어 GPU에 의한 실행을 위해 기록된 코드(101)를 수신한다. 단계(300)에서 수신된 코드는 에지들에 의해 접속된 기본 블록 노드들로 구성된 제어 흐름 그래프로서 표현된다. 각 기본 블록은 예를 들면 CPU(202)와 같은 타겟 환경에 의해 수행되는 동작들을 특정한다. 스텝(305)에서 트랜스레이터(220)는 장벽 동기화 프리미티브들 주위에서 CUDA 코드(101)를 파티셔닝하여 파티셔닝된 코드를 생성한다. 파티셔닝된 코드는 도 3b 및 도 3c에 도시되고 파티셔닝 프로세스는 이들 도면들과 관련하여 기술된다. 동기화 파티션은 동작들의 순서가 파티션 내의 기본 블록들의 제어 흐름 및 데이터 흐름 특성들에 의해 전적으로 결정되는 코드의 영역이다. 파티션은 스레드 루프가 파티션 주위에 삽입되어 병렬 스레드들을 작동시킬 수 있는 특성을 갖는다. 제어 흐름 그래프는 각 synchthreads 프리미티브를 에지로 대체하고 기본 블록 노드를 상이한 파티션들로 분리함으로써 동기화 파티션 제어 흐름 그래프를 생성하는 데 사용될 수 있다.In step 300 the translator 220 is executed by a multi-core GPU, such as a multithreaded processing subsystem 112 or a processor including one or more parallel processors 134, for example CUDA code 101. Receive the recorded code 101. The code received in step 300 is represented as a control flow graph composed of basic block nodes connected by edges. Each basic block specifies the operations performed by the target environment, such as, for example, the CPU 202. In step 305 translator 220 partitions CUDA code 101 around barrier synchronization primitives to generate partitioned code. The partitioned code is shown in FIGS. 3B and 3C and the partitioning process is described with respect to these figures. A synchronization partition is an area of code in which the order of operations is entirely determined by the control flow and data flow characteristics of the basic blocks within the partition. Partitions have the property that thread loops can be inserted around partitions to run parallel threads. The control flow graph can be used to create a sync partition control flow graph by replacing each synchthreads primitive with an edge and separating the base block node into different partitions.

단계(310)에서 파티셔닝된 코드는 각 문장(statement)이 수렴하는 것 또는 발산하는 것 중 하나로서 식별되도록 분류된다. 파티셔닝된 코드는 식들 및 문장들을 포함할 수 있다. 식은 프로그래머에 의해 만들어진 이름 붙여진 변수들, 암시적 threadID들 및 상수들을 수반할 수 있는 계산이지만, 부작용(side-effect) 또는 할당(assignment)은 없다. 간단한 문장은 단일 할당을 초래하는 계산식으로서 정의된다. 일반적인 문장은 장벽, 제어 흐름 조건 또는 루프 구조체, 또는 문장들의 순차적인 블록을 표현할 수 있다. CTA 차원들 x, y 및 z는 코드를 통하여 전파 되어 각 동작이 1 이상의 CTA 차원에 의존하는지의 여부를 결정한다. 차원 x, y 및/또는 z의 threadID(스레드 식별자)를 참조하는 동작들은 CTA 차원을 참조하는 스레드가 동일한 CTA의 다른 스레드들로부터 실행 동안 발산할(diverge) 수 있기 때문에 발산한다고 간주된다. 예를 들어, threadID.x에 의존하는 동작은 x 차원에서 발산한다. threadID.x에 의존하지 않는 다른 동작은 수렴한다. 발산하는 문장(divergent statement)들은 그들이 참조하는 각 CTA 차원에 대한 스레드 루프들을 필요로 한다.The code partitioned at step 310 is classified such that each statement is identified as either converging or diverging. Partitioned code can include expressions and statements. An expression is a calculation that may involve named variables, implicit threadIDs, and constants made by a programmer, but without side-effects or assignments. Simple statements are defined as expressions that result in a single assignment. A general sentence may represent a barrier, control flow condition or loop structure, or a sequential block of sentences. CTA dimensions x, y and z are propagated through the code to determine whether each operation depends on one or more CTA dimensions. Operations referring to threadIDs (thread identifiers) of dimensions x, y and / or z are considered divergent because a thread referring to the CTA dimension may diverge during execution from other threads of the same CTA. For example, operations that depend on threadID.x diverge in the x dimension. Other operations that do not depend on threadID.x converge. Divergent statements require thread loops for each CTA dimension they refer to.

단계(315)에서 파티셔닝된 코드는 분류 정보를 이용하여 성능에 대하여 최적화되어 최적화된 코드를 생성한다. 예를 들어, 파티션 내의 명령어들은 동작들을 융합하기 위해 재정리(reordered)되어 동일하게 분류되는 그러한 동작들이 함께 그룹지어지고 단계(325)에서 삽입되는 동일한 스레드 루프 내에 속할 수 있다. 동작들은 자신의 베어리언스 벡터(variance vector)에서 더 적은 threadID 차원들을 가지는 동작들이 더 많은 threadID 차원에 의존하는 동작들을 진행하도록 정리된다. 문장은 그것이 의존하는 상태들의 베어리언스 벡터의 수퍼세트(superset)인 베어리언스 벡터를 가져야 하기 때문에 이러한 재정리는 효과적이다. 그리하여 베어리언스 벡터에 오직 하나의 차원을 갖는 상태들은 베어리언스 벡터에 상이한 차원 또는 1 이상의 차원을 갖는 임의의 상태에 의존할 수 없다.The code partitioned in step 315 is optimized for performance using the classification information to generate the optimized code. For example, instructions within a partition may be reordered to fuse the operations and fall within the same thread loop where those operations that are equally classified are grouped together and inserted in step 325. The operations are arranged so that operations with fewer threadID dimensions in their variation vector proceed with operations that depend on more threadID dimensions. This reordering is effective because the statement must have a bearance vector that is a superset of the bearance vector of the states it depends on. Thus states having only one dimension in the bearance vector may not depend on any state having a different dimension or more than one dimension in the bearance vector.

단계(320)에서 최적화된 코드의 스레드-로컬 메모리 참조들은 필요한 때 어레이 참조들로 승격되어(promoted), 객체의 각 인스턴스가 값을 저장하는 고유한 위치를 갖는 것을 보장한다. 특히, 하나의 파티션에서 다른 파티션으로 전달되는 데이터는 복사되어 각 파티션에서 이용가능할 필요가 있다. (하나의 파티션에 할당되고 다른 파티션에서 참조되는) 교차 파티션 의존성을 갖는 로컬 변수의 조건들 중 하나를 충족하는 변수는 어레이 참조로 승격된다.In step 320 thread-local memory references of the optimized code are promoted to array references as needed, ensuring that each instance of the object has a unique location for storing a value. In particular, data transferred from one partition to another needs to be copied and made available to each partition. A variable that satisfies one of the conditions of a local variable with cross-partition dependency (assigned to one partition and referenced by another partition) is promoted to an array reference.

단계(320)에서 트랜스레이터(220)는 스레드-로컬 메모리 참조들을 어레이 참조들로 승격시킨다. 표 1에서 도시된 프로그램은 동기화 장벽 프리미티브 및 발산 참조(divergent reference)들을 포함한다.In step 320 the translator 220 promotes thread-local memory references to array references. The program shown in Table 1 includes synchronization barrier primitives and divergent references.

표 1에 도시된 프로그램은 synchthreads 프리미티브 전의 제1 파티션과 synchthreads 프리미티브 후의 제2 파티션으로 파티셔닝된다. 제2 파티션은 제1 파티션에서 계산되고 CTA 차원에 의존하는 참조들(leftIndex 및 rightIndex)을 포함한다. 만약 발산 참조들이 승격되지 않는다면, 제2 파티션은 제1 파티션의 마지막 반복(iteration)에 의해 계산된 값들을 부정확하게 사용할 것이다. 제2 파티션은 제1 파티션의 threadId.x의 대응하는 반복 각각에 대하여 계산된 값을 사용해야 한다. 계산이 정확하다는 것을 보장하기 위해서, 발산 참조들은 표 2에서 도시된 바와 같이 승격된다.The program shown in Table 1 is partitioned into a first partition before the synchthreads primitive and a second partition after the synchthreads primitive. The second partition includes references (leftIndex and rightIndex) that are calculated in the first partition and depend on the CTA dimension. If divergence references are not promoted, the second partition will incorrectly use the values calculated by the last iteration of the first partition. The second partition should use the calculated value for each corresponding iteration of threadId.x of the first partition. To ensure that the calculation is correct, divergence references are promoted as shown in Table 2.

단계(325)에서 스레드 루프들은 베어리언스 벡터들에 threadID 차원들을 포함하는 문장들에 대하여 생성된다. 적응적인 루프 중첩(nesting)은 최상의 중복 제거를 달성하기 위하여 루프 교환, 루프 분해(loop fission) 및 루프 불변값 제거(loop invariant removal)와 동등한 변환들의 값을 동시에 구하기(evaluate) 위해 사용된다. 중첩된 루프들은 특정 루프 중첩을 가정하거나 그 중첩에 기초한 어플리케이션의 값을 구하기보다는, threadID 투플의 각 차원의 값들에 의해서 어플리케이션에 가장 적합하도록 동적으로 생성된다. 문장들이 단계(315)에서 정리된 후에, 루프들은 그들의 베어리언스 벡터에 그 차원을 포함하는 문장들의 주변에서만 threadID 차원들에 대하여 생성될 수 있다. 루프 오버헤드를 제거하기 위해, 트랜스레이터(220)는 하나의 그룹이 다른 그룹의 서브세트인 베어리언스 벡터를 갖는 인접하는 문장 그룹들을 융합(fuse)할 수 있다.In step 325 thread loops are generated for statements that include threadID dimensions in the bearance vectors. Adaptive loop nesting is used to simultaneously evaluate the transforms equivalent to loop swapping, loop fission and loop invariant removal to achieve the best deduplication. Nested loops are dynamically created to best fit the application by values of each dimension of the threadID tuple, rather than assuming a particular loop nesting or evaluating an application based on that nesting. After the statements are sorted in step 315, loops may be generated for threadID dimensions only around those statements that include that dimension in their bearance vector. To eliminate loop overhead, translator 220 can fuse adjacent group of sentences with one group having a vector of vectors, one group being a subset of another group.

도 3b는 본 발명의 일 실시예에 따른, 파티셔닝된 코드(350)로 트랜스레이트되는 입력 코드(101)를 예시하는 개념도이다. 입력 코드(330)는 멀티스레드형 프로세싱 서브시스템(112)에 의한 실행을 위해 구성되고 동기화 장벽 명령어(336)에 의해 분리되는 코드 시퀀스들(331 및 332)을 포함한다. CTA의 모든 스레드는 스레드들 중 임의의 스레드가 코드 시퀀스(332)의 실행을 시작하기 전에 코드 시퀀스(331)의 실행을 완료할 것이다. 트랜스레이터(220)는 입력 코드(330)를 파티셔닝하여 파티셔닝된 코드(350)를 생성하고, 파티션(351)은 코드 시퀀스(331)에 의해 표현되는 명령어들을 포함하고 파티션(352)은 코드 시퀀스(332)에 의해 표현되는 명령어들을 포함한다. 스레드 루프(353)는 파티션(352) 주위에 삽입되어 파티셔닝된 코드(350)가 본래 동기화 장벽 명령어를 지원하지 않는 범용 프로세서에 의해 실행될 때 동기화 시맨틱이 유지되는 것을 보장한다. 이 예에서, 코드 파티션(351)은 수렴 참조(convergent reference)들을 포함하고 파티션(352)은 발산 참조(divergent reference)들을 포함할 수 있다. 그리하여, 스레드 루프(353)는 파티션(352) 주위에 삽입된다.3B is a conceptual diagram illustrating an input code 101 translated to partitioned code 350, in accordance with an embodiment of the present invention. Input code 330 includes code sequences 331 and 332 that are configured for execution by multithreaded processing subsystem 112 and separated by synchronization barrier instruction 336. All threads of the CTA will complete execution of code sequence 331 before any of the threads begins execution of code sequence 332. Translator 220 partitions input code 330 to generate partitioned code 350, partition 351 includes instructions represented by code sequence 331 and partition 352 includes code sequence 332. Contains the commands represented by). Thread loop 353 is inserted around partition 352 to ensure that the synchronization semantics are maintained when partitioned code 350 is executed by a general purpose processor that does not natively support synchronization barrier instructions. In this example, code partition 351 may include convergent references and partition 352 may include divergent references. Thus, thread loop 353 is inserted around partition 352.

도 3a의 단계(352)에서, CPU(202)에 의한 실행을 위해 트랜스레이트되는 코드(201)를 생성하기 위해 트랜스레이터(220)는 (스레드 루프(353)와 같은) 스레드 루프들을 최적화된 코드에 삽입한다. 각 파티션은 각 CTA 차원에 대해 삽입된 스레드 루프를 가질 수 있다. 동기화 파티셔닝 및 스레드 루프 삽입의 하나의 예가 표 3 및 표 4에 도시된다. 표 3에 도시된 프로그램은 표 4에 도시된 프로그램으로 트랜스레이트된다.In step 352 of FIG. 3A, translator 220 adds thread loops (such as thread loop 353) to optimized code to generate code 201 that is translated for execution by CPU 202. Insert it. Each partition can have a thread loop inserted for each CTA dimension. One example of synchronous partitioning and thread loop insertion is shown in Tables 3 and 4. The program shown in Table 3 is translated into the program shown in Table 4.

표 3의 프로그램은 CTA의 다양한 스레드 사이에서 메모리의 정확한 공유를 보장하기 위해 명백한 동기화를 사용한다. 트랜스레이터(220)는 프로그램을 각각이 x CTA 차원에 의존하는 두 개의 파티션으로 파티셔닝한다. 그리하여, 트랜스레이트된 프로그램이 정확한 순서로 동작들을 수행하는 것을 보장하기 위하여 두 개의 파티션 각각의 주위에 스레드 루프가 삽입된다.The program in Table 3 uses explicit synchronization to ensure accurate sharing of memory among the various threads of the CTA. Translator 220 partitions the program into two partitions, each of which depends on the x CTA dimension. Thus, a thread loop is inserted around each of the two partitions to ensure that the translated program performs the operations in the correct order.

범용 프로세서에 의한 실행을 위한 프로그램을 트랜스레이트하기 위한 더욱 간단한 기술은 명백한 스레드 루프들을 각 CTA 차원에 삽입하여, 동일한 파티션 내의 참조들에 대한 차원 의존성(dimension dependency)을 결정할 필요가 없게 한다. 예를 들면, 표 5에 도시된 프로그램은 표 6에 도시된 프로그램으로 트랜스레이트된다. 차원 의존성을 결정하지 않고 프로그램이 생성되었기 때문에 표 5에 삽입된 스레드 루프들 중 하나 이상이 필요하지 않을 수 있다는 것에 주의한다.A simpler technique for translating a program for execution by a general purpose processor inserts explicit thread loops into each CTA dimension, eliminating the need to determine dimension dependencies for references within the same partition. For example, the program shown in Table 5 is translated into the program shown in Table 6. Note that one or more of the thread loops inserted in Table 5 may not be needed because the program was created without determining the dimension dependencies.

도 3c는 본 발명의 일 실시예에 따른, 최적화된 코드(360)로 트랜스레이트되는 입력 코드(333)를 예시하는 개념도이다. 입력 코드(333)는 멀티스레드형 프로세싱 서브시스템(112)에 의한 실행을 위해 구성되고 동기화 장벽 명령어(335)에 의해 분리되는 코드 시퀀스들(334 및 338)를 포함한다. CTA의 모든 스레드는 스레드들 중 임의의 하나의 스레드가 코드 시퀀스(338)의 실행을 시작하기 전에 코드 시퀀스(334)의 실행을 완료할 것이다. 트랜스레이터(220)는 입력 코드(333)를 파티셔닝하여 파티셔닝된 코드(360)를 생성하고, 파티션(361)은 코드 시퀀스(334)에 의해 표현되는 명령어들을 포함하고 파티션들(362, 364, 365)은 코드 시퀀스(338)에 의해 표현되는 명령어들을 포함한다.3C is a conceptual diagram illustrating input code 333 translated to optimized code 360, in accordance with an embodiment of the present invention. Input code 333 includes code sequences 334 and 338 that are configured for execution by multithreaded processing subsystem 112 and separated by synchronization barrier instruction 335. All threads of the CTA will complete execution of code sequence 334 before any one of the threads begins execution of code sequence 338. Translator 220 partitions input code 333 to generate partitioned code 360, partition 361 includes instructions represented by code sequence 334 and partitions 362, 364, 365. Includes instructions represented by code sequence 338.

파티션(362)은 제1 CTA 차원에서 발산하는 명령어들의 제1 부분을 포함한다. 파티션(364)은 수렴하는 명령어들의 제2 부분을 포함한다. 파티션(365)은 제2 CTA 차원에서 발산하는 명령어들의 제3 부분을 포함한다. 스레드 루프(363)는 파티션(362) 주위에 삽입되어 파티셔닝된 코드(360)가 본래 동기화 장벽 명령어를 지원하지 않는 범용 프로세서에 의해 실행될 때 동기화 시맨틱이 유지되는 것을 보장한다. 스레드 루프(363)는 제1 CTA 차원 상에서 반복된다. 스레드 루프(366)는 제2 CTA 차원 상에서 반복하도록 파티션(365) 주위에 삽입된다.Partition 362 includes a first portion of instructions that emanate at a first CTA level. Partition 364 contains the second subsection ¶ of converging instructions. Partition 365 includes a third portion of instructions emitted at the second CTA level. Thread loop 363 is inserted around partition 362 to ensure that the synchronization semantics are maintained when partitioned code 360 is executed by a general purpose processor that does not natively support synchronization barrier instructions. Thread loop 363 is repeated on the first CTA dimension. Thread loop 366 is inserted around partition 365 to repeat on the second CTA dimension.

표 7은 예시적인 CUDA 커널을 도시하고 표 8은 범용 프로세서에 의한 실행을 위한 CUDA 커널의 트랜스레이션을 도시한다. 예시적인 커널은 작은 행렬들의 목록을 곱한다(multiply). 각 스레드 블록은 목록 밖의 하나의 작은 행렬 곱셈을 계산하지만, 각 스레드는 그 블록에 대한 결과 행렬의 하나의 원소를 계산한다.Table 7 shows an example CUDA kernel and Table 8 shows the translation of the CUDA kernel for execution by a general purpose processor. The exemplary kernel multiply the list of small matrices. Each thread block computes one small matrix multiplication outside the list, but each thread calculates one element of the resulting matrix for that block.

col은 x 차원에 의존하고 행은 y 차원에 의존하기 때문에, 표 7의 (9)행의 문장이 (x,y)의 베어리언스 벡터를 갖는 것을 주의한다. z 차원은 결코 사용되지 않기 때문에, z 상에서 반복되는 루프가 삽입되지 않는다. 통상적인 코스트 분석 기술들이 표 7에 도시된 예시적인 커널의 문장들 (5 및 6)과 같은 경우들을 결정하는 데 사용될 수 있다. 각각은 오직 하나의 threadID 차원에만 의존하기 때문에, x및 y 인덱스 루프들의 어느 하나의 중첩 순서를 선택하는 것은 문장의 중복 실행을 강제하거나, 파티션의 메인 루프 중첩 바깥의 중복 루프를 강제할 것이다.Note that the statement in row (9) of Table 7 has a bearance vector of (x, y) because col depends on the x dimension and the row depends on the y dimension. Since the z dimension is never used, loops that repeat on z are not inserted. Conventional cost analysis techniques may be used to determine cases such as sentences 5 and 6 of the exemplary kernel shown in Table 7. Since each depends only on one threadID dimension, selecting the nesting order of either of the x and y index loops will force duplicate execution of the statement, or force a duplicate loop outside the partition's main loop nesting.

도 4는 본 발명의 일 실시예에 따른, CPU(202)와 같은 범용 프로세서에 의한 트랜스레이트된 코드(201)의 실행을 위한 방법 단계들의 흐름도이다. 단계(400)에서 컴파일러(225)는, 선택적으로 CPU 특정 최적화들을 수행하여, 트랜스레이트된 코드(201)를 컴파일함으로써 컴파일된 코드를 생성한다. 단계(405)에서 CPU(202)에서 이용가능한 실행 코어들(400)의 수는 장치 드라이버(203)에 의해 결정된다. 트랜스레이트된 코드(201)는 개선된 성능을 위해 이용가능한 실행 코어들 상에서의 실행을 위해 자동적으로 스케일될(scaled) 수 있다. 단계(410)에서 런타임 환경(227) 또는 장치 드라이버(203)는 트랜스레이트된 코드(201)를 실행할 실행 코어들의 수를 인에이블하도록(enable the number of execution cores) CPU(202)를 구성한다.4 is a flowchart of method steps for execution of translated code 201 by a general purpose processor, such as CPU 202, in accordance with an embodiment of the present invention. In step 400 the compiler 225 optionally performs CPU specific optimizations to generate the compiled code by compiling the translated code 201. In step 405 the number of execution cores 400 available to the CPU 202 is determined by the device driver 203. The translated code 201 can be automatically scaled for execution on execution cores available for improved performance. In step 410 the runtime environment 227 or device driver 203 configures the CPU 202 to enable the number of execution cores to execute the translated code 201.

런타임 환경(227)은 다수의 운영 체제(OS) 런타임 스레드를 만들 수 있고, 그것은 환경 변수에 의해 제어될 수 있다. 기본적으로, 시스템의 코어의 수는 OS 런타임 스레드의 수로서 사용될 수 있다. 단계(410)에서, 런칭될 CUDA 스레드의 수의 값이 구해질 수 있고 런타임 스레드의 수로 통계적으로 파티셔닝될 수 있다. 각 런타임 스레드는 컴파일된 코드의 일부를 순차적으로 실행하고 장벽 상에서 대기한다. 모든 런타임 스레드가 장벽에 도달했을 때, CTA는 완료된다. 단계(415)에서 런타임 환경(227) 또는 장치 드라이버(203)는 CPU(202)에 의한 실행을 위해 컴파일된 코드를 런칭한다.Runtime environment 227 can create multiple operating system (OS) runtime threads, which can be controlled by environment variables. Basically, the number of cores in the system can be used as the number of OS runtime threads. In step 410, a value of the number of CUDA threads to be launched can be obtained and statistically partitioned by the number of runtime threads. Each runtime thread executes some of the compiled code sequentially and waits on the barrier. When all runtime threads reach the barrier, the CTA completes. In step 415 the runtime environment 227 or device driver 203 launches the compiled code for execution by the CPU 202.

트랜스레이터(220), 컴파일러(225) 및 런타임 환경(227)은 CUDA 어플리케이션 프로그램들을 범용 CPU에 의한 실행을 위한 코드로 변환하기 위해 사용된다. CUDA 프로그래밍 모델은 각 태스크가 결이 고운 SPMD 스레드들로 구성되는 벌크 동기 태스크 병렬성(bulk synchronous task parallelism)을 지원한다. CUDA 프로그래밍 모델의 사용은 GPU들에 의한 실행을 위한 특수화된(specialized) 코드를 기록하고자 하는 프로그래머들로 제한되어 왔다. 이러한 특수화된 코드는 프로그래머가 CUDA 어플리케이션 프로그램을 재기록할 필요 없이 범용 CPU에 의한 실행을 위해 전환될 수 있다. CUDA에 의해 지원되는 3개의 핵심 개념은 SPMD 스레드 블록들, 장벽 동기화 및 공유 메모리이다. 트랜스레이터(220)는 CUDA 스레드 블록의 결이 고운 스레드들에 걸쳐 동작들을 단일 CPU 스레드로 직렬화하고 CUDA 어플리케이션 프로그램을 변환하기 위한 최적화 변환들을 수행한다.Translator 220, compiler 225, and runtime environment 227 are used to convert CUDA application programs into code for execution by a general purpose CPU. The CUDA programming model supports bulk synchronous task parallelism, where each task consists of fine-grained SPMD threads. The use of the CUDA programming model has been limited to programmers who wish to write specialized code for execution by GPUs. This specialized code can be switched for execution by a general purpose CPU without the programmer having to rewrite the CUDA application program. The three key concepts supported by CUDA are SPMD thread blocks, barrier synchronization and shared memory. Translator 220 performs optimization transformations for serializing operations into a single CPU thread and transforming a CUDA application program across the grainy threads of the CUDA thread block.

위의 내용은 본 발명의 실시예들에 관한 것이지만, 본 발명의 다른 그리고 추가적인 실시예들이 본 발명의 기본적인 범위를 벗어나지 않고 고안될 수 있다. 예를 들면, 본 발명의 태양들은 하드웨어 또는 소프트웨어 또는 하드웨어와 소프트웨어의 조합으로 구현될 수 있다. 본 발명의 일 실시예는 컴퓨터 시스템과 함께 사용되는 프로그램 제품으로서 구현될 수 있다. 프로그램 제품의 프로그램(들)은 (본 명세서에서 기술된 방법들을 포함하여) 실시예들의 기능들을 정의하고 다양한 컴퓨터-판독가능 저장 매체 상에 포함될 수 있다. 예시적인 컴퓨터-판독가능 저장 매체는 (ⅰ) 정보가 영구적으로 저장되는 기록 불가능한 저장 매체(non-writable storage media) (예를 들면, CD-ROM 드라이브에 의해 판독가능한 CD-ROM 디스크들, 플래시 메모리, ROM 칩들 또는 임의의 유형의 비휘발성 고상 반도체 메모리(solid-state non-volatile semiconductor memory)와 같은 컴퓨터 내의 판독 전용 메모리 장치들); 및 (ⅱ) 수정가능한 정보가 저장되는 기록가능한 저장 매체 (예를 들면, 디스켓 드라이브 또는 하드 디스크 드라이브 내의 플로피 디스크들 또는 임의의 유형의 고상 랜덤 액세스 반도체 메모리)를 포함하지만, 거기에 제한되지는 않는다. 그러한 컴퓨터-판독가능 저장 매체는, 본 발명의 기능들을 지시하는 컴퓨터-판독가능 명령어들을 수행할 경우, 본 발명의 실시예들이다. 그리하여, 본 발명의 범위는 이하의 청구범위에 의해 결정된다.While the above is directed to embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. For example, aspects of the invention may be implemented in hardware or software or a combination of hardware and software. One embodiment of the invention may be implemented as a program product for use with a computer system. The program (s) of the program product define the functions of the embodiments (including the methods described herein) and may be included on various computer-readable storage media. Exemplary computer-readable storage media include (i) non-writable storage media (e.g., CD-ROM disks, flash memory readable by a CD-ROM drive) in which information is stored permanently. Read-only memory devices in a computer, such as ROM chips or any type of solid-state non-volatile semiconductor memory); And (ii) a recordable storage medium (eg, floppy disks in a diskette drive or hard disk drive or any type of solid state random access semiconductor memory) in which modifiable information is stored. . Such computer-readable storage media are embodiments of the invention when performing computer-readable instructions that direct the functions of the invention. Thus, the scope of the invention is determined by the following claims.

도 1은 컴퓨터 시스템을 예시하는 블록도.1 is a block diagram illustrating a computer system.

도 2는 본 발명의 일 실시예에 따른 컴퓨터 시스템을 예시하는 블록도.2 is a block diagram illustrating a computer system according to one embodiment of the invention.

도 3a는 본 발명의 일 실시예에 따른, 멀티코어 그래픽 프로세싱 유닛에 의한 실행을 위해 기록된 코드를 범용 프로세서에 의한 실행을 위한 코드로 트랜스레이트하는 방법 단계들의 흐름도.3A is a flow diagram of method steps for translating code written for execution by a multicore graphics processing unit into code for execution by a general purpose processor, in accordance with an embodiment of the present invention.

도 3b는 본 발명의 일 실시예에 따른, 파티셔닝된 코드로 트랜스레이트되는 입력 코드를 예시하는 개념도.3B is a conceptual diagram illustrating an input code translated into a partitioned code, in accordance with an embodiment of the present invention.

도 3c는 본 발명의 일 실시예에 따른, 최적화된 코드로 트랜스레이트되는 입력 코드를 예시하는 개념도.3C is a conceptual diagram illustrating input codes that are translated into optimized codes, in accordance with one embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따른, 범용 프로세서에 의해 트랜스레이트된 코드의 실행 방법 단계들의 흐름도.4 is a flow diagram of method steps for executing code translated by a general purpose processor, in accordance with an embodiment of the present invention.

<도면의 주요 부분들에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100: 컴퓨터 시스템100: computer system

101: 코드101: code

102: CPU102: CPU

103: 장치 드라이버103: device drivers

104: 시스템 메모리104: system memory

Claims

A computing system configured to execute a translated application program, the computing system comprising:

A general purpose processor configured to execute a compiler;

A system memory coupled with the processor and configured to store the translated application program and compiled code;

Compiler-The compiler receives the translated application program transformed from an application program recorded using a parallel programming model for execution on a multicore graphics processing unit, and compiles the translated application program to generate the general purpose. Generate compiled code for execution by a processor;

Device driver-the device driver is configured to determine the number of execution cores of the general purpose processor available for executing the translated application program, and to enable the number of execution cores. Configured to configure a general purpose processor; And

A runtime environment configured to launch the compiled code for execution by the general purpose processor, including the number of execution cores

Computing system comprising a.

The method of claim 1,

The translated application program includes a first loop nest around a first area of a partitioned application program such that any thread of a thread of a cooperating thread array is in a second area of the partitioned application program. Computing system to ensure that all threads of the cooperative thread array complete execution of the first region of the partitioned application program before starting execution.

The method of claim 2,

And the first loop is repeated on one or more dimensions of the cooperative thread array.

The method of claim 1,

The translated application program is generated by partitioning the application program into regions of synchronization independent instructions to generate a partitioned application program, and inserting a loop around at least one region of the partitioned application program, And said loop is repeated on a cooperative thread array dimension corresponding to a number of threads executed simultaneously by parallel processors within said multicore graphics processing unit.

The method of claim 4, wherein

And a first region of the partitioned application program includes instructions before a synchronization barrier instruction, and a second region of the partitioned application program includes instructions after the synchronization barrier instruction.

The method of claim 5,

And additional loops are inserted around at least one area of the partitioned application program to generate the translated application program, the additional loops being repeated on different cooperative thread array dimensions.

The method of claim 4, wherein

And said partitioned application program is classified to identify each statement as one of converging or diverging with respect to said cooperative thread array dimension.

The method of claim 1,

The general purpose processor is further configured to execute the compiled code.

The method of claim 1,

The general purpose processor is further configured to perform optimizations specific to the general purpose processor.

The method of claim 1,

And said application program written using a parallel programming model for execution on a multicore graphics processing unit is a Compute Unified Device Architecture (CUDA) application program.