TWI776338B

TWI776338B - Compiler adapted in graph processing unit and non-transitory computer-readable medium

Info

Publication number: TWI776338B
Application number: TW109146968A
Authority: TW
Inventors: 陳中和; 陳惇介; 許峰銘; 林聖堯
Original assignee: 國立成功大學
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-09-01
Also published as: US20220206768A1; TW202225953A; US11567745B2

Abstract

A compiler includes a front-end module, an optimization module, and a back-end module. The front-end module pre-processes a source code to generate an intermediate code. The optimization module optimizes the intermediate code. The back-end module translates the optimized intermediate code to generate a machine code. Optimization includes translating a branch instruction in the intermediate code into performing the following operations: building a post dominator tree for the branch instruction to find an immediate post dominator of the branch instruction as a convergent node of a first path and a second path of the branch instruction; inserting a specific instruction in the front end of the convergent node, so that on the condition that the specific instruction in the first path is executed, it jumps to executes the instructions of the second path.

Description

Compiler and non-transitory computer-readable storage medium for graphics processor

本發明係關於編譯器的技術領域，特別是關於一種應用於圖形處理器的編譯器。The present invention relates to the technical field of compilers, in particular to a compiler applied to a graphics processor.

近年來物聯網（Internet of Things，IoT）的崛起，以及人工智慧、機器學習等領域的快速發展，資料的處理量已大幅提升。傳統的雲端運算已經無法應付這樣即時的龐大資料處理，因而取代的是以分散式計算（例如，霧運算（Fog computing）、邊緣運算（Edge computing）、終端使用者運算（End user computing））的應用架構。舉例來說，邊緣運算將應用程式、數據資料與服務的運算從網路中心節點，移往網路邏輯上的邊緣節點來處理。換言之，邊緣運算將原本完全由中心節點處理大型服務加以分解，切割成更小與更容易管理的部份，分散到邊緣節點去處理。邊緣節點更接近於用戶終端裝置，可以加快資料的處理與傳送速度，減少延遲。In recent years, the rise of the Internet of Things (IoT) and the rapid development of artificial intelligence, machine learning and other fields have greatly increased the amount of data processing. Traditional cloud computing has been unable to cope with such real-time huge data processing, so it is replaced by decentralized computing (for example, fog computing, edge computing, end user computing). Application Architecture. For example, edge computing moves the computing of applications, data and services from the central node of the network to the edge nodes in the network logic for processing. In other words, edge computing decomposes large-scale services that were originally handled entirely by central nodes, cut them into smaller and easier-to-manage parts, and distribute them to edge nodes for processing. The edge node is closer to the user terminal device, which can speed up the processing and transmission of data and reduce the delay.

因此，通用型圖形處理器（General Purpose Graph Processing Unit，GPGPU）開始被廣泛運用於這類需要計算大量資料且可高度平行化處理的應用上。這類的圖形處理單元除了可處理圖形資料之外，還可用來計算原本由中央處理器處理的通用計算任務，而這些通用計算任務通常與圖形處理沒有任何關係。由於現代圖形處理器有強大的平行處理能力和可程式化管線，使得在面對單指令流多資料流（SIMD）且資料處理的運算量遠大於資料排程和傳輸的需要時，通用圖形處理器在效能上可大幅度地超越傳統的中央處理器。Therefore, the General Purpose Graph Processing Unit (GPGPU) has been widely used in such applications that need to calculate a large amount of data and can be highly parallelized. In addition to processing graphics data, this type of graphics processing unit can also be used to calculate general-purpose computing tasks originally handled by the central processing unit, and these general-purpose computing tasks usually have nothing to do with graphics processing. Due to the powerful parallel processing capabilities and programmable pipelines of modern graphics processors, general graphics processing is easier to handle when faced with single instruction stream, multiple data streams (SIMD) and the amount of data processing is much larger than the needs of data scheduling and transmission. The performance of the device can greatly exceed the traditional central processing unit.

然而，大部分的圖形處理器都是用各家廠商自己的系統架構和編譯器，其通常只支援用他們自己的定義的架構和語言的應用程式。即便是這些廠商有釋出一些針對開源（open source）軟體支援的服務，然編譯器等相關軟體或是硬體仍是得使用他們定義的。例如，傳統採用的開放計算語言（OpenCL）編譯器是AMD CLOC，其為閉源（closed source）軟體，且僅提供X86平台使用。換言之，開發人員並無法對其做修正、新增指令以及優化等操作。因此，導致在開發和使用上有一定的難處。因此，如何提供一種可移植性的OpenCL編譯平台及可進行優化的編譯器以提升支援OpenCL的圖形處理器的效能是目前的一個課題。However, most graphics processors use each vendor's own system architecture and compiler, which usually only support applications using their own defined architecture and language. Even if these manufacturers have released some services for open source software support, the compiler and other related software or hardware still have to use their definitions. For example, the traditionally adopted Open Computing Language (OpenCL) compiler is AMD CLOC, which is closed source software and only available for use on the X86 platform. In other words, developers cannot modify it, add instructions, or optimize it. Therefore, there are certain difficulties in development and use. Therefore, how to provide a portable OpenCL compilation platform and an optimizing compiler to improve the performance of a graphics processor supporting OpenCL is a current issue.

本發明之一目的在於提供一種應用於圖形處理器的編譯器以及一種非暫態電腦可讀式儲存媒體。An object of the present invention is to provide a compiler applied to a graphics processor and a non-transitory computer-readable storage medium.

為達上述之目的，本發明提供一種應用於可進行通用運算的圖形處理器的編譯器，經組態以對藉由圖形處理器所執行的應用程式進行編譯以產生相應於應用程式的機器碼以供圖形處理器中的複數個串流多處理器執行。編譯器包含前端模組、優化模組、以及後端模組。前端模組經組態以對相應於應用程式的源代碼進行前處理以產生中介代碼。優化模組經組態以對中介代碼進行優化處理。後端模組經組態以將經優化處理的中介代碼進行轉譯處理以產生機器碼。優化處理包含對中介代碼中的每一分支指令轉譯成執行以下操作：對分支指令建立一反向支配樹以找出分支指令的一直接反向支配點作為分支指令的第一路徑的指令及第二路徑的指令的一收斂節點；以及於收斂節點前端插入一特定指令，使得當執行分支指令的第一路徑的指令時，執行完該第一路徑上的特定指令後，跳至執行分支指令的第二路徑的指令直到執行完第二路徑上的特定指令，才繼續執行該收斂節點開始的指令。In order to achieve the above object, the present invention provides a compiler applied to a graphics processor capable of general-purpose operations, configured to compile an application program executed by the graphics processor to generate machine code corresponding to the application program For execution by a plurality of streaming multiprocessors in the graphics processor. The compiler includes front-end modules, optimization modules, and back-end modules. The front end module is configured to preprocess the source code corresponding to the application to generate intermediate code. The optimization module is configured to optimize the intermediate code. The backend module is configured to translate the optimized intermediate code to generate machine code. The optimization process includes translating each branch instruction in the intermediate code to perform the following operations: building an inverse dominance tree for the branch instruction to find a direct inverse dominance point of the branch instruction as the instruction of the first path of the branch instruction and the first path of the branch instruction. A convergent node of the instructions of the two paths; and inserting a specific instruction at the front end of the convergent node, so that when the instruction of the first path of the branch instruction is executed, after the specific instruction on the first path is executed, jump to the execution of the branch instruction. The instruction of the second path does not continue to execute the instruction started by the convergent node until the specific instruction on the second path is executed.

在本發明的一實施例中，分支指令由被分配到的該等串流多處理器中的一者所包含的複數個串流處理器同時執行，其中第一路徑的指令由該等串流處理器中的複數個第一串流處理器以及複數個第二串流處理器使用第一線程遮罩同時執行，以及第二路徑的指令由該等第一串流處理器以及該等第二串流處理器使用第二線程遮罩同時執行。In an embodiment of the present invention, branch instructions are simultaneously executed by a plurality of stream processors included in one of the assigned stream multiprocessors, wherein the instructions of the first path are executed by the streams A plurality of first stream processors and a plurality of second stream processors in the processor are executed concurrently using the first thread mask, and instructions of the second path are executed by the first stream processors and the second stream processors The stream processors use the second thread mask to execute concurrently.

在本發明的一實施例中，在執行完該第一路徑上的特定指令時，僅該等第一串流處理器所執行的結果被儲存，且在執行完第二路徑上的特定指令前時，僅該等第二串流處理器所執行的結果被儲存。In an embodiment of the present invention, when the specific instructions on the first path are executed, only the results executed by the first stream processors are stored, and before the specific instructions on the second path are executed , only the results of execution by the second stream processors are stored.

在本發明的一實施例中，在執行分支指令的該第一路徑的指令時，在執行到該特定指令後，結束使用第一線程遮罩；以及在執行該分支指令的該第二路徑的指令時，在執行到該特定指令後，結束使用該第二線程遮罩。In an embodiment of the present invention, when executing the instruction of the first path of the branch instruction, after the specific instruction is executed, the first thread mask is terminated; and when executing the instruction of the second path of the branch instruction When executing the instruction, after the specific instruction is executed, the use of the second thread mask is terminated.

在本發明的一實施例中，優化處理還包含對中介代碼中的每一調用函式指令轉譯成執行以下操作：將調用函式指令所調用的函式的所有內容直接於使用調用指令函式的調用者中進行內聯擴展。In an embodiment of the present invention, the optimization process further includes translating each call function instruction in the intermediary code to perform the following operation: direct all contents of the function called by the call function instruction to the function using the call instruction inline expansion in the caller of .

在本發明的一實施例中，優化處理還包含對中介代碼中的每一迴圈指令轉譯成執行以下操作：對迴圈指令分析迴圈的次數；以及對迴圈指令內所執行的指令根據迴圈的次數全部展開。In an embodiment of the present invention, the optimization process further includes translating each loop instruction in the intermediate code to perform the following operations: analyzing the number of loops for the loop instruction; The number of loops is fully expanded.

在本發明的一實施例中，前端模組係clang編譯器，經組態以產生底層虛擬機器所定義的中介代碼。In one embodiment of the invention, the front end module is a clang compiler configured to generate intermediate code defined by the underlying virtual machine.

在本發明的一實施例中，前處理包含巨集處理、靜態分析、以及產生對應源代碼的語法樹。In an embodiment of the present invention, the preprocessing includes macro processing, static analysis, and generating a syntax tree corresponding to the source code.

本發明還提供一種非暫態電腦可讀式儲存媒體，經組態以儲存複數個指令，當該等指令被電腦系統中的處理器執行時使所述處理器執行一編譯方法以對電腦系統中的圖形處理器所執行的一應用程式進行編譯以產生相應於應用程式的一機器碼以供圖形處理器中的複數個串流多處理器執行，所述編譯方法包含：對相應該應用程式的一源代碼進行一前處理以產生一中介代碼；對該中介代碼進行一優化處理；以及對經優化處理的該中介代碼進行一轉譯處理以產生該機器碼；其中該優化處理包含對該中介代碼中的每一分支指令轉譯成執行以下操作：對該分支指令建立一反向支配樹以找出該分支指令的一直接反向支配點作為該分支指令的一第一路徑的指令及一第二路徑的指令的一收斂節點；以及於該收斂節點前端插入一特定指令，使得當執行該分支指令的該第一路徑的指令時，執行完該第一路徑上的該特定指令後，跳至執行該分支指令的該第二路徑的指令直到執行完該第二路徑上的該特定指令，才繼續執行該收斂節點開始的指令。The present invention also provides a non-transitory computer-readable storage medium configured to store a plurality of instructions that, when executed by a processor in a computer system, cause the processor to execute a compilation method for the computer system An application program executed by a graphics processor in the graphics processor is compiled to generate a machine code corresponding to the application program for execution by a plurality of streaming multiprocessors in the graphics processor, and the compiling method includes: compiling the corresponding application program A source code of a source code is preprocessed to generate an intermediate code; an optimization process is performed on the intermediate code; and a translation process is performed on the optimized intermediate code to generate the machine code; wherein the optimization process includes the intermediate code Each branch instruction in the code is translated to perform the following operations: build a reverse dominator tree for the branch instruction to find a direct reverse dominator point of the branch instruction as an instruction of a first path of the branch instruction and a first A convergent node of the instruction of the two paths; and inserting a specific instruction at the front end of the convergent node, so that when the instruction of the first path of the branch instruction is executed, after the specific instruction on the first path is executed, jump to The instruction of the second path of the branch instruction is executed until the specific instruction on the second path is executed, and then the instruction started by the convergent node is continued to be executed.

本發明透過對上述的分支相關指令、調用指令和迴圈指令進行相應優化的編譯流程，使軟體堆疊更能配合硬體的運作，獲得大幅整體效能之提升，藉以提供開發人員便利的開源執行環境。In the present invention, the above-mentioned branch-related instructions, call instructions and loop instructions are correspondingly optimized in the compilation process, so that the software stack can better cooperate with the operation of the hardware, and the overall performance is greatly improved, thereby providing developers with a convenient open source execution environment. .

為了讓本發明之上述及其他目的、特徵、優點能更明顯易懂，下文將特舉本發明較佳實施例，並配合所附圖式，作詳細說明如下。In order to make the above-mentioned and other objects, features and advantages of the present invention more clearly understood, the preferred embodiments of the present invention will be exemplified below and described in detail in conjunction with the accompanying drawings.

請參照第1圖，第1圖係根據本發明一較佳實施例繪示的圖形處理器100的方塊示意圖。通用圖形處理器100是單指令多執行緒（Single Instruction Multiple Thread，SIMT）的架構，其包含互連網路模組110、多個串流多處理器（Streaming Multiprocessor，SM）120、工作排程模組130、以及記憶體140。互連網路模組110電性連接於各個串流多處理器120、工作排程模組130、以及記憶體140，且經組態以在這些元件之間進行資料的傳輸。串流多處理器120經組態以進行運算與執行指令。每個串流多處理器120皆包含執行緒束（warp）排程模組121以及多個串流處理器（Streaming Processor，SP）122，其用途於之後說明。工作群排程模組130經組態以跟外部的中央處理器（圖未繪示）進行通訊，並接收來自中央處理器指派的工作以及將工作排程給串流多處理器120執行。Please refer to FIG. 1, which is a block diagram of a graphics processor 100 according to a preferred embodiment of the present invention. The general-purpose graphics processor 100 is a single instruction multiple thread (Single Instruction Multiple Thread, SIMT) architecture, which includes an interconnection network module 110 , a plurality of Streaming Multiprocessors (SM) 120 , and a work scheduling module 130 , and memory 140 . The interconnecting network module 110 is electrically connected to each of the streaming multiprocessors 120, the work scheduling module 130, and the memory 140, and is configured to transmit data among these elements. Streaming multiprocessor 120 is configured to perform operations and execute instructions. Each streaming multiprocessor 120 includes a warp scheduling module 121 and a plurality of Streaming Processors (SPs) 122, the purpose of which will be described later. The workgroup scheduling module 130 is configured to communicate with an external central processing unit (not shown), receive work assigned from the central processing unit, and schedule the work to the streaming multiprocessor 120 for execution.

執行緒（thread）是通用圖形處理器100所執行的程式的最小單位，其排程會經由兩層不同的排程模組來進行派發，分別是工作群排程模組130以及執行緒束排程模組121。當中央處理器發送新的工作時，工作群排程模組130會以執行緒網格（grid）為單位接收所要執行之程式，並對其進行切割與排程後，接著以執行緒塊（block）為單位派發至每個串流多處理器120去執行。某一串流多處理器120在收到執行緒塊後，會根據單指令多資料流（SIMD）的寬度分成多個執行緒束，並且以執行緒束為單位進行運算。多個執行緒束是經由執行緒束排程模組121進行排程，並且派發至每個串流處理器122去執行。同一個執行緒束裡的多個執行緒是經由串流多處理器120裡的串流處理器122同時運算的。例如，若串流多處理器120包含32個串流處理器122（亦即，SIMD的寬度為32個），則在安排上每個執行緒束會盡量具有32個執行緒並且由這些32個串流處理器122同時平行執行，若執行緒束內的執行緒不足32個，則會有一些對應的串流處理器122在當下不工作。須了解的是，在圖形處理器上所執行的程式一般稱為kernel，而一個kernel會對應一個執行緒網格（grid），每一個執行緒網格包含多個執行緒塊（block），每一個執行緒塊又包含多個執行緒（thread）。A thread is the smallest unit of a program executed by the general-purpose graphics processor 100 , and its schedule is dispatched through two different scheduling modules, namely the work group scheduling module 130 and the thread bundle scheduling module 130 . program module 121. When the central processing unit sends a new job, the work group scheduling module 130 receives the program to be executed in the unit of thread grid (grid), cuts and schedules it, and then uses the thread block ( block) is dispatched to each streaming multiprocessor 120 for execution. After receiving the thread block, a stream multiprocessor 120 divides it into multiple thread bundles according to the width of the single instruction multiple data stream (SIMD), and performs operations in units of thread bundles. A plurality of thread bundles are scheduled through the thread bundle scheduling module 121 and dispatched to each stream processor 122 for execution. Multiple threads in the same thread bundle are simultaneously operated by the stream processor 122 in the stream multiprocessor 120 . For example, if the stream multiprocessor 120 contains 32 stream processors 122 (ie, the SIMD width is 32), then each thread bundle will be arranged to have as many as 32 threads and the 32 The stream processors 122 are executed in parallel at the same time. If there are less than 32 threads in the thread bundle, some corresponding stream processors 122 will not work at the moment. It should be understood that the program executed on the graphics processor is generally called a kernel, and a kernel corresponds to a thread grid (grid), each thread grid contains multiple thread blocks (blocks), each A thread block contains multiple threads.

請參照第2圖，第2圖係根據本發明一較佳實施例繪示的通用圖形處理器100軟體層級的示意圖。如第2圖所示，最上層為TensorFlow 執行平台（runtime）210，開發人員可在上面使用TensorFlow內有支援的應用程式庫以支援機器學習、深度學習模型開發。然後，透過OpenCL執行平台220支援通用圖形處理器100 來達到大量平行運算以提升效能。換言之，無論是TensorFlow CNN 應用程式或是OpenCL應用程式都能在通用圖形處理器100上達到加速效果。最後，透過異構系統架構（Heterogeneous System Architecture ，HSAHSA）執行平台230提供共同硬體介面，在軟硬體間搭載一個橋樑與通用圖形處理器100進行溝通，以降低 OpenCL執行平台220的設計複雜度。通用圖形處理器100收到軟體端的資訊後便開始運作，最後再將結果傳回中央處理器端的記憶體中，以達到程式加速的效果。Please refer to FIG. 2. FIG. 2 is a schematic diagram of a software level of a general-purpose graphics processor 100 according to a preferred embodiment of the present invention. As shown in Figure 2, the top layer is the TensorFlow execution platform (runtime) 210, on which developers can use the supported application libraries in TensorFlow to support machine learning and deep learning model development. Then, the general-purpose graphics processor 100 is supported by the OpenCL execution platform 220 to achieve a large number of parallel operations to improve performance. In other words, both TensorFlow CNN applications and OpenCL applications can be accelerated on the general-purpose graphics processor 100. Finally, the Heterogeneous System Architecture (HSAHSA) execution platform 230 provides a common hardware interface, and a bridge between software and hardware is provided to communicate with the general-purpose graphics processor 100, so as to reduce the design complexity of the OpenCL execution platform 220 . The general-purpose graphics processor 100 starts to operate after receiving the information from the software side, and finally transmits the result back to the memory of the CPU side, so as to achieve the effect of program acceleration.

然而，通用圖形處理器100 的軟體層級如果沒有編譯器支援的話，是無法完整將整個通用圖形處理器100的系統平台建立起來的，因此編譯器在整個軟硬體系統上佔有非常重要的地位。在本發明中，編譯器240是OpenCL LLVM 編譯器以支援通用圖形處理器100，其中編譯器240能夠進行優化以及自定義自己的指令集，使得硬體與軟體間達到良好的配合，進而提升執行的效率。However, if the software level of the general-purpose graphics processor 100 is not supported by a compiler, the entire system platform of the general-purpose graphics processor 100 cannot be completely established, so the compiler occupies a very important position in the entire software and hardware system. In the present invention, the compiler 240 is an OpenCL LLVM compiler to support the general-purpose graphics processor 100, wherein the compiler 240 can optimize and customize its own instruction set, so as to achieve good cooperation between hardware and software, thereby improving execution s efficiency.

具體來說，針對TensorFlow執行平台210，為了能夠讓 TensorFlow 應用能在 OpenCL 架構底下執行，首先需要了解TensorFlow Stream Executor以及 TF-Coriander的搭配方案。TensorFlow Stream Executor是Google為TensorFlow所定義的Kernel應用程式介面的共用介面。架構概念上是以Stream Executor作為各目標平台的硬體抽象層，上方的Kernel應用程式會透過統一介面對虛擬裝置進行資源管理相關的命令，例如記憶體分配、指令派發、以及程式流程監控（Kernel Process Monitoring）等等。各平台開發人員也可藉此將與平台相關的優化程式放入Kernel實作中以優化各 Kernel 於平台的執行效率。Specifically, for the TensorFlow execution platform 210, in order to enable TensorFlow applications to be executed under the OpenCL architecture, it is first necessary to understand the matching scheme of TensorFlow Stream Executor and TF-Coriander. TensorFlow Stream Executor is a common interface to the Kernel API defined by Google for TensorFlow. The architecture concept uses Stream Executor as the hardware abstraction layer of each target platform. The Kernel application above will perform resource management-related commands for virtual devices through a unified interface, such as memory allocation, command dispatch, and program flow monitoring (Kernel). Process Monitoring) and so on. Platform developers can also put platform-related optimization programs into Kernel implementation to optimize the execution efficiency of each Kernel on the platform.

原生的TensorFlow GPU Support 僅支援採用 CUDA Programming Language 的圖形處理器裝置，對於其他平台開發者需自行針對目標平台設計Stream Executor。由於TensorFlow提供眾多Kernel Operation的種類，如果為了要提供平台更完整的支援會需要大量的人力成本，且TensorFlow若有更新亦會難以同步與維護。為了降低新增硬體的複雜度，一種CUDA-on-CL架構被提出，其利用Coriander的Source-to-Source Compiler將原生的CUDA 應用程式轉譯為OpenCL Device可以執行的Host Code與Device Code，藉此將TensorFlow 原生之CUDA 程式碼轉為OpenCL Device Kernel，並為 OpenCL 設計一種Stream Executor，其獨立為TensorFlow的一個分支，也就是TF-Coriander。The native TensorFlow GPU Support only supports GPU devices using CUDA Programming Language. For other platforms, developers need to design Stream Executor for the target platform. Since TensorFlow provides many types of Kernel Operation, it will require a lot of labor costs to provide more complete support for the platform, and if TensorFlow is updated, it will be difficult to synchronize and maintain. In order to reduce the complexity of the newly added hardware, a CUDA-on-CL architecture is proposed, which uses Coriander's Source-to-Source Compiler to translate native CUDA applications into Host Code and Device Code executable by OpenCL Device. This converts TensorFlow's native CUDA code into OpenCL Device Kernel, and designs a Stream Executor for OpenCL, which is an independent branch of TensorFlow, namely TF-Coriander.

TF-Coriander透過Coriander Compiler將Tensorflow內建的CUDA Code轉譯為OpenCL Device Kernel Code，並搭配clBLAST[11]、DNN[12]等OpenC函式庫（library）取代CUDA內的cuBlast與cuDNN，建置了支援OpenCL裝置的Tensorflow以供OpenCL 1.2的裝置使用。TF-Coriander translates Tensorflow's built-in CUDA Code into OpenCL Device Kernel Code through Coriander Compiler, and replaces cuBlast and cuDNN in CUDA with OpenC libraries such as clBLAST[11], DNN[12], and builds Tensorflow for OpenCL devices is supported for OpenCL 1.2 devices.

另外，對於HSA執行平台230而言，由於現今的運算平台普遍由中央處理器（CPU）、圖形處理器（GPU）或特定應用晶（ASIC）等異質性（Heterogeneous）硬體所組成。為此，Apple 提出一種開源語言框架，也就是開放計算語言（Open Computing Language。OpenCL為各種不同架構硬體提供統一抽象軟體架構與語言，並使用相同的應用程式介面連接至目標硬體，提供如Device Memory Allocation、Device Kernel Compilation與Device Code Dispatching等功能。為了支援各平台硬體，OpenCL執行平台在軟體架構中是以Shared Library（Linux）/Dynamic Loadable Library（NT）的形式實現。各硬體開發商會為其硬體根據OpenCL specification實作應用程式介面。In addition, for the HSA execution platform 230 , today's computing platforms are generally composed of heterogeneous hardware such as a central processing unit (CPU), a graphics processing unit (GPU), or an application-specific chip (ASIC). To this end, Apple proposes an open source language framework, the Open Computing Language (Open Computing Language). OpenCL provides a unified abstract software architecture and language for various hardware architectures, and uses the same application programming interface to connect to the target hardware, providing such as Device Memory Allocation, Device Kernel Compilation, Device Code Dispatching and other functions. In order to support the hardware of each platform, the OpenCL execution platform is implemented in the form of Shared Library (Linux)/Dynamic Loadable Library (NT) in the software architecture. Each hardware development The Chamber implements APIs for its hardware according to the OpenCL specification.

OpenCL應用程式架構上將程式碼分成Host Code及Device Code （kernel）。Host Code所執行的內容大部分是由OpenCL執行平台提供的C++ Classes與Runtime API所組成的Host Code，而針對圖形處理器/加速器等目標裝置則需要另外寫OpenCL Kernel Code，並遵循OpenCL Programming mode進行設計已進行Kernel的派發（dispatch）。OpenCL Kernel Code是基於C99的程式語言，其搭配Kernel應用程式介面提供任務分割/資料分割的平行運算能力。The OpenCL application architecture divides the code into Host Code and Device Code (kernel). Most of the content executed by Host Code is the Host Code composed of C++ Classes and Runtime API provided by the OpenCL execution platform. For target devices such as graphics processors/accelerators, it is necessary to write OpenCL Kernel Code separately and follow the OpenCL Programming mode. The design has been dispatched to the Kernel. OpenCL Kernel Code is a C99-based programming language, which, together with the Kernel API, provides parallel computing capabilities for task division/data division.

對於HSA執行平台230而言，為了將CPU、GPU、及DSP等不同架構的硬體平台進行整合，HSA Foundation提出了異構系統架構（（Heterogeneous System Architecture ，HSA）的軟體架構。類似於 OpenCL 提供一個共同的平行運算軟體開發框架，HSA 目的為提供一個共同硬體介面。不同於 OpenCL規範了統一的應用程式開發介面，HSA規範了統一的硬體操作介面，以簡化上層（如 OpenCL等）與底層進行橋接介面之開發複雜度。For the HSA execution platform 230, in order to integrate hardware platforms with different architectures such as CPU, GPU, and DSP, the HSA Foundation proposes a software architecture of Heterogeneous System Architecture (HSA). Similar to the software architecture provided by OpenCL A common parallel computing software development framework, HSA aims to provide a common hardware interface. Unlike OpenCL, which standardizes a unified application development interface, HSA standardizes a unified hardware operation interface to simplify the upper layer (such as OpenCL, etc.) and The development complexity of the underlying bridge interface.

在本實施例中，為了提供OpenCL Kernel應用程式與通用圖形處理器100所支援的特殊運算指令，需要另外設置裝置函式庫250以配合編譯器240使用。裝置函式庫250包含OCKL模組251、OCML模組252及OpenCL模組253。OCL模組251經組態以提供Kernel運行時所需的相關參數（例如，工作項目ID、執行緒塊大小、執行緒網格大小等）的應用程式介面。OCML模組252經組態以提供數學運算相關的應用程式介面。OpenCL模組253經組態以提供OpenCL Kernel應用程式介面以跟OCKL模組215及OCML模組252的功能相對應。透過裝置函式庫250，編譯器240可提供OpenCL Kernel應用程式介面相關的資源以供開發人員使用其內部的特殊運算指令集。In this embodiment, in order to provide the OpenCL Kernel application and the special operation instructions supported by the general-purpose graphics processor 100 , the device function library 250 needs to be additionally set up to cooperate with the compiler 240 . The device library 250 includes an OCKL module 251 , an OCML module 252 and an OpenCL module 253 . The OCL module 251 is configured to provide an API for the relevant parameters (eg, work item ID, thread block size, thread grid size, etc.) required by the Kernel at runtime. The OCML module 252 is configured to provide an application programming interface related to mathematical operations. The OpenCL module 253 is configured to provide an OpenCL Kernel API to correspond to the functions of the OCKL module 215 and the OCML module 252. Through the device library 250, the compiler 240 can provide OpenCL Kernel API-related resources for developers to use its internal special operation instruction set.

請參照第3圖，第3圖係根據本發明一較佳實施例繪示的編譯器240的方塊圖。編譯器240可被實作為電腦程式且儲存於儲存裝置中。儲存裝置包含非暫態電腦可讀取記錄媒體或其他具有儲存功能的裝置。此電腦程式包括一或多個電腦可執行指令。電腦可執行指令可由一個或多個處理器來執行以執行編譯器240的編譯操作。具體來說，編譯器240可用於電腦系統中的通用圖形處理器。電腦系統包含中央處理器、所述通用圖形處理器以及與中央處理器連接的記憶體。編譯器240可儲存於記憶體中，並由中央處理器執行編譯器240以對經由通用圖形處理器100所執行的應用程式（例如以OpenCL語言撰寫的Kernel）進行編譯以產生對應該應用程式的機器碼（binary code），編譯後的機器碼可供如第1圖的通用圖形處理器100的串流多處理器120來執行，而執行緒的派發及執行則如前文所述，於此不再贅述。編譯器240依功能可分成前端模組310、優化模組320及後端模組330。前端模組310經組態以對相應於應用程式的源代碼（source code）進行前處理以產生中介代碼（intermediate representation，IR）。優化模組320經組態以對中介代碼進行優化處理。後端模組330經組態以將經優化處理的中介代碼轉譯為組譯代碼（assembly code），並且呼叫組譯器（assembler）將組譯代碼轉譯為機器碼。Please refer to FIG. 3, which is a block diagram of a compiler 240 according to a preferred embodiment of the present invention. The compiler 240 may be implemented as a computer program and stored in a storage device. The storage device includes a non-transitory computer-readable recording medium or other device with storage function. The computer program includes one or more computer-executable instructions. Computer-executable instructions may be executed by one or more processors to perform the compilation operations of compiler 240 . Specifically, the compiler 240 may be used in a general-purpose graphics processor in a computer system. The computer system includes a central processing unit, the general-purpose graphics processing unit, and a memory connected to the central processing unit. The compiler 240 can be stored in the memory, and is executed by the central processing unit to compile an application (eg, Kernel written in the OpenCL language) executed by the general-purpose graphics processor 100 to generate a corresponding code of the application. Machine code (binary code), the compiled machine code can be executed by the streaming multiprocessor 120 such as the general-purpose graphics processor 100 in FIG. Repeat. The compiler 240 can be divided into a front-end module 310 , an optimization module 320 and a back-end module 330 according to functions. The front end module 310 is configured to preprocess the source code (source code) corresponding to the application to generate an intermediate representation (IR). The optimization module 320 is configured to perform optimization processing on the mediation code. Backend module 330 is configured to translate the optimized intermediate code into assembly code, and to call an assembler to translate the assembly code into machine code.

在本實施例中，編譯器240採用LLVM架構做為開發平台。LLVM於編譯器架構設計時即以元件化為設計目標，將各個編譯器功能切分為個別對應的子模組，使得編譯器的核心元件可以於不同語言與不同目標架構之間皆可共用，其中中間資料的傳輸機制採用LLVM所定義的中介語言（LLVM-IR），其為與平台無關的高階抽象中介代碼，可供前端模組310以及後端模組330所使用。In this embodiment, the compiler 240 adopts the LLVM architecture as a development platform. LLVM takes componentization as the design goal when designing the compiler architecture, and divides each compiler function into individual corresponding sub-modules, so that the core components of the compiler can be shared between different languages and different target architectures. The transmission mechanism of the intermediate data adopts the intermediate language (LLVM-IR) defined by LLVM, which is a platform-independent high-level abstract intermediate code that can be used by the front-end module 310 and the back-end module 330 .

具體來說，前端模組310負責進行與語言相關的處理。舉例來說，前端模組310可將源代碼進行轉譯以產生內部所需的抽象語法樹（abstract syntax tree，AST）資料結構，並對源代碼進行前處理，然後將處理後的源代碼轉譯以生成前述的LLVM-IR以供後端模組330處理。前處理可包含巨集處理（macro processing）、靜態分析（static analysis）等等。巨集處理例如項次展開、常數項處理等語言規範的相關功能。靜態分析則是對程式碼的特性進行分析，如程序大小、使用變數的情形、程式複雜度、效能等等。Specifically, the front-end module 310 is responsible for language-related processing. For example, the front-end module 310 can translate the source code to generate an internal required abstract syntax tree (AST) data structure, pre-process the source code, and then translate the processed source code to The aforementioned LLVM-IR is generated for processing by the backend module 330 . Preprocessing may include macro processing, static analysis, and the like. Macros handle language-specific functions such as item expansion, constant term handling, etc. Static analysis is to analyze the characteristics of the code, such as program size, use of variables, program complexity, performance and so on.

在本實施例中，前端模組310可為Clang編譯器，以產生對應的LLVM-IR。在一實施例中，Clang可先對源代碼進行前述的前處理，接著再透過Token based Parser將源代碼轉譯為Clang所定義的語法樹Clang AST。在產生Clang AST之後，Clang可對其進行語言的相關優化，並把Clang AST轉換為LLVM-IR。In this embodiment, the front-end module 310 may be a Clang compiler to generate the corresponding LLVM-IR. In one embodiment, Clang can first perform the aforementioned preprocessing on the source code, and then translate the source code into the syntax tree Clang AST defined by Clang through the Token based Parser. After generating the Clang AST, Clang can perform language-related optimizations on it and convert the Clang AST to LLVM-IR.

優化模組320可對LLVM-IR進行優化處理，例如常數前處理、條件式優化等與語言相依的優化處理。The optimization module 320 can perform optimization processing on LLVM-IR, such as constant preprocessing, conditional optimization and other language-dependent optimization processing.

後端模組330用以將前端模組310和優化模組320所產生的LLVM-IR進行指令統整，並產生出目標可執行的指令以及檔案格式。換言之，後端模組330可將LLVM-IR進行轉譯處理，以產生通用圖形處理器100裡的串流多處理器120可執行的機器碼/檔案。The back-end module 330 is used to unify the instructions of the LLVM-IR generated by the front-end module 310 and the optimization module 320, and generate the target executable instructions and file formats. In other words, the backend module 330 can translate the LLVM-IR to generate machine code/files executable by the streaming multiprocessor 120 in the general-purpose graphics processor 100 .

在本發明中，對於中介代碼（亦即，LLVM-IR）中所含的部分指令，編譯器240的優化模組320會進行進一步的優化處理，其敘述如下。In the present invention, for some instructions contained in the intermediate code (ie, LLVM-IR), the optimization module 320 of the compiler 240 performs further optimization processing, which is described below.

在一實施例中，當中介代碼包含分支（branch）指令時，優化模組320可將其進行優化處理以轉譯成執行以下操作的對應機器碼：對分支指令建立反向支配樹（post dominator tree）以找出分支指令的一直接反向支配點（immediate post dominator，IPDOM）作為分支指令的第一路徑的指令及第二路徑的指令的收斂節點（reconverge point）；以及於收斂節點前端插入一特定指令（例如，跳躍指令），使得當執行分支指令的第一路徑的指令時，執行完該第一路徑上的特定指令時，跳至分支指令的第二路徑的指令，而不是繼續執行收斂節點開始的剩餘指令，直到執行完第二路徑上的特定指令，才繼續執行收斂節點開始的剩餘指令。In one embodiment, when the intermediate code includes a branch instruction, the optimization module 320 can perform optimization processing to translate it into corresponding machine code that performs the following operations: building a post dominator tree for the branch instruction ) to find an immediate post dominator (IPDOM) of the branch instruction as the reconverge point of the instruction of the first path and the instruction of the second path of the branch instruction; A specific instruction (eg, a jump instruction) such that when an instruction of the first path of the branch instruction is executed, when the specific instruction on the first path is executed, it jumps to the instruction of the second path of the branch instruction, rather than continuing to execute convergence The remaining instructions starting from the node will not continue to execute the remaining instructions starting from the convergent node until the specific instructions on the second path are executed.

請參照第4圖，第4圖係根據本發明一實施例繪示的分支指令400的操作的示意圖。如第4圖所示，分支指令意味著條件式的執行不同操作。在條件判斷方塊410中，若符合執行A方塊420的條件A則往A方塊420所在第一路徑執行下去，若符合執行B方塊430的條件B則往B方塊430所在第二路徑執行下去。如先前所述，通用圖形處理器100是採用SIMT架構，也就是同一個指令會由多個串流處理器同時執行，但所執行的資料位址則不同。對於分支指令而言，當遇到不同的資料導致分支後的目標位址不同時會產生分歧（divergence），最後會因為串流處理器內的線程（lane）目標不一致而無法以SIMT的方式執行。在本實施例中，通用圖形處理器100採用遮罩執行（masked execution）的模式執行遇到分歧的指令。具體來說，通用圖形處理器100執行分歧的指令仍是會採用SIMT的模式，但會使用線程遮罩（lane mask）來決定哪些線程（亦即，從執行緒束排程模組指派執行緒給串流處理器的通道）是有效的，並根據線程遮罩決定執行結果是否要寫入/儲存至快取/暫存器/記憶體中，等到該流程結束後再切換另一個線程遮罩繼續執行下去。Please refer to FIG. 4 , which is a schematic diagram illustrating the operation of the branch instruction 400 according to an embodiment of the present invention. As shown in Figure 4, the branch instruction means that the conditional execution of different operations. In the condition judgment block 410, if the condition A for executing the A block 420 is met, the execution proceeds to the first path where the A block 420 is located, and if the condition B for executing the B block 430 is met, the execution proceeds to the second path where the B block 430 is located. As mentioned above, the general-purpose graphics processor 100 adopts the SIMT architecture, that is, the same instruction will be executed by multiple stream processors at the same time, but the executed data addresses are different. For branch instructions, divergence will occur when encountering different data and the target address after branching is different, and finally, because the thread (lane) target in the stream processor is inconsistent, it cannot be executed in SIMT mode . In this embodiment, the general-purpose graphics processor 100 uses a masked execution mode to execute divergent instructions. Specifically, the general-purpose graphics processor 100 still uses the SIMT mode to execute different instructions, but uses a thread mask (lane mask) to determine which threads (ie, assign threads from the thread bundle scheduling module) channel to the stream processor) is valid, and according to the thread mask to decide whether the execution result should be written/stored to the cache/register/memory, wait until the end of the process and then switch to another thread mask Go ahead and do it.

以第4圖的分支指令400為例，在此例中假設執行緒束裡包含6個執行緒，然而其中3個執行緒是符合條件A的情況並且透過由線程441接收資料的串流處理器來執行，而另外3個執行緒則是符合條件B的情況並且透過由線程442接收資料的串流處理器來執行。因此，對於執行這個執行緒束的串流多處理器而言，這6個執行緒仍然會由串流多處理器裡連接到線程441和442的6個串流處理器同時執行第一路徑（包含A方塊410及C方塊450）的指令，但在執行的同時會使用第一線程遮罩。因此在執行完第一路徑的的指令後，只有經由線程441傳送的資料的運算結果被寫入/儲存至快取/暫存器/記憶體中，而經由線程442傳送的資料的運算結果則是會被丟棄。接著，由連接到線程441和442的6個串流處理器同時繼續執行第二路徑（包含B方塊420及C方塊450）的指令，但在執行的同時會使用第二線程遮罩。因此在執行完第二路徑的的指令後，只有經由線程442傳送的資料的運算結果被寫入/儲存至快取/暫存器/記憶體中，而經由線程441傳送的資料的運算結果則是會被丟棄。在一實施例中，第一線程遮罩和第二線程遮罩可例如具有對應線程數量的位元數的資料結構，每一個位元對應到一個線程，並根據位元的內容來決定對應的線程的資料是否是有效的。例如，第一線程遮罩裡對應線程441的3個位元可以都是高準位，對應線程442的3個位元可以都是低準位。第二線程遮罩裡對應線程441的3個位元可以都是低準位，對應線程442的3個位元可以都是高準位。在線程遮罩裡具有高準位的位元所對應的線程所運算的結果才是有效的，而低準位的位元所對應的線程的運算結果則是無效的，並不會被寫入。Taking the branch instruction 400 in FIG. 4 as an example, in this example, it is assumed that there are 6 threads in the thread bundle, but 3 of them meet the condition A and pass through the stream processor that receives the data from the thread 441 to execute, and the other 3 threads meet condition B and execute through the stream processor that receives data from thread 442. Therefore, for the stream multiprocessor executing this thread bundle, the 6 threads will still be executed simultaneously by the 6 stream processors in the stream multiprocessor connected to threads 441 and 442 of the first path ( Contains instructions from A-block 410 and C-block 450), but using the first thread mask while executing. Therefore, after executing the instructions of the first path, only the operation result of the data transmitted through the thread 441 is written/stored in the cache/register/memory, while the operation result of the data transmitted through the thread 442 is will be discarded. Next, the 6 stream processors connected to threads 441 and 442 continue to execute the instructions of the second path (including B block 420 and C block 450 ) at the same time, but use the second thread mask while executing. Therefore, after executing the instructions of the second path, only the operation result of the data transmitted through the thread 442 is written/stored in the cache/register/memory, while the operation result of the data transmitted through the thread 441 is will be discarded. In one embodiment, the first thread mask and the second thread mask may have, for example, a data structure with a number of bits corresponding to the number of threads, each bit corresponds to a thread, and the corresponding bit is determined according to the content of the bit. Whether the thread's data is valid. For example, the three bits corresponding to the thread 441 in the first thread mask may all be high-level, and the three bits corresponding to the thread 442 may all be low-level. In the second thread mask, the three bits corresponding to the thread 441 may all be low-level, and the three bits corresponding to the thread 442 may all be high-level. In the thread mask, the operation result of the thread corresponding to the high-level bit is valid, while the operation result of the thread corresponding to the low-level bit is invalid and will not be written. .

在第4圖的例子中，可以發現對於有分歧的指令而言，第一路徑和第二路徑的C方塊450的指令被執行了兩次，若是C方塊450的指令是龐大的程式，則會大幅影響整個通用圖形處理器的執行效能。In the example in Figure 4, it can be found that for the divergent instructions, the instructions of the C block 450 of the first path and the second path are executed twice. If the instruction of the C block 450 is a huge program, it will be Significantly affects the performance of the entire general-purpose graphics processor.

請一併參照第5圖和第6圖，第5圖係根據第4圖的分支指令400所建立的反向支配樹500的示意圖，第6圖係根據本發明一較佳實施例繪示的分支指令400轉譯後的對應操作的示意圖。在本實施例中，本發明的編譯器在進行優化處理時看到中介代碼中的分支指令400後，可進行反向支配樹分析（Post Dominator Tree analysis）對其建立如第6圖所示的反向支配樹（Post Dominator Tree）500。從反向支配樹500可以找出A方塊420與B方塊430所有擁有的反向支配點（Post Dominator，PDOM）與直接反向支配點（Immediate Post Dominator ，IPDOM）皆為C方塊450，因此可以判定C方塊450為分支指令400分歧之後的收斂節點（reconverge point）。接著，可在C方塊450前端插入一特定指令（例如，跳躍指令），可使得當執行分支指令400的A方塊420的指令（亦即，第一路徑的指令）執行到特定指令時，轉而執行分支指令400的B方塊430的指令（亦即，第二路徑的指令），而不是繼續執行C方塊的指令，也就是第一路徑於收斂節點開始的剩餘指令（包含收斂節點的指令）。等到執行B方塊430的指令執行到特定指令時，即可結束分支指令的分歧，此時可清除掉線程遮罩，使得特定指令之後的指令（亦即，C方塊450的指令）同時由連接到線程441和442的串流處理器同時執行，避免了重複執行，進而提升了通用圖形處理器100的執行效率和效能。Please refer to FIG. 5 and FIG. 6 together. FIG. 5 is a schematic diagram of a reverse dominator tree 500 established according to the branch instruction 400 of FIG. 4 , and FIG. 6 is a diagram according to a preferred embodiment of the present invention. A schematic diagram of the corresponding operations after the branch instruction 400 is translated. In this embodiment, after the compiler of the present invention sees the branch instruction 400 in the intermediate code when performing optimization processing, it can perform Post Dominator Tree analysis to establish it as shown in FIG. 6 . Post Dominator Tree 500. From the reverse dominance tree 500, it can be found that all the reverse dominators (Post Dominator, PDOM) and direct reverse dominator (Immediate Post Dominator, IPDOM) owned by A block 420 and B block 430 are C block 450, so it can be Decision C block 450 is the reconverge point after branch instruction 400 diverges. Next, a specific instruction (eg, a jump instruction) can be inserted at the front end of the C block 450, so that when the instruction of the A block 420 of the branch instruction 400 (ie, the instruction of the first path) is executed to the specific instruction, the Instead of continuing to execute the instructions of the C-block, ie, the remaining instructions of the first path starting at the convergent node (including the convergent node's instructions), the instructions of the B-block 430 of the branch instruction 400 (ie, the instructions of the second path) are executed. The branch instruction divergence can be ended when the instruction executing the B block 430 reaches the specific instruction, and the thread mask can be cleared at this time, so that the instruction following the specific instruction (ie, the instruction of the C block 450 ) is simultaneously connected to the branch instruction. The stream processors of the threads 441 and 442 are executed at the same time, which avoids repeated execution, thereby improving the execution efficiency and performance of the general graphics processor 100 .

在一實施例中，當中介代碼包含調用函式（call）指令時，優化模組320可進行優化處理以將其轉譯成對應的機器碼以執行以下操作：將調用函式指令所調用的函式（callee）的所有內容直接於使用調用指令函式的調用者（caller）中進行內聯擴展（inline）。由於call指令會產生複雜的分歧問題，使得硬體的成本提升以及效率不佳等問題。因此，本發明的編譯器240在處理到call相關的指令時，會直接將指定的函數體插入並取代每一處調用該函數的地方，亦即將調用的函式內容直接於呼叫者（caller）內部全部展開，以避免分歧產生，並從而節省了每次調用函數帶來的額外時間開支。In one embodiment, when the intermediate code includes a call function (call) instruction, the optimization module 320 can perform optimization processing to translate it into corresponding machine code to perform the following operations: All contents of the callee are directly expanded inline in the caller of the function using the call instruction. Due to the complex divergence problem of the call instruction, the cost of the hardware is increased and the efficiency is not good. Therefore, when the compiler 240 of the present invention processes a call-related instruction, it directly inserts the specified function body and replaces each place where the function is called, that is, the content of the function to be called is directly in the caller (caller) Expand all internally to avoid divergence and thus save the extra time cost of each function call.

在一實施例中，當中介代碼包含迴圈指令（例如，loop指令、for指令等）時，優化模組320可對其進行優化處理以將其轉譯成對應的機器碼以執行以下操作：對迴圈指令分析迴圈的次數；以及對迴圈指令內所執行的指令根據迴圈的次數全部展開。由於分支指令會造成分歧，使得串流多處理器在面對分支指令時會阻塞分支指令後的所有指令的派發，等到管線（pipeline）中的指令都完成之後才會執行分支指令，並且等跳至指定的目標之後才能繼續派發後面的指令，導致管線使用效率降低。為了減少分支指令所需的指令數量，本實施例利用迴圈展開（loop unrolling）的方式對迴圈指令內的指令根據其次數在資源允許的情況下全部展開，進而降低在執行期間迴圈指令內分支指令所佔據的比例。In one embodiment, when the intermediate code includes a loop instruction (eg, loop instruction, for instruction, etc.), the optimization module 320 may perform optimization processing on it to translate it into corresponding machine code to perform the following operations: The loop instruction analyzes the number of loops; and all the instructions executed in the loop instruction are expanded according to the number of loops. Because the branch instruction will cause divergence, the streaming multiprocessor will block the dispatch of all instructions after the branch instruction when facing the branch instruction, and wait until the instructions in the pipeline (pipeline) are completed before executing the branch instruction, and wait for the jump After reaching the specified target, subsequent instructions can be dispatched, resulting in a decrease in the efficiency of pipeline usage. In order to reduce the number of instructions required by the branch instruction, this embodiment uses the loop unrolling method to unroll all the instructions in the loop instruction according to the number of times when resources allow, thereby reducing the number of loop instructions during execution. Proportion occupied by inner branch instructions.

綜上所述，本發明所提供的通用圖形處理器根據 OpenCL 規範設計了圖形處理器的執行平台以及對應的OpenCL LLVM編譯器，進而提供符合及支援OpenCL/TensorFlow的應用程式介面。另外，透過對上述的分支相關指令、調用指令和迴圈指令等進行相應優化的編譯流程，使軟體堆疊更能配合硬體的運作，獲得大幅整體效能之提升，藉以提供開發人員便利的開源執行環境。To sum up, the general-purpose graphics processor provided by the present invention designs the execution platform of the graphics processor and the corresponding OpenCL LLVM compiler according to the OpenCL specification, thereby providing an application program interface that conforms to and supports OpenCL/TensorFlow. In addition, through the corresponding optimization of the compilation process of the above-mentioned branch-related instructions, call instructions and loop instructions, the software stack can better cooperate with the operation of the hardware, and the overall performance can be greatly improved, thereby providing developers with convenient open source execution. surroundings.

雖然本發明已以較佳實施例揭露，然其並非用以限制本發明，任何熟習此項技藝之人士，在不脫離本發明之精神和範圍內，當可作各種更動與修飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者爲準。Although the present invention has been disclosed with preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, the present invention The scope of protection shall be determined by the scope of the appended patent application.

100 通用圖形處理器 110 互連網路模組 120 串流多處理器 121 執行緒束排程模組 122 串流處理器 130 工作排程模組 140 記憶體 210 TensorFlow 執行平台 220 OpenCL執行平台 230 異構系統架構執行平台 240 編譯器 250 裝置函式庫 251 OCKL模組 252 OCML模組 253 OpenCL模組 310 前端模組 320 優化模組 330 後端模組 400 分支指令 410 條件判斷方塊 420 A方塊 430 B方塊 441、442 線程 450 C方塊 500 反向支配樹 100 General Purpose Graphics Processors 110 Internet Modules 120 Streaming Multiprocessors 121 Thread bundle scheduling module 122 Stream Processor 130 Job Scheduler Module 140 memory 210 TensorFlow Execution Platform 220 OpenCL Execution Platform 230 Heterogeneous system architecture execution platform 240 Compiler 250 Device Libraries 251 OCKL module 252 OCML modules 253 OpenCL modules 310 Front-end modules 320 Optimization Modules 330 Backend Modules 400 branch instructions 410 Conditional Judgment Block 420 A square 430 B square 441, 442 threads 450 C block 500 reverse dominance tree

第1圖係根據本發明一較佳實施例繪示的圖形處理器的方塊示意圖。第2圖係根據本發明一較佳實施例繪示的通用圖形處理器軟體層級的示意圖。第3圖係根據本發明一較佳實施例繪示的編譯器的方塊圖。第4圖係根據本發明一實施例繪示的分支指令的操作的示意圖。第5圖係根據第4圖的分支指令所建立的反向支配樹的示意圖。第6圖係根據本發明一較佳實施例繪示的分支指令轉譯後的對應操作的示意圖。 FIG. 1 is a block diagram of a graphics processor according to a preferred embodiment of the present invention. FIG. 2 is a schematic diagram of a general-purpose graphics processor software hierarchy according to a preferred embodiment of the present invention. FIG. 3 is a block diagram of a compiler according to a preferred embodiment of the present invention. FIG. 4 is a schematic diagram illustrating the operation of a branch instruction according to an embodiment of the present invention. FIG. 5 is a schematic diagram of a reverse dominator tree established according to the branch instruction of FIG. 4 . FIG. 6 is a schematic diagram illustrating a corresponding operation after translation of a branch instruction according to a preferred embodiment of the present invention.

240 編譯器 310 前端模組 320 優化模組 330 後端模組 240 Compiler 310 Front-end modules 320 Optimization Modules 330 Backend Modules

Claims

A compiler configured to compile an application program executed by a graphics processor to generate a machine code corresponding to the application program for execution by a plurality of streaming multiprocessors in the graphics processor , wherein the compiler includes: a front-end module configured to perform a pre-processing on a source code corresponding to the application to generate an intermediate code; an optimization module configured to perform a pre-processing on the intermediate code an optimization process; and a backend module configured to perform a translation process on the optimized intermediate code to generate the machine code; wherein the optimization process includes translating each branch instruction in the intermediate code into Performing the following operations: establishing a reverse dominator tree for the branch instruction to find a direct reverse dominator point of the branch instruction as a convergence node of an instruction of a first path and an instruction of a second path of the branch instruction; and inserting a specific instruction at the front end of the convergence node, so that when executing the instruction of the first path of the branch instruction, after executing the specific instruction on the first path, jump to the second path of executing the branch instruction until the specific instruction on the second path is executed, the instruction from the convergent node will continue to be executed; wherein the optimization process further includes translating each call function instruction in the intermediate code to perform the following operations: All contents of the function called by the call-function instruction are expanded inline directly in the caller of the function using the call-instruction.

The compiler of claim 1, wherein the branch instruction is concurrently executed by a plurality of stream processors included in one of the assigned stream multiprocessors, wherein the instruction of the first path is executed by A plurality of first stream processors and a plurality of second stream processors of the stream processors execute concurrently using a first thread mask, and instructions of the second path are processed by the first streams The processor and the second stream processors execute concurrently using a second thread mask.

The compiler of claim 2, wherein upon completion of executing the specific instruction on the first path, only the results of execution by the first stream processors are stored, and upon completion of execution on the second path When the specific instruction is executed, only the results executed by the second stream processors are stored.

The compiler of claim 2, wherein when executing the instruction of the first path of the branch instruction, after the specific instruction is executed, the first thread mask is terminated; and when executing the instruction of the branch instruction When the instruction of the second path is executed, the use of the second thread mask is terminated after the specific instruction is executed.

The compiler of claim 1, wherein the optimizing process further comprises translating each loop instruction in the intermediate code to perform the following operations: analyzing the loop instruction a number of times of a loop; and the loop instruction The instructions executed within the loop are all expanded according to the number of loops.

The compiler of claim 1, wherein the front end module is a clang compiler configured to generate the intermediate code defined by the underlying virtual machine.

The compiler of claim 6, wherein the preprocessing includes macro processing, static analysis, and generating a syntax tree corresponding to the source code.

A non-transitory computer-readable storage medium configured to store a plurality of instructions that, when executed by a processor in a computer system, cause the processor to execute a compilation method for the computer system compiling an application program executed by a graphics processor to generate a machine code corresponding to the application program for execution by a plurality of streaming multiprocessors in the graphics processor, the compiling method includes: compiling the application program corresponding to the application program A source code of the program is preprocessed to generate an intermediate code; an optimization process is performed on the intermediate code; and a translation process is performed on the optimized intermediate code to generate the machine code; wherein the optimization process includes the Each branch instruction in the intermediate code is translated into an instruction that performs the following operations: builds an inverse domination tree for the branch instruction to find a direct inverse domination point of the branch instruction as an instruction of a first path of the branch instruction and an instruction A convergent node of the instruction of the second path; and inserting a specific instruction at the front end of the convergent node, so that when the instruction of the first path of the branch instruction is executed, after the specific instruction on the first path is executed, jump The instruction to execute the second path of the branch instruction does not continue to execute the instruction starting from the convergent node until the specific instruction on the second path is executed; wherein the optimization process also includes each call in the intermediate code Function instructions translate to perform the following operations: All contents of the function called by the call-function directive are expanded inline directly in the caller of the function using the call-directive.

The non-transitory computer-readable storage medium of claim 8, wherein the branch instruction is concurrently executed by a plurality of stream processors included in one of the assigned stream multiprocessors, wherein Instructions of the first path are concurrently executed by a plurality of first stream processors and a plurality of second stream processors of the stream processors using a first thread mask, and instructions of the second path are executed by The first stream processors and the second stream processors execute concurrently using a second thread mask.

The non-transitory computer-readable storage medium of claim 9, wherein when the specific instruction on the first path is executed, only the results executed by the first stream processors are stored, and in When the specific instruction on the second path is executed, only the results executed by the second stream processors are stored.

The non-transitory computer-readable storage medium of claim 9, wherein when executing the instruction of the first path of the branch instruction, after the specific instruction is executed, the first thread mask is terminated; and When executing the instruction of the second path of the branch instruction, after the specific instruction is executed, the use of the second thread mask is terminated.

The non-transitory computer-readable storage medium of claim 8, wherein the optimization process further comprises translating each loop instruction in the intermediate code to perform the following operations: Analyzing the number of loops for the loop instruction; and all the instructions executed in the loop instruction are expanded according to the number of loops.