TW202328905A

TW202328905A - General-purpose graphics processor core function configuration setting prediction system and method having a set screening module, a resource occupancy screening module, an execution module, a training data extraction module, and a modeling module

Info

Publication number: TW202328905A
Application number: TW111100267A
Authority: TW
Inventors: 郭錦福; 莊宗儒
Original assignee: 國立高雄大學
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2023-07-16
Also published as: TWI782845B

Abstract

The present invention provides a general-purpose graphics processor core function configuration setting prediction system and method, which is suitable for the general-purpose graphics processor, wherein the system includes: a set screening module to perform preliminary screening on the parameter set of the configuration setting of the general-purpose graphics processor core function, the screened parameter set being used as an input parameter; a resource occupancy screening module to perform secondary screening on the input parameters with the resource occupancy rate value of the Warp and registers and the shared memory in the stream composite processor, so as to obtain the first execution configuration setting; an execution module executing the first execution configuration setting by the general-purpose graphics processor, so as to obtain the execution time; a training data extraction module extracting a second execution configuration setting as the training data from the first execution configuration setting according to the execution time; and a modeling module using the training data to establish a configuration setting prediction model of a core function of the general-purpose graphics processor. Therefore, the present invention can quickly and accurately predict the configuration setting of the core function.

Description

System and method for configuration setting prediction of general-purpose graphics processor core function

本發明係關於一種通用型圖形處理器核心函式之組態設定預測系統及方法，特別關於一種利用機器學習建立監督式學習模型的通用型圖形處理器核心函式之組態設定預測系統及方法。The present invention relates to a general-purpose graphics processor core function configuration setting prediction system and method, in particular to a general-purpose graphics processor core function configuration setting prediction system and method using machine learning to establish a supervised learning model .

隨著人工智慧、機器學習以及深度學習技術的發展，相關技術需要大量的運算能力，並且模型的發展也越來越複雜，如果只單純利用一般處理器（Central Processing Unit；以下簡稱CPU）進行運算，將會需要較多的時間在訓練模型上；另一方面，由於神經網路層中各個元素之運算沒有相依性具有獨立性，並不會相互影響計算結果，因此可以透過使用通用型圖形處理器（General-Purpose Graphics Processing Unit，GPGPU，為方便說明，以下皆簡稱GPU）硬體加速，讓訓練過程所需時間縮短。GPU具有高度平行化運算的架構，相較於CPU中只有數個或是數十個核心，GPU擁有數以千計的CUDA（Compute Unified Device Architecture，統一計算架構）核心，並且這些CUDA 核心可以並行運作。雖然它的個別核心並不像CPU的核心有強大能力，能執行較複雜的指令，與執行各種形式的任務負載，但GPU憑藉數量優勢，能夠大幅提升特定運算的工作效率。With the development of artificial intelligence, machine learning, and deep learning technologies, related technologies require a lot of computing power, and the development of models is becoming more and more complex. If only a general processor (Central Processing Unit; hereinafter referred to as CPU) is used for computing , it will take more time to train the model; on the other hand, since the operations of each element in the neural network layer have no dependencies and are independent, and will not affect each other's calculation results, it can be achieved by using general-purpose graphics processing The GPU (General-Purpose Graphics Processing Unit, GPGPU, for convenience, hereinafter referred to as GPU) hardware acceleration shortens the time required for the training process. The GPU has a highly parallel computing architecture. Compared with only a few or dozens of cores in the CPU, the GPU has thousands of CUDA (Compute Unified Device Architecture, unified computing architecture) cores, and these CUDA cores can be parallelized. operate. Although its individual cores are not as powerful as CPU cores, which can execute more complex instructions and perform various forms of task loads, GPUs can greatly improve the work efficiency of specific operations by virtue of their number advantages.

然而，組態設定將影響GPU執行時其底層資源的使用。雖然可能隨便給一個組態設定就能夠讓核心函式執行，但是要能達到最佳的執行效能極不容易。如果想要得到最佳的組態設定，有可能需要測試所有可能的組態設定組合後，才能確定何者為最佳組態設定；此外，不同核心函式的程式運作不同，使用的資源與邏輯差異很大，所以最佳組態設定也不同。即使以同一個核心函式在不同的GPU中或是運算不同大小的數據量，其最佳組態設定也可能會有所不同。雖然使用暴力窮舉法、或是將所有可優化的作法及因素都測試過，可以在所有組合中找出最佳組態設定，但是其時間及空間成本過大，故在執行階段快速立即得到最佳組態設定是件不可能的事情。However, configuration settings will affect the use of the underlying resources of the GPU during execution. Although it is possible to make the core function execute by randomly setting a configuration, it is not easy to achieve the best execution performance. If you want to get the best configuration setting, you may need to test all possible configuration setting combinations before you can determine which one is the best configuration setting; in addition, the programs of different core functions operate differently, and the resources and logic used The difference is large, so the optimal configuration settings are also different. Even if the same core function is used in different GPUs or the amount of data of different sizes is calculated, the optimal configuration settings may be different. Although the best configuration settings can be found in all combinations by using the brute force method or by testing all the optimization methods and factors, but the time and space costs are too large, so the optimal configuration can be obtained quickly and immediately in the execution stage. Optimum configuration settings are impossible.

在此狀況下，現有技術為了能夠有效提高GPU的計算效能，可以利用估算GPU硬體資料在執行核心函式的資源佔有率，透過執行緒組（Warp）、暫存器、共享記憶體這三種GPU中有限資源的使用量，算出一個串流複合處理器（Streaming Multiprocessor: SM）中Warp的使用比例，GPU執行核心函式時，會因為上述三種有限資源影響到同一時間內在SM中的Warp存在數量，而在SM中可以同時執行的Warp數量有一定的上限，因此可以透過「估算Warp個數/Warp上限個數」計算資源佔有率，可以大致呈現出並行效率。In this case, in order to effectively improve the computing performance of the GPU, the existing technology can use the resource occupancy of the estimated GPU hardware data in the execution of the core function, through the three types of thread group (Warp), temporary register, and shared memory. The amount of limited resources used in the GPU is used to calculate the usage ratio of Warp in a Streaming Multiprocessor: SM. When the GPU executes core functions, the above three limited resources will affect the existence of Warp in the SM at the same time. The number of warps that can be executed at the same time in SM has a certain upper limit, so the resource occupancy can be calculated by "estimating the number of warps/the upper limit of warps", which can roughly show the parallel efficiency.

然而，考量各組態設定Warp、暫存器、共享記憶體這三種有限資源的使用率，若單項資源佔有率高，亦不一定代表效能較佳，故僅考量Warp使用率亦難以真實反映效能較佳的組態設定。However, considering the utilization rate of the three limited resources of Warp, register, and shared memory in each configuration setting, if a single resource has a high occupancy rate, it does not necessarily mean better performance. Therefore, it is difficult to truly reflect the performance only by considering the Warp utilization rate. Better configuration settings.

另一方面，一個核心函式會處理運算各種可能大小的數據，而其可以採用的組態設定組合可能也很多，因此若要完整測試所有大小數據以及其對應的所有組態設定，將是一件耗時且不可能的事情。此外，在先前技術中，作者提出單一與多個核心函式的效能與能耗最佳化方式，選擇較高佔有率的組態做測試，避免測試所有可能的組態，但其仍使用暴力法測試留下的組態；並且，在使用佔有率篩選時，可能會出現誤判情況，而將效能較佳但佔有率卻較低的組態先剔除。［先前技術文獻］［非專利文獻］ On the other hand, a core function will process and operate data of various possible sizes, and there may be many combinations of configuration settings that it can adopt. Therefore, it will be a challenge to completely test data of all sizes and all corresponding configuration settings time consuming and impossible. In addition, in the prior art, the author proposes a way to optimize the performance and energy consumption of single and multiple core functions, select a configuration with a higher occupancy rate for testing, and avoid testing all possible configurations, but it still uses violence Moreover, when using occupancy screening, misjudgments may occur, and configurations with better performance but lower occupancy are eliminated first. [Prior Technical Literature] [Non-patent literature]

［非專利文獻1］ K. Iliakis, S. Xydis, and D. Soudris, “LOOG: Improving GPU Efficiency With Light-weight Out-Of-Order Execution,” IEEE Computer Architecture Letters, Vol. 18, No. 2, pp. 166-169, 1 July-Dec. 2019. ［非專利文獻2］J. Guerreiro, A. Ilic, N. Roma and P. Tomás, "Multi-kernel Auto-Tuning on GPUs: Performance and Energy-Aware Optimization," 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2015. [Non-Patent Document 1] K. Iliakis, S. Xydis, and D. Soudris, “LOOG: Improving GPU Efficiency With Light-weight Out-Of-Order Execution,” IEEE Computer Architecture Letters, Vol. 18, No. 2, pp. 166-169, 1 July-Dec. 2019. [Non-Patent Document 2] J. Guerreiro, A. Ilic, N. Roma and P. Tomás, "Multi-kernel Auto-Tuning on GPUs: Performance and Energy-Aware Optimization," 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2015.

［發明所欲解決之技術問題］[Technical problem to be solved by the invention]

據此，本發明提出一套兩階段機制：離線與線上，能夠在當要執行核心函式時，迅速並精準地建議核心函式一組以較少的執行時間完成運算的組態設定。［技術手段］ Accordingly, the present invention proposes a two-stage mechanism: offline and online, which can quickly and accurately suggest a set of configuration settings for the core function to complete operations with less execution time when the core function is to be executed. [Technical means]

本發明在離線階段首先對參數集合進行初步篩選，接著綜合考量各組態設定的資源佔用率，留下執行效能可能較佳的組態設定後，再分別以這些組態設定執行核心函式，並記錄其所花費執行時間。之後，提取出用來建立模型所需的訓練資料。而在線上階段，本發明選擇機器學習方法建立預測模型，以推薦出一組有較佳效能的組態設定。In the off-line stage, the present invention firstly screens the parameter set, then comprehensively considers the resource occupancy rate of each configuration setting, leaves configuration settings with better execution performance, and then uses these configuration settings to execute the core functions respectively. And record the time it takes to execute. Afterwards, the training data needed to build the model is extracted. In the online stage, the present invention chooses machine learning methods to establish a predictive model to recommend a set of configuration settings with better performance.

具體而言，本發明提供一種通用型圖形處理器核心函式之組態設定預測系統，其適用於該通用型圖形處理器，其包含：一集合篩選模組，對該通用型圖形處理器核心函式之組態設定的參數集合進行初步篩選，將篩選完的該參數集合作為輸入參數；一資源佔有率篩選模組，將該輸入參數藉由在串流複合處理器中Warp、暫存器與共享記憶體的資源佔有率值進行二次篩選，得到第一執行組態設定；一執行模組，將該第一執行組態設定以該通用型圖形處理器執行，得到執行時間；一訓練資料提取模組，根據該執行時間由該第一執行組態設定提取出第二執行組態設定，作為訓練資料；及一建模模組，藉由該訓練資料建立出一該通用型圖形處理器核心函式之組態設定預測模型。 Specifically, the present invention provides a general-purpose graphics processor core function configuration setting prediction system, which is suitable for the general-purpose graphics processor, which includes: A set screening module, which preliminarily screens the parameter set of the configuration setting of the general-purpose graphics processor core function, and takes the filtered parameter set as an input parameter; A resource occupancy rate screening module, which performs secondary screening on the input parameters by the resource occupancy values of Warp, temporary register and shared memory in the stream composite processor to obtain the first execution configuration setting; An execution module, the first execution configuration is set to be executed by the general-purpose graphics processor, and the execution time is obtained; A training data extraction module, extracting a second execution configuration from the first execution configuration according to the execution time as training data; and A modeling module is used to establish a configuration setting prediction model of the general-purpose graphics processor core function by using the training data.

進一步地，該參數集合可包含數據量集合、必要組態集合及可選組態集合。Further, the parameter set may include a data volume set, a necessary configuration set and an optional configuration set.

進一步地，該訓練資料提取模組可針對每一個該數據量集合，根據該執行時間由該必要組態集合及該可選組態集合中進行提取，並將該數據量集合、該提取後的該必要組態集合及該可選組態集合作為訓練資料。Further, the training data extraction module can extract from the necessary configuration set and the optional configuration set according to the execution time for each data volume set, and combine the data volume set, the extracted The required configuration set and the optional configuration set are used as training data.

進一步地，該建模模組可採用K-鄰近演算法或邏輯式迴歸演算法建立模型。Furthermore, the modeling module can use K-neighbor algorithm or logistic regression algorithm to establish a model.

進一步地，該集合篩選模組可藉由設定該數據量集合中之元素屬性的間隔初步篩選該數據量集合。Further, the set screening module can preliminarily filter the data set by setting the interval of element attributes in the data set.

進一步地，該集合篩選模組可藉由以32為倍數選取執行緒區塊的執行緒個數來初步篩選該必要組態集合。Further, the set screening module can preliminarily screen the necessary configuration set by selecting the number of threads in the thread block with a multiple of 32.

進一步地，該集合篩選模組可進一步優化該可選組態集合，該優化方法可為平鋪或共享記憶體補丁。Furthermore, the set screening module can further optimize the optional configuration set, and the optimization method can be tiled or shared memory patches.

進一步地，該集合篩選模組對該可選組態集合的初步篩選方法可為：選定數個該數據量集合，在數個該數據量集合範圍內挑選該些範圍中的前、中、後段子範圍，測試該些子範圍、所有該必要組態集合及所有該可選組態集合，將擁有多數執行效能較佳的執行結果中的該可選組態集合篩選出。Further, the preliminary screening method of the set screening module for the optional configuration set may be: select several sets of the data volume, and select the front, middle, and back of these ranges within the range of the several sets of data volumes Segment sub-ranges, test the sub-ranges, all the required configuration sets and all the optional configuration sets, and filter out the optional configuration sets from the execution results with most execution performances.

本發明亦提供一種通用型圖形處理器核心函式之組態設定預測方法，其適用於該通用型圖形處理器，其包含：對該通用型圖形處理器核心函式之組態設定的參數集合進行初步篩選，將篩選完的該參數集合作為輸入參數；將該輸入參數藉由在串流複合處理器中Warp、暫存器與共享記憶體的資源佔有率值進行二次篩選，得到第一執行組態設定；將該第一執行組態設定以該通用型圖形處理器執行，得到執行時間；根據該執行時間由該第一執行組態設定提取出第二執行組態設定，作為訓練資料；藉由該訓練資料建立出一該通用型圖形處理器核心函式之組態設定預測模型；及藉由該預測模型預測出一組組態設定。 The present invention also provides a general-purpose graphics processor core function configuration setting prediction method, which is suitable for the general-purpose graphics processor, which includes: Preliminary screening is performed on the parameter set of the configuration setting of the general-purpose graphics processor core function, and the parameter set that has been screened is used as an input parameter; The input parameter is secondarily screened by the resource occupancy values of Warp, temporary register and shared memory in the stream composite processor to obtain the first execution configuration setting; Setting the first execution configuration to be executed by the general-purpose graphics processor to obtain an execution time; Extracting a second execution configuration from the first execution configuration according to the execution time as training data; Establishing a configuration setting prediction model of the general-purpose GPU kernel function by using the training data; and A set of configuration settings is predicted by the prediction model.

進一步地，該參數集合可包含數據量集合、必要組態集合及可選組態集合。［發明之功效］ Further, the parameter set may include a data volume set, a necessary configuration set and an optional configuration set. [Efficacy of the invention]

藉由本發明之通用型圖形處理器核心函式之組態設定預測系統及方法，綜合考量在串流複合處理器中Warp、暫存器與共享記憶體的資源佔有率值作為篩選標準，能更精準計算組態設定執行效能。With the general-purpose graphics processor core function configuration setting prediction system and method of the present invention, comprehensively considering the resource occupancy values of Warp, temporary register and shared memory in the stream composite processor as screening criteria, it can be more Accurately calculate the performance of configuration settings.

並且，藉由離線階段的參數集合的初步篩選，能依據使用者參數，有效地降低預測組態設定過程中耗費的成本。In addition, the preliminary screening of the parameter set in the offline stage can effectively reduce the cost in the predictive configuration setting process according to the user parameters.

並且，藉由線上階段的預測模型，本發明可以降低對全域記憶體的存取次數，大幅減低從記憶體存取資料的時間，進而提升核心函式運行時的效率，並快速推薦適當的必要組態設定。Moreover, by using the prediction model in the online stage, the present invention can reduce the number of accesses to the global memory, greatly reduce the time for accessing data from the memory, and then improve the efficiency of the core function when it is running, and quickly recommend appropriate necessary configuration settings.

並且，本發明藉由先在相關組態設定下執行不同資料量以獲得執行時間，再在線上階段將該些既有資料導入預測模型中，有效且快速地使預測模型可推薦一個有較短執行時間的組態設定。Moreover, the present invention obtains execution time by first executing different amounts of data under related configuration settings, and then importing these existing data into the prediction model in the online stage, so that the prediction model can recommend a shorter time effectively and quickly. Configuration settings for execution time.

並且，若進一步採用多種參數集合，本發明可減少離線階段測試執行的數據量個數，在線上階段也能依據資料量推薦適當執行組態。Moreover, if a variety of parameter sets are further adopted, the present invention can reduce the amount of data to be executed in the offline stage, and can also recommend appropriate execution configurations according to the amount of data in the online stage.

綜上所述，本發明提出一個能夠快速且精準地預測組態設定的系統及方法，預測出來的組態設定誤差值小、且有較佳的執行效能。To sum up, the present invention proposes a system and method capable of rapidly and accurately predicting the configuration setting, the predicted configuration setting error value is small, and the execution performance is better.

以下藉由示例性實施方式說明本發明之通用型圖形處理器核心函式之組態設定預測系統及方法。應注意，下述示例性實施方式僅用以說明本發明，而非用以限制本發明之範圍。The configuration and setting prediction system and method of the general-purpose graphics processor kernel function of the present invention are described below by means of exemplary implementations. It should be noted that the following exemplary embodiments are only used to illustrate the present invention, but not to limit the scope of the present invention.

圖1為通用型圖形處理器核心函式之組態設定預測系統之方塊示意圖，圖2則為通用型圖形處理器核心函式之組態設定預測方法之流程示意圖。本發明之預測方法主要分為離線及線上兩個階段，該離線階段分析在不同數據量下採用各別組態設定組合下的執行時間，篩選該線上階段建立模型之訓練資料來源；該線上階段提取建立模型所需的該訓練資料的特徵及標籤，並藉由該訓練資料建立預測模型。［離線階段］ FIG. 1 is a schematic block diagram of a configuration setting prediction system for a general-purpose graphics processor core function, and FIG. 2 is a schematic flowchart of a configuration setting prediction method for a general-purpose graphics processor core function. The prediction method of the present invention is mainly divided into two stages: off-line and online. The off-line stage analyzes the execution time under different data volumes using individual configuration setting combinations, and screens the source of training data for building models in the on-line stage; the on-line stage Extracting the features and labels of the training data required for building a model, and establishing a prediction model based on the training data. [Offline stage]

首先說明離線階段。在離線階段，如圖2所示，該離線階段分為三部分： (1) 參數集合的初步篩選； (2) 資源佔有率篩選； (3) 執行組態設定。 First, the offline phase will be described. In the offline phase, as shown in Figure 2, the offline phase is divided into three parts: (1) Preliminary screening of parameter sets; (2) Resource occupancy screening; (3) Execute configuration setting.

在第(1)部分，由於一個核心函式會處理運算各種可能大小的數據，而其可採用的組態設定組合可能很多；例如，一組組態設定之參數集合可包含數據量、必要組態與可選組態三種參數設定。在此狀況下，若未將該些參數集合進行初步篩選，假設離線階段所要處理的數據量集合為 X = {𝑥 ₁, 𝑥 ₂, 𝑥 ₃, … , 𝑥 _𝑝}，必要組態集合為 M = {𝑚 ₁, 𝑚 ₂, 𝑚 ₃, … , 𝑚 _𝑞}，可選組態集合為 O = {𝑜 ₁, 𝑜 ₂, 𝑜 ₃, … , 𝑜 _𝑟}，則將必須測試 |X| ∗ |M| ∗ |O| 種組合，組合數量極為龐大，將會花費大量的時間和空間成本。因此，本發明之集合篩選模組依據各參數集合的相關特性過濾掉不必要或是不可能為適當組態設定的參數，以限制適當組態的可能範圍。 In part (1), since a core function will process and operate data of various possible sizes, there may be many combinations of configuration settings it can adopt; for example, the parameter set of a set of configuration settings can include There are three parameter settings of state and optional configuration. In this case, if these parameter sets are not preliminarily screened, assuming that the data set to be processed in the offline stage is X = {𝑥 ₁ , 𝑥 ₂ , 𝑥 ₃ , … , 𝑥 _𝑝 }, the necessary configuration set is M = {𝑚 ₁ , 𝑚 ₂ , 𝑚 ₃ , … , 𝑚 _𝑞 }, the set of optional configurations is O = {𝑜 ₁ , 𝑜 ₂ , 𝑜 ₃ , … , 𝑜 _𝑟 }, then you will have to test |X| ∗ | M| ∗ |O| combinations, the number of combinations is extremely large, and it will take a lot of time and space costs. Therefore, the set screening module of the present invention filters out parameters that are unnecessary or impossible to set for proper configuration according to the relevant characteristics of each parameter set, so as to limit the possible scope of proper configuration.

以下，以數據量、必要組態與可選組態三種集合為例，說明初步篩選的機制。［數據量集合篩選機制］ In the following, the three sets of data volume, required configuration and optional configuration are taken as examples to illustrate the mechanism of preliminary screening. [Data collection screening mechanism]

數據量集合 X = {𝑥 ₁, 𝑥 ₂, 𝑥 ₃, … , 𝑥 _𝑝}的大小為 |X| = 𝑝，假設元素的屬性個數為𝑛，集合中的某個元素為 𝑥 _𝑖= (𝑥 _𝑖 _,1, 𝑥 _𝑖 _,2, … , 𝑥 _𝑖 _, _𝑛)，一個元素的每個屬性上下限值的範圍為無窮大，之間的關係可能為獨立或有相依性，根據不同實際應用情況下實作出的核心函式而有所差異。 The size of the data set X = {𝑥 ₁ , 𝑥 ₂ , 𝑥 ₃ , … , 𝑥 _𝑝 } is |X| = 𝑝, assuming that the number of attributes of the element is 𝑛, and an element in the set is 𝑥 _𝑖 = (𝑥 _𝑖 _,1 , 𝑥 _𝑖 _,2 , … , 𝑥 _𝑖 _, _𝑛 ), the range of the upper and lower limits of each attribute of an element is infinite, and the relationship between them may be independent or dependent, depending on the actual application The core functions made vary.

示例性地，設定該數據量集合中元素的屬性 𝑗 的最大與最小極限值分別為 _𝑗和 _𝑗，因此， _𝑗≤ 𝑥 _𝑖 _, _𝑗≤ _𝑗。為了減少組合數量，元素 𝑥 _𝑖的屬性 𝑗 值將限制在 _𝑗到 _𝑗範圍之間，且因為其代表資料個數，所以 𝑥 _𝑖 _, _𝑗為正整數。 Exemplarily, the maximum and minimum limit values of the attribute 𝑗 of the elements in the data set are set to be _𝑗 and _𝑗 , therefore, _𝑗 ≤ 𝑥 _𝑖 _, _𝑗 ≤ _𝑗 . In order to reduce the number of combinations, the value of the attribute 𝑗 of the element 𝑥 _𝑖 will be limited to _𝑗to _𝑗 range, and because it represents the number of data, so 𝑥 _𝑖 _, _𝑗 are positive integers.

為了減少數據量且使數據量具有規律性，本發明以固定間隔距離選取欲執行的數據量，假設屬性 𝑗 間隔為 𝑑 _𝑗，則屬性 𝑗 的數量會從變為 ⌈ ⌉，數據量集合大小 p 會從減少為。［必要組態篩選機制］ In order to reduce the amount of data and make the amount of data regular, the present invention selects the amount of data to be executed at a fixed interval. Assuming that the attribute 𝑗 interval is 𝑑 _𝑗 , the number of attributes 𝑗 will change from becomes ⌈ ⌉, the data collection size p will change from reduced to . [Required Configuration Filtering Mechanism]

必要組態集合M = {𝑚 ₁, 𝑚 ₂, 𝑚 ₃, … , 𝑚 _𝑞}，集合中的元素𝑚 _𝑘= (𝑚 _𝑘 _,1, 𝑚 _𝑘 _,2)，屬性 𝑚 _𝑘 _,1和 𝑚 _𝑘 _,2分別代表執行緒區塊個數 (TB Number) 以及執行緒區塊中的執行緒個數 (TB Size)。 Necessary configuration set M = {𝑚 ₁ , 𝑚 ₂ , 𝑚 ₃ , … , 𝑚 _𝑞 }, elements in the set 𝑚 _𝑘 = (𝑚 _𝑘 _,1 , 𝑚 _𝑘 _,2 ), attributes 𝑚 _𝑘 _,1 and 𝑚 _𝑘 _{, 2} respectively represent the number of thread blocks (TB Number) and the number of threads in the thread block (TB Size).

假設已知欲處理的該數據量組合為 𝑥 _𝑖= (𝑥 _𝑖 _,1, 𝑥 _𝑖 _,2, … , 𝑥 _𝑖 _, _𝑛)，其運算總數為𝑂𝑝 _Total ，該運算總數為核心函式對於該數據量組合的資料執行計算所需的運算個數，而一次運算定義由兩個運算元 (Operand) 及一個運算子(Operator) 組成，運算子不一定要是不可切割的運算 (Atomic Operation)，不同核心函式因為其程式邏輯與行為不同，運算總數的算法也將不同。 Assume that the combination of the amount of data to be processed is 𝑥 _𝑖 = (𝑥 _𝑖 _,1 , 𝑥 _𝑖 _,2 , … , 𝑥 _𝑖 _, _𝑛 ), the total number of operations is 𝑂𝑝 _Total , which is the core function for the data The number of operations required to perform calculations based on the data combined by the quantity, and the definition of an operation consists of two operands and an operator. The operator does not have to be an inseparable operation (Atomic Operation). Different cores Because of the different program logic and behavior of the function, the algorithm for calculating the total will also be different.

以卷積運算核心函式為例，單次卷積的運算量為過濾器陣列 (Filter) 中的元素分別乘上輸入圖像陣列 (Input_Image) 中的相對應元素再加總。過濾器陣列中每個元素都會與相應的輸入陣列元素進行一次乘法及一次加法運算，單次卷積所需的運算量為過濾器陣列大小乘上 2，並依序往右且往下的方向滑行 (Slide)，直到完成特徵圖陣列 (Feature_Map)，其運算總數為該滑行次數乘以該單次卷積的運算量。Taking the core function of convolution operation as an example, the calculation amount of a single convolution is the multiplication of the elements in the filter array (Filter) by the corresponding elements in the input image array (Input_Image) and summing up. Each element in the filter array will perform a multiplication and an addition operation with the corresponding input array element. The calculation required for a single convolution is the size of the filter array multiplied by 2, and the direction to the right and down in sequence Slide until the feature map array (Feature_Map) is completed, and the total number of operations is the number of slides multiplied by the calculation amount of the single convolution.

在此假設 𝑊𝑖𝑑𝑡ℎ _{𝐼𝑛𝑝𝑢𝑡}和 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐼𝑛𝑝𝑢𝑡}分別為輸入圖像的寬度與高度，而 𝑊𝑖𝑑𝑡ℎ _{𝐹𝑖𝑙𝑡𝑒𝑟}和 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐹𝑖𝑙𝑡𝑒𝑟}分別為過濾器的寬度與高度，卷積運算的運算總數公式為 2 ∗ (𝑊𝑖𝑑𝑡ℎ _{𝐼𝑛𝑝𝑢𝑡}− 𝑊𝑖𝑑𝑡ℎ _{𝐹𝑖𝑙𝑡𝑒𝑟}+ 1) ∗ (𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐼𝑛𝑝𝑢𝑡}− 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐹𝑖𝑙𝑡𝑒𝑟}+ 1) ∗ (𝑊𝑖𝑑𝑡ℎ _{𝐹𝑖𝑙𝑡𝑒𝑟}∗ 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐹𝑖𝑙𝑡𝑒𝑟})。 Assume here that 𝑊𝑖𝑑𝑡ℎ _{𝐼𝑛𝑝𝑢𝑡} and 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐼𝑛𝑝𝑢𝑡} are the width and _height of the input image respectively, and 𝑊𝑖𝑑𝑡ℎ _{𝐹 𝑖𝑙𝑡𝑒𝑟} and 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐹𝑖𝑙𝑡𝑒𝑟} are the width and height of the filter respectively. 𝑢𝑡 − 𝑊𝑖𝑑𝑡ℎ _{𝐹𝑖𝑙𝑡𝑒𝑟} + 1) ∗ (𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐼𝑛𝑝𝑢𝑡} − 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐹𝑖𝑙𝑡𝑒𝑟} + 1) ∗ (𝑊𝑖𝑑 𝑡ℎ _{𝐹𝑖𝑙𝑡𝑒𝑟} ∗ 𝐻𝑒𝑖𝑔ℎ𝑡 _{𝐹𝑖𝑙𝑡𝑒𝑟} ).

核心函式的該運算總數為 𝑂𝑝 _{𝑇𝑜𝑡𝑎𝑙}，一個執行緒區塊所需處理的運算量為 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑇𝐵，且每個執行緒所需處理的運算量為 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}時，則執行緒區塊個數𝑚 _𝑘 _,1為。而一個執行緒區塊中的執行緒所需處理的運算量為𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}× 𝑚 _𝑘 _,2，𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}為一個執行緒所需處理的運算量，而𝑚 _𝑘 _,2為執行緒區塊的執行緒個數，故可以推得關係式為 𝑚 _𝑘 _,1= 。 The total number of operations of the core function is 𝑂𝑝 _{𝑇𝑜𝑡𝑎𝑙} , the amount of operations that a thread block needs to process is 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑇𝐵 , and the amount of operations that each thread needs to process is 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} , execute The number of thread blocks 𝑚 _𝑘 _{, 1} is . The amount of computation required by a thread in a thread block is 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} × 𝑚 _𝑘 _,2 , 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒 𝑎𝑑} is the amount of computation required to be processed by a thread, and 𝑚 _𝑘 _{, 2} is the number of threads in the thread block, so the relationship can be deduced as 𝑚 _𝑘 _{, 1} = .

假如使用其他優化方法，如平鋪 (Tiling)，將會使每個執行緒的工作量提升，進而影響到𝑚 _𝑘 _,1。假設平鋪設定值為 𝑡，代表實作核心函式時，需要讓每個執行緒工作量增加為原來的𝑡倍，則該關係式將變為 𝑚 _𝑘 _,1= 。 If you use other optimization methods, such as tiling, it will increase the workload of each execution thread, which will affect 𝑚 _𝑘 _,1 . Assuming that the tiling setting value is 𝑡, it means that when implementing the core function, the workload of each thread needs to be increased by 𝑡 times of the original, then the relationship will become 𝑚 _𝑘 _,1 = .

示例性地，本發明定義 𝑚 _𝑘 _,1的上限值為 ₁，下限值為 ₁；而 𝑚 ₂的上限值為 ₂，下限值為 ₂。根據 NVIDIA GPU 的規格說明（ CUDA Toolkit Documentation v11.3.0），一個 TB 中最多只能有 1024 個執行緒，因此 ₂與 ₂的值分別為1024與1；而 TB 的個數 𝑚 _𝑘 _,1的 ₁與 ₁分別為2 ³¹− 1與 1。 Exemplarily, the present invention defines 𝑚 _𝑘 _{, and the upper limit of 1} is ₁ , the lower limit is ₁ ; while the upper limit of 𝑚 ₂ is ₂ , the lower limit is ₂ . According to the specifications of NVIDIA GPU (CUDA Toolkit Documentation v11.3.0), there can only be a maximum of 1024 threads in a TB, so ₂ with The values of ₂ are 1024 and 1 respectively; while the number of TB 𝑚 _𝑘 _{, the number of 1} ₁ with ₁ is 2 ³¹ − 1 and 1 respectively.

依據核心函式的特性，每個執行緒工作量 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}可分為固定量和不固定量兩種情況，以卷積運算核心函式為例，每個執行緒都要根據自身的索引值拿取其相對應的過濾器陣列元素對輸入圖像陣列元素作卷積運算，故每個執行緒被分配到的運算量皆為固定量。以平行歸約核心函式為例，運算過程中不同執行緒所需處理的運算量有所不同，則 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}不固定。 According to the characteristics of the core function, the workload of each thread 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} can be divided into two cases: fixed amount and non-fixed amount. Taking the core function of convolution operation as an example, each thread must be based on its own The index value of is used to perform a convolution operation on the input image array element with its corresponding filter array element, so the amount of operation allocated to each thread is a fixed amount. Taking the parallel reduction core function as an example, the amount of calculations required to be processed by different execution threads is different during the operation process, so 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} is not fixed.

以固定運算量的核心函式為例。若 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}為固定量，在不使用其他優化方法的情況下，當一個核心函式處理固定大小的數據量時，所需的運算總數 𝑂𝑝 _{𝑇𝑜𝑡𝑎𝑙}將是固定的，因此 𝑚 _𝑘 _,1與 𝑚 _𝑘 _,2之間的關係式為 𝑚 _𝑘 _,1= ，而其中的𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}與𝑂𝑝 _{𝑇𝑜𝑡𝑎𝑙}為已知數。給定 𝑚 _𝑘 _,1或 𝑚 _𝑘 _,2時，即可推算出另外一個的數值，因此必要組態集合大小 𝑞 為 ₂ – ₂ + 1。 Take the core function with fixed computation as an example. If 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} is a fixed amount, without using other optimization methods, when a core function processes a fixed size of data, the total number of operations required 𝑂𝑝 _{𝑇𝑜𝑡𝑎𝑙} will be fixed, so 𝑚 _𝑘 _, The relationship between ₁ and 𝑚 _𝑘 _{, 2} is 𝑚 _𝑘 _{, 1} = , and 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} and 𝑂𝑝 _{𝑇𝑜𝑡𝑎𝑙} are known numbers. Given 𝑚 _𝑘 _,1 or 𝑚 _𝑘 _,2 , the other value can be calculated, so the necessary configuration set size 𝑞 is ₂ – ₂ +1.

為減少時間與空間成本，示例性地，以下述方式減少必要組態集合M集合中的個數。 GPU 執行時以 Warp 為一個基本單位，一個 Warp 中最多可包含 32 個執行緒，若 Warp 中的執行緒不足 32可能因串流處理器（Streaming Processor，以下簡稱SP）閒置導致效能較差，故本發明以 32 為倍數選取 𝑚 ₂。藉由此挑選步驟，在 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}為固定量的情況下，可以將該集合M的大小 𝑞 從 ( ₂ – ₂ + 1) 減為 ⌈ ；而在不固定量的情況下，集合M的大小 ( _k – _k + 1)變為⌈ ₁– ₁+1)。［可選組態篩選機制］ In order to reduce time and space costs, for example, the number of necessary configuration sets M is reduced in the following manner. Warp is used as a basic unit for GPU execution. A Warp can contain up to 32 execution threads. If the number of execution threads in a Warp is less than 32, the performance may be poor due to the idle Streaming Processor (hereinafter referred to as SP). Therefore, this The invention selects 𝑚 ₂ in multiples of 32. With this selection step, when 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} is a fixed amount, the size of the set M can be obtained from ( ₂ – ₂ + 1) reduces to ⌈ ; and in the case of an unfixed amount, the size of the set M ( _k – _k +1) becomes ⌈ ₁ – ₁ +1). [Optional Configuration Filtering Mechanism]

可選組態集合O = {𝑜 ₁, 𝑜 ₂, 𝑜 ₃, … , 𝑜 _𝑟}，假設有 𝑠 種的優化方式可以選用，元素的屬性個數則為 𝑠，集合中的某一元素 𝑜 _𝑙定義為 𝑜 _𝑙= (𝑜 _𝑙 _,1, 𝑜 _𝑙 _,2, … , 𝑜 _𝑙 _, _𝑠)，元素 𝑜 _𝑙的屬性 𝑢 會因為不同核心函式所使用的優化方法而有所不同。 Optional configuration set O = {𝑜 ₁ , 𝑜 ₂ , 𝑜 ₃ , … , 𝑜 _𝑟 }, assuming that there are 𝑠 kinds of optimization methods that can be selected, the number of attributes of elements is 𝑠, and an element in the set 𝑜 _𝑙 Defined as 𝑜 _𝑙 = (𝑜 _𝑙 _,1 , 𝑜 _𝑙 _,2 , … , 𝑜 _𝑙 _, _𝑠 ), the attributes 𝑢 of the element 𝑜 _𝑙 will vary due to the optimization methods used by different core functions.

以卷積運算為例，可使用下列優化方法：如平鋪 (Tiling)、共享記憶體補丁 (Padding)。可選組態非常具有彈性，分為使用或不使用以及設定數值兩種形式；假設 𝑜 _𝑙 _,1代表共享記憶體補丁，𝑜 _𝑙 _,2代表平鋪工作量，共享記憶體補丁是選擇使用或不使用，屬性 𝑜 _𝑙 _,1的數值個數為 2；而平鋪則是要設定執行緒工作量為原基本工作量的倍數，基本工作量定義為不使用平鋪技術，或使用平鋪技術但設定的參數為 1 時。 Taking the convolution operation as an example, the following optimization methods can be used: such as tiling and shared memory patching (Padding). The optional configuration is very flexible, and it can be divided into two forms: use or not use and set the value; assuming 𝑜 _𝑙 _{, 1} represents the shared memory patch, 𝑜 _𝑙 _{, 2} represents the tiling workload, and the shared memory patch is selected to use or Not used, attribute 𝑜 _𝑙 _{, the number of values of 1} is 2; and tiling is to set the thread workload as a multiple of the original basic workload, the basic workload is defined as not using tiling technology, or using tiling technology But when the parameter is set to 1.

在以下示例性實施方式中，本發明將不會定義 𝑜 _𝑙 _, _𝑢的上下限範圍，因為其屬性的範圍會根據不同的核心函式有很大的差異。 In the following exemplary implementations, the present invention will not define the upper and lower ranges of 𝑜 _𝑙 _, _𝑢 because the range of their attributes will vary greatly according to different core functions.

示例性地，可選定數個數據量組合測試以進行篩選，例如在欲測試的數據量範圍內，分別挑選此範圍中前、中、後段子範圍，作為觀察趨勢的代表，接著測試這些選定的數據量組合對應之所有的必要組態組合以及所有的可選組態集合組合，分別觀察執行結果的趨勢，將擁有多數執行效能較佳的的執行結果中的可選組態挑選出來，作為後續測試時的可選組態設定，並將可選組態集合 O的大小 r 從 |O| 縮減為 1。Exemplarily, several data volume combination tests can be selected for screening, for example, within the data volume range to be tested, respectively select the front, middle, and back segment sub-ranges in this range as representatives of the observed trend, and then test these selected All necessary configuration combinations and all optional configuration set combinations corresponding to the data volume combination, respectively observe the trend of the execution results, and select the optional configurations from the execution results with the majority of better execution performance as follow-up Optional configuration setting at test time, and reduces the size r of the optional configuration set O from |O| to 1.

挑選作為數據量所採用的可選組態代表的方式，會根據不同的核心函式改變。示例性地，並不能期待卷積運算核心函式所有數據量在使用同一個可選組態時能有好的表現。該核心函式運算過程中，過濾器陣列會高度地重複使用，故將該過濾器陣列放在共享記憶體資源中，利用共享記憶體的低延遲特性達到降低執行時間、提升效能的效果。故共享記憶體基本的使用量會隨著過濾器陣列的大小而變動，過濾器陣列越大，共享記憶體基本的使用量越多，若使用與較小過濾陣列時相同的平鋪工作量設定，可能使共享記憶體資源量需求過高，導致核心函式共享記憶體不足無法執行。故挑選作為代表的可選組態的方式就必須根據數據量組合進行調整。The way to select the optional configuration representative used as the data volume will change according to different core functions. For example, it is not expected that all data volumes of the convolution core function will perform well with the same optional configuration. During the operation of the core function, the filter array will be highly reused, so the filter array is placed in the shared memory resource, and the low-latency feature of the shared memory is used to reduce execution time and improve performance. Therefore, the basic usage of the shared memory will vary with the size of the filter array. The larger the filter array, the more basic usage of the shared memory. If you use the same tiling workload setting as the smaller filter array , may make the demand for shared memory resources too high, resulting in insufficient shared memory for core functions to execute. Therefore, the way of selecting the optional configuration as a representative must be adjusted according to the combination of data volume.

示例性地，使用平行歸約核心函式時，因為其使用的優化方法在不同的數據量組合下不會影響到共享記憶體的使用量，使用同一個可選組態作為代表不會因不同的數據量組合導致特定的硬體資源量不足，故可以將該可選組態作為代表套用到所有執行組合上。［資源佔有率篩選］ For example, when using the parallel reduction core function, because the optimization method used by it will not affect the usage of shared memory under different data volume combinations, using the same optional configuration as a representative will not cause different The combination of the amount of data results in insufficient amount of specific hardware resources, so this optional configuration can be used as a representative to apply to all execution combinations. [Resource occupancy screening]

離線階段的第(2)部分為資源佔有率篩選，資源佔有率篩選模組藉由計算出篩選後數據量集合、必要組態集合以及可選組態集合元素中各組合的資源佔有率，利用資源佔有率高可能有較好的效能表現之特性，進一步淘汰執行效能可能較差的組合，以減少組合數。Part (2) of the offline stage is the resource occupancy rate screening. The resource occupancy rate screening module calculates the resource occupancy rate of each combination of the filtered data volume set, necessary configuration set, and optional configuration set elements, using A high resource occupancy rate may have better performance characteristics, and further eliminate combinations that may have poor execution performance to reduce the number of combinations.

首先，定義計算資源使用率所需用到名詞及其對應的符號；因每個 SM 上的資源皆有上限，如：暫存器個數、共享記憶體大小、可同時執行的 Warp 個數與可同時執行的執行緒個數，本發明分別以 𝑀𝑎𝑥𝑅𝑒𝑔 _𝑆𝑀、𝑀𝑎𝑥𝑆𝑚𝑒𝑚 _𝑆𝑀、𝑀𝑎𝑥𝑊𝑎𝑟𝑝 _𝑆𝑀、和 𝑀𝑎𝑥𝑇ℎ𝑟𝑒𝑎𝑑 _𝑆𝑀作為這些限制值。 First of all, nouns and their corresponding symbols are used to define the calculation resource utilization rate; because each SM has an upper limit on resources, such as: the number of temporary registers, the size of shared memory, the number of warps that can be executed at the same time, and The number of threads that can be executed at the same time is represented by 𝑀𝑎𝑥𝑅𝑒𝑔 _𝑆𝑀 , 𝑀𝑎𝑥𝑆𝑚𝑒𝑚 _𝑆𝑀 , 𝑀𝑎𝑥𝑊𝑎𝑟𝑝 _{𝑆 𝑀} , and 𝑀𝑎𝑥𝑇ℎ𝑟𝑒𝑎𝑑 _𝑆𝑀 as these limit values.

同時，也定義了核心函式在執行時每個執行緒區塊中的執行緒個數為 𝑁𝑢𝑚 _𝑇 _ℎ _{_} _{𝑝𝑒𝑟𝑇𝐵}，每個執行緒所使用的暫存器個數為𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟𝑇} _ℎ，每個執行緒區塊所使用的共享記憶體大小為 𝑁𝑢𝑚 _{𝑆𝑚𝑒𝑚} _{_} _{𝑝𝑒𝑟𝑇𝐵}，單一 SM 中可以同時執行的執行緒區塊為𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀}。 At the same time, it also defines the number of threads in each thread block when the core function is executed as 𝑁𝑢𝑚 _𝑇 _ℎ _{_} _{𝑝𝑒𝑟𝑇𝐵} , and the number of scratchpads used by each thread is 𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟 𝑇} _ℎ , per The shared memory size used by each thread block is 𝑁𝑢𝑚 _{𝑆𝑚𝑒𝑚} _{_} _{𝑝𝑒𝑟𝑇𝐵} , and the thread blocks that can be executed simultaneously in a single SM are 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀} .

示例性地，考慮僅有單一個核心函式執行，並且欲執行核心函式中所有的執行緒區塊無法同時放入 GPU 中的 SM 執行時，每一個 SM 中的不同資源佔有率： (a) Warp 佔有率(𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦𝑊𝑎𝑟𝑝)：SM 中 Warp 資源的使用比例； (b) 暫存器佔有率(𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦𝑅𝑒𝑔)：SM 中暫存器使用比例； (c) 共享記憶體佔有率 (𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦𝑆𝑚𝑒𝑚)：SM 中共享記憶體使用比例。以這些佔有率訊息推測若以該組態設定執行時，效能的可能狀況。 As an example, consider the different resource occupancy rates of each SM when only a single core function is executed, and all the thread blocks in the core function cannot be simultaneously executed by the SMs in the GPU: (a) Warp Occupancy Rate (𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦𝑊𝑎𝑟𝑝): the proportion of Warp resources used in SM; (b) Temporary register occupancy rate (𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦𝑅𝑒𝑔): the proportion of register usage in SM; (c) Shared memory occupancy rate (𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦𝑆𝑚𝑒𝑚): Shared memory usage ratio in SM. Based on these occupancy information, it is possible to estimate the possible state of performance when running with this configuration setting.

分別計算一個 SM 中的 Warp、暫存器、和共享記憶體佔有率，因為在一個 SM中，可以使用的 Warp、暫存器、共享記憶體皆有數量限制，分別算出以這三種資源為限制之情況下得到的可同時運行執行緒區塊個數，並以三者之中的最小值作為在一個 SM 中可運行的執行緒區塊個數 (𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀})。 Calculate the Warp, scratchpad, and shared memory occupancy in an SM separately, because in an SM, the number of Warp, scratchpad, and shared memory that can be used is limited, and the calculation is based on these three resources. In this case, the number of thread blocks that can be run at the same time is obtained, and the minimum value among the three is used as the number of thread blocks that can be run in one SM (𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀} ).

各別的可同時運行執行緒區塊個數計算方式說明如下： (1) 以 Warp 為限制下得到的可同時運行執行緒區塊個數為𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑊𝑎𝑟𝑝𝐿𝑖𝑚𝑖𝑡𝑒𝑑}，計算方式為： 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑊𝑎𝑟𝑝𝐿𝑖𝑚𝑖𝑡𝑒𝑑}= ； (2) 以暫存器為限制下得到的可同時運行執行緒區塊個數為𝑁𝑢𝑚𝑇𝐵_𝑅𝑒𝑔𝐿𝑖𝑚𝑖𝑡𝑒𝑑，計算方式為：𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑅𝑒𝑔𝐿𝑖𝑚𝑖𝑡𝑒𝑑}= ； (3) 以共享記憶體為限制下得到的可同時運行執行緒區塊個數為𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑆𝑚𝑒𝑚𝐿𝑖𝑚𝑖𝑡𝑒𝑑}，計算方式為：𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑆𝑚𝑒𝑚𝐿𝑖𝑚𝑖𝑡𝑒𝑑}= 。 The calculation methods for the number of concurrently running thread blocks are explained as follows: (1) The number of concurrently running thread blocks obtained under the limit of Warp is 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑊𝑎𝑟𝑝𝐿𝑖𝑚𝑖𝑡𝑒𝑑} , the calculation method is: 𝑁𝑢 𝑚 _𝑇𝐵 _{_} _{𝑊𝑎𝑟𝑝𝐿𝑖𝑚𝑖𝑡𝑒𝑑} = ; (2) The number of blocks that can run simultaneously under the limit of the scratchpad is 𝑁𝑢𝑚𝑇𝐵_𝑅𝑒𝑔𝐿𝑖𝑚𝑖𝑡𝑒𝑑, the calculation method is: 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑅𝑒 𝑔𝐿𝑖𝑚𝑖𝑡𝑒𝑑} = ; (3) The number of concurrently running thread blocks obtained under the limitation of shared memory is 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑆𝑚𝑒𝑚𝐿𝑖𝑚𝑖𝑡𝑒𝑑} , the calculation method is: 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑆𝑚𝑒𝑚𝐿𝑖𝑚𝑖𝑡𝑒𝑑} = .

其中，分配暫存器資源時一個 Warp 所配得的暫存器個數若不為 256 倍數，分配時會自動補至 256 的倍數，故在計算一個 Warp 中的暫存器使用量，要先除上 256 再取高斯上限，才是真正運行時所分配的暫存器使用量。Among them, if the number of scratchpads allocated to a Warp is not a multiple of 256 when allocating scratchpad resources, it will be automatically filled to a multiple of 256 when allocating. Therefore, when calculating the scratchpad usage in a Warp, you must first Divide by 256 and then take the Gaussian upper limit, which is the allocated scratchpad usage during real runtime.

有三種方式可以獲取每個執行緒所使用的暫存器數目𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟𝑇ℎ}的值，第一種為透過觀察程式碼，從中推估出每個執行緒的暫存器使用量；第二種方式為利用 CUDA 編譯器 nvcc，加上參數-res-usage，即可將暫存器使用數量推估出來；第三種方式則在核心函式執行過程中，利用 CUDA 內建的資源監測工具 nvprof，加上參數--print-gpu-trace，反饋出每個執行緒的暫存器使用量。 There are three ways to obtain the value of the number of registers used by each thread 𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟𝑇ℎ} . The first is to estimate the amount of registers used by each thread by observing the code; the second is The method is to use the CUDA compiler nvcc and add the parameter -res-usage to estimate the number of scratchpad usage; the third method is to use the CUDA built-in resource monitoring tool nvprof during the execution of the core function , plus the parameter --print-gpu-trace, feedback the scratchpad usage of each execution thread.

只有第三種方式能夠精確地得知 𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟𝑇ℎ}的值，但也是最耗費成本的方式。故若要盡可能地降低成本，便可採用第一和第二種方式；反之，需要高精確度時則選擇第三種方式。 Only the third way can accurately know the value of 𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟𝑇ℎ} , but it is also the most cost-intensive way. Therefore, if you want to reduce the cost as much as possible, you can use the first and second methods; on the contrary, if you need high accuracy, you can choose the third method.

相同的，分配共享記憶體資源時一個執行緒區塊所使用之位元組數不為 256 倍數的話，也會自動補上不足的位元組數，使其成為 256 的倍數，故在計算上需要除以 256 再取高斯上限，才能得到真正運行時所分配的共享記憶體使用量。Similarly, if the number of bytes used by a thread block is not a multiple of 256 when allocating shared memory resources, the insufficient number of bytes will be automatically filled to make it a multiple of 256, so in calculation You need to divide by 256 and then take the upper limit of Gaussian to get the shared memory usage allocated during real runtime.

以上述三種資源為限制算出可同時運行執行緒區塊個數後，以最小值當作一個 SM 中可運行執行緒區塊個數，即可以推算出在運行時一個 SM 中三種資源各別的使用量，再除以一個 SM 中各別資源的數量上限，即可得出以下式子計算資源佔有率： (1) 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _{𝑊𝑎𝑟𝑝}= (2) 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _𝑅𝑒𝑔= (3) 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _{𝑆𝑚𝑒𝑚}= After calculating the number of thread blocks that can run simultaneously with the above three resources as the limit, take the minimum value as the number of thread blocks that can run in one SM, and then calculate the different resources of the three resources in one SM at runtime. Divide the usage amount by the upper limit of each resource in an SM to obtain the following formula to calculate the resource occupancy rate: (1) 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _{𝑊𝑎𝑟𝑝} = (2) 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _𝑅𝑒𝑔 = (3) 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _{𝑆𝑚𝑒𝑚} =

示例性地，假設某個組態設定中每個執行區塊中的執行緒個數 𝑁𝑢𝑚 _𝑇 _ℎ _{_} _{𝑝𝑒𝑟𝑇𝐵}為 128，每個執行緒所需要的暫存器個數𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟𝑇} _ℎ為 50 個，每個執行緒區塊所需要的共享記憶體位元組數𝑁𝑢𝑚 _{𝑆𝑚𝑒𝑚} _{_} _{𝑝𝑒𝑟𝑇𝐵}為 16,384 Bytes。 For example, assuming that the number of execution threads 𝑁𝑢𝑚 _𝑇 _ℎ _{_} _{𝑝𝑒𝑟𝑇𝐵} in each execution block in a certain configuration setting is 128, the number of registers required by each execution thread 𝑁𝑢𝑚 _𝑅𝑒𝑔 _{_} _{𝑝𝑒𝑟 𝑇} _ℎ is 50 , the number of shared memory bytes 𝑁𝑢𝑚 _{𝑆𝑚𝑒𝑚} _{_} _{𝑝𝑒𝑟𝑇𝐵} required for each thread block is 16,384 Bytes.

以 RTX 2060 顯示卡為例，一個 SM 最多可以同時執行的 Warp 個數 𝑀𝑎𝑥𝑊𝑎𝑟𝑝 _𝑆𝑀、擁有的暫存器個數𝑀𝑎𝑥𝑅𝑒𝑔 _𝑆𝑀、擁有的共享記憶體位元組個數 𝑀𝑎𝑥𝑆𝑚𝑒𝑚 _𝑆𝑀和最多可以同時執行的執行緒個數 𝑀𝑎𝑥𝑇ℎ𝑟𝑒𝑎𝑑 _𝑆𝑀分別為 32 個、65,536 個、65,536 Bytes 和 1,024 個。 Taking the RTX 2060 graphics card as an example, one SM can execute the maximum number of warps 𝑀𝑎𝑥𝑊𝑎𝑟𝑝 _𝑆𝑀 , the number of scratchpads 𝑀𝑎𝑥𝑅𝑒𝑔 _𝑆𝑀 , the number of shared memory bytes 𝑀 𝑎𝑥𝑆𝑚𝑒𝑚 _𝑆𝑀 and the maximum number of concurrently executing threads The numbers 𝑀𝑎𝑥𝑇ℎ𝑟𝑒𝑎𝑑 _𝑆𝑀 are 32 bytes, 65,536 bytes, 65,536 bytes and 1,024 bytes respectively.

在 Warp、暫存器、共享記憶體三種資源限制下，計算出的可運行執行緒區塊個數 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑊𝑎𝑟𝑝𝐿𝑖𝑚𝑖𝑡𝑒𝑑}為 8 (= )、𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑅𝑒𝑔𝐿𝑖𝑚𝑖𝑡𝑒𝑑}為 9.14 (= ) 和 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑆𝑚𝑒𝑚𝐿𝑖𝑚𝑖𝑡𝑒𝑑}為 4 (= )。以三者中最小的值作為𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀}， 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀}為 4。 Under the resource constraints of Warp, scratchpad, and shared memory, the calculated number of executable thread blocks 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑊𝑎𝑟𝑝𝐿𝑖𝑚𝑖𝑡𝑒𝑑} is 8 (= ), 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑅𝑒𝑔𝐿𝑖𝑚𝑖𝑡𝑒𝑑} is 9.14 (= ) and 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑆𝑚𝑒𝑚𝐿𝑖𝑚𝑖𝑡𝑒𝑑} is 4 (= ). Take the smallest value among the three as 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀} , and 𝑁𝑢𝑚 _𝑇𝐵 _{_} _{𝑝𝑒𝑟𝑆𝑀} is 4.

最後，套回上述計算資源佔有率計算式即可算出三個佔有率分別為： 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _{𝑊𝑎𝑟𝑝}為 0.5 (= )、𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _𝑅𝑒𝑔為 0.4 (= )、𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _{𝑆𝑚𝑒𝑚}為 1 (= )。 Finally, the above computing resource occupancy calculation formula can be used to calculate the three occupancy _ratios : ), 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _𝑅𝑒𝑔 is 0.4 (= ), 𝑂𝑐𝑐𝑢𝑝𝑎𝑛𝑐𝑦 _{𝑆𝑚𝑒𝑚} is 1 (= ).

採用前述三個數值進行綜合考量，該綜合考量可能為三者平均或加權平均的值，並使用綜合考量後的值作為篩選標準。篩選方式分為兩種，第一種係選定極限數值，將資源佔有率小於這個數值的組合刪除；第二種係將綜合考量後所有組合的資源佔有率由大到小排序，選定一比例值（如： 70%），將比例值後的組合刪除（如：刪除後 70%的組合）。前述兩數值愈大，能降低的時間與空間成本越多，但誤刪率也會隨之增加。The above three values are used for comprehensive consideration, which may be the average or weighted average of the three values, and the value after comprehensive consideration is used as the screening standard. There are two screening methods. The first is to select a limit value and delete combinations whose resource occupancy is less than this value; the second is to sort the resource occupancy of all combinations after comprehensive consideration from large to small, and select a ratio value (eg: 70%), delete the combination after the ratio value (eg: delete the combination of 70%). The larger the above two values are, the more time and space costs can be reduced, but the error deletion rate will also increase accordingly.

選定極限值的淘汰方式比選定比例值的方式可以減少誤刪機會。若採用選定比例值時固定值挑選不適當，可能一次性刪掉過多組合。以選擇極限值的方式，會較容易掌控要刪除組合的數量。［執行組態設定］ The elimination method of selecting the limit value can reduce the chance of mistaken deletion than the method of selecting the proportional value. If the fixed value is not properly selected when the selected proportional value is used, too many combinations may be deleted at one time. By choosing the limit value, it will be easier to control the number of combinations to be deleted. [Execute Configuration Settings]

離線階段的第(3)部分為執行模組使GPU實際執行已篩選完的組態設定，並紀錄其執行時間，以此作為線上階段時，建立模型所需要的訓練資料來源。Part (3) of the offline stage is to execute the module to make the GPU actually execute the filtered configuration settings, and record the execution time, which is used as the source of training data required for building the model in the online stage.

以一實施例進一步說明。以卷積運算為例，假設輸入的圖像陣列大小為 1000∗1000，過濾器陣列大小為 3∗3，其部分組合的資源佔有率及執行時間將如表 1所示。An example is used to further illustrate. Taking the convolution operation as an example, assuming that the size of the input image array is 1000*1000, and the size of the filter array is 3*3, the resource occupancy and execution time of some combinations will be shown in Table 1.

表1 執行緒區塊大小 Warp 佔有率暫存器佔有率共享記憶體佔有率執行時間 160 93.75 % 41.02 % 9.38 % 6.5446*10 ⁻²ms 512 100.0 % 43.75 % 7.81 % 7.5167*10 ⁻²ms 544 53.12% 23.24% 4.3% 9.7753*10 ⁻²ms 992 96.88 % 42.38 % 7.03 % 7.8713*10 ⁻²ms Table 1 thread block size Warp share register occupancy Shared Memory Occupancy execution time 160 93.75% 41.02% 9.38% 6.5446*10 ⁻² ms 512 100.0 % 43.75% 7.81% 7.5167*10 ⁻² ms 544 53.12% 23.24% 4.3% 9.7753*10 ⁻² ms 992 96.88% 42.38% 7.03% 7.8713*10 ⁻² ms

可發現，執行緒區塊大小為 160 的情況下，雖然 Warp 佔有率 93.5 % 低於執行緒區塊為 512 和 992 的 100 % 與 96.88 %，但共享記憶體佔有率 9.38 %卻高於兩者的 7.81 % 與 7.03 %，可見 Warp 佔有率高，不一定代表執行效能較佳。故本發明將其他佔有率納入考量的作法，會提升篩選標準的準確度。It can be found that when the thread block size is 160, although the Warp occupancy rate of 93.5% is lower than 100% and 96.88% of the thread block size of 512 and 992, the shared memory occupancy rate of 9.38% is higher than both 7.81 % and 7.03 %, it can be seen that a high Warp share does not necessarily mean better execution performance. Therefore, taking other occupancy ratios into consideration in the present invention will improve the accuracy of the screening criteria.

當執行緒區塊大小為 544 的情況下，其三種資源佔有率皆小於其他三者，其執行時間也是最久的，可見佔有率過低執行效能將會較差，故本發明依照資源佔有率篩選標準將佔有率過低的組合刪除。When the thread block size is 544, its three resource occupancy rates are smaller than the other three, and its execution time is also the longest. It can be seen that the execution performance will be poor if the occupancy rate is too low, so the present invention screens resources according to the resource occupancy rate The standard deletes combinations with too low occupancy.

假設藉由針對三種集合的元素篩選過程後，留下的需執行組合數為 y，而資源佔有率淘汰率為 z，藉由資源佔有率篩選後，必須測試的組合將從原本的 y 減少為 𝑦 × (1 –z)。假設原先必須測試的組合個數為 |X| × |M| × |O|，而算式經展開後為 × ( ₂ – ₂ + 1) ∗ |O|；藉由上述的 4 個篩選組合過程後，在每個執行緒所需執行的運算量 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}為固定量，且資源佔有率的淘汰率為 z 的情況下，將從原本的 × ( ₂ – ₂ + 1) ∗ |O| 縮減為 × ⌈ × 1 × (1 – 𝑧)。 Assuming that after the element screening process for the three sets, the number of combinations left to be executed is y, and the resource occupancy elimination rate is z, after the resource occupancy screening, the combinations that must be tested will be reduced from the original y to 𝑦 × (1 – z). Assume that the number of combinations that must be tested is |X| × |M| × |O|, and the formula after expansion is × ( ₂ – ₂ + 1) ∗ |O|; After the above 4 screening and combination processes, the amount of computation required to be executed in each thread 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} is a fixed amount, and the elimination rate of resource occupancy is z In the case of × ( ₂ – ₂ + 1) ∗ |O| reduces to × ⌈ × 1 × (1 - 𝑧).

本發明將原本極為龐大的執行組合，藉由篩選機制大幅地減少需執行組合，降低時間與空間成本，並留下相對優良的特徵資料作為線上階段時建立模型所需的訓練資料。［線上階段］ The present invention uses a screening mechanism to greatly reduce the number of execution combinations that are originally extremely large, reducing time and space costs, and leaving relatively good feature data as training data required for model building in the online stage. [Online stage]

接著，線上階段將前述離線階段篩選過的剩餘組合中提取並調整訓練資料，並將訓練資料匯入模型以建立模型。最後即可利用建立好的模型進行組態設定之預測。Next, in the online stage, the training data is extracted and adjusted from the remaining combinations screened in the aforementioned offline stage, and the training data is imported into the model to build the model. Finally, the established model can be used to predict the configuration settings.

參照圖2，該線上階段可分為二部分： (1) 提取訓練資料； (2) 建立模型並進行預測。［提取訓練資料］ Referring to Figure 2, the online stage can be divided into two parts: (1) Extract training data; (2) Build a model and make predictions. [Extract training data]

在機器學習的建立預測模型過程中，需要特徵資料及其相對應的標籤資料；據此，訓練資料提取模組需要從離線階段獲得的執行組合之執行結果中，提取適合的資料作為訓練資料，其中並分為特徵與標籤資料。In the process of building a predictive model for machine learning, feature data and corresponding label data are required; accordingly, the training data extraction module needs to extract suitable data from the execution results of the execution combination obtained in the offline stage as training data. It is divided into features and label data.

示例性地，可將數據量組合作為特徵資料，與其相對應的必要組態及可選組態則提取作為標籤資料。原先每一個數據量皆會與所有的必要組態組合與可選組態組合進行測試，該步驟則可將這些執行組合之中最好執行結果的組合作為此數據量組合的代表。Exemplarily, the combination of data volumes can be used as feature data, and the corresponding necessary configurations and optional configurations can be extracted as label data. Originally, each data volume will be tested with all necessary configuration combinations and optional configuration combinations. In this step, the combination with the best execution result among these execution combinations can be used as the representative of the data volume combination.

在每個執行緒工作量 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑}為固定量情況下，該步驟會在 ⌈ 個組合中挑選出一個擁有最佳執行結果的執行組合，最後會將訓練資料減少為筆。目的是為了讓機器學習中的預測模型在進行訓練時，能夠明確地分辨在特定數據量組合之下，要將該數據量組合判定為何種類別，降低誤判的機率。［建立模型］ When the workload of each thread 𝑂𝑝 _𝑝𝑒𝑟 _{_} _𝑡 _ℎ _{𝑟𝑒𝑎𝑑} is fixed, this step will be performed at ⌈ Select an execution combination with the best execution result from the combinations, and finally reduce the training data to Pen. The purpose is to enable the predictive model in machine learning to be able to clearly distinguish under a specific data volume combination, which category to judge the data volume combination as, and reduce the probability of misjudgment. [Modeling]

示例性地，建模模組採用機器學習中的監督式學習法，並可挑選 K-鄰近演算法 (KNN) 或邏輯式迴歸 (Logistic Regression：LR) 演算法作為模型來應用。Exemplarily, the modeling module adopts a supervised learning method in machine learning, and can select K-neighborhood algorithm (KNN) or logistic regression (Logistic Regression: LR) algorithm as a model for application.

在此部分，建模模組可能須調整訓練資料維度。在特徵資料部分，KNN 與 LR 模型必須最少二個維度；而在標籤資料的部分，KNN 模型對於維度沒有限制，LR 模型需為一維；然而不同的核心函式，會使從執行組合的執行結果中提取出的訓練資料維度有所不同，因此可能還需要對訓練資料的維度進行調整，才能匯入模型中。若一核心函式提取出之訓練資料中，特徵資料與標籤資料皆為二維，則匯入K-鄰近演算法不需要調整維度；提取出的特徵資料若是一維即需要進行增維，增維時為了不要影響特徵資料，可在增加的維度上統一將值填充為 1。In this section, the modeling module may have to adjust the dimensions of the training data. In the feature data part, the KNN and LR models must have at least two dimensions; in the label data part, the KNN model has no restriction on the dimension, and the LR model needs to be one-dimensional; however, different core functions will make the execution of the execution combination The dimensions of the training data extracted from the results are different, so the dimensions of the training data may need to be adjusted before they can be imported into the model. If the training data extracted by a core function, feature data and label data are both two-dimensional, then importing into the K-nearest algorithm does not need to adjust the dimension; if the extracted feature data is one-dimensional, it needs to increase the dimension. In order not to affect the characteristic data when dimensioning, the value can be uniformly filled with 1 in the added dimension.

據此，建立一通用型圖形處理器核心函式之組態設定預測模型；而利用該模型可建議出一組有較佳效能的組態設定。Accordingly, a general-purpose graphics processor core function configuration setting prediction model is established; and a group of configuration settings with better performance can be suggested by using the model.

綜上，利用本發明的通用型圖形處理器核心函式之組態設定預測系統及方法所建立的模型，可快速且精準地預測擁有較佳效能的組態設定；而該系統及方法亦能夠有效篩選建立預測模型之資料量，以相對低的時間及空間成本建立準確度高的核心函式組態設定預測模型。［實施例一］ In summary, the model established by the general-purpose graphics processor core function configuration setting prediction system and method of the present invention can quickly and accurately predict the configuration setting with better performance; and the system and method can also Effectively screen the amount of data for building a forecast model, and build a forecast model with high accuracy in core function configuration settings at a relatively low time and space cost. [Example 1]

使用KNN作為預測模型，並採用卷積運算核心函式。以R(a, b)_KNN代表預測機制，R代表在離線階段的篩選，a代表數據量集合篩選機制中選取間隔的設定值，b為資源佔有率篩選機制中設定的標準值，此標準值有兩種：(1)以P代表使用比例值淘汰方式；(2)以T代表使用極限值淘汰方式。 Use KNN as the prediction model, and use the core function of convolution operation. R(a, b)_KNN represents the prediction mechanism, R represents the screening in the offline stage, and a represents the selection interval in the data collection screening mechanism b is the standard value set in the resource occupancy rate screening mechanism. There are two types of standard values: (1) P represents the elimination method of the use ratio value; (2) T represents the use limit value elimination method.

卷積運算核心函式的數據量集合中的輸入圖像陣列寬（高）度大小範圍的最小值與最大值分別為1,000至2,000，且選取間隔為5、10、15與20；過濾器陣列寬（高）度大小範圍的最小值與最大值則分別3與74；資源佔有率篩選機制中的比例值淘汰機制值設定為20%、40%與60%，當設定比例值為20%時，代表刪除20%的資源佔有率較低的組態設定，留下80%有高佔有率的組態設定；極限值淘汰機制值設定為50%、55%與60%，當設定極限值為50%時，表示將平均資源佔有率值小於50%的組態設定刪除。 The minimum value and maximum value of the input image array width (height) in the data volume collection of the convolution operation core function are 1,000 to 2,000, and the selected interval 5, 10, 15, and 20; the minimum and maximum values of the width (height) of the filter array are 3 and 74 respectively; the value of the proportional value elimination mechanism in the resource occupancy screening mechanism is set to 20%, 40% And 60%, when the setting ratio is 20%, it means deleting 20% of the configuration settings with low resource occupancy, leaving 80% of the configuration settings with high occupancy; the limit value elimination mechanism value is set to 50% , 55% and 60%, when the set limit value is 50%, it means that the configuration settings whose average resource occupancy rate is less than 50% will be deleted.

以作為每個過濾器陣列寬（高）度大小的筆數，因此會有筆測試資料，而測試資料中的輸入圖像陣列寬（高）度大小則為隨機選取。 by As the number of pens of the width (height) degree size of each filter array, there will be test data, and the width (height) of the input image array in the test data is randomly selected.

表2所示為卷積運算核心函式實際的空間成本比較表；可以得知，使用越大的、P值或T值，能夠降低越多的時間成本，因此對於、P值或T值的選擇是很重要的，因為將與空間成本以及執行效能有密切的關係。 Table 2 shows the actual space cost comparison table of the core function of the convolution operation; it can be known that the larger the used , P value or T value, the more time cost can be reduced, so for The choice of , P value or T value is very important, because it will have a close relationship with space cost and execution performance.

表2 Table 2

圖3所示為預測所花費的時間在其預測出的組態設定的執行時間佔比比較，該佔比值越小代表本實施例中預測階段所花費的時間成本越低。圖中可見在不同篩選參數、P值或T值時，其佔比都小於0.7%，這代表我們提出的預測機制，在預測階段時，其時間成本極小，故可極快速地推薦出一組組態設定。［實施例二］ FIG. 3 shows the ratio of the time spent in prediction to the execution time of the predicted configuration setting. The smaller the ratio, the lower the cost of time spent in the prediction phase in this embodiment. It can be seen in the figure that at different screening parameters , P value or T value, their proportions are all less than 0.7%, which means that the prediction mechanism we proposed has a very small time cost in the prediction stage, so a set of configuration settings can be recommended very quickly. [Example 2]

使用邏輯式迴歸演算法作為預測模型，並採用平行歸約核心函式。以R(a, b)_LR代表預測機制，R代表在離線階段的篩選，a代表數據量集合篩選機制中選取間隔的設定值，b為資源佔有率篩選機制中設定的標準值，此標準值有兩種：(1)以P代表使用比例值淘汰方式；(2)以T代表使用極限值淘汰方式。 Use the logistic regression algorithm as the predictive model with a parallel reduction core function. R(a, b)_LR represents the prediction mechanism, R represents the screening in the offline stage, and a represents the selection interval in the data collection screening mechanism b is the standard value set in the resource occupancy rate screening mechanism. There are two types of standard values: (1) P represents the elimination method of the use ratio value; (2) T represents the use limit value elimination method.

平行歸約核心函式的數據量集合中的輸入陣列長度為八百萬至一千兩百萬，且選取間隔為10,000、20,000、30,000與40,000。資源佔有率比例值設定為0%、70%、80%和90%，極限值設定為50%、55%和60%。The length of the input array in the data set of the parallel reduction core function is 8 million to 12 million, and the selection interval is 10,000, 20,000, 30,000 and 40,000. The proportion of resource occupancy is set to 0%, 70%, 80% and 90%, and the limit value is set to 50%, 55% and 60%.

用於分析平行歸約核心函式的執行效能的測試資料總共有500筆，其測試資料中的輸入陣列大小則為隨機選取。There are a total of 500 test data for analyzing the execution performance of the parallel reduction core function, and the size of the input array in the test data is randomly selected.

圖4所示為預測所花費的時間在其預測出的組態設定的執行時間佔比比較，該佔比值越小代表本實施例中預測階段所花費的時間成本越低，圖中可見在不同篩選參數、P值或T值時，其佔比都小於1.2%，這代表我們提出的預測機制，在預測階段時，其時間成本極小，故可極快速地推薦出一組組態設定。 Figure 4 shows the comparison of the time spent in prediction with the proportion of execution time set in the predicted configuration. The smaller the proportion, the lower the cost of time spent in the prediction stage in this embodiment. It can be seen in the figure that different filter parameters , P value or T value, their proportions are all less than 1.2%, which means that the prediction mechanism we proposed has a very small time cost in the prediction stage, so a set of configuration settings can be recommended very quickly.

1:通用型圖形處理器核心函式之組態設定預測系統 11:集合篩選模組 12:資源佔有率篩選模組 13:執行模組 14:訓練資料提取模組 15:建模模組 1: Configuration setting prediction system for general-purpose graphics processor core functions 11:Collection filter module 12: Resource occupancy screening module 13: Execute the module 14: Training data extraction module 15:Modeling module

〔圖1〕通用型圖形處理器核心函式之組態設定預測系統之方塊示意圖。〔圖2〕通用型圖形處理器核心函式之組態設定預測方法之流程示意圖。〔圖3〕預測所花費的時間在其預測出的組態設定的執行時間佔比比較。〔圖4〕預測所花費的時間在其預測出的組態設定的執行時間佔比比較。 [FIG. 1] A block diagram of a configuration setting prediction system for general-purpose graphics processor core functions. [FIG. 2] Schematic diagram of the flow chart of the configuration setting prediction method for the core function of a general-purpose graphics processor. [Figure 3] Comparison of the time spent in prediction with the execution time of the predicted configuration settings. [Figure 4] Comparison of the time spent in prediction with the execution time of the predicted configuration settings.

1:通用型圖形處理器核心函式之組態設定預測系統 1: Configuration setting prediction system for general-purpose graphics processor core functions

11:集合篩選模組 11:Collection filter module

12:資源佔有率篩選模組 12: Resource occupancy screening module

13:執行模組 13: Execute the module

14:訓練資料提取模組 14: Training data extraction module

15:建模模組 15:Modeling module

Claims

A general-purpose graphics processor core function configuration setting prediction system is applicable to the general-purpose graphics processor, and its features include: A set screening module, which preliminarily screens the parameter set of the configuration setting of the general-purpose graphics processor core function, and takes the filtered parameter set as an input parameter; A resource occupancy rate screening module, which performs secondary screening on the input parameters by the resource occupancy values of Warp, temporary register and shared memory in the stream composite processor to obtain the first execution configuration setting; An execution module, the first execution configuration is set to be executed by the general-purpose graphics processor, and the execution time is obtained; a training data extraction module, extracting a second execution configuration from the first execution configuration according to the execution time as training data; and A modeling module is used to establish a configuration setting prediction model of the general-purpose graphics processor core function by using the training data.

The general-purpose graphic processor core function configuration setting prediction system as described in Claim 1, wherein the parameter set includes a data volume set, a necessary configuration set, and an optional configuration set.

The general-purpose graphics processor core function configuration setting prediction system as described in claim 2, wherein, the training data extraction module is for each data volume set, according to the execution time from the necessary configuration set and the The optional configuration set is extracted, and the data volume set, the extracted necessary configuration set and the optional configuration set are used as training data.

The general-purpose graphics processor core function configuration setting prediction system as described in claim 1 or 2, wherein the modeling module uses K-neighbor algorithm or logistic regression algorithm to establish a model.

The general-purpose graphic processor core function configuration setting prediction system as described in claim 2, wherein the set screening module preliminarily screens the data set by setting intervals of element attributes in the data set.

The general-purpose graphics processor core function configuration setting prediction system as described in claim 2, wherein, the set screening module preliminarily screens the necessary by selecting the number of execution threads of the execution thread block with a multiple of 32 Configuration collection.

The general-purpose graphics processor core function configuration setting prediction system as described in claim 2, wherein the set screening module further optimizes the optional configuration set, and the optimization method is tiling or shared memory patching.

The general-purpose graphic processor core function configuration setting prediction system as described in claim 2, wherein, the preliminary screening method of the set screening module for the optional configuration set is: select several sets of the data volume, Select the front, middle, and back sub-ranges of these ranges within the range of several data sets, and test these sub-ranges, all of the necessary configuration sets and all of the optional configuration sets, which will have the majority of execution performance. This set of optional configurations is filtered out from the best execution results.

A configuration setting prediction method for the core function of a general-purpose graphics processor, which is suitable for the general-purpose graphics processor, and its characteristics include: Preliminary screening is performed on the parameter set of the configuration setting of the general-purpose graphics processor core function, and the parameter set that has been screened is used as an input parameter; The input parameter is secondarily screened by the resource occupancy values of Warp, temporary register and shared memory in the stream composite processor to obtain the first execution configuration setting; Setting the first execution configuration to be executed by the general-purpose graphics processor to obtain an execution time; Extracting a second execution configuration from the first execution configuration according to the execution time as training data; Establishing a configuration setting prediction model of the general-purpose GPU kernel function by using the training data; and A set of configuration settings is predicted by the prediction model.

The method for predicting the configuration setting of a general-purpose graphics processor core function as described in Claim 9, wherein the parameter set includes a data volume set, a necessary configuration set, and an optional configuration set.