TW201243618A - Load balancing in heterogeneous computing environments - Google Patents

Load balancing in heterogeneous computing environments Download PDF

Info

Publication number
TW201243618A
TW201243618A TW100147983A TW100147983A TW201243618A TW 201243618 A TW201243618 A TW 201243618A TW 100147983 A TW100147983 A TW 100147983A TW 100147983 A TW100147983 A TW 100147983A TW 201243618 A TW201243618 A TW 201243618A
Authority
TW
Taiwan
Prior art keywords
processor
workload
processing unit
steps
following
Prior art date
Application number
TW100147983A
Other languages
Chinese (zh)
Other versions
TWI561995B (en
Inventor
Jayanth N Rao
Eric C Samson
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/094,449 external-priority patent/US20120192200A1/en
Application filed by Intel Corp filed Critical Intel Corp
Publication of TW201243618A publication Critical patent/TW201243618A/en
Application granted granted Critical
Publication of TWI561995B publication Critical patent/TWI561995B/en

Links

Abstract

Load balancing may be achieved in heterogeneous computing environments by first evaluating the operating environment and workload within that environment. Then, if energy usage is a constraint, energy usage per task for each device may be evaluated for the identified workload and operating environments. Work is scheduled on the device that maximizes the performance metric of the heterogeneous computing environment.

Description

201243618 六、發明說明: c發明戶斤屬之技術領域3 相關申請案之交互參照 本案為非臨時專利申請案,請求於2011年一月21曰所 提申之美國臨時專利申請案第61/434947號之優先權權 益,該臨時專利申請案係於此特意提及以併入本文。 發明領域 本案大體上係有關圖形處理技術,並更特別係有關用 於在中央處理單元與圖形處理單元間之負載平衡的技術。 ί;先前技術3 發明背景 本案大體上係有關圖形處理技術,並更特別係有關用 於在中央處理單元與圖形處理單元間之負載平衡的技術。 許多運算裝置係同時包括有用於一般用途的一個中央 處理單元和一個圖形處理單元。這些圖形處理單元主要係 專用於圖形用途。中央處理單元會作一般的任務,像是運 行應用等。 負載平衡可藉由在一個系統或網路内之不同可用裝置 之間切換任務來增進效能。負載平衡可亦被用來減少能量 使用。 一個異質運算環境係在同一個系統或網路内包括有不 同類型的處理或運算裝置。因此,同時具有一個中央處理 單元和一個圖形處理單元的一個典型平臺是異質運算環境 的一個例子。 201243618 【發明内容】 發明概要 依據本發明之一實施例,係特地提出一種方法,其包含 下列步驟:在至少兩個處理器之間,基於該等工作負載特 性和該等兩個處理器之該等能力而電子式地選擇一個處理 器來執行一個工作負載。 依據本發明之一實施例,係特地提出一種儲存有指令的 非暫時性電腦可讀媒體,該等指令係要由一個處理器執行 以進行下列步驟:基於該等工作負載特性和二或更多個處 理器之該等能力,在該等至少兩個處理器之間分配工作負 載,一個處理器用來執行一個工作負載。 依據本發明之一實施例,係特地提出一種設備,其包 含:一個圖形處理單元;以及耦接至該圖形處理單元的一 個中央處理單元,該中央處理單元係用於基於該等工作負 載特性和該等兩個處理器之該等能力而選擇一個處理器來 執行一個工作負載。 、 圖式簡單說明 第1圖是一個實施例的流程圖; 第2圖描繪用於對每個任務判定平均能量的作圖;並且 第3圖是對一個實施例的硬體描繪。 C實施方式3 較佳實施例之詳細說明 在一個異質運算環境,例如開放式運算語言(Open Computing Language, OpenCL),中,一個給定的工作負載 201243618 係可在此運算環境中的任何運算裝I上執行。在—些平二 中’是有兩個這樣的裝置,-個中央處理單元(:二 processingunit,cpu)和-個圖形處理單元(_ies P—仙g⑽,GPU)…個異f知覺負載平衡器會排程在 可用處理H上的m以使得最大化在這些機電和設 計限制内的可達效能。 然而,雖然係可在運算環境中的任何運算裂置上執行 給定工作貞載,但各個運裝置會具有其㈣的特性所以 它可能會最適合執行某種難的卫作負載。理想上,會有 對工作負載特性和行為的—個完美制器,以使得一個給 定工作負餘夠被排程在會使得效能最大化的處理器上。 但-般而言’對於效能預測器的近似最好是能夠作即時實 施。此效能預測器可兼用有關此工作負載(靜態和動態的) 及其操作環境(靜態和動態的)的蚊論式的和統計學的 資訊。 操作環境估算會考慮與特定操作情況匹配的處理器能 力。舉例來說,可能會有當中CPU係比GPU更有能力的平 臺,或反之。然而,在-個給定客戶平臺巾,對於某些工 作負載而言,GPU係可能會比cpu更有能力。 此操作%境可能會具有一些靜態特性。靜態特性的一 些範例包括裝置類型或類別、操作頻率範圍、核心數量和 位置、取樣器及其他諸如此類者、算數位元精密度、以及 機電限制。決定動態操作環境特性之動態裝置能力的範例 包括實際頻率和溫度寬裕限度、實際能量寬裕限度、實際 201243618 閒置核心數量、機電特性和寬裕限度之實際狀態、以及電 力政策選擇,例如電池模式對適應性模式。 某些浮點數匹配/超越函數係仿擬於GPU中。然而,cpu 係可以最高效能來當地輯域這些减。這可亦係在編 譯時間決定。 某些OpenCL演算法係利用「共享本地記憶體」。一個 GPU可能會具有特化的硬體來支援這個可補償負載平衡之 效益性的記憶體模型。 任何對於工作負載的先前知識,包括特性例如其大 小如何影響實際效能,都可能會被用來決定負載平衡能夠 有多大效益。舉另一個例子,在一個給定GPU的較舊版本 中可能並不具有64位元支援。 應用也可能會有明顯支援或使負載平衡之效益性失效 的特性。在影像處理中,具有取樣器硬體的GPU的表現比 CPU好。在具有圖形應用程式介面(appHcati〇n program interfaces,API)的表面共享中,〇penCL容許在開放式圖形 語言(Open Graphics Language,OpenGL )與DirectX之間的 表面共享。對於此種使用事例而言,使用GPU來避免從視 訊記憶體複製表面到系統記憶體可能會是較佳的。 工作負載的搶先性要求可能會影響負載平衡的效益。 對於要工作於True-Vision Targa格式位元映像圖形中(IVB ) 的OpenCL而言,此IVB OpenCL實作必須容許OpenCL工作 負載在一個IVB GPU上的搶先和持續進送進程。 試圖要微控制特定硬體目標平衡的一個應用在不當使 6201243618 VI. INSTRUCTIONS: c Inventing the technical field of the households. 3 Cross-references to the relevant applications. This is a non-provisional patent application. The US Provisional Patent Application No. 61/434947 filed on January 21, 2011 The priority patent application is hereby expressly incorporated herein by reference. FIELD OF THE INVENTION The present invention is generally related to graphics processing techniques, and more particularly to techniques for load balancing between a central processing unit and a graphics processing unit. BACKGROUND OF THE INVENTION The present invention is generally related to graphics processing techniques, and more particularly to techniques for load balancing between a central processing unit and a graphics processing unit. Many computing devices include both a central processing unit and a graphics processing unit for general use. These graphics processing units are primarily dedicated to graphics purposes. The central processing unit performs general tasks such as running applications. Load balancing can improve performance by switching tasks between different devices within a system or network. Load balancing can also be used to reduce energy usage. A heterogeneous computing environment includes different types of processing or computing devices in the same system or network. Therefore, a typical platform having both a central processing unit and a graphics processing unit is an example of a heterogeneous computing environment. 201243618 SUMMARY OF INVENTION Summary of the Invention In accordance with an embodiment of the present invention, a method is specifically provided comprising the steps of: between at least two processors, based on the workload characteristics and the two processors The ability to electronically select a processor to perform a workload. In accordance with an embodiment of the present invention, a non-transitory computer readable medium storing instructions is provided, the instructions being executed by a processor to perform the following steps: based on the workload characteristics and two or more The ability of the processors to distribute the workload among the at least two processors, one processor for executing a workload. According to an embodiment of the present invention, an apparatus is specifically provided, comprising: a graphics processing unit; and a central processing unit coupled to the graphics processing unit, the central processing unit is configured to be based on the workload characteristics and These two processors have the ability to select a processor to execute a workload. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a flow chart of an embodiment; Fig. 2 depicts a plot for determining an average energy for each task; and Fig. 3 is a hardware depiction of an embodiment. C Embodiment 3 Detailed Description of the Preferred Embodiment In a heterogeneous computing environment, such as Open Computing Language (OpenCL), a given workload 201243618 can be any computing device in this computing environment. Execute on I. In the "two flats" there are two such devices, a central processing unit (: two processing units, cpu) and a graphics processing unit (_ies P - fairy g (10), GPU) ... different f sensory load balancer The m on the available processing H is scheduled to maximize the reachability within these electromechanical and design constraints. However, although a given job load can be performed on any of the operational shards in the computing environment, each device will have its (4) characteristics so it may be best suited to perform some difficult workload. Ideally, there will be a perfect controller for workload characteristics and behavior so that a given workload can be scheduled on a processor that maximizes performance. But in general, the approximation of the performance predictor is best able to be implemented on the fly. This performance predictor combines both mosquito-based and statistical information about this workload (static and dynamic) and its operating environment (static and dynamic). Operating environment estimates take into account processor capabilities that match specific operating conditions. For example, there may be platforms where the CPU is more capable than the GPU, or vice versa. However, in a given client platform, the GPU may be more capable than a CPU for some workloads. This operation % environment may have some static characteristics. Some examples of static characteristics include device type or class, operating frequency range, number and location of cores, samplers and the like, arithmetic precision, and electromechanical limitations. Examples of dynamic device capabilities that determine dynamic operating environment characteristics include actual frequency and temperature margins, actual energy margins, actual 201243618 number of idle cores, actual state of electromechanical characteristics and margins, and power policy choices such as battery mode versus adaptability mode. Some floating-point matching/transcend functions are simulated in the GPU. However, the cpu system can achieve the highest performance to local reduction. This can also be determined at the compile time. Some OpenCL algorithms utilize "shared local memory." A GPU may have specialized hardware to support this memory model that compensates for load balancing benefits. Any prior knowledge of the workload, including characteristics such as how small or small it affects actual performance, may be used to determine how much load balancing can be. As another example, there may not be 64-bit support in an older version of a given GPU. Applications may also have features that significantly support or disable the effectiveness of load balancing. In image processing, a GPU with sampler hardware performs better than a CPU. In surface sharing with the appHcati〇n program interfaces (API), 〇penCL allows surface sharing between Open Graphics Language (OpenGL) and DirectX. For such use cases, it may be preferable to use a GPU to avoid copying the surface from the video memory to the system memory. Preemptive requirements for workloads can affect the benefits of load balancing. For OpenCL to work in True-Vision Targa format bit map graphics (IVB), this IVB OpenCL implementation must allow OpenCL workloads to be preemptive and continuous in the IVB GPU. An application that attempts to micro-control the balance of a particular hardware target is improperly made 6

S 201243618 用下可能會損毀任何使CPU/GPU負栽平衝的合 動態工作負載特徵化係指在即時 曰。 作負載之資訊。這包括長期歷史、短^的關於此工 4歷史、過去歷.、 和目前歷史。例如,絲執行先料料,是目前歷史 的一個範例,而視平均時段或時間常 π默而定,使一個新任 務破處理的平均日㈣可為長期錢。先前 ,特定核心所費的時間是過去歷史的一個範例。所有的 技些方法都可為對於可施用來排程下 的有效酬ϋ。 個任務的未來效能 請參考第旧’依據-些實施例的1用於負載平衡之 序列係可實施於軟體、硬體或韌體中。 触也 具可係藉由一個軟 =關,湘料儲存指令的—個㈣時性電腦可讀媒 ,來貫施。這樣的-個非暫時性電腦可讀媒體的範例包括 光學、磁性或半導體儲存裝置。 在-些實施例中,此序列可係從評估操作環境開始, 。力=_指出I作業環境對於判定靜‘態或動態裝置 可能會是很重要的。接著’此系統可評估特定工 成=方塊⑴。_地,係可將卫作負載特性廣泛地分 杨態㈣。接下來,此㈣可判定是否有任何 =限制,例如由方塊14所指“。負載平衡在必須 量使㈣實施射可能會與錢量《不為考量 的貫%例中有所不同。 若能量使用事實上為—項限制的話,此序列4 "針對所識別4的工作負載和操作環境而逐任務地判 201243618 定處理器能量使用(方塊丨6 )。最後,在任何情況中,係可 將工作排程於處理器上以最大化效能度量,如於方塊18中 所指出的。若沒有任何能量使用限制,則可簡單地繞過方 塊16。 目標排程政策/演算法可最大化任何給定度量,其常係 被歸、.内到-組基準成㉟中。排程政策/演算法可係基於靜態 特徵化和動態特徵化二者而設計。基於這些靜態和動態特 丨,:針對各個裝置產生-個度量’針對工作負載排程而 7异其適當程度。針對—個特定處理器類型的具有最佳成 ’責的裝置很可能會被排程在那個處理器類型上。 於有能量限制,平臺可能會具有最大頻率限制。 二能量限制的平臺可實施對於在能量受限限制下之最佳 需:排程演算法的一種取樣器形式。只要有能量寬 衡決i。—種版本的最短排程估算器便可驅動排程/負載平 行之個工作負載將在較短但稀疏分佈的叢集中被執 於一個㈣排程決策。對於叢發型工作負載而言,對 平臺將會& I作負載來說看起來會是受能量限制的一個 道一個工作Γ看起來是受頻率限制的。若我們事先並不知 載將是叢發Γ將是叢發型的’但我們擁有對於此工作負 程決策。@可能性的估算,則可使用此估算來驅動排 當 201243618 策。要運行一個任務的處理器能量為: 要運行下一個任務的處理器A能量 由處理器A所消耗的電力*在處理器A上的歷時 要運行下一個任務的處理器B能量 由處理器B所消耗的電力*在處理器B上的歷時 當工作負載行為在事前為未知時,會需要對於這些量 值的計算。若實際能量消耗不能直接獲得(例如,從晶粒 上能量計數器),則係可替代地使用對於能量消耗的個別成 份之估算。例如(且一般化對於處理器X的式子),S 201243618 The following may damage any dynamic workload characterization that causes the CPU/GPU to be flushed. Information about the load. This includes long-term history, short history of this work, past history, and current history. For example, silk implementation of the first material is an example of current history, and depending on the average time period or time, the average day (4) for a new task can be long-term money. Previously, the time spent on a particular core was an example of past history. All of the methods can be an effective reward for the schedule that can be applied. The future performance of the tasks can be implemented in software, hardware or firmware in accordance with the previous one. The touch can also be applied through a soft (off), Hunan food storage instruction - (four) time computer readable media. Examples of such non-transitory computer readable media include optical, magnetic or semiconductor storage devices. In some embodiments, this sequence can begin with an evaluation of the operating environment. Force = _ indicates that the I operating environment may be important for determining static or dynamic devices. Then the system can evaluate the specific work = block (1). _ ground, the system can be widely divided into the load characteristics of the security (four). Next, this (4) can determine whether there is any = limit, for example, as indicated by block 14. "Load balancing in the amount of (4) implementation of the shot may be different from the amount of money that is not considered. Using a de facto limit, this sequence 4 " is task-by-task to determine the processor energy usage of 201243618 for the identified 4 workloads and operating environment (block 丨6). Finally, in any case, Work is scheduled on the processor to maximize performance metrics, as indicated in block 18. If there are no energy usage restrictions, then simply bypass block 16. The target scheduling policy/algorithm can maximize any Given a metric, it is often attributed to an internal-to-group benchmark of 35. Scheduling policies/algorithms can be designed based on both static characterization and dynamic characterization. Based on these static and dynamic characteristics,: Producing a metric for each device is different for the workload schedule. The device with the best responsibilities for a particular processor type is likely to be scheduled at that processor type. The platform may have a maximum frequency limit with energy limitations. The two energy-limited platform can implement the best need for energy-limited constraints: a sampler format for scheduling algorithms. The version of the shortest schedule estimator that drives the schedule/load parallel workload will be enforced in one (four) schedule decision in a shorter but sparsely distributed cluster. For a cluster workload, The platform will look like a load. It will look like an energy-limited one. It seems to be frequency-limited. If we don’t know before, it’s a bundle of hairpins. But we With an estimate of the potential for this work. The estimate of @ possibility can be used to drive the 201243618 policy. The processor energy to run a task is: Processor A energy to run the next task by processor A The power consumed * on the processor A to run the next task of the processor B energy consumed by the processor B * the duration on the processor B when the workload behavior is When the former is unknown, calculations for these magnitudes may be required. If the actual energy consumption is not directly available (eg, from the on-die energy counter), then an estimate of the individual components of the energy consumption may be used instead. Generalized for the processor X),

要運行下一個任務的處理器X能量 針對處理器X的估算電力*在處理器X上的估算歷時 Power_estimate_for processor X static power estimate (v,f,T) + dynamic—power—estimate (v,f,T,t), 其中static_power_estimate(v,f,T)是將電壓v、經標準化的 頻率f、和溫度T相依度納入考量的一個值,但並非係以工 作負載相依即時更新方式。Dynamic_power_estimate(v,f,Τ, t)係有將工作負載相依即時資訊t納入考量。 例如,Processor X energy to run the next task Estimated power for processor X * Estimated duration on processor X Power_estimate_for processor X static power estimate (v,f,T) + dynamic—power—estimate (v,f, T, t), where static_power_estimate(v, f, T) is a value that takes into account the voltage v, the normalized frequency f, and the temperature T dependency, but is not a workload-dependent immediate update. Dynamic_power_estimate(v,f,Τ, t) takes into account the workload-dependent real-time information t. E.g,

Dynamic power一estimate (v,f,Τ,η) 201243618 (l-b) * Dynamic power estimate (v, f, T, n-1) + b * instantaneous_power_estimate(v, f, T, n) · 其中「b」是一個常數,用以控制要針對dynamic_power_ estimate向過去作多遠的考量。接著, instantaneous—power_estimate (V,f,Τ,η) C_estimate*vA2*f + Ι(ν, Τ)*ν » 其中C_eStimate是追蹤工作負載電力之電容部份的一個變 數,且I(v, T)是追蹤工作負載電力之漏損相依部份。類似 地,係有可能基於對用於過去和現在工作負載與處理器頻 率的時鐘計數之測量而作出對工作負载的一個估算。在上 面這些式子中所定義的這些參數可為基於剖析資料的指定 值。 舉-個能量經濟自偏壓的範例,係可基於哪個處理器 類型最後結束一個任務來排程一個新的任務。平均而士, 快速處理任務的一個處理器更常成為可用狀態。若沒^任 何目前資訊,則可係使用一個預設初始處理器。或者是, 可使用針對處理器A和處理器B所產生的度量來指定I作 給最後結束的處理ϋ,只要最後結束的處理器要:任務 的能量低於: G *非最後結束的4理器要運行饪務的能量,Dynamic power-estimate (v,f,Τ,η) 201243618 (lb) * Dynamic power estimate (v, f, T, n-1) + b * instantaneous_power_estimate(v, f, T, n) · where "b" Is a constant to control how far to go to the dynamic_power_ estimate. Then, instantaneous—power_estimate (V, f, Τ, η) C_estimate*vA2*f + Ι(ν, Τ)*ν » where C_eStimate is a variable that tracks the capacitive portion of the workload power, and I(v, T ) is to track the leakage-dependent part of the workload power. Similarly, it is possible to make an estimate of the workload based on measurements of clock counts for past and present workloads and processor frequencies. The parameters defined in the above equations can be specified values based on the profile data. An example of an energy-economy self-bias can be based on which processor type ends up with a task to schedule a new task. On average, a processor that processes tasks quickly is more often available. If there is no current information, a default initial processor can be used. Alternatively, the metrics generated for processor A and processor B can be used to specify the processing I for the last end, as long as the processor that ends last is: the energy of the task is lower than: G * non-final 4 The energy to run the service,

S 10 201243618 其中「G」是被決定來最大化整體效能的一個值。 在第2圖中’水平軸在圖的左邊示出最近期的事件,並S 10 201243618 where "G" is a value that is determined to maximize overall performance. In Figure 2, the horizontal axis shows the most recent event on the left side of the figure, and

且較舊的事件往右邊。然後C、D、E、F、G和Y是OpenCL 任務。處理器B運行一些非0penCL任務r其他」,並且兩個 處理15都經歷—些閒置時段。下一個要被排程的OpenCL任 務疋任務Z。所有的處理器a任務都被示為相等電力位準, 並且亦等於處理器B OpenCL任務Y,以減少此範例的複雜 度。And the older event goes to the right. Then C, D, E, F, G, and Y are OpenCL tasks. Processor B runs some non-0penCL tasks r other, and both processes 15 experience some idle periods. The next OpenCL task to be scheduled, task Z. All processor a tasks are shown as equal power levels and are also equal to processor B OpenCL task Y to reduce the complexity of this paradigm.

OpenCL任務γ花費了一段長時間〔第2圖最上面〕,並 且因此相對於運行於處理器A上的其他〇penCL任務而言消 耗較多能量〔第2圖,下面低一點〕。 一個新的任務被排程在較佳處理器上,直到它要使一 個新任務在此處理器上被處理所花費的時間超過一個臨界 值為止,並且接著任務會被分配給另一個處理器。若並沒 有目刖H §fi,則可係使用一個預設初始處理器。或者是, 若能量知覺上下文工作針對此預定處理器所花費的時間超 過-個臨界值且對於切換處理器所估算出的能量成本合 理,則此能量知覺上下文工作會被指定給另一個處理器。 新任務可被排程在具有使一個新的批次緩衝器被處理 的最短平均時間的處理器上。若並沒有目前資訊,則可係 使用一個預設初始處理器。 對於這些概念的額外置換是有可能的。係可有可作替 代性使用的許多不同類型的估算以預測器(比例積分微分 (Proportional integral Differemial,pID)控制器、卡爾曼 201243618 (Kalman)濾波器等等)。亦有許多不同的運算能量寬裕限 度之近似之方式,取決於在特定實作上具便利性的特性。 也可能藉由效能特徵化和/或其他度量,例如最短處理 時間、記憶體足跡等等,而將其他實作置換納入考量。 可被用來調整/調制政策決策或決策臨界值以將能量 效率或電力預算納入考量的度量,包括GPU和GPU使用、 頻率、能量消耗、效率和預算、GPU和GPU輸入/輸出 (input/output, I/O)使用、記憶體使用、機電狀態(例如 操作溫度及其最佳範圍)、浮點運算、及特定於OpenCL或 其他異質運算環境類型的CPU和CPU使用。 例如,若我們已知處理器A是目前I/O受限但處理器B 並沒有,則可使用這項事實來減少運行一個新任務的任務A 投影能量效率,並因而減少處理器A會被選擇的可能性。 一個好的負載平衡實作不只利用所有的關於工作負載 和運算環境的相關資訊來最大化其效能,其並可改變作業 環境的特性。 在一個加速實作中,並沒有使CPU和GPU的加速點會 是具有能量經濟的任何保證。加速設計目標是針對非異質 非同作CPU/GPU工作負載的尖峰效能。在同作CPU/GPU工 作負載的事例中,對於可用能量預算的分配並非係由任何 對於能量經濟或末端使用者察知利益的考量來判定。 然而’ OpenCL是一種能夠同作地使用CPU和GPU工作 的負載類型’並且相較於其他工作負載類型而言,其可用 電力預算分配的末端使用者察知利益較不含糊。The OpenCL task γ takes a long time (top of Figure 2) and therefore consumes more energy than the other 〇penCL tasks running on processor A (Fig. 2, lower below). A new task is scheduled on the preferred processor until it takes a new task to be processed on the processor for more than a critical value, and then the task is assigned to another processor. If you do not see H §fi, you can use a default initial processor. Alternatively, if the energy conscious context work takes more than a threshold for the predetermined processor and the energy cost estimated by the switching processor is reasonable, then the energy conscious context work is assigned to the other processor. New tasks can be scheduled on the processor with the shortest average time for a new batch buffer to be processed. If there is no current information, a default initial processor can be used. Additional permutations for these concepts are possible. There are many different types of estimates that can be used interchangeably as predictors (Proportional integral Differemial (pID) controllers, Kalman 201243618 (Kalman) filters, etc.). There are also many different ways to approximate the operational energy margin, depending on the characteristics of the particular implementation. It is also possible to take into account other implementation permutations by performance characterization and/or other metrics such as minimum processing time, memory footprint, and the like. Metrics that can be used to adjust/modulate policy decisions or decision thresholds to take energy efficiency or power budgeting into account, including GPU and GPU usage, frequency, energy consumption, efficiency and budget, GPU and GPU input/output (input/output) , I/O) use, memory usage, electromechanical state (eg operating temperature and its optimal range), floating point arithmetic, and CPU and CPU usage specific to OpenCL or other heterogeneous environment types. For example, if we know that processor A is currently I/O limited but processor B does not, then this fact can be used to reduce the efficiency of task A projection running a new task, and thus reduce processor A will be The possibility of choice. A good load balancing implementation not only takes advantage of all the information about the workload and the computing environment to maximize its performance, it also changes the characteristics of the operating environment. In an accelerated implementation, there is no guarantee that the CPU and GPU acceleration points will be energy-efficient. The accelerated design goal is for spike performance of non-heterogeneous non-similar CPU/GPU workloads. In the case of the same CPU/GPU workload, the allocation of available energy budget is not determined by any consideration of the energy economy or the end user's perceived benefit. However, 'OpenCL is a type of load that can work with CPUs and GPUs in the same way' and its end users who are available for power budget allocation are more ambiguous than other workload types.

12 S 201243618 例如,處J里斋A可-般性地為咖似任務之較佳處理 器’而’處理HA正運行於其最大操作頻率上,並且還有 電力預算。所以處理器W亦同作地運行加❿工作負 載。接著,只要並不會將處理器A的電力預算減少到會使其 不運行於其最大頻率上,同作地使用處理 器B來增進通量 (假設處理HB能夠夠快速地完成這些任務)是很合理的。 可藉由不肖,丨弱處理HA效能且仍祕可用預算的最低處理 HB頻率(和/或核心數量)來獲得最大效能,而非對於非 OpenCL工作負載的預設作業系統或pcu 選擇。 此演算法的範嘴可更進—步被拓寬。此任務的某些特 性可在編料間且亦在執行時間被評估,以導出對執行此 任務所需的時間和資源的更為精確的估算。在cpu和Gpu 上之對OpenCL的整備時間是另一個範例。 若一個給定任務必須要在某個時間限制内完成,則可 係以各種優先性來實施複數個佇列。那麼,此排程會比具 有較低優先性的佇列更傾向於在具有較高優先性的佇列中 之任務。 在OpenCL中,交互相關性在執行時係為〇pen(:L事件 實體所知悉。係可使錢f訊來祕交互相關性潛時被最 小化。 ^ GPU任務典型上會藉由創造一個命令緩衝器而針對執 行被作排程。此命令緩衝器可例如係基於相依性而含有複 數個任務q任務或子任務的數量可基於此演算法而被呈^ 給裝置。 13 201243618 GPU典型上係用於顯現圖形API任務。排程器可係負責 有影響交互性或圖形可見經歷之危險的任何OpenCL或 GPU任務(即,比預定完成時間更長的任務)。這樣的任務 可在非OpenCL或顯現工作負載也正運行時被搶先。 示於第3圖中的運算系統130可包括一個硬驅動機134 和一個可移除媒體136 ’可移除媒體136藉由一個匯流排丨〇4 耦接至一個晶片組核心邏輯110。此電腦系統可為任何電腦 系統,包括智慧型行動裝置,例如智慧型電話、輸入板、 或行動網際網路裝置。係可使一個鍵盤和滑鼠12〇 ,或其他 習知部件,經由一個匯流排105耦接至圖形處理器112,以 及使其輕接至主要或主機處理器丨〇〇。圖形處理器丨12可亦 藉由一個匯流排106而耦接至一個訊框緩衝器114 ^訊框緩 衝器114可藉由一個匯流排1〇7耦接至一個顯示螢幕118。在 一個實施例中,一個圖形處理器112可為使用單指令多資料 (single instruction muhiple 如% SIMD)架構的一個多緒 式多核心並行處理器。 係可藉由在一個貫施例中所評估的至少兩個處理器中 之一者來實施此處理器選擇演算法。在這個事例中,當是 在圖形和中央處理器間作選擇時,在-個實施例中,係可 由中央處理單元進行選擇。在其他數個事例中 ,係可由一 個特化的或專用的處理絲實施選擇演算法。 、在-個軟體實作的事例中,係可將相關碼儲存在任何 適田的半導體、磁性、或光學記憶體中,包括主要記憶體 132或在此圖形處理器内的任何可用記憶體。因此,在一個 14 € 201243618 實施例中,用來進行第丨圖之序列的碼可係儲存在一個非 暫時性機器或電腦可讀媒體,例如記憶體132中,並且在 一個實施例中,係可由處理器10或圖形處理器112執行。 第1圖是一個流程圖。在一些實施例中,繪示於此流裎 圖中的序列係可被實施在硬體、軟體或韌體中。在 體實施例中,係可使用-個非暫時性電腦可讀媒體,例2 一個半導Μ置、-個磁性記憶體、或—個光學記憶體來 儲存指令,且其係可由一個處理器執行,以實施示於第 中的序列。 於本文中所描述的圖形處理技術係可在各種硬體架構 中實施。例如’係可將圖形功能整合在一個晶片组中。、或 者是,係可使用一個分立圖形處理器。再舉另一個實施例2, 這些圖形功能係可由一個一般用途處理器實施’包括多核 心處理器。 〆 於本說明書通篇中對於「一個實施例」或「一實施例 的指涉係指配合此實施例所描述的此特定特徵、結構或特1 性係包括在涵蓋於本發明内的至少一個實作中。因此,「 個實施例」或「在-實施例中」等詞語的出現並不必然係 指同一個實施例。此外,這些特定特徵、結構或特性係可 以與所例不之特定實施例不同的其他適當的形式制定,並 且所有的這些开》式都可係涵蓋於本申請案之申請專利範圍 内。 雖然已針對有限數量的實施例來說明本發明,熟於此 技者會町識出從中而生的許多修改體和變異體。係意欲要 15 201243618 使後附申請專利範圍涵蓋落於本發明之真實精神與範疇内 的所有這些修改體和變異體。 【圖式簡單說明3 第1圖是一個實施例的流程圖; 第2圖描繪用於對每個任務判定平均能量的作圖; 第3圖是對一個實施例的硬體描繪。 【主要元件符號說明】 118.. .顯示螢幕 120.. .鍵盤和滑鼠 130.. .運算系統 132.. .記憶體 134.. .硬驅動機 136.. .可移除媒體 10〜18...方塊 100.. .主要或主機處理器 104〜107...匯流排 110.. .晶片組核心邏輯 112.. .圖形處理器 114.. .訊框緩衝器 1612 S 201243618 For example, a J-Right A can be used as a better processor for a task-like task while the HA is operating at its maximum operating frequency, and there is also a power budget. Therefore, the processor W also runs the twisted workload. Then, as long as processor A's power budget is not reduced so that it does not run at its maximum frequency, processor B is used to increase throughput (assuming that processing HB can accomplish these tasks quickly) Very reasonable. The maximum performance can be achieved by not being able to handle the HA performance and still use the minimum processing HB frequency (and/or the number of cores) of the budget, rather than the default operating system or pcu selection for non-OpenCL workloads. The scope of this algorithm can be further advanced - the step is widened. Some of the characteristics of this task can be evaluated between the compilations and also at the execution time to derive a more accurate estimate of the time and resources required to perform this task. The set-up time for OpenCL on cpu and GPU is another example. If a given task must be completed within a certain time limit, multiple queues can be implemented with various priorities. Then, this schedule will be more likely to be in a higher priority queue than a queue with lower priority. In OpenCL, the interaction correlation is implemented as 〇pen(:L event entity is known. It is possible to minimize the latency of interactions. ^ GPU tasks are typically created by creating a command. The buffer is scheduled for execution. This command buffer may, for example, contain a plurality of tasks based on dependencies. The number of tasks or subtasks may be presented to the device based on this algorithm. 13 201243618 GPU is typically used Used to visualize graphical API tasks. The scheduler can be responsible for any OpenCL or GPU task (ie, a task that is longer than the scheduled completion time) that affects the danger of interactive or graphically visible experiences. Such tasks can be in non-OpenCL or The presentation workload is also preempted while it is running. The computing system 130 shown in Figure 3 can include a hard drive 134 and a removable media 136 'Removable media 136 coupled by a busbar 丨〇4 To a chipset core logic 110. The computer system can be any computer system, including a smart mobile device, such as a smart phone, an input pad, or a mobile internet device. A keyboard and mouse 12, or other conventional components, are coupled to the graphics processor 112 via a bus bar 105 and are lightly coupled to the primary or host processor. The graphics processor 12 can also be A bus bar 106 is coupled to a frame buffer 114. The frame buffer 114 can be coupled to a display screen 118 by a bus bar 〇 7. In one embodiment, a graphics processor 112 can be A multi-threaded multi-core parallel processor using a single instruction muhiple (% SIMD) architecture, which may be implemented by one of at least two processors evaluated in one embodiment The processor selects the algorithm. In this case, when making a choice between the graphics and the central processor, in one embodiment, the central processing unit can select. In other cases, the system can be a special Or a special processing wire to implement the selection algorithm. In the case of a software implementation, the relevant code can be stored in any suitable semiconductor, magnetic, or optical memory, including the main record. Body 132 or any available memory within the graphics processor. Thus, in a 14 € 201243618 embodiment, the code used to perform the sequence of the first diagram may be stored on a non-transitory machine or computer readable medium. , for example, in memory 132, and in one embodiment, may be performed by processor 10 or graphics processor 112. Figure 1 is a flow diagram. In some embodiments, the sequence depicted in this flow diagram is shown. The system can be implemented in a hardware, a soft body or a firmware. In the embodiment, a non-transitory computer readable medium can be used, for example, a semi-conductive device, a magnetic memory, or a The optical memory stores instructions and can be executed by a processor to implement the sequence shown in the middle. The graphics processing techniques described herein can be implemented in a variety of hardware architectures. For example, the graphics function can be integrated into a single chipset. Or, a discrete graphics processor can be used. As another example 2, these graphics functions can be implemented by a general purpose processor' including a multi-core processor. The phrase "one embodiment" or "an embodiment" in this specification means that the specific feature, structure, or feature described in connection with the embodiment includes at least one of the features included in the present invention. The appearances of the words "in the embodiment" or "in the embodiment" are not necessarily referring to the same embodiment. In addition, these specific features, structures, or characteristics may be made in other suitable forms than those of the specific embodiments, and all of these may be covered by the scope of the present application. While the invention has been described with respect to a limited number of embodiments, the skilled artisan recognizes many modifications and variants derived therefrom. It is intended to cover all such modifications and variations within the true spirit and scope of the invention. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flow chart of one embodiment; FIG. 2 depicts a plot for determining an average energy for each task; and FIG. 3 is a hardware depiction of an embodiment. [Main component symbol description] 118.. Display screen 120.. Keyboard and mouse 130.. Operation system 132.. Memory 134.. Hard drive 136.. Removable media 10~18 ...block 100.. main or host processor 104~107... bus bar 110.. chipset core logic 112.. graphics processor 114.. frame buffer 16

SS

Claims (1)

201243618 七、申請專利範圍: !·種方法,其包含下列步驟: 在至少_處判n於該“ 邊等兩個處理器之兮耸 負載特性和 器來執行-個工^負載。月^力而電子式地選擇-個處理 2·如申物範㈣旧之方法,其包 用。針對該工作負載,評估哪個處理器有較=能量使 I如申料利細1項之方法,其包括下列步驟. 知形和中央處理單元之間作選擇。 .如申請專利範圍第1項之方法,其包括下列步驟: 硪別出能量使用限制,以及 該::等能量使用限制而選擇-個處理器來執行 5·如中請專利範圍第1項之方法,其包括下列步驟: 在針對-個給定工作負載而言係具有較佳效能度 里的該處理器上排程工作。 6 如申請專利範圍第5項之方法,其包括下列步驟: 在靜態和動態工作負載下評估該效能度量。 如申請專利範圍第5項之方法,其包括下列步驟: 選擇可以最短時間執行社作負載的該處理器。 —種儲存有指令的㈣魄電腦可讀舰,該等指令係 要由-個處理H執行以進行下列㈣: 基於該等工作負载特性和二或更多個處理器之該 17 201243618 等能力,在該等至少兩個處理器之間分配工作負載,一 個處理器用來執行一個工作負載。 9. 如申請專利範圍第8項之媒體,其進一步儲存有用於進 行下列步驟的指令: 針對該工作負載,評估哪個處理器有較低能量使 用。 10. 如申請專利範圍第8項之媒體,其進一步儲存有用於進 行下列步驟的指令: 在圖形和中央處理單元之間作選擇。 11. 如申請專利範圍第8項之媒體,其進一步儲存有用於進 行下列步驟的指令: 識別出能量使用限制,以及 基於該等能量使用限制而選擇一個處理器來執行 該工作負載。 12_如申請專利範圍第8項之媒體,其進一步儲存有用於進 行下列步驟的指令: 在針對一個給定工作負載而言係具有較佳效能度 量的該處理器上排程工作。 13. 如申請專利範圍第12項之媒體,其進一步儲存有用於進 行下列步驟的指令: 在靜態和動態工作負載下評估該效能度量。 14. 如申請專利範圍第12項之媒體,其進一步儲存有用於進 行下列步驟的指令: 選擇可以最短時間執行該工作負載的該處理器。 18 S 201243618 15. —種設備,其包含: 一個圖形處理單元;以及 耦接至該圖形處理單元的一個中央處理單元,該中 央處理單元係用於基於該等工作負載特性和該等兩個 處理器之該等能力而選擇一個處理器來執行一個工作 負載。 16_如申請專利範圍第15項之設備,該中央處理單元係用 於: 針對該工作負載,評估哪個處理器有較低能量使 用。 17. 如申請專利範圍第15項之設備,該中央處理單元係用 於: 識別出能量使用限制,以及 基於該等能量使用限制而選擇一個處理器來執行 該工作負載。 18. 如申請專利範圍第15項之設備,該中央處理單元係用 於: 在針對一個給定工作負載而言係具有較佳效能度 量的該處理器上排程工作。 19. 如申請專利範圍第18項之設備,該中央處理單元係用 於: 在靜態和動態工作負載下評估該效能度量。 20. 如申請專利範圍第18項之設備,該中央處理單元係用 於: 19 201243618 選擇可以最短時間執行該工作負載的該處理器。201243618 VII, the scope of application for patents: ! · A method, which includes the following steps: At least _ judged n "the edge of the two processors and the load characteristics and the device to perform - a ^ ^ ^ load. And electronically select a process 2, such as the application method (4) the old method, which is used for the workload, for the workload, to evaluate which processor has a = energy, such as the method of claiming 1 item, including The following steps: selection between the tangible and central processing unit. As in the method of claim 1, the method comprises the steps of: identifying energy usage restrictions, and::: equal energy usage restrictions and selection - one processing The method of claim 1 wherein the method of claim 1 includes the following steps: scheduling the processor on a processor with better performance for a given workload. 6 The method of claim 5, comprising the steps of: evaluating the performance metric under static and dynamic workloads, such as the method of claim 5, comprising the following steps: selecting can be performed in the shortest time The processor that acts as a load. A computer readable ship that stores instructions (4) that are to be executed by a process H to perform the following (4): based on the workload characteristics and two or more processors The ability to allocate a workload between the at least two processors, one processor is used to execute a workload, and the medium of claim 8 is further stored for performing the following steps. Directive: For the workload, evaluate which processor has lower energy usage. 10. As in the medium of claim 8 of the patent, it further stores instructions for performing the following steps: Between the graphics and the central processing unit 11. The medium of claim 8 of the patent application further stores instructions for performing the following steps: Identifying energy usage restrictions and selecting a processor to perform the workload based on the energy usage restrictions. _ As for the media in the scope of patent application No. 8, it further stores instructions for performing the following steps: The processor is scheduled to work on a processor with a better performance metric for a given workload. 13. The media of claim 12, further storing instructions for performing the following steps: in static and dynamic The performance metric is evaluated under workload. 14. The media of claim 12, further storing instructions for performing the following steps: Selecting the processor that can execute the workload in the shortest amount of time. 18 S 201243618 15. — Apparatus comprising: a graphics processing unit; and a central processing unit coupled to the graphics processing unit, the central processing unit for utilizing the workload characteristics and the capabilities of the two processors Select a processor to perform a workload. 16_ As for the equipment of claim 15 of the patent scope, the central processing unit is for: evaluating, for the workload, which processor has a lower energy usage. 17. The apparatus of claim 15 wherein the central processing unit is operative to: identify an energy usage limit and select a processor to perform the workload based on the energy usage limits. 18. The apparatus of claim 15 wherein the central processing unit is operable to: schedule work on the processor for a given workload with better performance. 19. As for the equipment of claim 18, the central processing unit is used to: Evaluate the performance metric under static and dynamic workloads. 20. For the equipment of claim 18, the central processing unit is used for: 19 201243618 Select the processor that can execute the workload in the shortest amount of time.
TW100147983A 2011-04-26 2011-12-22 Load balancing in heterogeneous computing environments TWI561995B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/094,449 US20120192200A1 (en) 2011-01-21 2011-04-26 Load Balancing in Heterogeneous Computing Environments

Publications (2)

Publication Number Publication Date
TW201243618A true TW201243618A (en) 2012-11-01
TWI561995B TWI561995B (en) 2016-12-11

Family

ID=48093886

Family Applications (1)

Application Number Title Priority Date Filing Date
TW100147983A TWI561995B (en) 2011-04-26 2011-12-22 Load balancing in heterogeneous computing environments

Country Status (1)

Country Link
TW (1) TWI561995B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI675289B (en) * 2013-10-31 2019-10-21 三星電子股份有限公司 Electronic systems including heterogeneous multi-core processors and methods of operating same

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6867779B1 (en) * 1999-12-22 2005-03-15 Intel Corporation Image rendering
US7093147B2 (en) * 2003-04-25 2006-08-15 Hewlett-Packard Development Company, L.P. Dynamically selecting processor cores for overall power efficiency
US7386739B2 (en) * 2005-05-03 2008-06-10 International Business Machines Corporation Scheduling processor voltages and frequencies based on performance prediction and power constraints
JP4308241B2 (en) * 2006-11-10 2009-08-05 インターナショナル・ビジネス・マシーンズ・コーポレーション Job execution method, job execution system, and job execution program
US8284205B2 (en) * 2007-10-24 2012-10-09 Apple Inc. Methods and apparatuses for load balancing between multiple processing units
JPWO2009150815A1 (en) * 2008-06-11 2011-11-10 パナソニック株式会社 Multiprocessor system
US9507640B2 (en) * 2008-12-16 2016-11-29 International Business Machines Corporation Multicore processor and method of use that configures core functions based on executing instructions

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI675289B (en) * 2013-10-31 2019-10-21 三星電子股份有限公司 Electronic systems including heterogeneous multi-core processors and methods of operating same

Also Published As

Publication number Publication date
TWI561995B (en) 2016-12-11

Similar Documents

Publication Publication Date Title
US20120192200A1 (en) Load Balancing in Heterogeneous Computing Environments
US11106495B2 (en) Techniques to dynamically partition tasks
US8046468B2 (en) Process demand prediction for distributed power and resource management
US8914515B2 (en) Cloud optimization using workload analysis
JP5075274B2 (en) Power aware thread scheduling and dynamic processor usage
US8707300B2 (en) Workload interference estimation and performance optimization
KR101812583B1 (en) Apparatus or task assignment, method for task assignment and a computer-readable storage medium
JP5946068B2 (en) Computation method, computation apparatus, computer system, and program for evaluating response performance in a computer system capable of operating a plurality of arithmetic processing units on a computation core
KR20170062493A (en) Heterogeneous thread scheduling
EP2446357A1 (en) High-throughput computing in a hybrid computing environment
JP5345990B2 (en) Method and computer for processing a specific process in a short time
Jeon et al. TPC: Target-driven parallelism combining prediction and correction to reduce tail latency in interactive services
Stavrinides et al. Energy-aware scheduling of real-time workflow applications in clouds utilizing DVFS and approximate computations
US10089155B2 (en) Power aware work stealing
Sahba et al. Improving IPC in simultaneous multi-threading (SMT) processors by capping IQ utilization according to dispatched memory instructions
Di et al. Optimization of composite cloud service processing with virtual machines
Huang et al. Novel heuristic speculative execution strategies in heterogeneous distributed environments
Feliu et al. Symbiotic job scheduling on the IBM POWER8
JP2020513613A (en) Application profiling for power performance management
US10846086B2 (en) Method for managing computation tasks on a functionally asymmetric multi-core processor
US20220050716A1 (en) Virtual machine placement method and virtual machine placement device implementing the same
Azhar et al. SLOOP: QoS-supervised loop execution to reduce energy on heterogeneous architectures
Kang et al. Priority-driven spatial resource sharing scheduling for embedded graphics processing units
TW201243618A (en) Load balancing in heterogeneous computing environments
Mariani et al. Arte: An application-specific run-time management framework for multi-core systems