JP2022546271A

JP2022546271A - Method and apparatus for predicting kernel tuning parameters

Info

Publication number: JP2022546271A
Application number: JP2022510786A
Authority: JP
Inventors: カーンジャハーンダッド; イサムローウェルダニエル
Original assignee: Advanced Micro Devices Inc
Current assignee: Advanced Micro Devices Inc
Priority date: 2019-09-04
Filing date: 2020-08-25
Publication date: 2022-11-04
Also published as: US20210065051A1; WO2021045935A1; EP4026004A4; CN114286985A; EP4026004A1; KR20220054397A

Abstract

処理性能を改善する処理デバイスが提供され、処理デバイスは、データを記憶するように構成されたメモリと、メモリと通信するプロセッサと、を備える。プロセッサは、識別されたハードウェアデバイス上でプログラムの一部を実行するために、それぞれ数値を有するチューニングパラメータを受信し、チューニングパラメータの数値をワードに変換するように構成されている。また、プロセッサは、１つ以上の機械語学習アルゴリズムを使用して、性能効率に基づいて、識別されたハードウェアデバイス上でプログラムの一部を実行するために、何れのワードの組み合わせが良いかを予測し、識別されたハードウェアデバイス上でプログラムの一部を実行するために、ワードの予測された組み合わせを対応する数値に変換するように構成されている。【選択図】図３A processing device for improving processing performance is provided, the processing device comprising a memory configured to store data, and a processor in communication with the memory. A processor is configured to receive tuning parameters each having a numerical value and convert the numerical values of the tuning parameters into words for executing the portion of the program on the identified hardware device. The processor also uses one or more machine language learning algorithms to determine which word combination is better for executing the portion of the program on the identified hardware device based on performance efficiency. and convert the predicted combinations of words into corresponding numerical values for execution of the program portion on the identified hardware device. [Selection drawing] Fig. 3

Description

（関連出願への相互参照）
本願は、２０１９年９月４日に出願された「ＭＥＴＨＯＤＡＮＤＡＰＰＡＲＡＴＵＳＦＯＲＰＲＥＤＩＣＴＩＮＧＫＥＲＮＥＬＴＵＮＩＮＧＰＡＲＡＭＥＴＥＲＳ」と題される米国特許出願第１６／５６０，９５４号の利益を主張し、その全体が、言及することによって本明細書に組み込まれる。 (Cross reference to related application)
This application claims the benefit of U.S. patent application Ser. incorporated herein by.

プログラムの性能効率は、例えば、プログラムの命令がハードウェア（例えば、集積回路（ＩＣ）又はチップ）上で実行される速度又は時間によって判定される。ハードウェアの物理的特性及び仕様は、ハードウェアの世代又はバージョンによって異なる。したがって、プログラムの性能効率は、通常、ハードウェアデバイスの異なる世代間で大きく異なる。プログラムは、通常、異なるハードウェアに対するプログラムの性能効率を変化させるために使用されるチューニングパラメータを含む。 A program's performance efficiency is determined, for example, by the speed or time at which the program's instructions are executed on the hardware (eg, integrated circuit (IC) or chip). The physical characteristics and specifications of hardware vary by hardware generation or version. Therefore, the performance efficiency of programs usually varies greatly between different generations of hardware devices. Programs typically include tuning parameters that are used to vary the program's performance efficiency on different hardware.

添付図面と共に例として与えられる以下の説明から、より詳細な理解を得ることができる。 A more detailed understanding can be obtained from the following description given by way of example in conjunction with the accompanying drawings.

本開示の１つ以上の特徴を実装可能な例示的なデバイスのブロック図である。1 is a block diagram of an example device that may implement one or more features of the disclosure; FIG. 追加の詳細を示す図１のデバイスのブロック図である。2 is a block diagram of the device of FIG. 1 showing additional details; FIG. プログラムのチューニングパラメータを予測する方法の一例を示すブロック図である。1 is a block diagram illustrating an example method for predicting tuning parameters of a program; FIG. 図３に示す言語学習及び予測を実施する例示的な方法を示す図である。4 illustrates an exemplary method of implementing the language learning and prediction shown in FIG. 3; FIG.

識別されたハードウェアデバイスで実行するためにプログラムを展開する前に、プログラムは、通常、プログラムのチューニングパラメータの異なる組み合わせを使用してプログラムを実行することによって、識別されたハードウェアのプロファイリングを行い、パフォーマンスの効率を変化させる。識別されたハードウェアのプログラムチューニングパラメータは、結果として生じる性能効率に基づいて選択される。 Before deploying a program to run on the identified hardware device, the program typically profiles the identified hardware by running the program using different combinations of the program's tuning parameters. , varying performance efficiency. Program tuning parameters for the identified hardware are selected based on resulting performance efficiency.

プログラムの性能効率は、プログラムのチューニングパラメータの値によって変わる。プログラムは、通常、複数のチューニングパラメータ（例えば、１０個のパラメータ）を含み、各々が複数の異なる選択可能な値（例えば、１０個の値）を有する。これらのチューニングパラメータ値の異なる組み合わせによって正しい結果が計算されるが、これらの結果について性能効率が変わることがもたらされる。 A program's performance efficiency depends on the values of the program's tuning parameters. A program typically includes multiple tuning parameters (eg, 10 parameters), each with multiple different selectable values (eg, 10 values). Different combinations of these tuning parameter values will compute correct results, but will result in varying performance efficiencies for these results.

従来のプロファイリングシステムは、解空間（solution space）を横断する検索アルゴリズムによって、プログラム（例えば、ＧＰＵコンピューティングカーネル）のチューニングパラメータ値を判定する。例えば、行列乗算命令の場合、従来のシステムでは、乗算される行列サイズの組み合わせ毎に、記憶されたチューニングパラメータ値のデータベースを継続的に検索する必要がある。 Conventional profiling systems determine tuning parameter values for a program (eg, a GPU computing kernel) by a search algorithm that traverses a solution space. For example, for matrix multiplication instructions, conventional systems require a continuous search of a database of stored tuning parameter values for each combination of matrix sizes to be multiplied.

これらの検索アルゴリズムはコスト及び時間がかかる。例えば、これらの従来の検索アルゴリズムは、通常、プログラム（例えば、カーネル）を調整するために、多額のコストがかかる計算リソース及び多くの時間を必要とする。さらに、チューニングは、選択したプログラムのみに適用される。選択されていないプログラムを実行すると、通常、パフォーマンスが低下し、ユーザーが自身のカーネルをチューニングすることを選ぶと、長時間の遅延を経験する。また、これらの従来の検索アルゴリズムは、各プログラムに対する異なる入力サイズ及びプログラムが解決しようとしている様々なタイプの問題を考慮したチューニングパラメータ値を提供しない。 These search algorithms are costly and time consuming. For example, these conventional search algorithms typically require costly computational resources and a lot of time to tune programs (eg, kernels). Additionally, tuning applies only to selected programs. Running unselected programs usually results in poor performance, and users who choose to tune their kernels experience long delays. Also, these conventional search algorithms do not provide tuning parameter values that take into account the different input sizes for each program and the various types of problems the program is trying to solve.

本明細書で説明するデバイス及び方法は、機械学習アルゴリズムを使用することによって、非効率的な検索アルゴリズムを使用せずに、識別されたハードウェア上で実行されるプログラムのチューニングパラメータ値を効率的に判定して、入力値（例えば、画像の寸法、行列の次元、カラーチャネルの数、実行する操作の数を含む入力テンソル値）に基づいてチューニングパラメータ値を予測する。 The devices and methods described herein use machine learning algorithms to efficiently obtain tuning parameter values for programs running on identified hardware without using inefficient search algorithms. to predict tuning parameter values based on input values (eg, input tensor values including image dimensions, matrix dimensions, number of color channels, number of operations to perform).

入力数値に基づいて数値を出力する従来の機械学習モデルとは対照的に、本明細書で説明する機械学習アルゴリズムは、入力数値をワード（すなわち、１文字以上の言語）に変換し、言語モデルを使用して、入力されたワードからパラメータを予測する。言語学習アルゴリズムは、ソース言語（例えば、１つ以上の数値から変換された入力ワード又はワードシーケンス）からターゲット言語（例えば、出力ワードシーケンス）に翻訳することを学習する。次に、出力ワードが数値に変換され、実行可能チューニングパラメータ値を取得する。 In contrast to conventional machine learning models that output numeric values based on input numeric values, the machine learning algorithms described herein convert input numeric values into words (i.e., languages of one or more letters) and use language models is used to predict parameters from input words. A language learning algorithm learns to translate from a source language (eg, an input word or word sequence converted from one or more numbers) to a target language (eg, an output word sequence). The output word is then converted to a number to obtain the executable tuning parameter value.

チューニングパラメータ値は、シーケンスで（並列入力とは対照的である）プログラムに入力されるチューニングパラメータ値に基づいて予測され、チューニングパラメータ値は、スカラー数ではなく個別のワードとしてエンコードされる。次に、個別のワードは、ニューラル機械語翻訳技術（例えば、多層パーセプトロン（ＭＬＰ）及び他のＭＬプリミティブ（畳み込み、アクティベーション、バッチ正規化、ドロップアウト、及び、リカレントニューラルネットワーク（ＲＮＮ））の組み合わせを使用して、ある言語から別の言語に文章を翻訳する技術）を使用して翻訳される。 Tuning parameter values are predicted based on tuning parameter values entered into the program in sequence (as opposed to parallel input), where the tuning parameter values are encoded as separate words rather than scalar numbers. Individual words are then processed using a combination of neural machine language translation techniques such as multi-layer perceptrons (MLP) and other ML primitives (convolution, activation, batch normalization, dropout, and recurrent neural networks (RNN)). is translated using the technology of translating text from one language to another using

従来の言語モデルとは対照的に、本明細書で説明する機械学習言語アルゴリズムは、事前に判定された（すなわち、実行前に判定された）制約（例えば、パラメータ値の組み合わせが無効であること、スレッド毎に割り当てられたレジスタの最大数、及び、スレッド毎にアクセス可能なメモリ量）に基づいて、チューニングパラメータ値を予測する。制約により、同時に存在できない値又は無効な結果を生じさせる値がチューニングパラメータ値として予測されるのを抑制する。したがって、チューニングパラメータ値が小さな空間（すなわち、潜在的なパラメータ値の小さい数）から予測されるため、制約は、より効率的な予測プロセスを促進し、また、選択されたチューニングパラメータ値が無効な結果を回避するため、より正確な予測を提供する。 In contrast to traditional language models, the machine learning language algorithms described herein rely on pre-determined (i.e., pre-execution) constraints (e.g., invalid combinations of parameter values). , the maximum number of registers allocated per thread, and the amount of memory accessible per thread). Constraints prevent values that cannot exist simultaneously or that produce invalid results from being predicted as tuning parameter values. Constraints therefore facilitate a more efficient prediction process, as the tuning parameter values are predicted from a small space (i.e., a small number of potential parameter values), and also prevent the selected tuning parameter values from being invalid. Provide more accurate forecasts to avoid consequences.

処理性能を改善する処理デバイスが提供され、処理デバイスは、データを記憶するように構成されたメモリと、メモリと通信するプロセッサと、を含む。プロセッサは、識別されたハードウェアデバイス上でプログラムの一部を実行するために、それぞれ数値を有するチューニングパラメータを受信して、チューニングパラメータの数値をワードに変換するように構成されている。また、プロセッサは、１つ以上の機械語学習アルゴリズムを使用して、性能効率に基づいて、識別されたハードウェアデバイス上でプログラムの一部を実行するために、何れのワードの組み合わせが良いかを予測し、識別されたハードウェアデバイス上でプログラムの一部を実行するために、ワードの予測された組み合わせを対応する数値に変換するように構成されている。 A processing device for improving processing performance is provided, the processing device including a memory configured to store data and a processor in communication with the memory. A processor is configured to receive tuning parameters each having a numerical value and convert the numerical values of the tuning parameters into words for executing the portion of the program on the identified hardware device. The processor also uses one or more machine language learning algorithms to determine which word combination is better for executing the portion of the program on the identified hardware device based on performance efficiency. and convert the predicted combinations of words into corresponding numerical values for execution of the portion of the program on the identified hardware device.

処理性能を改善する方法が提供され、この方法は、識別されたハードウェアデバイス上でプログラムの一部を実行するために、それぞれ数値を有するチューニングパラメータを受信することと、チューニングパラメータの数値をワードに変換することと、を含む。また、方法は、１つ以上の機械語学習アルゴリズムを使用して、性能効率に基づいて、識別されたハードウェアデバイス上でプログラムの一部を実行するために、何れのワードの組み合わせが良いかを予測することと、識別されたハードウェアデバイス上でプログラムの一部を実行するために、ワードの予測された組み合わせを対応する数値に変換することと、を含む。 A method of improving processing performance is provided, comprising: receiving tuning parameters each having a numerical value; and converting to The method also uses one or more machine language learning algorithms to determine which word combination is better for executing the portion of the program on the identified hardware device based on performance efficiency. and converting the predicted combinations of words into corresponding numerical values for executing the portion of the program on the identified hardware device.

コンピュータに方法を実行させるための命令を含む非一時的なコンピュータ可読記憶媒体が提供され、この方法は、識別されたハードウェアデバイス上でプログラムの一部を実行するために、それぞれ数値を有するチューニングパラメータを受信することと、チューニングパラメータの数値をワードに変換することと、を含む。また、方法は、１つ以上の機械語学習アルゴリズムを使用して、性能効率に基づいて、識別されたハードウェアデバイス上でプログラムの一部を実行するために、何れのワードの組み合わせが良いかを予測することと、識別されたハードウェアデバイス上でプログラムの一部を実行するために、ワードの予測された組み合わせを対応する数値に変換することと、を含む。 A non-transitory computer-readable storage medium is provided that includes instructions for causing a computer to perform a method, the method for performing a portion of a program on an identified hardware device, each tuning having a numerical value. Receiving the parameter and converting the numerical value of the tuning parameter to a word. The method also uses one or more machine language learning algorithms to determine which word combination is better for executing the portion of the program on the identified hardware device based on performance efficiency. and converting the predicted combinations of words into corresponding numerical values for executing the portion of the program on the identified hardware device.

本明細書で使用する場合、プログラムは、プロシージャ又はルーチン（例えば、操作、計算、機能、プロセス、ジョブ）を行うために１つ以上のプロセッサを使用して実行される任意の命令のシーケンスを含む。本明細書で使用する場合、プロセッサ上でプログラムされた命令（例えば、アプリケーション、ドライバ、オペレーティングシステム、又は、他のソフトウェア）の実行は、限定されないが、フェッチ、デコード、実行のスケジューリング、実行の開始、及び、プログラムされた命令の特定の部分の実行（例えば、フルスクリーンでのビデオのレンダリング）等の複数のステージのうち何れかを含む。プログラムされた命令は、チューニングパラメータ及びチューニングパラメータ設定を含み、チューニングパラメータ設定は、ハードウェアデバイス上で実行されるプログラムの性能効率を制御するために使用される調整可能（すなわち、変更可能）な値を有する。 As used herein, a program includes any sequence of instructions that are executed using one or more processors to perform a procedure or routine (e.g., operation, computation, function, process, job) . As used herein, execution of programmed instructions (e.g., applications, drivers, operating systems, or other software) on a processor includes, but is not limited to, fetching, decoding, scheduling execution, starting execution , and execution of a particular portion of programmed instructions (eg, rendering a video in full screen). Programmed instructions include tuning parameters and tuning parameter settings, which are adjustable (i.e., changeable) values used to control the performance efficiency of programs executing on hardware devices. have

図１は、本開示の１つ以上の特徴を実装可能な例示的なデバイス１００のブロック図である。デバイス１００は、例えば、コンピュータ、ゲーミングデバイス、ハンドヘルドデバイス、セットトップボックス、テレビ、携帯電話、又は、タブレットコンピュータを含み得る。デバイス１００は、プロセッサ１０２と、メモリ１０４と、ストレージ１０６と、１つ以上の入力デバイス１０８と、１つ以上の出力デバイス１１０と、を含む。また、デバイス１００は、オプションで、入力ドライバ１１２及び出力ドライバ１１４を含み得る。デバイス１００は、図１に示されていない追加のコンポーネントを含み得ることを理解されたい。 FIG. 1 is a block diagram of an exemplary device 100 that can implement one or more features of this disclosure. Device 100 may include, for example, a computer, gaming device, handheld device, set-top box, television, mobile phone, or tablet computer. Device 100 includes processor 102 , memory 104 , storage 106 , one or more input devices 108 and one or more output devices 110 . Device 100 may also optionally include input driver 112 and output driver 114 . It should be appreciated that device 100 may include additional components not shown in FIG.

様々な代替例では、プロセッサ１０２は、中央処理装置（ＣＰＵ）、グラフィックスプロセシングユニット（ＧＰＵ）、同一のダイ上に配置されたＣＰＵ及びＧＰＵ、又は、１つ以上のプロセッサコアを含み、各プロセッサコアは、ＣＰＵ又はＧＰＵであり得る。様々な代替例では、メモリ１０４は、プロセッサ１０２と同一のダイ上に位置する、又は、プロセッサ１０２とは別に位置する。メモリ１０４は、揮発性メモリ又は不揮発性メモリ（例えば、ランダムアクセスメモリ（ＲＡＭ）、ダイナミックＲＡＭ、キャッシュ）を含む。 In various alternatives, processor 102 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, each processor A core can be a CPU or a GPU. In various alternatives, memory 104 is located on the same die as processor 102 or is located separately from processor 102 . Memory 104 includes volatile or nonvolatile memory (eg, random access memory (RAM), dynamic RAM, cache).

ストレージ１０６は、固定式ストレージ又はリムーバブルストレージ（例えば、ハードディスクドライブ、ソリッドステートドライブ、光ディスク、又は、フラッシュドライブ）を含む。入力デバイス１０８は、限定されないが、キーボード、キーパッド、タッチスクリーン、タッチパッド、検出器、マイクロホン、加速度計、ジャイロスコープ、バイオメトリックススキャナ、又は、ネットワーク接続（例えば、無線ＩＥＥＥ８０２信号の送信用及び／若しくは受信用の無線ローカルエリアネットワークカード）を含む。出力デバイス１１０は、限定されないが、ディスプレイ、スピーカ、プリンタ、触覚フィードバックデバイス、１つ以上のライト、アンテナ、又は、ネットワーク接続（例えば、無線ＩＥＥＥ８０２信号の送信用及び／若しくは受信用の無線ローカルエリアネットワークカード）を含む。 Storage 106 includes fixed storage or removable storage (eg, hard disk drive, solid state drive, optical disk, or flash drive). Input devices 108 include, but are not limited to, keyboards, keypads, touch screens, touch pads, detectors, microphones, accelerometers, gyroscopes, biometrics scanners, or network connections (e.g., for transmitting wireless IEEE 802 signals and/or or a wireless local area network card for reception). Output device 110 includes, but is not limited to, a display, speaker, printer, haptic feedback device, one or more lights, antenna, or network connection (e.g., wireless local area network for transmitting and/or receiving wireless IEEE 802 signals). card).

入力ドライバ１１２は、プロセッサ１０２及び入力デバイス１０８と通信し、プロセッサ１０２が入力デバイス１０８から入力を受信することを可能にする。出力ドライバ１１４は、プロセッサ１０２及び出力デバイス１１０と通信し、プロセッサ１０２が出力デバイス１１０に出力を送信することを可能にする。入力ドライバ１１２及び出力ドライバ１１４がオプションのコンポーネントであることと、入力ドライバ１１２及び出力ドライバ１１４が存在しない場合には、デバイス１００が同様に動作することに留意されたい。出力ドライバ１１４は、表示デバイス１１８に結合されたアクセラレーテッド処理デバイス（ＡＰＤ）１１６を含む。ＡＰＤは、プロセッサ１０２から計算コマンド及びグラフィックスレンダリングコマンドを受け入れ、それらの計算コマンド及びグラフィックスレンダリングコマンドを処理し、表示するために表示デバイス１１８にピクセル出力を提供するように構成されている。以下により詳細に説明するように、ＡＰＤ１１６は、単一命令複数データ（ＳＩＭＤ）パラダイムに従って計算を行うように構成された１つ以上の並列処理ユニットを含む。したがって、様々な機能が、ＡＰＤ１１６によって又はＡＰＤ１１６と連動して行われるものとして本明細書で説明しているが、様々な代替例では、ＡＰＤ１１６によって行われるものとして説明する機能は、ホストプロセッサ（例えば、プロセッサ１０２）によって駆動されず、表示デバイス１１８にグラフィカル出力を提供するように構成された同様の機能を有する他のコンピューティングデバイスによって追加的に又は代替的に行われる。例えば、ＳＩＭＤパラダイムに従って処理タスクを行う任意の処理システムが、本明細書で説明する機能を行うように構成され得ることが想定される。代替的に、ＳＩＭＤパラダイムに従って処理タスクを行わないコンピューティングシステムは、本明細書で説明する機能を行うことが想定される。 Input driver 112 communicates with processor 102 and input device 108 and enables processor 102 to receive input from input device 108 . Output driver 114 communicates with processor 102 and output device 110 and enables processor 102 to send output to output device 110 . Note that input driver 112 and output driver 114 are optional components, and that device 100 operates similarly if input driver 112 and output driver 114 are not present. Output driver 114 includes an accelerated processing device (APD) 116 coupled to display device 118 . APD is configured to accept computational and graphics rendering commands from processor 102, process the computational and graphics rendering commands, and provide pixel output to display device 118 for display. As described in more detail below, APD 116 includes one or more parallel processing units configured to perform computations according to the Single Instruction Multiple Data (SIMD) paradigm. Thus, while various functions are described herein as being performed by or in conjunction with APD 116, in various alternatives the functions described as being performed by APD 116 are performed by a host processor (e.g., , processor 102 ) and configured to provide graphical output to display device 118 . For example, it is envisioned that any processing system that performs processing tasks according to the SIMD paradigm may be configured to perform the functions described herein. Alternatively, any computing system that does not perform processing tasks according to the SIMD paradigm is envisioned to perform the functions described herein.

図２は、ＡＰＤ１１６上での処理タスクの実行に関連する追加の詳細を示す、デバイス１００のブロック図である。ＡＰＤ１１６は、複数の計算ユニット１３２と、処理パイプライン（例えば、グラフィックス処理パイプライン）１３４と、スケジューラ１３６と、を含む。プロセッサ１０２は、システムメモリ１０４内に、プロセッサ１０２が実行するための１つ以上の制御論理モジュールを保持する。制御論理モジュールは、オペレーティングシステム１２０と、カーネルモードドライバ１２２と、アプリケーション１２６と、を含む。これらの制御論理モジュールは、プロセッサ１０２及びＡＰＤ１１６の操作の様々な特徴を制御する。例えば、オペレーティングシステム１２０は、ハードウェアと直接通信し、プロセッサ１０２で実行される他のソフトウェアにハードウェアへのインターフェースを提供する。カーネルモードドライバ１２２は、例えば、プロセッサ１０２上で実行されるソフトウェア（例えば、アプリケーション１２６）にアプリケーションプログラミングインターフェース（ＡＰＩ）を提供することによって、ＡＰＤ１１６の操作を制御し、ＡＰＤ１１６の様々な機能にアクセスする。また、カーネルモードドライバ１２２は、ＡＰＤ１１６の処理コンポーネント（以下により詳細に説明するＳＩＭＤユニット１３８等）による実行のために、プログラムをコンパイルするジャストインタイムコンパイラを含む。 FIG. 2 is a block diagram of device 100 showing additional details related to performing processing tasks on APD 116 . APD 116 includes multiple computation units 132 , a processing pipeline (eg, graphics processing pipeline) 134 , and a scheduler 136 . Processor 102 maintains in system memory 104 one or more control logic modules for processor 102 to execute. The control logic module includes an operating system 120 , kernel mode drivers 122 and applications 126 . These control logic modules control various aspects of the operation of processor 102 and APD 116 . For example, operating system 120 communicates directly with the hardware and provides an interface to the hardware for other software running on processor 102 . Kernel-mode driver 122 controls the operation of APD 116 and accesses various functions of APD 116, for example, by providing an application programming interface (API) to software (e.g., application 126) running on processor 102. . Kernel mode driver 122 also includes a just-in-time compiler that compiles programs for execution by processing components of APD 116 (such as SIMD unit 138, described in more detail below).

ＡＰＤ１１６は、並列処理に適し得るグラフィックス操作及び非グラフィックス操作等の選択された機能のためのコマンド及びプログラムを実行する。ＡＰＤ１１６は、ピクセル操作、幾何学計算等のグラフィックスパイプライン操作を実行するために、及び、プロセッサ１０２から受信したコマンドに基づいて表示デバイス１１８に画像をレンダリングするために使用することができる。また、ＡＰＤ１１６は、プロセッサ１０２から受信したコマンドに基づいて、ビデオに関連する操作、物理シミュレーション、計算流体力学、又は、他のタスク等のように、グラフィックス操作に直接関連しない計算処理操作も実行する。 APD 116 executes commands and programs for selected functions such as graphics and non-graphics operations that may be suitable for parallel processing. APD 116 can be used to perform graphics pipeline operations such as pixel manipulation, geometric calculations, etc., and to render images on display device 118 based on commands received from processor 102 . APD 116 also performs computational processing operations not directly related to graphics operations, such as video-related operations, physics simulations, computational fluid dynamics, or other tasks, based on commands received from processor 102. do.

ＡＰＤ１１６は、ＳＩＭＤパラダイムに従って並列にプロセッサ１０２の要求に応じて操作を行うように構成された１つ以上のＳＩＭＤユニット１３８を含む計算ユニット１３２を含む。ＳＩＭＤパラダイムは、複数の処理要素が単一のプログラム制御フローユニット及びプログラムカウンタを共有し、ひいては、同じプログラムを実行するが、異なるデータでそのプログラムを実行することが可能なパラダイムである。一例では、ＳＩＭＤユニット１３８の各々は１６個のレーンを含み、各レーンは、ＳＩＭＤユニット１３８の他のレーンと同時に同じ命令を実行するが、異なるデータでその命令を実行することができる。全てのレーンが所定の命令を実行する必要がない場合、予測を使用してレーンをオフにすることができる。また、予測を使用して、分岐する制御フローでプログラムを実行することができる。より具体的には、制御フローが個々のレーンによって行われる計算に基づく条件付き分岐又は他の命令を有するプログラムの場合、現在実行されていない制御フローパスに対応するレーンの予測、及び、異なる制御フローパスの直列実行は、任意の制御フローを可能にする。 APD 116 includes a computational unit 132 that includes one or more SIMD units 138 that are configured to perform operations at the request of processor 102 in parallel according to the SIMD paradigm. The SIMD paradigm is a paradigm in which multiple processing elements can share a single program control flow unit and program counter, thus executing the same program, but with different data. In one example, each of SIMD units 138 includes 16 lanes, each lane executing the same instruction at the same time as other lanes of SIMD unit 138, but may execute that instruction with different data. Prediction can be used to turn off lanes when not all lanes need to execute a given instruction. Prediction can also be used to execute programs with branching control flow. More specifically, for programs whose control flow has conditional branches or other instructions based on computations performed by individual lanes, prediction of lanes corresponding to currently unexecuted control flow paths and different control flow paths. The serial execution of allows for arbitrary control flow.

計算ユニット１３２における実行の基本単位は、ワークアイテムである。各々のワークアイテムは、特定のレーンにおいて並列に実行されることになるプログラムの単一のインスタンス化を表す。ワークアイテムは、単一のＳＩＭＤ処理ユニット１３８で「ウェーブフロント」として同時に実行することができる。１つ以上のウェーブフロントが「ワークグループ」に含まれ、「ワークグループ」は、同一のプログラムを実行するように指定されたワークアイテムの集合を含む。ワークグループは、ワークグループを構成するウェーブフロントの各々を実行することによって実行することができる。代替例では、ウェーブフロントは、単一のＳＩＭＤユニット１３８上で連続して、又は、異なるＳＩＭＤユニット１３８上で部分的若しくは完全に並列に実行される。ウェーブフロントは、単一のＳＩＭＤユニット１３８上で同時に実行可能なワークアイテムの最大の集合として考えられ得る。したがって、プロセッサ１０２から受信したコマンドが、プログラムが単一のＳＩＭＤユニット１３８上で同時に実行できない程度に特定のプログラムが並列処理されることを示す場合、そのプログラムは、２つ以上のＳＩＭＤユニット１３８上で並列処理されるウェーブフロント、又は、同一のＳＩＭＤユニット１３８上で直列処理される（又は、必要に応じて、並列処理及び直列処理の両方が行われる）ウェーブフロントに分割される。スケジューラ１３６は、異なる計算ユニット１３２及びＳＩＭＤユニット１３８上で様々なウェーブフロントをスケジューリングすることに関連する操作を行うように構成されている。 The basic unit of execution in computation unit 132 is a work item. Each work-item represents a single instantiation of the program to be executed in parallel on a particular lane. Work-items can be executed simultaneously in a single SIMD processing unit 138 as a "wavefront". One or more wavefronts are contained in a "workgroup", which contains a collection of workitems designated to execute the same program. A workgroup can be implemented by executing each of the wavefronts that make up the workgroup. Alternatively, the wavefronts are run serially on a single SIMD unit 138 or partially or completely in parallel on different SIMD units 138 . A wavefront can be thought of as the largest set of work items that can be executed simultaneously on a single SIMD unit 138 . Thus, if a command received from processor 102 indicates that a particular program is to be processed in parallel to the extent that the program cannot be executed on a single SIMD unit 138 at the same time, that program may be executed on two or more SIMD units 138 . , or serially processed on the same SIMD unit 138 (or both parallel and serially processed, if desired). Scheduler 136 is configured to perform operations related to scheduling various wavefronts on different compute units 132 and SIMD units 138 .

計算ユニット１３２によって許容される並列性は、画素値計算、頂点変換、及び、他のグラフィックス操作等のグラフィックス関連操作に適切である。したがって、ある場合、プロセッサ１０２からグラフィックスプロセッシングコマンドを受け付けるグラフィックス処理パイプライン１３４は、並列に実行するために計算タスクを計算ユニット１３２に提供する。 The parallelism allowed by computation unit 132 is adequate for graphics related operations such as pixel value computations, vertex transformations, and other graphics operations. Thus, in some cases, graphics processing pipeline 134, which receives graphics processing commands from processor 102, provides computational tasks to compute units 132 for execution in parallel.

また、計算ユニット１３２を使用して、グラフィックに関連しない又はグラフィックス処理パイプライン１３４の「通常」操作の一部として行われない計算タスク（例えば、グラフィックス処理パイプライン１３４の操作に対して行われる処理を補足するために行われるカスタム操作）を行う。プロセッサ１０２上で実行されるアプリケーション１２６又は他のソフトウェアは、そのような計算タスクを定義するプログラムを、実行のためにＡＰＤ１１６に送信する。 Compute unit 132 may also be used to perform computational tasks not related to graphics or not performed as part of the “normal” operation of graphics processing pipeline 134 (e.g., for operations of graphics processing pipeline 134). custom operations performed to supplement the processing performed). Application 126 or other software running on processor 102 sends programs defining such computational tasks to APD 116 for execution.

図３は、識別されたハードウェアデバイス上で実行されるプログラムのチューニングパラメータを予測する例示的な方法３００を示すブロック図である。エンコード、変換、言語学習、比較及び予測等の方法３００の各部分は、例えば、ＡＰＤ１１６等のプロセッサによって行われる。 FIG. 3 is a block diagram illustrating an exemplary method 300 of predicting tuning parameters for a program executing on an identified hardware device. Portions of method 300, such as encoding, transformation, language learning, comparison and prediction, are performed by a processor such as APD 116, for example.

チューニングパラメータは、本質的にカテゴリー的なパラメータ（例えば、プログラムの性能効率を変えるためにプログラムに提供されるオプションを表すパラメータ）と、例えば、メモリ（例えば、メインメモリ）からアクセスされるデータ量、リンクにわたって行われる並列メモリアクセスの数（例えば、読み取り、書き込み）、入力画像のチャネル数（例えば、画像のカラーチャネル）、出力チャネル数（例えば、ハイパースペクトル画像の出力チャネル）、パイプラインの深さ（例えば、入力深度及び出力深度）等の特定のパラメータをチューニングするための数値を有するパラメータと、を含む。チューニングパラメータの目標値は、例えば、画像の高さ、画像幅、入力チャネルの総数、出力チャネルの総数、及び、一度に処理される画像数等の入力パラメータに従って判定される。さらに、チューニングパラメータは、部分的に、プログラム間で異なる解釈を有するパラメータによって異なる。 Tuning parameters are parameters that are categorical in nature (e.g., parameters that represent options provided to a program to change its performance efficiency) and, for example, the amount of data accessed from memory (e.g., main memory), Number of parallel memory accesses made across the link (e.g. reads, writes), number of input image channels (e.g. color channels of an image), number of output channels (e.g. output channels of a hyperspectral image), pipeline depth parameters with numerical values for tuning specific parameters such as (eg, input depth and output depth). The target values of the tuning parameters are determined according to input parameters such as image height, image width, total number of input channels, total number of output channels, and number of images to be processed at one time. Furthermore, tuning parameters differ, in part, with parameters having different interpretations between programs.

図３の３０２に示すように、方法３００は、識別されたハードウェアデバイス上で実行されるプログラムの複数のチューニングパラメータ（例えば、ハードウェアデバイスの識別バージョン）に関する数値を受信することを含む。数値チューニングパラメータ値の各々は、例えば、ＡＰＤ１１６によって連続して（すなわち、順番に）受信される。 As shown at 302 of FIG. 3, method 300 includes receiving numerical values for a plurality of tuning parameters of a program executing on an identified hardware device (eg, an identified version of the hardware device). Each of the numerical tuning parameter values are received by APD 116 in succession (ie, in sequence), for example.

図３の３０４に示すように、方法３００は、チューニングパラメータのシーケンスにおいて数値をエンコードすることを含む。エンコードは、チューニングパラメータ値を数値から言語のワードに変換することによって行われる。チューニングパラメータ値の変換は、１つの数値をワードに変換することと、１つ以上の数値をワードに変換することと、１つの数値を複数のワードに変換することと、を含む。エンコードの例は、ワンホット（one-hot）エンコードと、ワンホットエンコードから生成された高密度ベクトルと、を含む。 As shown at 304 in FIG. 3, method 300 includes encoding numerical values in a sequence of tuning parameters. Encoding is done by converting the tuning parameter values from numbers to linguistic words. Transforming a tuning parameter value includes transforming a number into a word, transforming one or more numbers into words, and transforming a number into multiple words. Examples of encoding include one-hot encoding and dense vectors generated from one-hot encoding.

変換された各ワードは、機械語モデル３１２に提供され、言語学習及び予測プロセス３０６の一部として制約３１４に基づいて予測され、機械語学習及び予測アルゴリズムを使用して、性能効率に基づいて、識別されたハードウェアデバイス上でプログラムを実行するために何れのワードを使用するかを予測する。すなわち、機械語学習アルゴリズムは、何れのワードの組み合わせ（数値チューニングパラメータ値に対応する組み合わせ）が、識別されたハードウェアデバイス上でプログラムの一部を効率的に実行することをもたらすかを予測する（例えば、そのワードの組み合わせによって、他のワードの組み合わせよりも速くプログラムの一部が実行されるか、又は、他のワードの組み合わせよりも短い時間でプログラムの一部が実行されるかを予測する）。 Each transformed word is provided to the machine language model 312 and predicted based on the constraints 314 as part of the language learning and prediction process 306, using a machine language learning and prediction algorithm, based on performance efficiency, Predict which words will be used to execute the program on the identified hardware device. That is, the machine language learning algorithm predicts which word combinations (corresponding to numerical tuning parameter values) will result in efficient execution of a portion of the program on the identified hardware device. (e.g. predicting whether the word combination will execute part of the program faster than other word combinations or will execute part of the program in less time than other word combinations) do).

機械語モデル３１２は、１つ以上の機械学習プリミティブに従って、チューニングパラメータの変換されたワード値を処理する。機械学習プリミティブの例は、畳み込みニューラルネットワーク（ＣＮＮ）と、畳み込み層及びプーリング層と、一方向及び双方向の長短期記憶（ＬＳＴＭ）セル又はゲートされたリカレントユニット（ＧＲＵ）を含むリカレントニューラルネットワーク（ＲＮＮ）と、ドロップアウト及び異なるアクティベーション機能を有する、密に接続されたディープニューラルネットワークと、を含む。 Machine language model 312 processes the transformed word values of tuning parameters according to one or more machine learning primitives. Examples of machine learning primitives include convolutional neural networks (CNNs), convolutional and pooling layers, and recurrent neural networks ( RNN) and densely connected deep neural networks with dropout and different activation functions.

ワードは制約３１４に基づいて予測され、制約３１４は、例えば、パラメータ値の組み合わせが無効であること、スレッド毎に割り当てられたレジスタの最大数、及び、スレッド毎にアクセス可能なメモリ量を含む。制約３１４により、１つ以上の他のチューニングパラメータと一緒に同時に存在できないチューニングパラメータ値、又は、無効な結果を生じさせるチューニングパラメータ値の予測を防ぐ。制約により、予測が小さな空間で行われるので効率が向上する。さらに、予測されたチューニングパラメータ値では無効な結果が生じないため、制約によって予測の精度が向上する。 Words are predicted based on constraints 314, which include, for example, invalid combinations of parameter values, the maximum number of registers allocated per thread, and the amount of memory accessible per thread. Constraints 314 prevent prediction of tuning parameter values that cannot co-exist with one or more other tuning parameters or that produce invalid results. Constraints improve efficiency because predictions are made in a small space. In addition, the constraint improves the accuracy of the prediction because the predicted tuning parameter values do not produce invalid results.

図３の３０８に示すように、方法３００は、予測されたチューニングパラメータ値をデコードすることを含む。デコードは、予測されたチューニングパラメータワード値を変換して数値に戻すことによって行われる。次に、図３の３１０に示すように、予測されたチューニングパラメータ数値は、予測された実行可能チューニングパラメータ値として提供される。プログラムの一部は、予測された実行可能チューニングパラメータ値を使用して、識別されたハードウェアデバイス上で実行される。 As shown at 308 in FIG. 3, method 300 includes decoding the predicted tuning parameter values. Decoding is performed by converting the predicted tuning parameter word values back to numeric values. The predicted tuning parameter values are then provided as predicted viable tuning parameter values, as shown at 310 in FIG. A portion of the program is executed on the identified hardware device using the predicted executable tuning parameter values.

ここで、言語学習及び予測プロセス３０６の例は、図４に関してより詳細に説明される。上述したように、複数のチューニングパラメータ値を並行して受信するチューニングパラメータを判定するための従来のシステムとは対照的に、シーケンスでプログラムに入力されるチューニングパラメータ値に基づいて、本開示の特徴に応じて、チューニングパラメータ値が予測される。すなわち、入力チューニングパラメータ値の各々は連続して（すなわち、順番に）受信され、チューニングパラメータ値がシーケンスとして予測される。 An example language learning and prediction process 306 will now be described in more detail with respect to FIG. As noted above, in contrast to conventional systems for determining tuning parameters that receive multiple tuning parameter values in parallel, the features of the present disclosure are based on tuning parameter values that are entered into the program in sequence. , the tuning parameter values are predicted. That is, each of the input tuning parameter values are received consecutively (ie, in sequence) and the tuning parameter values are predicted as a sequence.

図４は、入力チューニングパラメータ値の各々が連続して受信される図３の３０６に示す言語学習及び予測を実施する方法４００の例を示す。以下により詳細に説明するように、図４は、制約３１４を使用して中間チューニングパラメータ値候補のフィルタリングと、予測されたチューニングパラメータ値候補（例えば、他のチューニングパラメータ値候補よりも優れた性能効率性でプログラムの一部を実行する可能性が高いと判別された候補）を使用して、シーケンスにおける次のチューニングパラメータ値候補を予測することと、を含む、予測シーケンスの段階を示す。例えば、エンコード、変換、言語学習、比較、フィルタリング、判別、及び、予測等の方法４００の各部分は、例えば、ＡＰＤ１１６等のプロセッサによって行われる。 FIG. 4 illustrates an example method 400 for performing language learning and prediction shown at 306 in FIG. 3 in which each of the input tuning parameter values are received in succession. As described in more detail below, FIG. 4 illustrates how constraints 314 can be used to filter intermediate tuning parameter value candidates and predict predicted tuning parameter value candidates (e.g., performance efficiency over other tuning parameter value candidates). predicting the next tuning parameter value candidate in the sequence using the candidate determined to be likely to execute part of the program with high probability. For example, portions of method 400 such as encoding, transformation, language learning, comparison, filtering, discrimination, and prediction are performed by a processor, such as APD 116, for example.

図４に示すように、入力ワードシーケンス４０２の各ワード４０２（１）～４０２（ｎ）が受信される。表現学習プロセス４０４は、１つ以上の機械学習プリミティブ（例えば、上述した１つ以上の機械学習プリミティブ）に従って各ワード４０２（１）～４０２（ｎ）に対して行われ、内部表現４０６（例えば、機械語モデル３１２の内部のワード４０２（１）～４０２（ｎ）の圧縮表現）を判定する。表現学習４０４の各ブロックは、例えば、入力ワードシーケンス４０２の対応するワードの内部表現を判定するために使用されるメモリセルを表す。 As shown in FIG. 4, each word 402(1)-402(n) of an input word sequence 402 is received. A representation learning process 404 is performed on each word 402(1)-402(n) according to one or more machine learning primitives (eg, one or more of the machine learning primitives described above) and an internal representation 406 (eg, The compressed representation of words 402(1)-402(n) within machine language model 312) is determined. Each block of representation learning 404 represents, for example, a memory cell used to determine the internal representation of the corresponding word of input word sequence 402 .

例えば、表現学習４０４の間に、第１のワード４０２（１）の内部表現が、第１のワード４０２（１）の内部表現４０６として出力される（例えば、一時的に記憶される）。また、第１のワード４０２（１）の内部表現は、上流に提供され（第１のワード４０２（１）のメモリセルと、第２のワード４０２（２）のメモリセルと、の間の左から右に向かう矢印によって示されている）、第２のワード４０２（２）の内部表現を判定するために使用される。 For example, during representation learning 404, the internal representation of first word 402(1) is output (eg, temporarily stored) as internal representation 406 of first word 402(1). Also, the internal representation of the first word 402(1) is provided upstream (to the left between the memory cells of the first word 402(1) and the memory cells of the second word 402(2)). ), which is used to determine the internal representation of the second word 402(2).

第２のワード４０２（２）の中間内部表現は、第１のワード４０２（１）及び第２のワード４０２（２）の内部表現に基づいて判定される。次に、第２のワード４０２（２）の中間内部表現は、第２のワード４０２（２）の内部表現４０６として出力される（例えば、一時的に記憶される）。また、第２のワード４０２（２）の内部表現は、第３のワード４０２（３）のメモリセルに向かうように上流に提供され（第２のワード４０２（２）のメモリセルと、第３のワード４０２（３）のメモリセルとの間の左から右に向かう矢印によって示されている）、第３のワード４０２（３）の内部表現を判定するために使用される。この処理は、入力ワードシーケンス４０２の残りのワード毎に、上流に（すなわち、表現学習４０４の左から右に向かう矢印の方向に）向かって継続する。 An intermediate internal representation of the second word 402(2) is determined based on the internal representations of the first word 402(1) and the second word 402(2). The intermediate internal representation of the second word 402(2) is then output (eg, temporarily stored) as the internal representation 406 of the second word 402(2). Also, the internal representation of the second word 402(2) is provided upstream toward the memory cells of the third word 402(3) (the memory cells of the second word 402(2) and the memory cells of the third word 402(3)). (indicated by the left-to-right arrows between the memory cells of word 402(3)), are used to determine the internal representation of the third word 402(3). The process continues upstream (ie, in the direction of the left-to-right arrow of expression learning 404 ) for each remaining word of input word sequence 402 .

図４に示す例では、表現学習４０４は、双方向学習を含む。すなわち、各ワード４０２（１）～４０２（ｎ）の内部表現も下流に（すなわち、表現学習４０４の右から左に向かう矢印の方向に）提供される。したがって、各ワード４０２（１）～４０２（ｎ－１）の内部表現は、入力ワードシーケンス４０２の上流ワードに基づいて判定される（すなわち、直接的にシーケンスにおける次の上流ワードに基づいて、間接的に入力ワードシーケンス４０２の他の上流ワードに基づいて判定される）。また、本開示の特徴は、例えば、一方向学習（すなわち、左から右に向かう矢印の方向）によって実施される。 In the example shown in FIG. 4, representation learning 404 includes interactive learning. That is, an internal representation of each word 402(1)-402(n) is also provided downstream (ie, in the direction of the right-to-left arrow of representation learning 404). Thus, the internal representation of each word 402(1)-402(n-1) is determined based on the upstream word of the input word sequence 402 (ie, directly based on the next upstream word in the sequence, indirectly typically determined based on other upstream words in the input word sequence 402). Also, features of the present disclosure are implemented, for example, by unidirectional learning (ie, the direction of the arrow going from left to right).

ワードの内部表現４０６を使用して、識別されたハードウェアデバイス上でプログラムの一部を実行するために、ワードのシーケンスが予測される。予測プロセスは、中間ワードシーケンス４０８及び出力ワードシーケンス４１０を生成することを含む。以下に説明するように、複数のチューニングパラメータ候補は、他の候補よりも優れた性能効率をもたらす可能性が高いと判定された候補を含み、実行のためにワードのシーケンスを予測するために使用される。例えば、最初の候補が１つ以上の制約３１４を満たさない場合、次に可能性が高い候補を使用して、シーケンスにおけるワードを予測する。 Using the internal representation 406 of the words, sequences of words are predicted for execution of the portion of the program on the identified hardware device. The prediction process includes generating an intermediate word sequence 408 and an output word sequence 410 . The multiple tuning parameter candidates are used to predict sequences of words for execution, including candidates determined to be more likely to yield better performance efficiency than other candidates, as described below. be done. For example, if the first candidate does not satisfy one or more of the constraints 314, then the next most likely candidate is used to predict words in the sequence.

一例では、予測プロセス中に使用される数値チューニングパラメータ候補が事前に判定される（すなわち、実行前に判定される）。例えば、所定の予測数ｋが伝播され、ｋ個の予測が生じる。 In one example, candidate numerical tuning parameters to be used during the prediction process are pre-determined (ie, determined prior to execution). For example, a given number of predictions k is propagated resulting in k predictions.

次に、ワードシーケンス４０２のワード４０２（１）～４０２（ｎ）の内部表現が、中間ワードシーケンス４０８を生成するために同様の機械学習構造に提供される。図４の中間ワードシーケンス４０８の各ブロックは、例えば、中間ワードシーケンス４０８の対応するワード４０８（１）～４０８（ｎ）（すなわち、チューニングパラメータ値候補）を中間的に予測するために使用されるメモリセルを表す。 The internal representation of words 402 ( 1 )- 402 ( n ) of word sequence 402 is then provided to a similar machine learning structure to generate intermediate word sequence 408 . Each block of the intermediate word sequence 408 of FIG. 4, for example, is used to intermediately predict corresponding words 408(1)-408(n) of the intermediate word sequence 408 (ie, tuning parameter value candidates). Represents a memory cell.

中間ワードシーケンス４０８の第１のワード４０８（１）（すなわち、第１の候補）は、上述した１つ以上の機械学習プリミティブに基づいて、識別されたハードウェアデバイス上でプログラムの一部を実行するために中間的に予測される。第１のワード４０８（１）の内部表現は、プログラムの一部（例えば、カーネルの一部）に対する１つ以上の制約３１４に基づいて分析される。すなわち、第１のワード４０８（１）が１つ以上の制約３１４の各々を満たす場合、第１のワード４０８（１）は、出力ワードシーケンス４１０のパラメータ値候補として中間的に予測される。第１のワード４０８（１）が１つ以上の制約３１４の各々を満たさない場合、第１のワード４０８（１）は、出力ワードシーケンス４１０のパラメータ値候補として選択されない。 The first word 408(1) (i.e., the first candidate) of the intermediate word sequence 408 executes part of the program on the identified hardware device based on one or more of the machine learning primitives described above. expected to be intermediate. The internal representation of the first word 408(1) is analyzed based on one or more constraints 314 on the part of the program (eg, part of the kernel). That is, if the first word 408 ( 1 ) satisfies each of the one or more constraints 314 , the first word 408 ( 1 ) is intermediately predicted as a parameter value candidate for the output word sequence 410 . If first word 408 ( 1 ) does not satisfy each of one or more constraints 314 , then first word 408 ( 1 ) is not selected as a candidate parameter value for output word sequence 410 .

また、第１のワード４０８（１）の内部表現は、中間ワードシーケンス４０８の第２のワード４０８（２）を判定するために、次のメモリセル（すなわち、次の上流メモリセル）にも提供される。第２のワード４０８（２）が第１のワード４０８（１）の代わりに１つ以上の制約３１４の各々を満たす場合、第２のワード４０８（２）は、出力ワードシーケンス４１０のパラメータ値候補として中間的に予測される。第２のワード４０８（２）が１つ以上の制約３１４の各々を満たさない場合、第２のワード４０８（２）は、出力ワードシーケンス４１０のパラメータ値候補として選択されない。中間ワードシーケンス４０８の残りのワード毎に処理が継続される。 The internal representation of the first word 408(1) is also provided to the next memory cell (ie, the next upstream memory cell) to determine the second word 408(2) of the intermediate word sequence 408. be done. If the second word 408(2) satisfies each of the one or more constraints 314 instead of the first word 408(1), then the second word 408(2) is a candidate parameter value for the output word sequence 410. is intermediately predicted as If second word 408 ( 2 ) does not satisfy each of one or more constraints 314 , then second word 408 ( 2 ) is not selected as a candidate parameter value for output word sequence 410 . Processing continues for each remaining word of the intermediate word sequence 408 .

また、予測プロセスは、チューニングパラメータ値を他のチューニングパラメータ値と比較して、何れのチューニングパラメータ値の組み合わせが、識別されたハードウェアデバイス上でプログラムの一部を実行するために、チューニングパラメータ値候補の他の組み合わせよりも優れた性能効率をもたらすものであるかを予測するアテンションメカニズム（attention mechanism）を含む。 The prediction process also compares the tuning parameter values to other tuning parameter values to determine which combination of tuning parameter values will produce a portion of the program on the identified hardware device. Includes an attention mechanism that predicts which combination will yield better performance efficiency than other combinations of candidates.

例えば、出力ワードシーケンス４１０のチューニングパラメータ値候補４１０（１）～４１０（ｎ）は、他のチューニングパラメータ値候補よりも優れた性能効率でプログラムの一部を実行する可能性に従って比較及びランク付けされる。出力ワードシーケンス４１０の１つ以上のチューニングパラメータ値候補（例えば、他の候補よりも優れた性能効率をもたらす可能性が高いと判定されたチューニングパラメータ値候補）は、中間ワードシーケンス４０８のメモリセルに戻すように提供され、中間ワードシーケンス４０８の１つ以上のワード４０８（１）～４０８（ｎ）を中間的に予測する。したがって、機械学習アルゴリズムは、入力チューニングパラメータ値（例えば、入力ワードシーケンス４０２の値）と、機械学習アルゴリズムにフィードバックされる予測されたチューニングパラメータ値候補と、に基づいて、チューニングパラメータ値を予測することを学習する。 For example, the tuning parameter value candidates 410(1)-410(n) of the output word sequence 410 are compared and ranked according to their likelihood of executing a portion of the program with greater performance efficiency than other tuning parameter value candidates. be. One or more tuning parameter value candidates of output word sequence 410 (eg, tuning parameter value candidates determined to be more likely to provide better performance efficiency than other candidates) are stored in memory cells of intermediate word sequence 408 . provided to return and intermediately predict one or more words 408(1)-408(n) of the intermediate word sequence 408. FIG. Accordingly, the machine learning algorithm predicts tuning parameter values based on input tuning parameter values (e.g., values of the input word sequence 402) and predicted tuning parameter value candidates that are fed back to the machine learning algorithm. to learn.

次に、図３のブロック３０８に示すように、出力ワードシーケンス４１０の予測されたチューニングパラメータ値候補４１０（１）～４１０（ｎ）を変換して数値に戻し、識別されたハードウェアデバイス上でプログラムの一部を実行するために、図３に示す予測された実行可能チューニングパラメータ値３１０として提供される。 Next, as shown in block 308 of FIG. 3, the predicted tuning parameter value candidates 410(1)-410(n) of the output word sequence 410 are converted back to numeric values and are computed on the identified hardware device. To run a portion of the program, it is provided as predicted executable tuning parameter values 310 shown in FIG.

本明細書の開示に基づいて、多くの変形が可能であることを理解されたい。特徴及び要素が特定の組み合わせで上述されているが、各特徴又は要素は、他の特徴及び要素を伴わず単独で、又は、他の特徴及び要素の有無にかかわらず様々な組み合わせで使用することができる。 It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element can be used alone, without other features or elements, or in various combinations with or without other features and elements. can be done.

提供される方法は、汎用コンピュータ、プロセッサ又はプロセッサコアで実施することができる。適切なプロセッサは、例として、汎用プロセッサ、専用プロセッサ、従来のプロセッサ、デジタルシグナルプロセッサ（ＤＳＰ）、複数のマイクロプロセッサ、ＤＳＰコアに関連する１つ以上のマイクロプロセッサ、コントローラ、マイクロコントローラ、特定用途向け集積回路（ＡＳＩＣ）、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）回路、任意の他のタイプの集積回路（ＩＣ）、及び／又は、状態機械を含む。このようなプロセッサは、ハードウェア記述言語（ＨＤＬ）命令及びネットリストを含む他の中間データ（コンピュータ可読媒体に記憶可能な命令）の処理結果を使用して製造プロセスを構成することによって、製造されてもよい。このような処理結果は、電力性能管理のためにアプリケーションプロファイリングを実施するプロセッサを製造する半導体製造プロセスで使用されるマスクワークであってもよい。 The provided methods can be implemented in a general purpose computer, processor or processor core. Suitable processors include, by way of example, general purpose processors, special purpose processors, conventional processors, digital signal processors (DSPs), multiple microprocessors, one or more microprocessors associated with a DSP core, controllers, microcontrollers, application specific It includes an integrated circuit (ASIC), a field programmable gate array (FPGA) circuit, any other type of integrated circuit (IC), and/or a state machine. Such processors are manufactured by configuring the manufacturing process using the results of processing hardware description language (HDL) instructions and other intermediate data (instructions storable on computer readable media) including netlists. may Such processing results may be masks used in semiconductor manufacturing processes to manufacture processors that implement application profiling for power performance management.

図に示す及び／又は本明細書で説明する様々な機能ユニット（限定されないが、プロセッサ１０２、入力ドライバ１１２、入力デバイス１０８、出力ドライバ１１４、出力デバイス１１０、アクセラレーテッド処理デバイス１１６、スケジューラ１３６、グラフィックス処理パイプライン１３４、計算ユニット１３２、及び、ＳＩＭＤユニット１３８を含む）は、汎用コンピュータ、プロセッサ若しくはプロセッサコアとして、又は、プログラム、ソフトウェア若しくはファームウェアとして実装され、非一時的なコンピュータ可読記憶媒体又は別の媒体に記憶され、汎用コンピュータ、プロセッサ又はプロセッサコアによって実行可能であり得る。 The various functional units shown and/or described herein (including but not limited to processor 102, input driver 112, input device 108, output driver 114, output device 110, accelerated processing device 116, scheduler 136, graphics processing pipeline 134, compute unit 132, and SIMD unit 138) may be implemented as general purpose computers, processors or processor cores, or as programs, software or firmware, and may be implemented as non-transitory computer readable storage media or It may be stored on another medium and executable by a general purpose computer, processor or processor core.

本明細書で提供される方法又はフローチャートは、汎用コンピュータ又はプロセッサによる実行のために非一時的なコンピュータ可読記憶媒体に組み込まれたコンピュータプログラム、ソフトウェア、ファームウェアに実装されてもよい。非一時的なコンピュータ可読記憶媒体の例は、リードオンリメモリ（ＲＯＭ）、ランダムアクセスメモリ（ＲＡＭ）、レジスタ、キャッシュメモリ、半導体メモリデバイス、内蔵ハードディスク及びリムーバブルディスク等の磁気媒体、磁気光学媒体、ＣＤ－ＲＯＭディスク及びデジタル多用途ディスク（ＤＶＤ）等の光学媒体を含む。 The methods or flowcharts provided herein may be implemented in computer programs, software, firmware embodied in non-transitory computer-readable storage media for execution by a general purpose computer or processor. Examples of non-transitory computer-readable storage media include read-only memory (ROM), random-access memory (RAM), registers, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, CDs - Including optical media such as ROM discs and Digital Versatile Discs (DVDs).

Claims

A processing device for improving processing performance, comprising:
a memory configured to store data;
a processor in communication with the memory;
The processor
receiving tuning parameters each having a numerical value for executing the portion of the program on the identified hardware device;
converting the numerical values of the tuning parameters into words;
using one or more machine language learning algorithms to determine which combination of words is better for executing the portion of the program on the identified hardware device based on performance efficiency; to predict,
converting predicted word combinations into corresponding numerical values for execution of the portion of the program on the identified hardware device;
is configured to do
processing device.

the processor is configured to successively determine a machine language learning representation for each word input in a word sequence;
2. The processing device of claim 1.

the processor is configured to determine a machine language learning representation of a word in the word sequence based on a determined machine language learning representation of another word in the word sequence;
3. The processing device of claim 2.

The processor
predicting an intermediate sequence of words based on the word-by-word machine language learning representation;
determining any word of the intermediate sequence of words as a candidate word for the predicted word combination if the word satisfies each of one or more predetermined constraints;
determining that the any word of the intermediate sequence of words is not a candidate word for the predicted word combination if the word does not satisfy each of the one or more predetermined constraints;
is configured to do
3. The processing device of claim 2.

each of the one or more predetermined constraints indicates whether the combination of words would produce an invalid result by executing a portion of the program;
5. The processing device of claim 4.

The processor
determining the plurality of words of the intermediate sequence of words as candidate words for the predicted word combination;
predicting the next word in the intermediate sequence of words based on candidate words determined to be more likely to execute the portion of the program with greater performance efficiency than other candidate words;
is configured to do
5. The processing device of claim 4.

the performance efficiency is a measure of speed or time to execute a portion of the program;
Based on the performance efficiency, the processor performs the configured to determine which combination of said words is good for executing said program on an identified hardware device;
2. The processing device of claim 1.

the received numeric values of the plurality of tuning parameters are tensor input values;
2. The processing device of claim 1.

the one or more machine language learning algorithms include at least one of a convolutional neural network, a recurrent neural network, and a combined neural network;
2. The processing device of claim 1.

A method for improving processing performance comprising:
receiving tuning parameters each having a numerical value for executing the portion of the program on the identified hardware device;
converting the numerical values of the tuning parameters into words;
Using one or more machine language learning algorithms to predict which word combinations are better for executing the portion of the program on the identified hardware device based on performance efficiency. and
converting predicted word combinations into corresponding numerical values for execution of the portion of the program on the identified hardware device;
Method.

further comprising successively determining a machine language learning representation for each word entered in the word sequence;
11. The method of claim 10.

further comprising determining machine language learning representations of words in the word sequence based on determined machine language learning representations of other words in the word sequence;
12. The method of claim 11.

predicting an intermediate sequence of words based on the word-by-word machine language learning representation;
determining any word of the intermediate sequence of words as a candidate word for the predicted word combination if the word satisfies each of one or more predetermined constraints;
determining that the any word of the intermediate sequence of words is not a candidate word for the predicted word combination if the word does not satisfy each of the one or more predetermined constraints; further comprising
11. The method of claim 10.

each of the one or more predetermined constraints indicates whether the combination of words would produce an invalid result by executing a portion of the program;
14. The method of claim 13.

determining the plurality of words of the intermediate sequence of words as candidate words for the predicted word combination;
predicting the next word in the intermediate sequence of words based on candidate words determined to be more likely to execute the portion of the program with greater performance efficiency than other candidate words; further including,
14. The method of claim 13.

further comprising ranking each of the plurality of words of the intermediate sequence of words according to the candidate word's likelihood of executing the portion of the program with greater performance efficiency than the other candidate words;
16. The method of claim 15.

the performance efficiency is a measure of speed or time to execute a portion of the program;
Based on the performance efficiency, the method performs the determining which combination of the words is good for executing the program on the identified hardware device;
12. The method of claim 11.

the received numeric values of the plurality of tuning parameters are tensor input values;
12. The method of claim 11.

the one or more machine language learning algorithms include at least one of a convolutional neural network, a recurrent neural network, and a combined neural network;
12. The method of claim 11.

A computer-readable storage medium having instructions for causing a computer to perform a method, comprising:
The method includes:
receiving tuning parameters each having a numerical value for executing the portion of the program on the identified hardware device;
converting the numerical values of the tuning parameters into words;
Using one or more machine language learning algorithms to predict which word combinations are better for executing the portion of the program on the identified hardware device based on performance efficiency. and
converting predicted word combinations into corresponding numerical values for execution of the portion of the program on the identified hardware device;
computer readable storage medium.