JP6957528B2

JP6957528B2 - Redundant thread fingerprinting with compiler insert translation code

Info

Publication number: JP6957528B2
Application number: JP2018565057A
Authority: JP
Inventors: アイ．ロウェルダニエル
Original assignee: Advanced Micro Devices Inc
Current assignee: Advanced Micro Devices Inc
Priority date: 2016-06-21
Filing date: 2017-06-21
Publication date: 2021-11-02
Anticipated expiration: 2037-06-21
Also published as: EP3472698A4; US20170364332A1; US10013240B2; JP2019526102A; KR20190010592A; EP3472698B1; KR102410349B1; CN109313551B; WO2017223189A1; CN109313551A; EP3472698A1

Description

中央処理装置（ＣＰＵ）、グラフィックス処理装置（ＧＰＵ）及び加速処理装置（ＡＰＵ：accelerated processing unit）等の処理ユニットには、複数の命令を同時に又は並行して処理するために複数の計算装置（例えば、プロセッサコア）が実装されている。例えば、ＧＰＵは、命令ストリーム（従来、「ワークアイテム」又は「スレッド」と呼ばれる）を同時に又は並行して実行するための複数の処理要素をそれぞれ含む複数の計算装置を用いて実装され得る。単一命令複数データ（ＳＩＭＤ）アーキテクチャに従って動作する計算装置は、異なるデータセットを用いて同じ命令を実行する。ＧＰＵ等の処理装置上で同時に又は並行して実行可能なスレッドの数は、数十スレッドから数千スレッドに及ぶ場合があり、エンジニアは、ＧＰＵに通常実装されている２次元（２Ｄ）又は３次元（３Ｄ）のグラフィックスアプリケーション以外のアプリケーションにもこの機能を利用したいと考えている。しかしながら、汎用アプリケーションは、アプリケーションエラーやシステムクラッシュを回避するために、従来のグラフィックスアプリケーションよりも高いレベルの耐障害性（fault tolerance）を必要とする。 Processing units such as central processing units (CPUs), graphics processing units (GPUs), and accelerated processing units (APUs) include multiple computing units (APUs) to process multiple instructions simultaneously or in parallel. For example, a processor core) is implemented. For example, a GPU may be implemented using a plurality of computing units, each containing a plurality of processing elements for executing instruction streams (conventionally referred to as "work items" or "threads") simultaneously or in parallel. A computer operating according to a single instruction multiple data (SIMD) architecture executes the same instruction with different data sets. The number of threads that can be executed simultaneously or in parallel on a processing device such as a GPU can range from tens to thousands of threads, and engineers can use two-dimensional (2D) or three threads that are typically implemented on a GPU. We would like to use this function for applications other than two-dimensional (3D) graphics applications. However, general purpose applications require a higher level of fault tolerance than traditional graphics applications to avoid application errors and system crashes.

添付の図面を参照することによって、本開示をより良く理解することができ、その多くの特徴及び利点が当業者に明らかとなるであろう。異なる図面において同一の参照符号が使用されている場合、類似又は同一のアイテムが示されている。 By referring to the accompanying drawings, the present disclosure will be better understood and many features and advantages thereof will be apparent to those skilled in the art. Similar or identical items are indicated when the same reference code is used in different drawings.

いくつかの実施形態による、加速処理装置のブロック図である。It is a block diagram of the acceleration processing apparatus according to some embodiments. いくつかの実施形態による、図１に示す加速処理装置上で実行可能なスレッドのグループの階層構造を示すブロック図である。It is a block diagram which shows the hierarchical structure of the group of threads which can be executed on the acceleration processing apparatus shown in FIG. 1 by some embodiments. いくつかの実施形態による、フィンガープリントの以前の比較以降に生じたイベントトリガの数に基づいて、冗長スレッドのフィンガープリントの比較を選択的にバイパスする加速処理装置のブロック図である。FIG. 3 is a block diagram of an acceleration processor that selectively bypasses redundant thread fingerprint comparisons based on the number of event triggers that have occurred since the previous comparison of fingerprints, according to some embodiments. いくつかの実施形態による、プログラムコードのコンパイル中にコンパイラによって挿入される変換コードによるプログラムコードの修正を示す図である。It is a figure which shows the modification of the program code by the conversion code inserted by the compiler during the compilation of the program code by some embodiments. いくつかの実施形態による、エラーを検出するために冗長スレッド間で共有及び比較演算を選択的にバイパス又は実行する方法のフロー図である。FIG. 5 is a flow diagram of a method of selectively bypassing or performing sharing and comparison operations between redundant threads to detect errors, according to some embodiments. いくつかの実施形態による、プログラムコードを終了する前に冗長スレッド間で共有及び比較演算を実行するか否かを判別するために終了チェックを実行する方法のフロー図である。FIG. 5 is a flow chart of a method of executing an end check to determine whether to execute a sharing and comparison operation between redundant threads before terminating the program code according to some embodiments.

冗長マルチスレッド（ＲＭＴ：Redundant multithreading）を使用して、２つ以上の冗長スレッドを異なる処理要素で実行し、次に冗長スレッドの結果を比較してエラーを検出することによって、処理装置の信頼性を向上させることができる。同じデータに対して同じ命令を実行する２つの冗長スレッドによって生成された結果間の相違を検出することによって、少なくとも１つの冗長スレッドにエラーがあることが示される。同じデータに対して同じ命令を実行する３つ以上の冗長スレッドによって生成された結果間の類似点及び相違点を用いて、例えば３つ以上の結果に適用される投票方式を使用することによって、エラーを検出し、場合によってはエラーを訂正することができる。ＲＭＴエラー検出又は訂正をサポートするために冗長スレッド間でデータを受け渡すメカニズムは、かなりのオーバーヘッドを招く。例えば、スピンロックメカニズムを使用して、冗長スレッド間のデータ及びメッセージの受け渡しを同期させてもよい。ＲＭＴシステムの性能は、少なくとも部分的には、従来のＲＭＴシステムが、エラーを含む可能性があるデータを記憶するのを避けるために、各ストア命令（又は他のイベントトリガ）の前に、冗長スレッドによって生成された結果を比較するため、オーバーヘッドによって著しく低下する可能性がある。 Processing device reliability by using Redundant multithreading (RMT) to run two or more redundant threads on different processing elements and then comparing the results of the redundant threads to detect errors. Can be improved. By detecting the difference between the results produced by two redundant threads executing the same instruction on the same data, it is indicated that at least one redundant thread has an error. Using similarities and differences between results generated by three or more redundant threads executing the same instructions on the same data, for example by using a voting system that applies to three or more results. It can detect errors and, in some cases, correct them. The mechanism for passing data between redundant threads to support RMT error detection or correction introduces considerable overhead. For example, a spinlock mechanism may be used to synchronize the passing of data and messages between redundant threads. The performance of the RMT system is redundant, at least in part, before each store instruction (or other event trigger) to avoid the traditional RMT system storing data that may contain errors. Overhead can be significantly reduced as the results produced by the threads are compared.

比較用のイベントトリガ（例えば、冗長スレッドによるストア命令の実行等）が以前の結果の比較から設定可能な回数（例えば２回以上）だけ発生したか否かに応じて、冗長スレッドによって実行された演算結果の比較を選択的にバイパスすることによって、ソフトウェア実装されたＲＭＴエラー検出又は訂正メカニズムのオーバーヘッドを、エラー検出精度を低下させることなく減らすことができる。スレッドによって生成された結果を、以前に符号化された値又はスレッドに関連する初期値と共にハッシュして、符号化値を冗長スレッド毎に生成することによって、エラーを検出する確率を大幅に低下させることなく、以前の演算の結果を後続の比較のために記憶するのに伴うオーバーヘッドを減らすことができる。符号化値は、複数のイベントトリガに関連する複数の結果を表す各スレッドのフィンガープリントを形成する。冗長スレッドのフィンガープリントの値は、冗長スレッドでいくつかのイベントトリガが発生した後に共有及び比較される。冗長スレッドのフィンガープリントの値が異なる場合にはエラーが検出され、これにより、バリエーションによってはエラー回復処理がトリガされる。冗長スレッドは、３つ以上の冗長スレッドを含むことができ、その場合、投票方式を使用して、フィンガープリントの最頻値を正しい値として選択することによって、エラー訂正を実行する。フィンガープリントのいくつかの実施形態では、記憶される結果及び記憶される値のアドレスを以前のフィンガープリントの値と共にハッシュすることによって、計算される。 It was executed by the redundant thread depending on whether the event trigger for comparison (for example, execution of the store instruction by the redundant thread) occurred a set number of times (for example, two or more times) from the comparison of the previous result. By selectively bypassing the comparison of the calculation results, the overhead of the software-implemented RMT error detection or correction mechanism can be reduced without degrading the error detection accuracy. By hashing the thread-generated results with previously encoded values or thread-related initial values and generating encoded values for each redundant thread, the probability of detecting an error is greatly reduced. Without having to, the overhead associated with storing the results of previous operations for subsequent comparisons can be reduced. The encoded value forms a fingerprint of each thread that represents multiple results associated with multiple event triggers. Redundant thread fingerprint values are shared and compared after several event triggers have occurred in the redundant thread. If the redundant thread fingerprint values are different, an error is detected, which triggers error recovery processing depending on the variation. Redundant threads can include three or more redundant threads, in which case error correction is performed by using a voting method to select the most frequent fingerprint value as the correct value. In some embodiments of the fingerprint, it is calculated by hashing the address of the stored result and the stored value with the value of the previous fingerprint.

いくつかのバリエーションでは、コンパイラを使用して、冗長スレッドによって実行されるプログラムコードをフィンガープリント方式に変換する。変換コードによって、冗長スレッドは、共有及び比較演算を選択的にバイパスし、バイパスされた共有及び比較演算に対する符号化値を、単一の符号化されたフィンガープリントにひとまとめにする。例えば、共有及び比較演算のイベントトリガがストア命令である場合、コンパイラは、コードを挿入して、冗長スレッドによって記憶される結果と、対応する以前のフィンガープリントの値と、をハッシュするのに使用されるコードのルックアップテーブルを生成する。また、コンパイラは、冗長スレッド毎のカウンタを初期化する。カウンタは、冗長スレッドがイベントトリガを実行したことに応じてインクリメントされ、共有及び比較演算がバイパスされた回数を判別するために使用される。また、コンパイラは、ハッシュを実行し、冗長スレッドのフィンガープリント変数の値の比較をバイパス又は実行するかをチェックし、冗長スレッドのフィンガープリント変数の値を共有及び比較し、プログラムコードを終了する前に未処理の共有及び比較演算を実行するか否かを判別するために、変換コードを挿入する。 In some variations, a compiler is used to translate the program code executed by redundant threads into a fingerprint scheme. The translation code selectively bypasses the sharing and comparison operations and bundles the encoded values for the bypassed sharing and comparison operations into a single coded fingerprint. For example, if the shared and compare operation event trigger is a store instruction, the compiler can insert code to hash the result stored by the redundant thread with the corresponding previous fingerprint value. Generate a lookup table of the code to be compiled. The compiler also initializes the counter for each redundant thread. Counters are incremented in response to redundant threads performing event triggers and are used to determine how many times sharing and comparison operations have been bypassed. The compiler also executes the hash, checks whether to bypass or compare the values of the fingerprint variables of the redundant threads, shares and compares the values of the fingerprint variables of the redundant threads, and before terminating the program code. Insert a conversion code in to determine whether to perform unprocessed sharing and comparison operations.

図１は、いくつかの実施形態による、加速処理装置１００のブロック図である。加速処理装置（ＡＰＤ：accelerated processing device）１００を用いて、様々なタイプの処理装置（例えば、中央処理装置（ＣＰＵ）、グラフィックス処理装置（ＧＰＵ）、汎用ＧＰＵ（ＧＰＧＰＵ）、特定用途向け集積回路（ＡＳＩＣ）、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、デジタル信号プロセッサ（ＤＳＰ）等）を実装することができる。ＡＰＤ１００は、１つ以上の仮装マシン（例えば、コンピュータシステムの動作をエミュレートし、アプリケーションを実行するためのプラットフォームを提供する低レベル仮想マシン（ＬＬＶＭ）等）を実装するように構成され得る。また、ＡＰＤ１００は、オペレーティングシステムを実装するように構成されており、いくつかの実施形態では、各仮装マシンは、オペレーティングシステムの個別のインスタンスを実行する。さらに、ＡＰＤ１００は、例えば、ピクセル演算、幾何学的計算、例えば画像レンダリング等を含むグラフィックスパイプライン演算等の演算を行うカーネルを実行するように構成されている。また、ＡＰＤ１００は、映像操作（video operations）、物理シミュレーション、計算流体力学等の非グラフィックス処理演算を実行することができる。 FIG. 1 is a block diagram of the acceleration processing apparatus 100 according to some embodiments. Various types of processing devices (eg, central processing unit (CPU), graphics processing unit (GPU), general-purpose GPU (GPGPU), integrated circuits for specific applications) using an accelerated processing device (APD) 100. (ASIC), field programmable gate array (FPGA), digital signal processor (DSP), etc.) can be implemented. The APD100 may be configured to implement one or more fake machines, such as a low-level virtual machine (LLVM) that emulates the behavior of a computer system and provides a platform for running applications. The APD100 is also configured to implement an operating system, and in some embodiments, each masquerade machine runs a separate instance of the operating system. Further, the APD 100 is configured to execute, for example, a kernel that performs operations such as pixel operations, geometric calculations, such as graphics pipeline operations including image rendering and the like. In addition, the APD100 can perform non-graphics processing operations such as video operations, physics simulation, and computational fluid dynamics.

ＡＰＤ１００は、複数の計算ユニット１０１，１０２，１０３を含み、これらは本明細書においてまとめて「計算ユニット１０１〜１０３」と呼ばれる。計算ユニット１０１〜１０３は、同じカーネルの異なるインスタンスを同時に実行するパイプラインとして動作するように構成することができる。例えば、計算ユニット１０１〜１０３のいくつかの変形例では、異なるデータを使用して同じ命令を並行して実行する単一命令複数データ（ＳＩＭＤ）プロセッサコアであってもよい。ＡＰＤ１００のいくつかの実施形態は、より多くの又はより少ない計算ユニット１０１〜１０３を実装してもよい。 The APD 100 includes a plurality of calculation units 101, 102, 103, which are collectively referred to herein as "calculation units 101-103". Computational units 101-103 can be configured to operate as a pipeline running different instances of the same kernel at the same time. For example, some variants of compute units 101-103 may be a single instruction multiple data (SIMD) processor core that executes the same instruction in parallel using different data. Some embodiments of the APD100 may implement more or less computational units 101-103.

計算ユニット１０１は、処理要素１０５，１０６，１０７（本明細書においてまとめて「処理要素１０５〜１０７」と呼ばれる）を含む。処理要素１０５〜１０７のいくつかの実施形態は、計算ユニット１０１内の処理要素１０５〜１０７による実行のためにスケジューリングされた命令によって示される算術演算及び論理演算を実行するように構成されている。また、計算ユニット１０１は、例えばローカルデータ記憶部（ＬＤＳ：local data store）１１０等のメモリを含む。ＬＤＳ１１０に記憶された命令又はデータは、処理要素１０５〜１０７からは見えるが、計算ユニット１０２，１０３上のエンティティからは見えない。したがって、ＬＤＳ１１０は、計算ユニット１０１の処理要素１０５〜１０７間での共有を可能にする。ＬＤＳ１１０は、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）、組み込みＤＲＡＭ（ｅＤＲＡＭ）、相変化メモリ（ＰＣＭ）等を使用して実装することができる。明瞭にするために、図１には、計算ユニット１０１に実装された処理要素１０５〜１０７及びＬＤＳ１１０のみが示されている。しかしながら、計算ユニット１０２，１０３も対応する処理要素及び対応するＬＤＳを含む。 The calculation unit 101 includes processing elements 105, 106, 107 (collectively referred to herein as "processing elements 105-107"). Some embodiments of processing elements 105-107 are configured to perform arithmetic and logical operations indicated by instructions scheduled for execution by processing elements 105-107 in calculation unit 101. Further, the calculation unit 101 includes a memory such as a local data store (LDS) 110 or the like. The instructions or data stored in the LDS 110 are visible to the processing elements 105-107 but not to the entities on the compute units 102, 103. Therefore, the LDS 110 enables sharing between the processing elements 105 to 107 of the calculation unit 101. The LDS 110 can be implemented using a dynamic random access memory (DRAM), an embedded DRAM (eDRAM), a phase change memory (PCM), or the like. For clarity, only the processing elements 105-107 and LDS 110 mounted on the compute unit 101 are shown in FIG. However, the calculation units 102, 103 also include corresponding processing elements and corresponding LDSs.

各処理要素１０５〜１０７は、カーネルの個別のインスタンスを実行する。処理要素１０５〜１０７によって実行されるカーネルのインスタンスは、ワークアイテム、タスク又はスレッドと呼ばれ得る。いくつかの変形例では、スレッドによって実行される命令、及び、命令によって操作されるデータは、ＬＤＳ１１０からアクセスされる。そして、スレッドによって実行された演算の結果は、ＬＤＳ１１０に記憶される。また、処理要素１０５〜１０７は、プライベートメモリ１１５，１１６，１１７を含み、これらは本明細書においてまとめて「メモリ１１５〜１１７」と呼ばれる。各処理要素１０５〜１０７のメモリ１１５〜１１７は、対応する処理要素１０５〜１０７からしか見えない。例えば、メモリ１１５は、処理要素１０５からしか見えず、処理要素１０６，１０７からは見えない。メモリ１１５〜１１７は、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）、組み込みＤＲＡＭ（ｅＤＲＡＭ）、相変化メモリ（ＰＣＭ）等を使用して実装することができる。 Each processing element 105-107 executes a separate instance of the kernel. An instance of the kernel executed by processing elements 105-107 may be called a work item, task or thread. In some variants, the instructions executed by the thread and the data manipulated by the instructions are accessed from the LDS 110. Then, the result of the operation executed by the thread is stored in the LDS 110. Further, the processing elements 105 to 107 include private memories 115, 116, 117, which are collectively referred to as "memory 115 to 117" in the present specification. The memory 115-117 of each processing element 105-107 is visible only from the corresponding processing element 105-107. For example, the memory 115 is visible only to the processing elements 105 and not from the processing elements 106, 107. The memories 115 to 117 can be implemented by using a dynamic random access memory (DRAM), an embedded DRAM (eDRAM), a phase change memory (PCM), or the like.

また、ＡＰＤ１００は、ＡＰＤ１００に実装される全ての計算ユニット１０１〜１０３から見えるメモリであるグローバルデータ記憶部（ＧＤＳ：global data store）１２０を含む。本明細書で使用される場合、「見える」という用語は、計算ユニット１０１〜１０３が、例えば、メモリに情報を書き込むためにストアを実行する、又は、メモリから情報を読み出すためにロードを実行することによって、ＧＤＳ１２０内の情報にアクセスできることを示している。したがって、ＧＤＳ１２０を使用することによって、計算ユニット１０１〜１０３の処理要素によって実行されているスレッド間の共有を容易にすることが可能になる。ＧＤＳ１２０のいくつかの実施形態は、ＡＰＤ１００に相互接続され得る他の処理装置からも見える。例えば、ＧＤＳ１２０は、ＡＰＤ１００に接続されているＣＰＵ（図１には示されていない）から見えてもよい。ＧＤＳ１２０は、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）、組み込みＤＲＡＭ（ｅＤＲＡＭ）、相変化メモリ（ＰＣＭ）等を使用して実装することができる。 Further, the APD 100 includes a global data store (GDS) 120 which is a memory visible from all the calculation units 101 to 103 mounted on the APD 100. As used herein, the term "visible" means that compute units 101-103 perform a store to write information to memory, or a load to read information from memory, for example. This indicates that the information in the GDS 120 can be accessed. Therefore, by using the GDS 120, it becomes possible to facilitate sharing between threads executed by the processing elements of the calculation units 101 to 103. Some embodiments of the GDS 120 are also visible to other processing devices that may be interconnected to the APD 100. For example, the GDS 120 may be visible to a CPU (not shown in FIG. 1) connected to the APD 100. The GDS 120 can be implemented using a dynamic random access memory (DRAM), an embedded DRAM (eDRAM), a phase change memory (PCM), or the like.

冗長スレッドは、ＡＰＤ１００内の処理要素１０５〜１０７によって実行される。冗長スレッドによって実行された演算の結果を符号化することによって、冗長スレッド毎にフィンガープリントを生成することができる。そして、冗長スレッドによって実行された演算の結果又は冗長スレッドのフィンガープリントを比較して、冗長スレッドの実行中に発生したエラーを検出（又は、場合によっては訂正）することができる。結果又は関連するフィンガープリントの比較は、通常、ストア命令等のトリガイベントに応じて実行され、これにより、結果をメモリに収容する前にエラーが検出又は訂正される。ＡＰＤ１００のいくつかの実施形態は、比較用のイベントトリガが、結果の符号化値の以前の比較から設定可能な回数だけ発生したか否かに応じて、冗長スレッドのフィンガーの比較を選択的にバイパスするように構成されている。設定可能な回数は、２以上の値に設定することができる。 Redundant threads are executed by processing elements 105-107 in the APD100. A fingerprint can be generated for each redundant thread by encoding the result of the operation performed by the redundant thread. Then, the result of the operation executed by the redundant thread or the fingerprint of the redundant thread can be compared to detect (or correct in some cases) an error generated during execution of the redundant thread. Comparison of results or associated fingerprints is typically performed in response to a trigger event such as a store instruction, which detects or corrects an error before storing the result in memory. In some embodiments of the APD100, the comparison of redundant thread fingers is selectively performed depending on whether the event trigger for comparison has occurred a configurable number of times from the previous comparison of the resulting coded values. It is configured to bypass. The number of times that can be set can be set to a value of 2 or more.

図２は、いくつかの実施形態による、スレッドのグループの階層構造２００を示すブロック図である。階層構造２００のいくつかの変形例は、図１に示すＡＰＤ１００によって同時に又は並行して実行されるスレッドを表す。階層構造２００は、図１に示す処理要素１０５〜１０７等の処理要素によって実行可能なプログラムコードを表すカーネル２０５を含む。カーネル２０５のインスタンスは、ワークグループ２１０，２１１，２１２にグループ分けされ、これらは本明細書においてまとめて「ワークグループ２１０〜２１２」と呼ばれる。各ワークグループ２１０〜２１２は、ワークグループ内のスレッド数を定義するローカルサイズと、ワークグループ２１０〜２１２の各々を一意に識別するグループ識別子と、を有する。いくつかの実施形態では、ワークグループ２１０〜２１２は、同時に又は並行して実行される関連スレッドの集合である。例えば、ワークグループ２１０はスレッド２１５，２１６，２１７を含み、これらは本明細書においてまとめて「スレッド２１５〜２１７」と呼ばれる。スレッド２１５〜２１７には、ワークグループ２１０内のスレッド２１５〜２１７を識別する異なるローカル識別子が割り当てられている。また、スレッド２１５〜２１７には、全てのワークグループ２１０〜２１２に割り当てられたスレッドに亘ってスレッド２１５〜２１７をグローバルに識別するグローバル識別子も割り当てられている。各ワークグループ２１０〜２１２内のスレッドは、対応するワークグループ２１０〜２１２内の他のスレッドと同期することができる。 FIG. 2 is a block diagram showing a hierarchical structure 200 of a group of threads according to some embodiments. Some variants of the hierarchical structure 200 represent threads that are executed simultaneously or in parallel by the APD 100 shown in FIG. The hierarchical structure 200 includes a kernel 205 that represents program code that can be executed by processing elements such as the processing elements 105 to 107 shown in FIG. Instances of kernel 205 are grouped into workgroups 210, 211,212, which are collectively referred to herein as "workgroups 210-212". Each workgroup 210-212 has a local size that defines the number of threads in the workgroup and a group identifier that uniquely identifies each of the workgroups 210-212. In some embodiments, workgroups 210-212 are a collection of related threads that are executed simultaneously or in parallel. For example, workgroup 210 includes threads 215, 216, 217, which are collectively referred to herein as "threads 215-217". Threads 215-217 are assigned different local identifiers that identify threads 215-217 within the workgroup 210. Threads 215 to 217 are also assigned a global identifier that globally identifies threads 215 to 217 across the threads assigned to all workgroups 210 to 212. Threads in each workgroup 210-212 can be synchronized with other threads in the corresponding workgroup 210-212.

ワークグループ２１０〜２１２は、対応する計算ユニットで実行されるように割り当てられている。例えば、ワークグループ２１０は、図１に示す計算ユニット１０１で実行され、ワークグループ２１１は、図１に示す計算ユニット１０２で実行され、ワークグループ２１２は、図１に示す計算ユニット１０３で実行され得る。ワークグループ２１０〜２１２内のスレッドは、割り当てられた計算ユニット内の対応する処理要素で実行されるようにスケジューリングされる。例えば、スレッド２１５は、図１に示す処理要素１０５で実行されるようにスケジューリングすることができ、スレッド２１６は、図１に示す処理要素１０６で実行されるようにスケジューリングすることができ、スレッド２１７は、図１に示す処理要素１０７で実行されるようにスケジューリングすることができる。 Workgroups 210-212 are assigned to run in the corresponding compute units. For example, the workgroup 210 may be executed in the calculation unit 101 shown in FIG. 1, the workgroup 211 may be executed in the calculation unit 102 shown in FIG. 1, and the workgroup 212 may be executed in the calculation unit 103 shown in FIG. .. Threads in workgroups 210-212 are scheduled to run on the corresponding processing elements in the assigned compute unit. For example, thread 215 can be scheduled to be executed by the processing element 105 shown in FIG. 1, thread 216 can be scheduled to be executed by the processing element 106 shown in FIG. 1, and thread 217 can be executed. Can be scheduled to be executed by the processing element 107 shown in FIG.

冗長マルチスレッド（ＲＭＴ）は、スレッド２１５〜２１７等のスレッドの処理中に発生するエラーを検出し、場合によっては訂正するために、ＡＰＤ１００によって実行される。冗長スレッドは、同じデータを使用して同じ命令を実行するために、複数のスレッドをインスタンス化することによって生成される。冗長スレッドは、異なる処理要素で実行される。グローバル識別子、又は、グループ識別子及びローカル識別子の組み合わせは、対応するスレッドによって処理されるデータを示している。例えば、スレッドのグローバル識別子を使用して、メモリアドレスを計算し、スレッドの制御を決定することができる。したがって、２つのスレッドが同じデータに対して同じカーネルコードを実行するように、複数のスレッドのグローバル識別子（又は、グループ識別子及びローカル識別子の組み合わせ）を単一のグローバル識別子にマッピングすることによって、冗長スレッドを生成することができる。ソフトウェア実装ＲＭＴ技術は、米国特許第９，２７４，９０４号に開示されており、その全体は、引用することによって本明細書に組み込まれる。比較用のイベントトリガが、結果の符号化値の以前の比較から設定可能な回数だけ発生したか否かに応じて、冗長スレッドによって実行された演算の結果の符号化値（フィンガープリントとも呼ばれ得る）の比較を選択的にバイパスすることによって、ソフトウェア実装ＲＭＴ技術によって生じるオーバーヘッドを低減することができる。 Redundant multithreading (RMT) is performed by the APD100 to detect and, in some cases, correct errors that occur during the processing of threads such as threads 215-217. Redundant threads are created by instantiating multiple threads to execute the same instruction with the same data. Redundant threads run on different processing elements. The global identifier, or combination of group and local identifiers, indicates the data processed by the corresponding thread. For example, the thread's global identifier can be used to calculate the memory address and determine thread control. Therefore, by mapping the global identifiers (or combinations of group and local identifiers) of multiple threads to a single global identifier so that two threads execute the same kernel code for the same data, it is redundant. Threads can be spawned. Software-implemented RMT technology is disclosed in US Pat. No. 9,274,904, which is incorporated herein by reference in its entirety. The coded value (also known as the fingerprint) of the result of an operation performed by a redundant thread, depending on whether the event trigger for comparison has occurred a set number of times since the previous comparison of the coded value of the result. By selectively bypassing the (obtain) comparison, the overhead caused by software-implemented RMT technology can be reduced.

図３は、いくつかの実施形態による、フィンガープリントの以前の比較以降に生じたイベントトリガの数に基づいて、フィンガープリントの比較を選択的にバイパスするＡＰＤ３００のブロック図である。ＡＰＤ３００は、図１に示すＡＰＤ１００のいくつかの実施形態を実装するために使用される。ＡＰＤ３００はカーネル３０５を実行するように構成されている。スレッド３１０，３１５は、同じデータに対して同じ命令を実行するカーネル３０５の冗長インスタンスを表している。いくつかの実施形態では、スレッド３１０，３１５は、同じ識別子（例えば、同じグローバル識別子、グループ識別子又はローカル識別子）によって識別される。図３には、２つのスレッド３１０，３１５が示されているが、他の実施形態では、より多くの冗長スレッドを含んでもよい。 FIG. 3 is a block diagram of the APD 300 that selectively bypasses fingerprint comparisons based on the number of event triggers that have occurred since the previous comparison of fingerprints, according to some embodiments. The APD 300 is used to implement some embodiments of the APD 100 shown in FIG. The APD300 is configured to run kernel 305. Threads 310 and 315 represent redundant instances of kernel 305 that execute the same instructions for the same data. In some embodiments, threads 310,315 are identified by the same identifier (eg, the same global identifier, group identifier or local identifier). Although two threads 310,315 are shown in FIG. 3, other embodiments may include more redundant threads.

ＡＰＤ３００は、冗長スレッドを実行するための処理要素３２０，３２５を含む。各処理要素３２０，３２５は、対応するプライベートメモリ３３０，３３５を含む。図３には、２つの処理要素３２０，３２５が示されているが、ＡＰＤ３００のいくつかの実施形態では、より多くの処理要素を含んでもよい。また、ＡＰＤ３００は、処理要素３２０，３２５の両方から見えるメモリ３４０を含む。例えば、処理要素３２０，３２５が同じ計算ユニットに実装される場合、メモリ３４０は、図１に示すＬＤＳ１１０等のローカルデータ記憶部とすることができる。他の例では、処理要素３２０，３２５が異なる計算ユニットに実装される場合、メモリ３４０は、図１に示すＧＤＳ１２０等のグローバルデータ記憶部とすることができる。 The APD 300 includes processing elements 320, 325 for executing redundant threads. Each processing element 320,325 includes a corresponding private memory 330,335. Although two processing elements 320,325 are shown in FIG. 3, some embodiments of the APD 300 may include more processing elements. Further, the APD 300 includes a memory 340 that can be seen from both the processing elements 320 and 325. For example, when the processing elements 320 and 325 are mounted in the same calculation unit, the memory 340 can be a local data storage unit such as the LDS 110 shown in FIG. In another example, when the processing elements 320 and 325 are mounted in different calculation units, the memory 340 can be a global data storage unit such as the GDS 120 shown in FIG.

ＡＰＤ３００によって実装されるコンパイラは、コンパイル中に、変換コードを、カーネル３０５によって定義されたプログラムコードに挿入する。変換コードは、スレッド３１０，３１５によって実行されると、スレッド３１０，３１５に、プログラムコードの一部又はブロックの実行結果に基づいてスレッド３１０，３１５によって生成されたフィンガープリントの比較を選択的にバイパスさせる。スレッド３１０，３１５は、比較用のイベントトリガが、以前のフィンガープリントの比較以降に設定可能な回数だけ発生したか否かに応じて、フィンガープリントの比較を選択的にバイパスする。いくつかの実施形態では、コンパイラは、フィンガープリントを生成するために結果をハッシュするのに用いられるコード値を含むルックアップテーブル３４５を、スレッド３１０，３１５の１つ以上に割り当てさせる変換コードを挿入する。例えば、２５６要素の８ビット配列をスレッドグループのローカル共有メモリ空間に割り当てて、ルックアップテーブル３４５を形成することができる。また、変換コードは、スレッド３１０，３１５の１つ以上に、ルックアップテーブル３４５内のコード値を初期化させることができる。例えば、スレッド３１０，３１５の１つ以上は、２５６個の８ビットの一意のキャッシュ要素（unique cache elements）を、ルックアップテーブル３４５を形成する配列に挿入することができる。或いは、いくつかの実施形態は、ルックアップテーブル３４５の使用を必要としない他のアルゴリズム（例えば、排他的論理和演算等に基づくハッシュアルゴリズム）を使用してハッシュを行う。 The compiler implemented by APD300 inserts the conversion code into the program code defined by kernel 305 during compilation. When the translation code is executed by threads 310,315, it selectively bypasses threads 310,315 to compare the fingerprints generated by threads 310,315 based on the execution result of part or block of the program code. Let me. Threads 310,315 selectively bypass the fingerprint comparison, depending on whether the comparison event trigger has occurred a setable number of times since the previous fingerprint comparison. In some embodiments, the compiler inserts a translation code that causes one or more of threads 310,315 to have a lookup table 345 containing code values used to hash the results to generate fingerprints. do. For example, an 8-bit array of 256 elements can be allocated to the thread group's local shared memory space to form a lookup table 345. Further, the conversion code can initialize the code value in the lookup table 345 to one or more of threads 310 and 315. For example, one or more of threads 310,315 can insert 256 8-bit unique cache elements into the array forming the lookup table 345. Alternatively, some embodiments perform hashing using another algorithm that does not require the use of a lookup table 345 (eg, a hash algorithm based on an exclusive OR or the like).

プログラムコードに挿入された変換コードは、フィンガープリントの値を記憶するために、対応するレジスタ３５０，３５５をスレッド３１０，３１５に初期化させるレジスタ初期化コードを含む。例えば、スレッド３１０，３１５は、対応するプライベートメモリ３３０，３３５内のレジスタ３５０，３５５を割り当てることができる。また、スレッド３１０，３１５は、ルックアップテーブル３４５に記憶されたコード値を使用して、レジスタ３５０，３５５に記憶されたフィンガープリントの値を初期化することができる。また、変換コードは、スレッド３１０，３１５が例えばストア命令等のイベントトリガに応じたフィンガープリント値の比較をバイパスした回数をカウントするのに使用される対応するカウンタ３６０，３６５を、スレッド３１０，３１５に初期化させるカウンタ初期化コードを含む。スレッド３１０，３１５は、カウンタ３６０，３６５を例えばゼロ等のデフォルト値に初期化することができる。 The conversion code inserted into the program code includes a register initialization code that initializes the corresponding registers 350,355 to threads 310,315 to store the fingerprint value. For example, threads 310,315 can allocate registers 350,355 in the corresponding private memories 330,335. In addition, threads 310 and 315 can use the code values stored in the lookup table 345 to initialize the fingerprint values stored in the registers 350 and 355. The conversion code also provides threads 310,315 with corresponding counters 360,365 used to count the number of times threads 310,315 bypassed the comparison of fingerprint values in response to event triggers such as store instructions. Includes a counter initialization code to initialize to. Threads 310, 315 can initialize counters 360, 365 to default values such as zero.

変換コードは、スレッド３１０、３１５に、現在の結果の値（及び、場合によっては、結果を記憶する場所のアドレス）を、対応するレジスタ３５０，３５５に記憶された現在のフィンガープリントの値とハッシュさせるハッシュコードを含む。例えば、スレッド３１０，３１５が以前のフィンガープリント値の比較をバイパスしていない場合、現在の結果の値（例えば、メモリに記憶される値等）及び記憶場所のアドレスは、対応するレジスタ３５０，３５５に記憶されたフィンガープリントの初期値とハッシュされる。他の例では、スレッド３１０，３１５が以前のフィンガープリント値の比較を既に１回以上バイパスしている場合、現在の結果の値及びアドレスは、現在のフィンガープリント値とハッシュされる。現在のフィンガープリント値は、以前の比較をバイパスしたことに応じてフィンガープリントとハッシュされた以前の結果の値及びアドレスに基づいて以前に生成されたものである。スキップチェックは、変換コードに含まれており、カウンタ３６０，３６５の値と、比較をバイパスするか比較を実行するかを示す設定可能な値と、を比較するために使用される。設定可能な値は、１より大きい値に設定される。設定可能な値が大きいほど、ＡＰＤ３００によるエラー検出を行うのに使用される共有及び比較アルゴリズムによって生じるオーバーヘッドを低減させることができる。 The conversion code has the current result value (and, in some cases, the address of the location where the result is stored) stored in threads 310,315 and the current fingerprint value stored in the corresponding registers 350,355. Includes the hash code to make. For example, if threads 310,315 have not bypassed the comparison of previous fingerprint values, the current result value (eg, a value stored in memory, etc.) and the address of the storage location will be in the corresponding registers 350,355. Hashed with the initial value of the fingerprint stored in. In another example, if threads 310,315 have already bypassed the previous fingerprint value comparison more than once, the current result value and address will be hashed with the current fingerprint value. The current fingerprint value is previously generated based on the value and address of the previous result hashed with the fingerprint in response to bypassing the previous comparison. The skip check is included in the conversion code and is used to compare the value of counters 360,365 with a configurable value that indicates whether to bypass or perform the comparison. The value that can be set is set to a value greater than 1. The larger the configurable value, the less overhead the sharing and comparison algorithms used to perform error detection with the APD300.

また、変換コードは、エラー検出又は訂正のために値を比較できるように、対応するレジスタ３５０，３５５内の値をスレッド３１０，３１５に共有させる共有及び比較コードを含む。例えば、スレッド３１０によって実行されたスキップチェックが、カウンタ３６０の値が設定可能な値以上であることを示す場合、スレッド３１０は、フィンガープリントの値を、レジスタ３５０から、メモリ３４０内に実装された共有バッファ３７０にコピーすることによって、レジスタ３５０に記憶されたフィンガープリントを共有することができる。同期、スピンロック又は他の技術を使用して、共有バッファ３７０の使用を調整してもよい。また、スレッド３１５によって実行されるスキップチェックは、カウンタ３６５の値が設定可能な値以上であることを示す。そして、スレッド３１５は、スレッド３１０に関連する共有されたフィンガープリントの値にアクセスし、共有されたフィンガープリントを、レジスタ３５５に記憶されたフィンガープリントの値と比較することができる。２つの値が等しい場合、ＡＰＤ３００は、エラーが発生していないと判別する。２つの値が異なる場合、ＡＰＤ３００は、エラーが発生したと判別し、エラー訂正を含むエラー手順を開始することができる。 The conversion code also includes a sharing and comparison code that causes threads 310 and 315 to share the values in the corresponding registers 350 and 355 so that the values can be compared for error detection or correction. For example, if the skip check performed by thread 310 indicates that the value of counter 360 is greater than or equal to the configurable value, thread 310 has implemented the fingerprint value from register 350 into memory 340. By copying to the shared buffer 370, the fingerprint stored in the register 350 can be shared. The use of shared buffer 370 may be adjusted using synchronization, spinlock or other techniques. Further, the skip check executed by the thread 315 indicates that the value of the counter 365 is equal to or greater than the settable value. Thread 315 can then access the shared fingerprint value associated with thread 310 and compare the shared fingerprint with the fingerprint value stored in register 355. If the two values are equal, the APD300 determines that no error has occurred. If the two values are different, the APD 300 can determine that an error has occurred and initiate an error procedure that includes error correction.

図４は、いくつかの実施形態による、プログラムコード４００のコンパイル中にコンパイラによって挿入される変換コードによるプログラムコード４００の修正を示す図である。プログラムコード４００は、図２に示すカーネル２０５又は図３に示すカーネル３０５等のカーネルに含まれるコードの一部とすることができる。プログラムコード４００は、コードブロック４０１と、イベントトリガ４０２と、コードブロック４０３と、イベントトリガ４０４と、コードブロック４０５と、終了コード４０６と、を含む。イベントトリガ４０２，４０４は、エラーを検出又は訂正するために冗長スレッドのフィンガープリントを比較するのに使用される共有及び比較演算をトリガする命令である。例えば、イベントトリガ４０２，４０４は、値をある場所及びメモリに記憶するのに使用されるストア命令とすることができる。したがって、記憶される値は、イベントトリガの結果であり、この値を用いて、例えば、記憶された値及び記憶場所のアドレスを、以前のフィンガープリントの値又は初期値と共にハッシュすることによって、対応するスレッドのフィンガープリントを決定する。プログラムコード４００を実行するスレッドが遭遇するイベントトリガ４０２，４０４の数は、確定的であってもよいし非確定的であってもよい。例えば、コードブロック４０１，４０３，４０５は、プログラムコード４００の実行中に異なる数のイベントトリガ４０２，４０４に遭遇するカーネルの異なるインスタンスをもたらす可能性があるループ、条件付き命令、分岐命令等を含むことができる。 FIG. 4 is a diagram showing modification of the program code 400 by the conversion code inserted by the compiler during the compilation of the program code 400 according to some embodiments. The program code 400 can be a part of the code contained in the kernel such as the kernel 205 shown in FIG. 2 or the kernel 305 shown in FIG. The program code 400 includes a code block 401, an event trigger 402, a code block 403, an event trigger 404, a code block 405, and an exit code 406. Event triggers 402, 404 are instructions that trigger sharing and comparison operations used to compare the fingerprints of redundant threads to detect or correct errors. For example, event triggers 402, 404 can be store instructions used to store values in a location and in memory. Therefore, the stored value is the result of an event trigger, which can be used, for example, by hashing the stored value and the address of the storage location with the value or initial value of the previous fingerprint. Determine the fingerprint of the thread to do. The number of event triggers 402, 404 encountered by the thread executing the program code 400 may be deterministic or non-deterministic. For example, code blocks 401, 403, 405 include loops, conditional instructions, branch instructions, etc. that can result in different instances of the kernel encountering different numbers of event triggers 402,404 during the execution of program code 400. be able to.

プログラムコード４００は、コンパイル中にコンパイラによって変換される。例えば、コンパイラのいくつかの変形例は、プログラムコード４００内の最初のコードブロック４０１の前にテーブル生成コード４１０を挿入し、その結果、修正されたプログラムコード４１５を実行するスレッドは、図３に示すルックアップテーブル３４５等のテーブルを割り当てて初期化する。テーブル生成コード４１０のいくつかの実施形態は、３２ビット巡回冗長検査（ＣＲＣ）ハッシュルーチンを実装し、ＣＲＣハッシュルーチンに使用される符号化値を記憶するために２５６要素の８ビット配列を割り当てる。例えば、テーブルを割り当ててデータを入力するためのテーブル生成コード４１０は、以下の擬似コードで記述することができる。

Program code 400 is converted by the compiler during compilation. For example, some variants of the compiler insert table generation code 410 before the first code block 401 in program code 400, so that the thread that executes the modified program code 415 is shown in FIG. A table such as the lookup table 345 shown is assigned and initialized. Some embodiments of table generation code 410 implement a 32-bit Cyclic Redundancy Check (CRC) hash routine and allocate an 8-bit array of 256 elements to store the encoded values used in the CRC hash routine. For example, the table generation code 410 for allocating a table and inputting data can be described by the following pseudo code.

さらに、コンパイラは、図３に示すカウンタ３６０，３６５等のカウンタを初期化するのに使用されるカウンタ初期化コード４２５を挿入することによって、プログラムコード４００を、修正されたプログラムコード４２０に変換する。また、修正されたプログラムコード４２０は、イベントトリガ４０２，４０４の各々の後に挿入されるハッシュコード４３０，４３１を含む。ハッシュコード４３０，４３１は、イベントトリガ４０２，４０４に関連する結果をハッシュして、対応するスレッドのフィンガープリントを表す符号化値をイベントトリガ４０２，４０４の後に形成するのに使用される。ハッシュコード４３０，４３１の例は、以下の擬似コードで記述することができる。

Further, the compiler converts the program code 400 into the modified program code 420 by inserting the counter initialization code 425 used to initialize the counters such as the counters 360, 365 shown in FIG. .. Further, the modified program code 420 includes hash codes 430 and 431 inserted after each of the event triggers 402 and 404. Hash codes 430,431 are used to hash the results associated with event triggers 402,404 and form a coded value after event triggers 402,404 that represents the fingerprint of the corresponding thread. Examples of hash codes 430 and 431 can be described by the following pseudo code.

スキップチェックコード４３５，４３６を挿入して、プログラムコード４００を実行するスレッドが、１つ以上の対応する冗長スレッドと最後に共有及び比較してから設定可能な回数だけイベントトリガ４０２，４０４に遭遇したか否かを判別する。共有及び比較コード４４０，４４１は、プログラムコード４００を実行するスレッドに、そのフィンガープリント値を、１つ以上の冗長スレッドによって計算されたフィンガープリント値と共有及び比較させるために挿入される。共有及び比較コード４４０，４４１は、対応するスキップチェックコード４３５，４３６が、最後にスレッドを共有及び比較してからスレッドが設定可能な数のイベントトリガ４０２，４０４に遭遇したことを示す場合にのみ実行される。 A thread that inserts skip check codes 435 and 436 and executes program code 400 encounters event triggers 402,404 a set number of times since it was last shared and compared with one or more corresponding redundant threads. Determine if it is. Sharing and comparison codes 440,441 are inserted into a thread running program code 400 to share and compare its fingerprint value with the fingerprint value calculated by one or more redundant threads. Sharing and comparison codes 440,441 only indicate that the corresponding skip check codes 435,436 have encountered a configurable number of event triggers 402,404 since the last time the thread was shared and compared. Will be executed.

最後に変換されたコード４４５は、スレッドが終了ブロック４０６においてプログラムコード４００を終了する前に、実行する必要がある未処理の共有及び比較演算が存在するか否かを判別するのに使用される終了チェック４５０を挿入することによって、生成される。 The last converted code 445 is used to determine if there are any outstanding sharing and comparison operations that need to be performed before the thread exits program code 400 in termination block 406. Generated by inserting a termination check 450.

図５は、いくつかの実施形態による、エラーを検出するために冗長スレッド間で共有及び比較演算を選択的にバイパス又は実行する方法５００のフロー図である。方法５００は、図１に示すＡＰＤ１００又は図３に示すＡＰＤ３００のいくつかの実施形態において実施される。ＡＰＤは、ＡＰＤに実装された対応する処理要素に割り当てられた複数のスレッドを使用して、カーネルの複数のインスタンスを実行するように構成されている。ＡＰＤ内のコンパイラは、カーネル内のプログラムコードを変換して、カーネルのインスタンスを実行するスレッドに共有及び比較演算を選択的にバイパス又は実行させる変換コードを生成する。例えば、コンパイラは、図４に示すように、プログラムコードを変換することができる。 FIG. 5 is a flow diagram of a method 500 according to some embodiments that selectively bypasses or executes sharing and comparison operations between redundant threads to detect errors. Method 500 is carried out in some embodiments of APD100 shown in FIG. 1 or APD300 shown in FIG. The APD is configured to execute multiple instances of the kernel using multiple threads assigned to the corresponding processing elements implemented in the APD. The compiler in the APD translates the program code in the kernel to generate translation code that causes the thread running the kernel instance to selectively bypass or perform sharing and comparison operations. For example, the compiler can convert the program code as shown in FIG.

ブロック５０５において、ＡＰＤは、実行されるスレッドを識別し、識別されたスレッドに対して冗長な１つ以上のスレッドを生成する。本明細書で説明したように、冗長スレッドは、同じ識別子（例えば、グローバル識別子、グループ識別子又はローカル識別子等）を使用して識別することができる。ＡＰＤは、冗長スレッドを異なる処理要素に割り当てる。ブロック５１０において、冗長スレッドが異なる処理要素で実行される。冗長スレッドは、同時に又は並行して実行することができ、冗長スレッドのいくつかの実施形態では同期している。 At block 505, the APD identifies the thread to be executed and creates one or more threads that are redundant to the identified thread. As described herein, redundant threads can be identified using the same identifier (eg, global identifier, group identifier, local identifier, etc.). APD allocates redundant threads to different processing elements. In block 510, redundant threads are executed with different processing elements. Redundant threads can run simultaneously or in parallel and are synchronized in some embodiments of redundant threads.

ブロック５１５において、スレッドは、エラーを検出又は訂正するために共有及び比較演算を潜在的にトリガするイベントトリガ（例えば、ストア命令等）を検出する。ブロック５２０において、スレッドは、イベントトリガを検出したことに応じて、対応するカウンタをインクリメントする。ブロック５２５において、スレッドは、イベントトリガに関連する結果（例えば、ストア命令に応じて記憶されたデータ等）のハッシュ値を、対応するフィンガープリントにひとまとめにする。例えば、いくつかの変形例では、スレッドは、結果の値を、以前のイベントトリガの結果に基づいて生成された以前のフィンガープリントとハッシュする。また、スレッドは、他の情報（例えば、ストア命令によって示されたデータを記憶する場所のアドレス等）を、フィンガープリントとハッシュすることができる。 At block 515, the thread detects an event trigger (eg, a store instruction, etc.) that potentially triggers a sharing and comparison operation to detect or correct an error. At block 520, the thread increments the corresponding counter in response to detecting an event trigger. At block 525, the thread bundles the hash values of the results associated with the event trigger (eg, the data stored in response to the store instruction) into the corresponding fingerprints. For example, in some variants, the thread hashes the resulting value with the previous fingerprint generated based on the result of the previous event trigger. The thread can also hash other information (eg, the address of the location that stores the data indicated by the store instruction) with the fingerprint.

判別ブロック５３０において、スレッドは、冗長スレッド間で共有及び比較演算を行う前に、カウンタの値を、共有及び比較演算がバイパスされる設定可能な数を示す閾値と比較する。カウンタが閾値以下の場合、方法５００は、ブロック５１０に移行して、スレッドの実行を継続する。カウンタが閾値より大きい場合、ブロック５３５において、スレッドが共有及び比較演算を実行して、冗長スレッドのフィンガープリントが一致するか否かを判別する。ＡＰＤは、フィンガープリントが一致しない場合、エラーが発生したと判別することができる。ＡＰＤは、ブロック５３５において実行される共有及び比較演算中にエラーが検出された場合、エラー報告又は回復を実行することができる。ブロック５４０において、スレッドは、対応するカウンタの値をリセットする。そして、方法５００は、ブロック５１０に移行して、スレッドの実行を継続する。 In the determination block 530, the thread compares the value of the counter with a threshold indicating a configurable number of bypassed sharing and comparing operations before performing the sharing and comparing operations between redundant threads. If the counter is below the threshold, method 500 transitions to block 510 and continues thread execution. If the counter is greater than the threshold, in block 535, the threads perform sharing and comparison operations to determine if the redundant thread fingerprints match. The APD can determine that an error has occurred if the fingerprints do not match. The APD may perform error reporting or recovery if an error is detected during the sharing and comparison operations performed in block 535. At block 540, the thread resets the value of the corresponding counter. Then, the method 500 shifts to the block 510 and continues the execution of the thread.

図６は、いくつかの実施形態による、プログラムコードを終了する前に冗長スレッド間の共有及び比較演算を実行すべきか否かを判別するために終了チェックを実行する方法６００のフロー図である。方法６００は、図１に示すＡＰＤ１００又は図３に示すＡＰＤ３００のいくつかの実施形態において実施される。ＡＰＤは、ＡＰＤに実装された対応する処理要素に割り当てられた複数のスレッドを使用して、カーネルの複数のインスタンスを実行するように構成されている。ＡＰＤのコンパイラは、カーネル内のプログラムコードを変換して、終了チェックを挿入する。例えば、コンパイラは、図４に示すように、終了チェック４５０を挿入するようにプログラムコード４００を変換することができる。 FIG. 6 is a flow diagram of a method 600 according to some embodiments in which an end check is performed to determine whether sharing and comparison operations between redundant threads should be performed before terminating the program code. Method 600 is performed in some embodiments of APD100 shown in FIG. 1 or APD300 shown in FIG. The APD is configured to execute multiple instances of the kernel using multiple threads assigned to the corresponding processing elements implemented in the APD. The APD compiler translates the program code in the kernel and inserts a termination check. For example, the compiler can convert the program code 400 to insert a termination check 450, as shown in FIG.

ブロック６０５において、冗長スレッドが異なる処理要素で実行される。冗長スレッドは、同時に又は並行して実行することができ、冗長スレッドのいくつかの実施形態では同期している。ブロック６１０において、スレッドは終了条件を検出する。判別ブロック６１５において、スレッドは、終了条件を検出したことに応じて、対応するカウンタの値をチェックする。カウンタの値がゼロ（又は、他のデフォルト値）より大きい場合、すなわち、冗長スレッド間で最後に共有及び比較演算が行われてから共有及び比較演算が少なくとも１回バイパスされた場合、方法は、ブロック６２０に移行する。ブロック６２０において、スレッドは、フィンガープリントの現在の値に基づいて共有及び比較を行う。そして、方法６００はブロック６２５に移行して、スレッドが終了コードを実行する。カウンタの値がゼロ（又は、他のデフォルト値）に等しい場合、すなわち、未処理の共有及び比較演算が存在しない場合、方法６００はブロック６２５に直接移行して、スレッドが終了コードを実行する。 In block 605, redundant threads are executed with different processing elements. Redundant threads can run simultaneously or in parallel and are synchronized in some embodiments of redundant threads. At block 610, the thread detects the termination condition. In the determination block 615, the thread checks the value of the corresponding counter in response to detecting the termination condition. If the value of the counter is greater than zero (or any other default value), that is, if the sharing and comparison operations have been bypassed at least once since the last sharing and comparison operation between redundant threads, the method is: Move to block 620. At block 620, threads share and compare based on the current value of the fingerprint. Method 600 then transitions to block 625, where the thread executes the exit code. If the value of the counter is equal to zero (or any other default value), i.e., if there are no outstanding shared and compare operations, method 600 goes directly to block 625 and the thread executes the exit code.

いくつかの実施形態では、上述した技術の特定の態様は、ソフトウェアを実行する処理システムの１つ以上のプロセッサによって実装され得る。ソフトウェアは、非一時的なコンピュータ可読記憶媒体に記憶され又は有形に具現化された実行可能命令の１つ以上のセットを含む。ソフトウェアは、１つ以上のプロセッサによって実行されると、上述した技術の１つ以上の態様を実施するように１つ以上のプロセッサを操作する命令及び特定のデータを含むことができる。非一時的なコンピュータ可読記憶媒体は、例えば、磁気若しくは光ディスク記憶装置、フラッシュメモリ等の固体記憶装置、キャッシュ、ランダムアクセスメモリ（ＲＡＭ）、又は、他の不揮発性メモリ装置等を含むことができる。非一時的なコンピュータ可読記憶媒体に記憶される実行可能命令は、ソースコード、アセンブリ言語コード、オブジェクトコード、又は、１つ以上のプロセッサによって解釈され若しくは実行可能な他の命令フォーマットであってもよい。 In some embodiments, certain aspects of the techniques described above may be implemented by one or more processors in a processing system running software. The software includes one or more sets of executable instructions stored or tangibly embodied in a non-temporary computer-readable storage medium. When executed by one or more processors, the software can include instructions and specific data that operate on one or more processors to implement one or more aspects of the techniques described above. Non-temporary computer-readable storage media can include, for example, magnetic or optical disk storage devices, solid-state storage devices such as flash memory, caches, random access memory (RAM), or other non-volatile memory devices. Executable instructions stored on a non-temporary computer-readable storage medium may be source code, assembly language code, object code, or other instruction format that can be interpreted or executed by one or more processors. ..

概要的な説明において上述した全てのアクティビティ又は要素が必要とされているわけではなく、特定のアクティビティ又はデバイスの一部が必要とされない場合があること、１つ以上のさらなるアクティビティが実行され、１つ以上のさらなる要素が含まれ得ることに留意されたい。さらに、アクティビティを列挙する順序は、必ずしもそれらが実行される順序ではない。また、概念は、特定の実施形態を参照して説明された。しかしながら、当業者であれば、以下の特許請求の範囲に記載されるように本発明の範囲から逸脱することなく、様々な変更及び変形を行うことが可能であることを理解するであろう。したがって、本明細書及び図面は、限定的な意味ではなく例示的な意味で考慮されるべきであり、かかる変更の全てが本発明の範囲内に含まれることを意図している。 Not all activities or elements mentioned above are required in the general description, and some specific activities or devices may not be required, and one or more additional activities are performed, 1 Note that one or more additional elements may be included. Moreover, the order in which activities are listed is not necessarily the order in which they are executed. The concept has also been described with reference to specific embodiments. However, one of ordinary skill in the art will appreciate that various modifications and modifications can be made without departing from the scope of the invention as described in the claims below. Therefore, the specification and drawings should be considered in an exemplary sense rather than a limiting sense, and all such modifications are intended to be within the scope of the present invention.

利益、他の利点及び問題に対する解決手段を、特定の実施形態に関して上記のように説明した。しかしながら、利益、利点、問題に対する解決手段、及び、何等かの利益、利点、解決手段を生じさせ又は顕著にし得る機能は、何れか又は全ての請求項の重要な、必須の、不可欠な特徴として解釈されるべきではない。さらに、本明細書の教示の恩恵を受ける当業者は、開示された発明を、異なるが当業者に明らかな同等の方式で変更及び実施することができるので、上述した特定の実施形態は例示に過ぎない。以下の特許請求の範囲に記述されているもの以外に、本明細書に示される構成又は設計の詳細に対する限定を意図していない。したがって、上記に開示された特定の実施形態は変更されてもよいし、修正されてもよく、このような変形形態の全ては、開示された発明の範囲内にあるとみなされることが明らかである。よって、本明細書で求められる保護は、以下の特許請求の範囲に記載のとおりである。 Benefits, other benefits and solutions to problems have been described above for a particular embodiment. However, benefits, benefits, solutions to problems, and features that can give rise to or make any benefit, benefits, solutions, as important, essential, and essential features of any or all claims. Should not be interpreted. Moreover, the particular embodiments described above are exemplified, as those skilled in the art benefiting from the teachings herein can modify and implement the disclosed inventions in a different but equivalent manner apparent to those skilled in the art. Not too much. No limitation is intended to limit the configuration or design details presented herein, other than those described in the claims below. Therefore, it is clear that the particular embodiments disclosed above may be modified or modified, and all such variants are considered to be within the scope of the disclosed invention. be. Therefore, the protection required herein is as described in the claims below.

Claims

One or more processing elements of a compute unit respond to an event trigger on the one or more processing elements.
The selective bypass is to selectively bypass at least one comparison of the results of operations performed by the redundant thread of the compute unit, and the selective bypass is at least one previously performed by the redundant thread. It is performed based on the determination that the event trigger has occurred a set number of times since the previous comparison of the results of one operation.
Method.

The number of times that can be set is greater than 1.
The method of claim 1.

The event trigger is a store instruction executed by the redundant thread to store the result in memory.
The method of claim 1 or 2.

Further comprising generating a coded value of the result by hashing the result with at least one of a previous coded value and an initial value.
The method according to any one of claims 1 to 3.

Further comprising modifying the program code executed by the redundant threads by inserting code during compilation to generate a look-up table of code values used to hash the results.
The method according to any one of claims 1 to 4.

Modifying the program code during the compilation further includes initializing the counters for each redundant thread, each counter being incremented in response to the redundant thread executing the event trigger, and the counters of the counter. The value is compared to the configurable number of times to determine whether to selectively bypass the at least one comparison.
The method of claim 5.

Modifying the program code during the compilation hashes the result to generate a coded value, compares the value of the counter with the configurable number of times, and selectively bypasses the at least one comparison. A code for sharing and comparing the coded value among the redundant threads is inserted according to the determination of whether or not to perform the event trigger and the determination that the event trigger for comparison has occurred the settable number of times. Including more to do,
The method of claim 6.

Modifying the program code during the compilation further means inserting code to determine whether the redundant thread will perform unprocessed sharing and comparison operations before terminating the program code. include,
7. The method of claim 7.

The first processing element for executing the first thread and
It includes at least one second processing element for executing at least one second thread that is redundant with respect to the first thread.
The first thread and the at least one second thread selectively bypass the comparison of the results of operations performed by the first thread and the at least one second thread, and selectively bypass the comparison. That is, in response to the event trigger for comparison , it is determined that the event trigger has occurred a set number of times since the previous comparison of the results of at least one operation previously executed by the redundant thread. Based on ,
Device.

The event trigger is a store instruction executed by the redundant thread to store the result.
The device of claim 9.

The first thread and the at least one second thread generate a coded value by hashing the result with at least one of a previous coded value and an initial value.
The device of claim 9 or 10.

A plurality of processing elements including the first processing element and at least one second processing element are further provided.
The first thread and the plurality of processing elements include the first thread and by inserting code during compilation to generate a look-up table of code values used to hash the results and generate encoded values. Implements a compiler configured to modify the program code executed by at least one second thread.
The device according to any one of claims 9 to 11.

Further comprising memory configured to implement counters for said first thread and said at least one second thread.
The compiler is configured to initialize the counter, and the first thread or at least one second thread increments the corresponding counter in response to executing the event trigger, and the counter The value of is compared to the configurable number of times to determine whether to selectively bypass the at least one comparison.
The device of claim 12.

The compiler hashes the results, compares the value of the counter with the configurable number of times to determine whether to selectively bypass the at least one comparison, and the event trigger for the comparison It is configured to insert a code for sharing and comparing the encoded value between the first thread and the at least one second thread in response to the determination that the occurrence has occurred a set number of times. ing,
The device of claim 13.

The compiler is to insert code to determine whether the first thread and at least one second thread perform unprocessed sharing and comparison operations before terminating the program code. It is configured,
The device of claim 14.