JP6709266B2

JP6709266B2 - Processing in neural networks

Info

Publication number: JP6709266B2
Application number: JP2018197162A
Authority: JP
Inventors: フィリックススティーブン; クリスチャンノウルズサイモン; ダコスタゴッドフリー
Original assignee: Graphcore Ltd
Current assignee: Graphcore Ltd
Priority date: 2017-10-20
Filing date: 2018-10-19
Publication date: 2020-06-10
Anticipated expiration: 2038-10-19
Also published as: US11900109B2; GB201717306D0; US20190121639A1; TW201931107A; GB2568230B; JP2019079524A; CN109697506B; CN109697506A; EP3474193A1; TWI719348B; KR102197539B1; GB2568230A; CA3021426C; KR20190044549A; CA3021426A1

Description

本開示は、ニューラルネットワークにおけるデータの処理に関する。 The present disclosure relates to processing data in neural networks.

ニューラルネットワークは、機械学習や人工知能の分野で使用されている。ニューラルネットワークは、いくつかのノードセットの構成を備え、ノードは、リンクによって相互接続され、互いに対話する。コンピューティングにおけるニューラルネットワークの原理は、電気的な刺激が人間の脳内でどのように情報を伝達するかに関する情報に基づいている。このため、ノードは、ニューロンと呼ばれることも多い。また、頂点と呼ばれることもある。リンクは、時としてエッジと呼ばれる。ネットワークは、入力データを受け取ることができ、特定のノードが、データに対して操作を行う。これらの操作の結果は、他のノードに渡される。各ノードの出力は、ノードの活性値またはノード値と呼ばれる。各リンクは、重みに関連付けられている。重みは、ニューラルネットワークのノード間の接続性を定義する。重みの値を変えることによって行われる、ニューラルネットワークが学習することを可能にする多くの異なる技術が知られている。 Neural networks are used in the fields of machine learning and artificial intelligence. A neural network comprises a set of node sets, the nodes being interconnected by links and interacting with each other. The principle of neural networks in computing is based on information about how electrical stimuli convey information in the human brain. For this reason, nodes are often called neurons. It is also sometimes called the apex. Links are sometimes called edges. The network can receive input data and a particular node operates on the data. The results of these operations are passed to other nodes. The output of each node is called the node liveness value or node value. Each link is associated with a weight. The weight defines the connectivity between the nodes of the neural network. Many different techniques are known which allow a neural network to learn by varying the value of the weights.

図１Ａに、ニューラルネットワーク内のノードの一構成の非常に簡略化した形態を示す。このタイプの構成は、学習または訓練において使用されることが多く、ノードの入力層、ノードの隠れ層、およびノードの出力層を備える。現実には、各層に多くのノードがあり、現在では、セクションごとに複数の層があり得る。入力層Ｎｉの各ノードは、その出力において活性値またはノード値を生成することが可能であり、活性値またはノード値は、そのノードに提供されたデータに対して関数を適用することによって生成される。入力層からのノード値のベクトルは、隠れ層内の各ノードの入力において、それぞれの重みのベクトルによってスケーリングされる。各重みは、入力層内のその特定のノードと、それが接続される隠れ層内のノードとの接続性を定義する。実際には、ネットワークは数百万個のノードを有し、多次元的に接続されていることがあり、したがって、ベクトルはテンソルであることが多い。ノードＮｈの入力にかけられる重みは、ｗ₀、…、ｗ₂と表される。入力層内の各ノードは、少なくとも最初は、隠れ層内の各ノードに接続されている。隠れ層内の各ノードは、それらに提供されたデータに対して活性化関数を適用することができ、同様に、出力層内の各ノードＮ₀に供給される出力ベクトルを生成することができる。各ノードは、例えば、それぞれの入力リンクに関して、ノードの入力活性値とそのノード固有の重みとのドット積の計算を行うことによって、その入力データに重みを付ける。次いで、重み付けされたデータに対して活性化関数を適用する。活性化関数は、例えばシグモイド関数でよい。図１Ｂを参照のこと。ネットワークは、入力層に入力されるデータを処理し、各ノードからの活性値に重みを割り当て、隠れ層内の各ノードに入力されるデータに作用することによって（ノードに重み付けして活性化関数を適用することによって）学習する。したがって、隠れ層内のノードは、データを処理し、出力層内のノードに出力を供給する。出力層のノードも、それらの入力データに重みを割り当てることができる。各重みは、それぞれの誤差値によって特徴付けられる。さらに、各ノードは、誤差状態に関連付けられてもよい。各ノードでの誤差状態は、ノードの重みの誤差が特定の許容レベルまたは許容度未満であるかどうかの尺度を与える。様々な学習手法があるが、いずれの場合にも、図１Ａでの左から右へのネットワークを通る順伝播、全体的な誤差の計算、およびネットワークを通る図１Ａでの右から左への誤差の逆伝播がある。次のサイクルで、各ノードは、逆伝播された誤差を考慮に入れ、改定された１組の重みを生成する。このようにして、ネットワークは、その所望の動作を行うように訓練することができる。 FIG. 1A shows a highly simplified form of a node configuration in a neural network. This type of configuration is often used in learning or training and comprises an input layer of nodes, a hidden layer of nodes, and an output layer of nodes. In reality there are many nodes in each layer and now there can be multiple layers per section. Each node of the input layer Ni is capable of producing an active or node value at its output, the active or node value being produced by applying a function to the data provided to that node. It The vector of node values from the input layer is scaled by the respective weight vector at the input of each node in the hidden layer. Each weight defines the connectivity between that particular node in the input layer and the node in the hidden layer to which it is connected. In practice, networks have millions of nodes and may be connected in multiple dimensions, so vectors are often tensors. The weights applied to the inputs of the node Nh are represented by w ₀ ,..., W ₂ . Each node in the input layer is at least initially connected to each node in the hidden layer. Each node in the hidden layer can apply an activation function to the data provided to them, as well as generate an output vector that is fed to each node N ₀ in the output layer. . Each node weights its input data by, for example, calculating the dot product of the input activation value of the node and its unique weight for each input link. An activation function is then applied to the weighted data. The activation function may be, for example, a sigmoid function. See FIG. 1B. The network processes the data input to the input layer, assigns weights to the activation values from each node, and acts on the data input to each node in the hidden layer (weighting nodes to activation function). Learn by applying. Therefore, the nodes in the hidden layer process the data and provide the output to the nodes in the output layer. Output layer nodes can also assign weights to their input data. Each weight is characterized by a respective error value. Further, each node may be associated with an error state. The error state at each node provides a measure of whether the error in the weight of the node is below a certain tolerance level or tolerance. There are various learning techniques, but in each case forward propagation through the network from left to right in FIG. 1A, calculation of the overall error, and error from right to left in FIG. 1A through the network. There is back propagation. In the next cycle, each node takes into account the backpropagated error and produces a revised set of weights. In this way, the network can be trained to perform its desired behavior.

ニューラルネットワークで発生し得る１つの問題は、「過学習（overfitting）」である。数百万または数十億個のパラメータ（重み）を有する大規模なネットワークは、過学習しやすい。オーバーフィッティングにより、訓練後のニューラルネットがサンプルからより一般的に特徴を抽出するアプリケーションに適するように関連のある特徴を抽出するように訓練されるというよりもむしろ、ネットワークは、そこに提供された各訓練サンプル（入力ノードにデータを提供する訓練サンプル）を覚えてしまう。過学習／過剰記憶を避けるためにニューラルネットワークを規則化することによってこの問題を解決するための広範な技法が開発されている。 One problem that can occur with neural networks is "overfitting." Large networks with millions or billions of parameters (weights) are prone to overfitting. By overfitting, rather than having the trained neural net be trained to extract the relevant features from the sample more generally to suit the application of extracting the features, the network was provided to it. Remember each training sample (the training sample that provides the data to the input node). Extensive techniques have been developed to solve this problem by ordering neural networks to avoid over-learning/over-memory.

「ドロップアウト（drop out）」と呼ばれる技法が、ＪＥＨｉｎｔｏｎ他の論文「ＩｍｐｒｏｖｉｎｇＮｅｕｒａｌＮｅｔｗｏｒｋｓｂｙＰｒｅｖｅｎｔｉｎｇＣｏ−ＡｄａｐｔｉｏｎｏｆＦｅａｔｕｒｅＤｅｔｅｃｔｏｒｓ」ＣｏＲＲａｂｓ／１２０７．０５８０（２０１２）で論じられている。この技法によれば、各訓練例について、順伝播は、各層での活性化の半分をランダムに削除することを含む。次いで、この誤差は、残りの活性化のみを通って逆伝播される。これにより、過学習が大幅に減少され、ネットワークのパフォーマンスが改善されることが示されている。 A technique called "drop out" is discussed in JE Hinton et al., "Improving Neural Networks by Presenting Co-Adaptation of Feature Detectors" CoRR abs/1207580 (2012). According to this technique, for each training example, forward propagation involves randomly removing half of the activations at each layer. This error is then propagated back through the remaining activations only. This has been shown to significantly reduce overfitting and improve network performance.

「ドロップ接続（drop connect）」として知られている別の技法が、ＤｅｐａｒｔｍｅｎｔｏｆＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅＮＹＵによって出版されているＬｉＷａｎ他の論文「ＲｅｇｕｌａｒｉｓａｔｉｏｎｏｆＮｅｕｒａｌＮｅｔｗｏｒｋｕｓｉｎｇＤｒｏｐｃｏｎｎｅｃｔ」で論じられている。この技法によれば、各訓練例について、順伝播中、重みのいくつかがランダムにゼロにされる。この技法も同様にニューラルネットワークのパフォーマンスを改善することが示されている。 Another technique known as "drop connect" is discussed in the paper "Regulation of Neural Network using Drop connect" by Li Wan et al., published by Department of Computer Science NYU. According to this technique, for each training example, some of the weights are randomly zeroed during forward propagation. This technique has also been shown to improve the performance of neural networks as well.

既知のコンピュータ技術を使用してニューラルネットワークを実装することには、様々な課題がある。例えばＣＰＵやＧＰＵを使用してドロップアウトやドロップ接続など特定の技法を実装するのは容易ではなく、これを実現すれば、効率的な実装で実現できるとの完全な利益を得られる。 There are various challenges in implementing neural networks using known computer techniques. For example, it is not easy to implement a specific technique such as a dropout or a drop connection using a CPU or GPU, and if this is achieved, there is a complete benefit that an efficient implementation can be achieved.

本発明者らは、処理ユニットの命令シーケンス中の単一の命令に基づいて、ドロップアウトまたはドロップ接続を効率的に実装することができるプロセッサ用の実行ユニットを開発した。 The inventors have developed an execution unit for a processor that can efficiently implement a dropout or drop connection based on a single instruction in the instruction sequence of the processing unit.

本発明の一態様によれば、命令のシーケンスを含むコンピュータプログラムを実行するための実行ユニットであって、シーケンスはマスキング命令を含み、実行ユニットは、マスキング命令を実行するように構成され、マスキング命令は、実行ユニットによって実行されるときに、ｎ個の値を有するソースオペランドからランダムに選択された値をマスクし、ソースオペランドからの他の元の値を固定して、ソースオペランドからの元の値と、それぞれの元の位置にあるマスクされた値とを含む結果を生成する、実行ユニットが提供される。 According to one aspect of the invention, an execution unit for executing a computer program including a sequence of instructions, the sequence including masking instructions, the execution unit configured to execute the masking instructions, the masking instructions Masks values randomly selected from source operands with n values and fixes other original values from the source operands when executed by the execution unit to fix the original values from the source operands. An execution unit is provided that produces a result that includes a value and a masked value in each original position.

実行ユニットは、ランダム化ビットストリングを生成するように構成されたハードウェア擬似乱数生成器（ＨＰＲＮＧ）を備えることがあり、そのランダム化ビットストリングから、マスクされる値をランダムに選択するためのランダムビットシーケンスが導出される。 The execution unit may comprise a hardware pseudo-random number generator (HPRNG) configured to generate a randomized bit string from which a random number for randomly selecting a masked value. A bit sequence is derived.

一実施形態では、ＨＰＲＮＧから出力された各ランダム化ビットストリングはｍビットを含み、実行ユニットは、ｍビットをｐ個のシーケンスに分割して、各シーケンスを確率値と比較して、値を選択的にマスクするための加重確率インジケータを生成するように構成される。 In one embodiment, each randomized bit string output from HPRNG includes m bits, and the execution unit divides the m bits into p sequences and compares each sequence with a probability value to select a value. Configured to generate a weighted probability indicator for dynamically masking.

マスキング命令は、ソースオペランドを識別するソースフィールドと、結果を保持するためのディスティネーションレジスタの指示と、確率値を定義する確率フィールドとを含むことがある。 The masking instruction may include a source field that identifies a source operand, an indication of a destination register to hold the result, and a probability field that defines a probability value.

実行ユニットは、ソースオペランドを保持するための入力バッファと、結果を保持する出力バッファとを備えることがある。 The execution unit may comprise an input buffer for holding source operands and an output buffer for holding results.

ソースオペランドの値は、（例えばドロップ接続を実装するために）ニューラルネットワーク内のリンクの重みを表すことがある。 The value of the source operand may represent the weight of a link in the neural network (eg, to implement a drop connection).

ソースオペランド内の値は、（例えばドロップアウトを実装するために）ニューラルネットワーク内のノードの出力値を定義する活性値を表すことがある。 The value in the source operand may represent an active value that defines the output value of a node in the neural network (eg, to implement dropout).

命令のシーケンスは、結果とさらなる１組の値とのドット積の計算を行うための命令、および／またはマスキング命令が実行された後に結果をメモリ位置に書き込むための命令を含むことがある。 The sequence of instructions may include instructions for performing a dot product calculation of the result with the further set of values, and/or for writing the result to a memory location after the masking instruction has been executed.

ソースオペランドは、任意の適切な長さを有する任意の好適な数を含むことがあり、例えば、（非限定的な例として）４つの１６ビット値、２つの３２ビット値、または８つの８ビット値を含む。一般に、ｎ個の値は、１つまたは複数の値でよい。 The source operand may include any suitable number with any suitable length, for example (as a non-limiting example) four 16-bit values, two 32-bit values, or eight 8-bit values. Contains the value. In general, the n values may be one or more values.

対応する方法およびコンピュータプログラムが提供される。 Corresponding methods and computer programs are provided.

一態様は、命令のシーケンスを含むコンピュータプログラムを実行する方法であって、シーケンスはマスキング命令を含み、方法は、マスキング命令の実行に応答して、ｎ個の値を有するソースオペランドから値をランダムに選択し、ソースオペランドからの他の元の値を固定して、ソースオペランドからの元の値と、それぞれの元の位置にあるマスクされた値とを含む結果を生成する、方法を提供する。 One aspect is a method of executing a computer program that includes a sequence of instructions, the sequence including a masking instruction, the method randomizing a value from a source operand having n values in response to executing the masking instruction. And fixing other original values from the source operand to produce a result that includes the original value from the source operand and the masked value at each original position. ..

ｍビットを含むランダム化ビットストリング出力を提供することができ、上記方法は、ｍビットをｐ個のシーケンスに分割し、各シーケンスを確率値と比較して、値を選択的にマスクするための加重確率インジケータを生成することを含むことがある。 A randomized bit string output including m bits may be provided, the method for dividing the m bits into p sequences and comparing each sequence with a probability value to selectively mask the values. It may include generating a weighted probability indicator.

マスキング命令は、ソースベクトルを識別するソースフィールドと、結果を保持するためのディスティネーションレジスタの指示と、確率値を定義する確率フィールドとを含むことがある。 The masking instruction may include a source field that identifies a source vector, an indication of a destination register to hold the result, and a probability field that defines a probability value.

命令のシーケンスは、結果とさらなる１組の値とのドット積の計算を行うための命令を含むことがある。 The sequence of instructions may include instructions for performing a dot product calculation of the result with the further set of values.

命令のシーケンスは、マスキング命令が実行された後に結果をメモリ位置に書き込むための命令を含むことがある。 The sequence of instructions may include instructions for writing the result to a memory location after the masking instruction has been executed.

別の態様は、非伝送媒体に記憶されているコンピュータ可読命令のシーケンスを含むコンピュータプログラムであって、シーケンスはマスキング命令を含み、マスキング命令は、実行されるときに、ｎ個の値を有するソースオペランドから値をランダムに選択してマスクし、ソースオペランドからの他の元の値を固定して、ソースオペランドからの元の値と、それぞれの元の位置にあるマスクされた値とを含む結果を生成する、コンピュータプログラムを提供する。 Another aspect is a computer program that includes a sequence of computer readable instructions stored on a non-transmission medium, the sequence including masking instructions, the masking instructions having n values when executed. A result that randomly selects a value from an operand and masks it, fixing other original values from the source operand and including the original value from the source operand and the masked value at each original position To provide a computer program.

本明細書で使用するとき、用語「ランダム」および「ランダムに」は、ランダムまたは擬似ランダムを意味するものとする。 As used herein, the terms "random" and "randomly" shall mean random or pseudo-random.

ニューラルネットワークの非常に簡略化した概略図。A very simplified schematic of a neural network. ニューロンの非常に簡略化した概略図。A very simplified schematic of a neuron. 本発明の一実施形態による処理ユニットの概略図。3 is a schematic view of a processing unit according to an embodiment of the present invention. FIG. マスキング命令の形式を示す図。The figure which shows the format of a masking instruction. マスキング命令を実装するための実行ユニットのブロック図。FIG. 6 is a block diagram of an execution unit for implementing masking instructions.

図２に、ベクトル内のランダムに選択された値をマスクするための単一の命令を実行するように構成された実行ユニットの概略ブロック図を示す。本明細書では、この命令をｒｍａｓｋ命令と呼ぶ。実行ユニット２は、処理ユニット内のパイプライン４の一部を成す。処理ユニットは、命令メモリ１０から命令をフェッチする命令フェッチユニット６を備える。また、処理ユニットは、データメモリ１２からデータをロードするため、またはメモリにデータを格納するためにデータメモリ１２にアクセスすることを担うメモリアクセスステージ８を備える。１組のレジスタ１４が設けられ、いかなる場合にもパイプライン４によって実行される命令のためのソースオペランドおよびディスティネーションオペランドを保持する。パイプライン４が、様々な異なる命令を実行するため、例えば数学的演算を行うための多くの異なるタイプの実行ユニットを含むことがあることは容易に理解されよう。本発明において有用となり得る１つのタイプの処理ユニットは、バレルスレッドタイムスロットを使用する処理ユニットであり、そのような処理ユニットでは、スーパーバイザースレッドが、異なるワーカースレッドを、それらの実行のために異なるタイムスロットに割り振ることができる。本明細書で述べるｒｍａｓｋ命令は、任意の適切な処理ユニットアーキテクチャと共に使用することができる。図３に、ｒｍａｓｋ命令形式を示す。この命令は、ソースオペランドが検索されるソースメモリ位置を定義するフィールド３０と、出力ベクトルが配置されるディスティネーションメモリ位置に関するフィールド３２とを有する。メモリ位置は、レジスタ１４内の暗黙的もしくは明示的に定義されたレジスタ、またはデータメモリ内のアドレスでよい。ｒｍａｓｋ命令はさらに、確率値Ｓｒｃ１３４を定義し、より詳細には後述するように、確率値Ｓｒｃ１３４と共にソースオペランド内の個々の値が維持される。命令はオペコード３６を有し、オペコード３６は、その命令をｒｍａｓｋ命令として識別する。ｒｍａｓｋ命令には２つの異なる形式がある。命令の１つの形式では、ソースオペランドは、４つの１６ビット値（半精度ベクトル）を含むベクトルを表すことがあり、別の形式では、ソースオペランドは、２つの３２ビット値（単精度ベクトル）を含むベクトルを定義することがある。 FIG. 2 shows a schematic block diagram of an execution unit configured to execute a single instruction for masking randomly selected values in a vector. In this specification, this instruction is called an rmask instruction. The execution unit 2 forms a part of the pipeline 4 in the processing unit. The processing unit comprises an instruction fetch unit 6 that fetches instructions from the instruction memory 10. The processing unit also comprises a memory access stage 8 responsible for accessing the data memory 12 for loading data from the data memory 12 or for storing data in the memory. A set of registers 14 is provided to hold source and destination operands for the instructions executed by pipeline 4 in any case. It will be readily appreciated that the pipeline 4 may include many different types of execution units for executing a variety of different instructions, for example performing mathematical operations. One type of processing unit that may be useful in the present invention is a processing unit that uses barrel thread timeslots, in which supervisor threads have different worker threads different for their execution. Can be assigned to time slots. The rmask instructions described herein can be used with any suitable processing unit architecture. FIG. 3 shows the rmask instruction format. The instruction has a field 30 that defines the source memory location where the source operand is retrieved and a field 32 that relates to the destination memory location where the output vector is located. The memory location may be an implicitly or explicitly defined register in register 14, or an address in data memory. The rmask instruction further defines the probability value Src1 34, and the individual values in the source operand are maintained along with the probability value Src1 34, as described in more detail below. The instruction has an opcode 36, which identifies the instruction as an rmask instruction. There are two different forms of the rmask instruction. In one form of the instruction, the source operand may represent a vector containing four 16-bit values (half-precision vector), and in another form the source operand contains two 32-bit values (single-precision vector). May define a containing vector.

いずれの場合も、ｒｍａｓｋ命令は、ソースベクトル内のランダムに選択された値をマスクする（０にする）効果がある。そのような機能は、いくつかの異なる用途を有し、特に、ニューラルネットワークの分野において「ドロップ接続」および「ドロップアウト」として知られている上述した機能を含む。既に説明したように、ニューラルネットワークでは、重みを表すベクトルを提供することができ、これらの重みは、ネット内のニューロン間のリンクを表す。別のベクトルは、活性値を表すことができ、これらの活性値は、リンクによって接続される個々のニューロンに出力される値である。 In either case, the rmask instruction has the effect of masking (zeroing) randomly selected values in the source vector. Such features have several different uses, and include among others the features described above known in the field of neural networks as "drop connections" and "dropouts". As already mentioned, neural networks can provide vectors representing weights, which represent the links between neurons in the net. Another vector can represent activity values, which are the values output to the individual neurons connected by the link.

ニューラルネットワークにおいてごく一般的な機能は、これらの２つのベクトル間でドット積の計算を行うことである。例えば、ドット積は、事前活性値でよく、事前活性値は、活性化関数を使用してそこからニューロン出力が計算される値である。このドット積を計算する前に、重みのベクトルへのｒｍａｓｋ命令の適用を使用して、ドロップ接続機能を実装することができる。重みのベクトルへのｒｍａｓｋ命令の適用は、ベクトル内の個々の重みをランダムにゼロにする。 A very common function in neural networks is to compute the dot product between these two vectors. For example, the dot product may be the pre-activation value, which is the value from which the neuron output is calculated using the activation function. Prior to calculating this dot product, application of the rmask instruction to the vector of weights can be used to implement the drop connection function. Applying the rmask instruction to a vector of weights randomly zeros the individual weights in the vector.

ドット積を計算する前に、活性値のベクトルへのｒｍａｓｋ命令の適用を使用して、ドロップアウト機能を実装することができる。活性値のベクトルへのｒｍａｓｋ命令の適用は、ベクトル内の個々の活性値をランダムにゼロにする。 Applying the rmask instruction to a vector of liveness values before computing the dot product can be used to implement the dropout function. Applying the rmask instruction to a vector of liveness values will randomly zero each liveness value in the vector.

ベクトルは、単精度または半精度での浮動小数点数を表すことができる。１つのタイプのベクトルは、４つの１６ビット値（半精度ベクトル）を含むことができ、別のタイプのベクトルは、２つの３２ビット値（単精度ベクトル）を含むことができる。図２には、４つの１６ビット値または２つの３２ビット値で動作する実行ユニットが示されている。図２での実行ユニット２は、入力バッファ２０を有し、入力バッファ２０は、４つの１６ビット値または２つの３２ビット値を保持することができる。重みｗ_０、…、ｗ_３をそれぞれ表す４つの１６ビット値を保持することが示されている。このユニットは、ｒｍａｓｋモジュール２４に乱数を提供するための、ハードウェアで実装された擬似乱数生成器２２を有する。ベクトル内の各個の値が維持される確率を保つために、入力バッファ位置２６が設けられている。この確率値は、（図３に示すように）命令内で提供することも、前の命令によって設定して、ｒｍａｓｋ命令によってレジスタまたはメモリアドレスにアクセスすることもできる。例えば、この確率値は、情報内の１７ビットのイミディエート値でよい。ｒｍａｓｋモジュールは、ＰＲＮＧ２２および確率値２６によって生成された乱数を使用して、入力ベクトルの値をランダムにマスクする（０にする）。出力ベクトルは、出力バッファ２８に入れられる。図２では、第１および第３の値が０にマスクされているものとして示されている。この確率は、各個の値が維持される（すなわちマスクされない）確率であることに留意されたい。例えば、確率が０．５である場合、各個の値がマスクされる確率は０．５である。したがって、４つの値すべてが０にマスクされる確率は（０．５）⁴、すなわち１／１６である。乱数生成器は、確率に基づいて、どの値をマスクすべきかの指示を提供することによって、所要のランダム性をマスキングに導入する。 Vectors can represent floating point numbers in single or half precision. One type of vector can contain four 16-bit values (half-precision vector), and another type of vector can contain two 32-bit values (single-precision vector). FIG. 2 shows an execution unit operating on four 16-bit values or two 32-bit values. The execution unit 2 in FIG. 2 has an input buffer 20, which can hold four 16-bit values or two 32-bit values. It is shown to hold four 16-bit values respectively representing the weights w ₀ ,..., W ₃ . This unit comprises a hardware-implemented pseudo-random number generator 22 for providing a random number to the rmask module 24. An input buffer location 26 is provided to maintain the probability that each individual value in the vector will be maintained. This probability value can be provided within the instruction (as shown in FIG. 3) or set by a previous instruction to access the register or memory address by the rmask instruction. For example, this probability value may be a 17-bit immediate value in the information. The rmask module uses the random numbers generated by PRNG 22 and probability value 26 to randomly mask (zero) the values of the input vector. The output vector is put into the output buffer 28. In FIG. 2, the first and third values are shown as masked to zero. Note that this probability is the probability that each individual value will be maintained (ie unmasked). For example, if the probability is 0.5, then the probability that each value is masked is 0.5. Therefore, the probability that all four values are masked to 0 is (0.5) ⁴ , or 1/16. The random number generator introduces the required randomness into the masking by providing an indication of which value should be masked based on the probability.

図４を参照すると、ｒｍａｓｋモジュールは、以下のように動作する。擬似乱数生成器２２は、６４ビットの出力を生成する。出力ごとに４つの１６ビットフィールドｃｆ０、…、ｃｆ３が生成され、ソースオペランドでの確率値と突き合わせてテストされる。単精度モードの場合には、２つの３２ビットフィールドが生成され、同様に、ソースオペランドでの確率値と突き合わせてテストされることに留意されたい。 Referring to FIG. 4, the rmask module operates as follows. The pseudo random number generator 22 generates a 64-bit output. For each output, four 16-bit fields cf0,...,cf3 are generated and tested against the probability value in the source operand. Note that for single precision mode, two 32-bit fields are generated and likewise tested against the probability value in the source operand.

マスク命令が実行されると、ＰＲＮＧの６４ビット出力が提供される。出力ごとに、４つの１６ビットフィールドが形成される。ｒｅｓ０［１５：０］は、ＰＲＮＧ出力の最下位１６ビットを表し、この構文は他の３つのフィールドに適用される。これらのフィールドはそれぞれ、ｃｆフィールド（ｃｆは比較フィールドを意味する）に代入され、比較フィールドｃｆ０、…、ｃｆ３は、それぞれ１６ビット長である。
A 64-bit output of PRNG is provided when the mask instruction is executed. For each output, four 16-bit fields are formed. res0[15:0] represents the least significant 16 bits of the PRNG output and this syntax applies to the other three fields. These fields are respectively substituted into the cf field (cf means a comparison field), and the comparison fields cf0,..., cf3 are each 16 bits long.

次いで、比較ロジック４０を使用して、各比較フィールドｃｆ０、…、３をソースオペランドｓｒｃ１［１５：０］からの確率値と比較することによって、４つの加重ランダムビットｗｐｂ［０］、…、ｗｐｂ［３］が導出される。
3 is compared with the probability value from the source operand src1[15:0] using the compare logic 40 to compare the four weighted random bits wpb[0],..., Wpb. [3] is derived.

上記の構文は、ｃｆ０＜ｓｒｃ１の場合にビットｗｐｂ［０］に値「１」が代入されることを意味する。ここで、ｃｆ０は、１６ビットの符号なしのランダム値であり、範囲｛０、・・・、６５５３５｝内の値を有することができる。ｓｒｃ１は３２ビット幅のオペランドであるが、最下位１７ビットのみがｒｍａｓｋに使用される。ｓｒｃ１には範囲｛０〜６５５３６｝内の値が許されており、したがって、「マスクされていない」確率は、ｓｒｃ１／６５５３６である。 The above syntax means that the value “1” is assigned to the bit wpb[0] when cf0<src1. Here, cf0 is a 16-bit unsigned random value and can have a value in the range {0,...,65535}. src1 is a 32-bit wide operand, but only the least significant 17 bits are used for rmask. Values in the range {0-65536} are allowed for src1, so the "unmasked" probability is src1/65536.

許されるｓｒｃ１の範囲｛０、…、６５５３６｝に関して、ｓｒｃ１の１７番目のビット（ｓｒｃ１［１６］）は、値６５５３６に関してのみ設定されることに留意されたい。ｃｆ０の最大値は６５５３５であり、６５５３５＜６５５３６であるので、ｓｒｃ１［１６］＝＝１のとき、ｗｐｂ［０］は、自動的に「１」になる。 Note that for the allowed src1 range {0,...,65536}, the 17th bit of src1 (src1[16]) is set only for the value 65536. Since the maximum value of cf0 is 65535 and 65535<65536, wpb[0] automatically becomes “1” when src1[16]==1.

その後、４ビットｗｐｂ［０，・・・，３］をそれぞれ使用して、ｓｒｃ０［６３：０］内の４つの１６ビット値をそれぞれアンマスクする。したがって、以下のようになる。
Then, the 4 bits wpb[0,..., 3] are used respectively to unmask the four 16-bit values in src0[63:0] respectively. Therefore, it becomes as follows.

１６’ｂ０は値「０」を有する１６ビット幅のシーケンスを表し、１’ｂ０は値「０」のビットを意味し、１’ｂ１は値「１」のビットを意味する。 16'b0 represents a 16-bit wide sequence having the value "0", 1'b0 means the bit with the value "0" and 1'b1 means the bit with the value "1".

例えば、ｗｐｂ［１］＝＝１の場合、ａＤｓｔ［３１：１６］＝ａＳｒｃ０［３１：１６］であり、そうでない場合、ａＤｓｔ［３１：１６］＝０である。 For example, when wpb[1]==1, aDst[31:16]=aSrc0[31:16], and otherwise, aDst[31:16]=0.

一実施形態では、確率値ａＳｒｃ１を使用して、値がランダムに「アンマスク」される（すなわちマスクされない）確率を選択する。 In one embodiment, the probability value aSrc1 is used to select the probability that a value will be randomly “unmasked” (ie unmasked).

オペランドａＳｒｃ１＝６５５３６の場合、すべての値が常にアンマスクされる。 If the operand aSrc1=65536, then all values are always unmasked.

オペランドａＳｒｃ１＝４９１５２の場合、各値は、０．２５の確率で独立してマスクされ、０．７５の確率でアンマスクされる。 For the operand aSrc1=49152, each value is independently masked with a probability of 0.25 and unmasked with a probability of 0.75.

オペランドａＳｒｃ１＝３２７６８の場合、各値は、０．５０の確率で独立してマスクされ、０．５０の確率でアンマスクされる。 For operand aSrc1=32768, each value is independently masked with a probability of 0.50 and unmasked with a probability of 0.50.

オペランドａＳｒｃ１＝１６３８４の場合、各値は、０．７５の確率でマスクされ、０．２５の確率でアンマスクされる。 For the operand aSrc1=16384, each value is masked with a probability of 0.75 and unmasked with a probability of 0.25.

オペランドａＳｒｃ１＝０の場合、すべての値が常にマスクされる。 If the operand aSrc1=0, all values are always masked.

入力を「アンマスクする」とは、入力によって表される値を出力で再現することを意味する。出力形式および精度は、入力形式と必ずしも同じではない。 "Unmasking" an input means reproducing the value represented by the input at the output. The output format and precision are not necessarily the same as the input format.

入力を「マスクする」とは、入力で表される値ではなく、値ゼロを表すシンボルを出力で生成することを意味する。 By "masking" an input is meant producing a symbol at the output that represents the value zero, rather than the value represented by the input.

アンマスク確率＝＝１−マスク確率であることに留意されたい。 Note that unmask probability == 1-mask probability.

本明細書で述べるｒｍａｓｋの実装では、確率値ａＳｒｃ１＝＝６５５３６＊アンマスク確率である。対応するマスク確率は、マスク確率：６５５３６＊（１−マスク確率）から導出することができる。 In the implementation of rmask described herein, the probability value aSrc1==65536*unmask probability. The corresponding mask probability can be derived from the mask probability: 65536*(1-mask probability).

したがって、マスク確率またはアンマスク確率のいずれかが定義される（それら２つの合計が「１」になる）ことが理解されよう。 Therefore, it will be appreciated that either the mask probability or the unmask probability is defined (the sum of the two is "1").

ｒｍａｓｋ命令にはいくつかの異なる用途がある。例えば、ドロップアウトを実装するとき、ニューラル出力をメモリに書き込む直前にｒｍａｓｋ命令を使用することができる。次いで、ニューラルネットワークの次の段階での全てのニューロンが、前の層からマスクされた活性値を取り出す。 The rmask instruction has several different uses. For example, when implementing dropout, the rmask instruction can be used just before writing the neural output to memory. All neurons in the next stage of the neural network then take the masked activity value from the previous layer.

ドロップ接続を実装するとき、ドット積を計算する直前にメモリから活性値を読み出した直後にｒｍａｓｋ命令を使用することができる。命令は、ソースベクトルが表すものには依存しないため、ドロップアウトおよびドロップ接続以外の用途もある。また、この命令は、ソースベクトル内の１６ビットまたは３２ビットスカラ値の形式には依存しない。出力では、各スカラ値は変更されないままか、またはすべてのビットがマスクされてゼロにされる。したがって、マスク命令を使用して、符号付きもしくは符号なしの４つの１６ビット整数、または様々な１６ビット浮動小数点数形式のベクトルをマスクすることができる。スカラ値形式の唯一の要件は、ゼロへのすべてのビットによって表される値が「０」になることである。 When implementing a drop connection, the rmask instruction can be used immediately after reading the active value from memory just before calculating the dot product. Instructions have other uses than dropouts and connections as they do not depend on what the source vector represents. Also, this instruction does not depend on the format of the 16-bit or 32-bit scalar value in the source vector. On output, each scalar value remains unchanged or all bits are masked to zero. Therefore, mask instructions can be used to mask four 16-bit integers, signed or unsigned, or vectors in various 16-bit floating point formats. The only requirement for the scalar value format is that the value represented by all bits to zero be "0".

いくつかの数値形式では、値ゼロを表すシンボルが複数あり得る。 In some numeric formats, there can be multiple symbols representing the value zero.

例えば、ＩＥＥＥの半精度浮動小数点数形式では、ゼロを表すために以下のシンボルのどちらを使用することもできる。
１６’ｂ００００００００００００００００（正のゼロ）
１６’ｂ１０００００００００００００００（負のゼロ） For example, in the IEEE half-precision floating point number format, either of the following symbols can be used to represent zero:
16'b0000000000000000 (positive zero)
16'b1000000000000000 (negative zero)

本明細書で使用する「ランダム」という用語は、「真にランダム」または「擬似ランダム」を意味することができる。ｒｍａｓｋ命令は、擬似ランダムビットシーケンス生成器または真性ランダムビットシーケンス生成器のいずれかを使用することができる。 The term "random" as used herein can mean "true random" or "pseudorandom." The rmask instruction can use either a pseudo-random bit sequence generator or a true random bit sequence generator.

擬似乱数は、「擬似乱数生成器」または「ＰＲＮＧ」によって生成される。ＰＲＮＧは、ソフトウェアまたはハードウェアとして実装することができる。真性乱数は、「真性乱数生成器」または「ＴＲＮＧ」によって生成される。ＴＲＮＧの一例は、「遷移効果リング発振器」である。ＴＲＮＧに勝るＰＲＮＧの利点は、決定性である（同じ開始条件で同じプログラムを２回実行すると、常に同じ結果が得られる）。 Pseudo-random numbers are generated by a "pseudo-random number generator" or "PRNG". PRNGs can be implemented as software or hardware. The true random number is generated by a “true random number generator” or “TRNG”. An example of TRNG is a "transition effect ring oscillator". The advantage of PRNGs over TRNGs is determinism (running the same program twice with the same starting conditions always gives the same result).

ＰＲＮＧに勝るＴＲＮＧの利点は、出力が真にランダムであることである（ＰＲＮＧの出力は、任意に選択される数学的性質の有限集合を満たすが、ＰＲＮＧの状態および出力は現在の状態から常に予測可能であり、したがって真にランダムではない）。 The advantage of TRNG over PRNG is that the output is truly random (the output of PRNG satisfies a finite set of arbitrarily chosen mathematical properties, but the state and output of PRNG is always from the current state). Predictable and therefore not truly random).

特定の実施形態を述べてきたが、開示ヒアリング後に、開示される技法の他の用途および変形が当業者には明らかになり得る。本開示の範囲は、上述した実施形態によっては限定されず、添付の特許請求の範囲のみによって限定される。
Although particular embodiments have been described, other applications and variations of the disclosed techniques may become apparent to those skilled in the art after a disclosure hearing. The scope of the present disclosure is not limited by the above-described embodiments, but only by the appended claims.

Claims

An execution unit for executing a computer program including a sequence of instructions, the sequence including masking instructions, the execution unit configured to execute the masking instructions, the masking instructions comprising the execution unit. , Masking randomly selected positions in the source operand having n values, and of each of the positions in the result, the selected position is a masked value and the selected An execution unit for generating a result that leaves the original value at the corresponding position of the source operand for positions other than the corresponding position .

A hardware pseudo-random number generator (HPRNG) configured to generate a randomized bit string, the sequence of random bit fields for randomly selecting a position to be masked derived from the randomized bit string. The execution unit according to claim 1, which is executed.

Each randomized bit string output from the HPRNG includes m bits, and the execution unit divides the m bits into n bit fields and compares each bit field with a probability value to obtain the value. The execution unit according to claim 2, wherein the execution unit is configured to generate a weighted probability indicator for selectively masking.

4. The execution unit of claim 3, wherein the masking instruction includes a source field that identifies the source operand , an indication of a destination register for holding the result, and a probability field that defines the probability value.

The execution unit according to any one of claims 1 to 4, comprising an input buffer for holding the source operand, and an output buffer for holding the result.

The execution unit according to claim 1, wherein the value of the source operand represents a weight of a link in a neural network.

The execution unit according to claim 1, wherein the value in the source operand represents an activation value that defines an output value of a node in a neural network.

8. An execution unit according to any one of claims 1 to 7, wherein the sequence of instructions comprises instructions for implementing a dot product calculation of the result and a further set of values.

8. An execution unit according to any of claims 1 to 7, wherein the sequence of instructions comprises instructions for writing the result to a memory location after the masking instruction has been executed.

The execution unit according to any one of claims 1 to 9, wherein the source operand comprises four 16-bit values.

The execution unit according to any one of claims 1 to 9, wherein the source operand comprises two 32-bit values.

A method of executing a computer program including a sequence of instructions, the sequence including a masking instruction, the method randomly responsive to execution of the masking instruction to randomly locate positions in a source operand having n values. Selecting and masking, and of the respective positions in the result, the selected positions are masked values, and the positions other than the selected positions are in the corresponding positions of the source operand. A method that includes producing a result that is left at its original value .

A weighted probability indicator for providing a randomized bit string output containing m bits, dividing the m bits into n bit fields and comparing each bit field with a probability value to selectively mask the position. 13. The method of claim 12, comprising generating.

14. The method according to claim 12 or 13, comprising generating a randomized bit string, wherein a sequence of random bit fields for randomly selecting the masked positions is derived from the randomized bit string.

14. The method of claim 13, wherein the masking instruction includes a source field that identifies the source operand , an indication of a destination register to hold the result, and a probability field that defines the probability value.

16. The method of any one of claims 12-15, wherein the sequence of instructions includes instructions for implementing a dot product calculation of the result and a further set of values.

16. The method of any of claims 12-15, wherein the sequence of instructions includes instructions for writing the result to a memory location after the masking instruction has been executed.

A program executed by a computer including a sequence of computer readable instructions stored on a non-transmission medium, the sequence including a masking instruction, the masking instruction causing the computer to: in a source operand having n values . A position is randomly selected to be a mask target, and among the respective positions in the result, the selected position is set to a masked value, and the positions other than the selected position are set to the source operand. A program that is responsible for performing the generation of the result with the original value at the corresponding position .