JP2021532451A

JP2021532451A - Hierarchical parallel processing in the core network of distributed neural networks

Info

Publication number: JP2021532451A
Application number: JP2021500263A
Authority: JP
Inventors: アーサー、ジョン、バーノン; キャシディ、アンドリュー、ステファン; フリックナー、マイロン; ダッタ、パラブ; ペナー、ハルトムート; アップスワミー、ラシナクマール; 潤澤田; モダ、ダルメンドラ; エッサー、スティーブン、カイル; タバ、ブライアン、セイショー; クラモ、ジェニファー
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2018-07-12
Filing date: 2019-07-11
Publication date: 2021-11-25
Anticipated expiration: 2039-07-11
Also published as: US20200019836A1; EP3821376A1; WO2020011936A1; CN112384935A; JP7426980B2

Abstract

分散したニューラル・コアのネットワークにおいて、階層的並列処理が実現される。様々な実施形態において、複数のニューラル・コアが提供される。複数のニューラル・コアの各々は、並列に動作するように構成されている複数のベクトル計算ユニットを備える。複数のニューラル・コアの各々は、その複数のベクトル計算ユニットを入力アクティベーションに適用することによって、出力アクティベーションを並列に計算するように構成されている。複数のニューラル・コアの各々にニューラル・ネットワークのある層の出力アクティベーションのサブセットを割り当てて、計算を行う。ニューラル・ネットワークの層の入力アクティベーションのサブセットを受け取ると、複数のニューラル・コアの各々は、その割り当てられた出力アクティベーションの各々について部分和を計算し、少なくとも計算された部分和から、その割り当てられた出力アクティベーションを計算する。Hierarchical parallel processing is realized in a network of distributed neural cores. In various embodiments, a plurality of neural cores are provided. Each of the plurality of neural cores comprises a plurality of vector computing units configured to operate in parallel. Each of the plurality of neural cores is configured to compute the output activation in parallel by applying the plurality of vector calculation units to the input activation. A subset of the output activations of one layer of the neural network is assigned to each of the multiple neural cores to perform the calculation. Upon receiving a subset of the input activations of the layers of the neural network, each of the neural cores calculates a partial sum for each of its assigned output activations, and at least from the calculated partial sum, that allocation. Calculate the output activation that was done.

Description

本開示の実施形態はニューラル計算に関し、より詳細には、分散したニューラル・コアのネットワークにおける階層的並列処理に関する。 The embodiments of the present disclosure relate to neural computations, and more particularly to hierarchical parallelism in a network of distributed neural cores.

本開示の実施形態によれば、ニューラル計算のためのシステムが提供される。様々な実施形態において、複数のニューラル・コアが提供される。複数のニューラル・コアの各々は、並列に動作するように構成されている複数のベクトル計算ユニットを備える。複数のニューラル・コアの各々は、その複数のベクトル計算ユニットを入力アクティベーションに適用することによって、出力アクティベーションを並列に計算するように構成されている。複数のニューラル・コアの各々にニューラル・ネットワークのある層の出力アクティベーションのサブセットを割り当てて、計算を行う。ニューラル・ネットワークの層の入力アクティベーションのサブセットを受け取ると、複数のニューラル・コアの各々は、その割り当てられた出力アクティベーションの各々について部分和を計算し、少なくとも計算された部分和から、その割り当てられた出力アクティベーションを計算する。 According to embodiments of the present disclosure, a system for neural calculations is provided. In various embodiments, a plurality of neural cores are provided. Each of the plurality of neural cores comprises a plurality of vector computing units configured to operate in parallel. Each of the plurality of neural cores is configured to compute the output activation in parallel by applying the plurality of vector calculation units to the input activation. A subset of the output activations of one layer of the neural network is assigned to each of the multiple neural cores to perform the calculation. Upon receiving a subset of the input activations of the layers of the neural network, each of the neural cores calculates a partial sum for each of its assigned output activations, and at least from the calculated partial sum, that allocation. Calculate the output activation that was done.

本開示の実施形態によれば、ニューラル計算の方法およびニューラル計算用のコンピュータ・プログラム製品が提供される。様々な実施形態において、複数のニューラル・コアの各々において、ニューラル・ネットワークの層の入力アクティベーションのサブセットが受け取られる。複数のニューラル・コアの各々は、並列に動作するように構成されている複数のベクトル計算ユニットを備える。複数のニューラル・コアの各々は、その複数のベクトル計算ユニットを入力アクティベーションに適用することによって、出力アクティベーションを並列に計算するように構成されている。ニューラル・ネットワークのある層の出力アクティベーションのサブセットが、複数のニューラル・コアの各々に割り当てられて、計算される。ニューラル・ネットワークの層の入力アクティベーションのサブセットを受け取ると、複数のニューラル・コアの各々は、その割り当てられた出力アクティベーションの各々について部分和を計算し、少なくとも計算された部分和から、その割り当てられた出力アクティベーションを計算する。 According to the embodiments of the present disclosure, a method of neural calculation and a computer program product for neural calculation are provided. In various embodiments, each of the plurality of neural cores receives a subset of the input activations of the layers of the neural network. Each of the plurality of neural cores comprises a plurality of vector computing units configured to operate in parallel. Each of the plurality of neural cores is configured to compute the output activation in parallel by applying the plurality of vector calculation units to the input activation. A subset of the output activations of one layer of the neural network is assigned to each of the plurality of neural cores and calculated. Upon receiving a subset of the input activations of the layers of the neural network, each of the neural cores calculates a partial sum for each of its assigned output activations, and at least from the calculated partial sum, that allocation. Calculate the output activation that was done.

本開示の実施形態に係るニューラル・コアを示す図である。It is a figure which shows the neural core which concerns on embodiment of this disclosure. 本開示の実施形態に係るコア間並列処理を示す例示的なニューラル・コアを示す図である。It is a figure which shows the exemplary neural core which shows the parallel processing between cores which concerns on embodiment of this disclosure. 本開示の実施形態に係るコア内並列処理を示す例示的なニューラル・コアを示す図である。It is a figure which shows the exemplary neural core which shows the parallel processing in a core which concerns on embodiment of this disclosure. 本開示の実施形態に係るニューラル計算の方法を示す図である。It is a figure which shows the method of the neural calculation which concerns on embodiment of this disclosure. 本開示の実施形態に係る計算ノードを描いた図である。It is a figure which drew the calculation node which concerns on embodiment of this disclosure.

人工ニューロンは、その出力がその入力の線形結合の非線形関数となる数学的関数である。一方の出力が他方への入力である場合、その２つのニューロンは結合されている。重みとは、あるニューロンの出力と別のニューロンの入力の間の結合強度を符号化するスカラ値である。 An artificial neuron is a mathematical function whose output is a non-linear function of a linear combination of its inputs. If one output is an input to the other, the two neurons are connected. A weight is a scalar value that encodes the bond strength between the output of one neuron and the input of another neuron.

ニューロンは、その入力の重み付き和に非線形活性化関数を適用することによって、アクティベーションと呼ばれるその出力を計算する。重み付き和とは、各入力と対応する重みとを乗算しその積を累算することによって計算される中間結果である。部分和とは、入力のサブセットの重み付き和である。１つまたは複数の部分和を累算することによって、全ての入力の重み付き和が段階的に計算され得る。 A neuron computes its output, called activation, by applying a non-linear activation function to the weighted sum of its inputs. The weighted sum is an intermediate result calculated by multiplying each input by the corresponding weight and accumulating the product. A partial sum is a weighted sum of a subset of inputs. By accumulating one or more partial sums, the weighted sum of all inputs can be calculated stepwise.

ニューラル・ネットワークとは、１つまたは複数のニューロンの集合である。ニューラル・ネットワークは多くの場合、層と呼ばれるニューロンの組へと分割されている。層とは、全てが同じ層から入力を受け取り全てが同じ層へと出力を送り、典型的には同様の機能を実行する１つまたは複数のニューロンの、集合である。入力層とは、ニューラル・ネットワークの外部のソースから入力を受け取る層である。出力層とは、ニューラル・ネットワークの外部のターゲットへと出力を送る層である。他の全ての層は中間処理層である。多層ニューラル・ネットワークとは、２つ以上の層を有するニューラル・ネットワークである。ディープ・ニューラル・ネットワークとは、多数の層を有する多層ニューラル・ネットワークである。 A neural network is a collection of one or more neurons. Neural networks are often divided into groups of neurons called layers. A layer is a collection of one or more neurons that all receive inputs from the same layer and all send outputs to the same layer, typically performing similar functions. The input layer is the layer that receives input from sources outside the neural network. The output layer is a layer that sends output to a target outside the neural network. All other layers are intermediate treatment layers. A multi-layer neural network is a neural network having two or more layers. A deep neural network is a multi-layer neural network having many layers.

テンソルは数値の多次元のアレイである。テンソル・ブロックは、テンソル中の要素の連続的なサブアレイである。 A tensor is a multidimensional array of numbers. A tensor block is a continuous subarray of elements in a tensor.

各ニューラル・ネットワーク層は、パラメータ・テンソルＶ、重みテンソルＷ、入力データ・テンソルＸ、出力データ・テンソルＹ、および中間データ・テンソルＺと関連付けられている。パラメータ・テンソルは、層中のニューロン活性化関数σを制御するパラメータの全てを包含する。重みテンソルは、入力を層に結合する重みの全てを包含する。入力データ・テンソルは、層が入力として消費するデータの全てを包含する。出力データ・テンソルは、層が出力として計算するデータの全てを包含する。中間データ・テンソルは、層が中間計算値として生成する任意のデータ、例えば部分和を包含する。 Each neural network layer is associated with a parameter tensor V, a weighted tensor W, an input data tensor X, an output data tensor Y, and an intermediate data tensor Z. The parameter tensor includes all the parameters that control the neuron activation function σ in the layer. The weight tensor contains all of the weights that connect the inputs to the layer. The input data tensor contains all the data that the layer consumes as input. The output data tensor contains all the data that the layer calculates as output. The intermediate data tensor includes any data that the layer produces as intermediate calculated values, such as a subset.

ある層についてのデータ・テンソル（入力、出力、および中間）は３次元であってもよく、この場合、最初の２つの次元が空間位置を符号化するものとして解釈され、３番目の次元が異なる特徴を符号化するものとして解釈され得る。例えば、データ・テンソルがカラー画像を表す場合、最初の２つの次元は画像中の垂直および水平座標を符号化し、３番目の次元は各位置における色を符号化する。入力データ・テンソルＸのあらゆる要素を別個の重みによってあらゆるニューロンに結合することができ、この場合、重みテンソルＷは一般に、入力データ・テンソルの３つの次元（入力行ａ、入力列ｂ、入力特徴ｃ）を出力データ・テンソルの３つの次元（出力行ｉ、出力列ｊ、出力特徴ｋ）と連結した、６次元を有する。中間データ・テンソルＺは、出力データ・テンソルＹと同じ形状を有する。パラメータ・テンソルＶは、出力データ・テンソルの３つの次元を、活性化関数σのパラメータのインデックスとなる追加の次元ｏと連結する。 The data tensors (inputs, outputs, and intermediates) for a layer may be three-dimensional, in which case the first two dimensions are interpreted as encoding spatial positions, and the third dimension is different. It can be interpreted as encoding a feature. For example, if the data tensor represents a color image, the first two dimensions encode the vertical and horizontal coordinates in the image, and the third dimension encodes the color at each position. Any element of the input data tensor X can be connected to any neuron by a separate weight, in which case the weighted tensor W is generally the three dimensions of the input data tensor (input row a, input column b, input features). It has 6 dimensions in which c) is concatenated with 3 dimensions of the output data tensor (output row i, output column j, output feature k). The intermediate data tensor Z has the same shape as the output data tensor Y. The parameter tensor V concatenates the three dimensions of the output data tensor with an additional dimension o that is the index of the parameters of the activation function σ.

ある層の出力データ・テンソルＹの要素は式１のように計算でき、式中、ニューロン活性化関数σは活性化関数のパラメータのベクトルＶ［ｉ，ｊ，ｋ，：］によって構成されており、重み付き和Ｚ［ｉ，ｊ，ｋ］は式２のように計算できる。
Ｙ［ｉ，ｊ，ｋ］＝σ（Ｖ［ｉ，ｊ，ｋ，：］；Ｚ［ｉ，ｊ，ｋ］）
式１ The elements of the output data tensor Y of a certain layer can be calculated as in Equation 1, and in the equation, the neuroactivation function σ is composed of the vector V [i, j, k ,:] of the parameters of the activation function. , The weighted sum Z [i, j, k] can be calculated as in Equation 2.
Y [i, j, k] = σ (V [i, j, k ,:]; Z [i, j, k])
Equation 1

表記を簡単にするために、式２中の重み付き和を出力と呼ぶ場合があるが、これは線形活性化関数Ｙ［ｉ，ｊ，ｋ］＝σ（Ｚ［ｉ，ｊ，ｋ］）＝Ｚ［ｉ，ｊ，ｋ］を用いることと等価であり、異なる活性化関数が使用されるときに一般性を失うことなく同じ説明が当てはまるものと理解される。 To simplify the notation, the weighted sum in Equation 2 is sometimes called the output, which is the linear activation function Y [i, j, k] = σ (Z [i, j, k]). It is understood that it is equivalent to using = Z [i, j, k] and the same explanation applies without loss of generality when different activation functions are used.

様々な実施形態において、上記したような出力データ・テンソルの計算は、より小さい問題へと分解される。各問題は、１つもしくは複数のニューラル・コア上で、または従来のマルチコア・システムの１つもしくは複数のコア上で並列に、解かれ得る。 In various embodiments, the calculation of the output data tensor as described above is broken down into smaller problems. Each problem can be solved on one or more neural cores, or in parallel on one or more cores of a conventional multi-core system.

ニューラル・ネットワークが本来的に並列な構造であることが、上記から明らかであろう。所与の層中のニューロンは、１つまたは複数の層または他の入力から、要素ｘ_ｉを有する入力Ｘを受け取る。各ニューロンは、入力および要素ｗ_ｉを有する重みＷに基づいて、その状態ｙ∈Ｙを計算する。様々な実施形態において、入力の重み付き和はバイアスｂによって調整され、次いでその結果が非線形処理（ｎｏｎｌｉｎｅａｒｉｔｙ）Ｆ（・）に渡される。例えば、単一のニューロンのアクティベーションは、ｙ＝Ｆ（ｂ＋Σｘ_ｉｗ_ｉ）として表現できる。 It will be clear from the above that neural networks are inherently parallel structures. Neurons in a given layer receive input X with _{element x i} from one or more layers or other inputs. Each neuron on the basis of the weight W having an input and an element w _i, to calculate the state Y∈Y. In various embodiments, the weighted sum of the inputs is adjusted by the bias b, and then the result is passed to the non-linearity F (.). For example, activation of a single neuron can be expressed as _{y = F (b + Σx i} w _i).

所与の層中の全てのニューロンが同じ層から入力を受け取りそれらの出力を独立して計算するので、ニューロンのアクティベーションを並列に計算することができる。全体的なニューラル・ネットワークのこの態様により、並列分散型コアにおいて計算を行うことによって全体的な計算の速度が高まる。更に、各コア内で、ベクトル演算を並列に計算することができる。回帰的入力がある場合、例えばある層がそれ自体に戻るように投影される場合ですら、全てのニューロンがやはり同時に更新される。実際には、回帰的な接続は、その層への次の入力と揃うように遅延される。 Since all neurons in a given layer receive inputs from the same layer and calculate their outputs independently, neuron activations can be calculated in parallel. This aspect of the overall neural network speeds up the overall computation by performing the computation in a parallel distributed core. Furthermore, vector operations can be calculated in parallel within each core. All neurons are also updated at the same time if there is a recursive input, for example if a layer is projected back to itself. In practice, the recursive connection is delayed to align with the next input to that layer.

ここで図１を参照すると、本開示の実施形態に係るニューラル・コアが描かれている。ニューラル・コア１００は、出力テンソルの１つのブロックを計算する、タイル化可能（ｔｉｌｅａｂｌｅ）な計算ユニットである。ニューラル・コア１００は、Ｍ個の入力およびＮ個の出力を有する。様々な実施形態において、Ｍ＝Ｎである。出力テンソル・ブロックを計算するために、ニューラル・コアは、Ｍ×１の入力テンソル・ブロック１０１をＭ×Ｎの重みテンソル・ブロック１０２と乗算し、その積を累算して重み付き和を得、これが１×Ｎの中間テンソル・ブロック１０３に格納される。Ｏ×Ｎのパラメータ・テンソル・ブロックは、中間テンソル・ブロック１０３に適用されて１×Ｎの出力テンソル・ブロック１０５を生成するＮ個のニューロン活性化関数の各々を規定する、Ｏ個のパラメータを包含する。 Here, with reference to FIG. 1, a neural core according to an embodiment of the present disclosure is drawn. The neural core 100 is a tileable computational unit that computes one block of output tensors. The neural core 100 has M inputs and N outputs. In various embodiments, M = N. To calculate the output tensor block, the neural core multiplies the M × 1 input tensor block 101 by the M × N weighted tensor block 102 and accumulates the products to obtain a weighted sum. , This is stored in the 1 × N intermediate tensor block 103. The OxN parameter tensor block applies OxN parameter tensor blocks to each of the N neuron activation functions that are applied to the intermediate tensor block 103 to produce the 1xN output tensor block 105. Include.

複数のニューラル・コアをニューラル・コアのアレイ中でタイル化することができる。いくつかの実施形態では、アレイは２次元である。 Multiple neural cores can be tiled in an array of neural cores. In some embodiments, the array is two-dimensional.

ニューラル・ネットワーク・モデルとは、ニューロン間の結合のグラフならびにあらゆるニューロンについての重みおよび活性化関数のパラメータを含む、ニューラル・ネットワークが行う計算の全体を集合的に規定する定数のセットである。訓練とは、所望の機能を実行するようにニューラル・ネットワーク・モデルを修正するプロセスである。推論とは、ニューラル・ネットワーク・モデルを変更することなく、ニューラル・ネットワークを入力に適用して出力を生成するプロセスである。 A neural network model is a set of constants that collectively define the entire calculation performed by a neural network, including a graph of connections between neurons and parameters of weights and activation functions for every neuron. Training is the process of modifying a neural network model to perform the desired function. Inference is the process of applying a neural network to an input to produce an output without changing the neural network model.

推論処理ユニットは、ニューラル・ネットワーク推論を実行するプロセッサの一範疇である。ニューラル推論チップは、推論処理ユニットの具体的な物理的実例である。 An inference processing unit is a category of processors that perform neural network inference. The neural inference chip is a concrete physical example of an inference processing unit.

アレイ中に複数のコアを含む実施形態では、各コアは全体的なニューラル・ネットワークの計算の一部を実施する。添え字ｉ＝１：Ｎ_{ＩＮＰＵＴ}を有するｘ_ｉを値とするＮ_{ＩＮＰＵＴ}個の入力ニューロンを有する、添え字ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}を有するＮ_{ＮＥＵＲＯＮ}個の出力ニューロンを有する所与の層の場合には、その層の出力の値は、ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝１：Ｎ_{ＩＮＰＵＴ}に対して、式３によって与えられる。
Ｙ_ｊ＝Ｆ（ｂ_ｊ＋Σｘ_ｉｗ_ｊｉ）
式３ In embodiments that include multiple cores in an array, each core performs part of the overall neural network computation. Subscript _{i = 1:} the _{x i} with _{N INPUT} having _{N INPUT} inputs neurons value, subscript _{j = 1:} For a given layer having _{N NEURON} number of output neurons with _{N NEURON} The output value of that layer is given by Equation 3 for _{j = 1: N NEURON} and i = 1: N _INPUT.
Y _j = F (b _j + Σx _i w _ji )
Equation 3

いくつかの実施形態では、各コアは、何らかの数の出力ニューロンを計算する。Ｎ_ＣＯＲＥ個のコアに関して、各コアは平均Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ個の出力ニューロンを計算し、各々が全ての入力を受け取る。この場合、全てのコアが全ての入力を必要とするが、計算するのは一部の出力のみである。例えば：
コア１は以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｉ＝１：Ｎ_{ＩＮＰＵＴ}に対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋Σｘ_ｉｗ_ｊｉ）
コア２は以下を計算する：ｊ＝Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ＋１：２Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｉ＝１：Ｎ_{ＩＮＰＵＴ}に対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋Σｘ_ｉｗ_ｊｉ）
コアｋは以下を計算する：ｊ＝（ｋ−１）Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ＋１：ｋＮ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｉ＝１：Ｎ_{ＩＮＰＵＴ}に対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋Σｘ_ｉｗ_ｊｉ）
コアＮ_ＣＯＲＥは以下を計算する：ｊ＝（Ｎ_ＣＯＲＥ−１）Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ＋１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝１：Ｎ_{ＩＮＰＵＴ}に対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋Σｘ_ｉｗ_ｊｉ） In some embodiments, each core computes some number of output neurons. For N _CORE cores, each core calculates an average N N _EURON / N _CORE output neurons, each receiving all inputs. In this case, all cores require all inputs, but only some outputs are calculated. for example:
Core 1 calculates: j = 1: N N _EURON / N _CORE and i = 1: N _INPUT , Y _j = F (b _j + Σx _i w _ji )
Core 2 calculates: j = N N _EURON / N _CORE + 1: 2N N _EURON / N _CORE and i = 1: N _INPUT , Y _j = F (b _j + Σx _i w _ji )
Core k calculates the _{_{following: j = (k-1)}} N NEURON / N CORE +1: kN NEURON / N CORE and _{i = 1:} against _{_{N INPUT, Y j = F (}} b j + Σx i w ji)
The core _{N CORE} calculates the _{_{_{following: j = (N CORE -1)}}} N NEURON / N CORE +1: N NEURON and _{i = 1:} against _{_{N INPUT, Y j = F (}} b j + Σx i w ji)

いくつかの実施形態では、各コアは全ての出力ニューロンを計算するものの、入力のサブセットについてのみである。Ｎ_ＣＯＲＥ個のコアについて、各コアは、平均Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ個の入力ニューロンについて、全ての出力ニューロンを計算する。この場合、コアは一部の入力しか必要としないが、出力は収集し加算する必要がある。例えば：
コア１は以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝１：Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥに対して、Ｙ_{ｊ＿ｃｏｒｅ１}＝Σｘ_ｉｗ_ｊｉ
コア２は以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ＋１：２Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥに対して、Ｙ_{ｊ＿ｃｏｒｅ２}＝Σｘ_ｉｗ_ｊｉ
コアｋは以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝（ｋ−１）Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ＋１：ｋＮ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥに対して、Ｙ_{ｊ＿ｃｏｒｅｋ}＝Σｘ_ｉｗ_ｊｉ
コアＮ_ＣＯＲＥは以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝（Ｎ_ＣＯＲＥ−１）Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ＋１：Ｎ_{ＩＮＰＵＴ}に対して、Ｙ_{ｊ＿ｃｏｒｅＮＣＯＲＥ}＝Σｘ_ｉｗ_ｊｉ In some embodiments, each core computes all output neurons, but only for a subset of the inputs. For N _CORE cores, each core calculates all output neurons for _{an average N INPUT} / N _{CORE input neurons.} In this case, the core needs only some inputs, but the outputs need to be collected and added. for example:
Core 1 calculates: j = 1: N _NEURON and i = 1: N _INPUT / N _CORE , Y _{j_core1} = Σx _i w _ji
Core 2 calculates: j = 1: N _NEURON and i = N _INPUT / N _CORE + 1: 2N _INPUT / N _CORE , Y _{j_core2} = Σx _i w _ji
The core k calculates: j = 1: N _NEURON and i = (k-1) N _INPUT / N _CORE + 1: kN _NEURON / N _CORE , Y _{j_core} = Σx _i w _ji
The core N _CORE calculates: j = 1: N _NEURON and i = (N _CORE -1) N _INPUT / N _CORE +1: N _INPUT , Y _{j_core NCORE} = Σx _i w _ji

完全な結果に達するためには、ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｋ＝１：Ｎ_ＣＯＲＥに対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋ΣＹ_{ｉ＿ｃｏｒｅｋ}）である。様々な実施形態において、完全な結果はコア間でまたはコア外で計算される。 To reach a complete result, Y _j = F (b _j + _{ΣY i_corek} ) for _{j = 1: N NEURON} / N _CORE and k = 1: N _CORE. In various embodiments, complete results are calculated between or outside the core.

いくつかの実装形態では、各コアは何らかの数の出力ニューロンを計算するが、各コアが入力ニューロンの全てにアクセスできる訳ではない。そのような実施形態では、各コアは部分的な出力を計算し、各コアがそのニューロンのセットを完全に計算するために必要な部分的出力の全てを得るまで、それらを分配する。そのような実施形態では、コアは非ゼロである情報（部分和）を渡しさえすればよく、アレイはニューラル・ネットワーク層の高レベルの構造を利用することが可能になる。例えば、畳み込みニューラル・ネットワークは、入力、出力、および部分和の計算および連絡の必要量がより少ない。例えば：
コア１は以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝１：Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥに対して、Ｙ_{ｊ＿ｃｏｒｅ１}＝Σｘ_ｉｗ_ｊｉ
コア２は以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ＋１：２Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥに対して、Ｙ_{ｊ＿ｃｏｒｅ２}＝Σｘ_ｉｗ_ｊｉ
コアｋは以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝（ｋ−１）Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ＋１：ｋＮ_{ＮＥＵＲＯＮ}／Ｎ_ｃｏｒｅに対して、Ｙ_{ｊ＿ｃｏｒｅｋ}＝Σｘ_ｉｗ_ｊｉ
コアＮ_ＣＯＲＥは以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝（Ｎ_ＣＯＲＥ−１）Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ＋１：Ｎ_{ＩＮＰＵＴ}に対して、Ｙ_{ｊ＿ｃｏｒｅＮＣＯＲＥ}＝Σｘ_ｉｗ_ｊｉ In some implementations, each core computes some number of output neurons, but not each core has access to all of the input neurons. In such an embodiment, each core calculates partial outputs and distributes them until each core has all of the partial outputs needed to fully calculate its set of neurons. In such an embodiment, the core only needs to pass non-zero information (partial sum), and the array can take advantage of the high-level structure of the neural network layer. For example, a convolutional neural network requires less input, output, and partial sum calculation and communication. for example:
Core 1 calculates: j = 1: N _NEURON and i = 1: N _INPUT / N _CORE , Y _{j_core1} = Σx _i w _ji
Core 2 calculates: j = 1: N _NEURON and i = N _INPUT / N _CORE + 1: 2N _INPUT / N _CORE , Y _{j_core2} = Σx _i w _ji
Core k calculates the _{following: j = 1: N NEURON} and _{_{i = (k-1) N}} INPUT / N CORE +1: relative _{_{_{kN NEURON / N core, Y j_corek}}} = Σx i w ji
The core N _CORE calculates: j = 1: N _NEURON and i = (N _CORE -1) N _INPUT / N _CORE +1: N _INPUT , Y _{j_core NCORE} = Σx _i w _ji

次いで、完全な結果が連続的にまたはオーバーラップしてのいずれかで計算される。ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｋ＝１：Ｎ_ＣＯＲＥに対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋ΣＹ_{ｉ＿ｃｏｒｅｋ}）である。
コア１は以下を計算する：ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｋ＝１：Ｎ_ＣＯＲＥに対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋ΣＹ_{ｊ＿ｃｏｒｅｋ}）
コア２は以下を計算する：ｊ＝Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ＋１：２Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｋ＝１：Ｎ_ＣＯＲＥに対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋ΣＹ_{ｊ＿ｃｏｒｅｋ}）
コアｋは以下を計算する：ｊ＝（ｋ−１）Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ＋１：ｋＮ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｋ＝１：Ｎ_ＣＯＲＥに対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋ΣＹ_{ｊ＿ｃｏｒｅｋ}）
コアＮ_ｃｏｒｅは以下を計算する：ｊ＝（Ｎ_ｃｏｒｅ−１）Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ＋１：Ｎ_{ＮＥＵＲＯＮ}およびｋ＝１：Ｎ_ＣＯＲＥに対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋ΣＹ_{ｊ＿ｃｏｒｅｋ}） The complete results are then calculated either continuously or overlapping. For j = 1: N _NEURON / N _CORE and k = 1: N _CORE , Y _j = F (b _j + _{ΣY i_core} ).
Core 1 calculates: j = 1: N N _EURON / N _CORE and k = 1: N _CORE , Y _j = F (b _j + _{ΣY j_corek} )
Core 2 calculates: j = N N _EURON / N _CORE + 1: 2N N _EURON / N _CORE and k = 1: N _CORE , Y _j = F (b _j + _{ΣY j_core} )
The core k calculates: j = (k-1) N N _EURON / N _CORE + 1: kN _{N EURON} / N _CORE and k = 1: N _CORE , Y _j = F (b _j + _{ΣY j_core} )
The core N _core calculates: j = (N _core -1) N N _EURON / N _CORE + 1: N N _EURON and k = 1: N _CORE , Y _j = F (b _j + _{ΣY j_core} )

ここで図２を参照すると、コア間並列処理を示す例示的なニューラル・コアが示されている。コア２０１は並列ベクトル計算ユニットを備えるクロスバー２０２を含み、クロスバー２０２は入力２０３、

を受け取り、これらにパラメータ

を乗算するものであり、ここで、Ｍ＝Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥである。行列乗算の結果は並列和および非線形処理ユニット２０５に提供され、非線形処理ユニット２０５は、アレイ中の他のコアから部分和入力Ｙ_{ｉ＿ｃｏｒｅｋ}、２０６を受け取る。 Here, with reference to FIG. 2, an exemplary neural core showing parallel processing between cores is shown. The core 201 includes a crossbar 202 with a parallel vector calculation unit, the crossbar 202 having an input 203,

Receives these parameters

Is multiplied by, where M = N _INPUT / N _CORE . The result of the matrix multiplication is provided to the parallel sum and nonlinear processing unit 205, which receives the _{partial sum inputs Y i_core, 206 from the other cores in the array.}

上で指摘したように、アレイ中の各コアは、並列／同時ニューラル・ネットワーク計算要素を含む。例えば、所与のコアｋは以下を計算する：
ｊ＝１：Ｎ_{ＮＥＵＲＯＮ}およびｉ＝（ｋ−１）Ｎ_{ＩＮＰＵＴ}／Ｎ_ＣＯＲＥ＋１：ｋＮ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥに対して、Ｙ_{ｊ＿ｃｏｒｅｋ}＝Σｘ_ｉｗ_ｊｉ
または以下を計算する：
ｊ＝（ｋ−１）Ｎ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥ＋１：ｋＮ_{ＮＥＵＲＯＮ}／Ｎ_ＣＯＲＥおよびｋ＝１：Ｎ_ＣＯＲＥに対して、Ｙ_ｊ＝Ｆ（ｂ_ｊ＋ΣＹ_{ｊ＿ｃｏｒｅｋ}）
あるいはその両方を計算する。 As pointed out above, each core in the array contains parallel / simultaneous neural network compute elements. For example, a given core k calculates:
For j = 1: N _NEURON and i = (k-1) N _INPUT / N _CORE + 1: kN _NEURON / N _CORE , Y _{j_core} = Σx _i w _ji
Or calculate:
For j = (k-1) N N _EURON / N _CORE + 1: kN _{N EURON} / N _CORE and k = 1: N _CORE , Y _j = F (b _j + _{ΣY j_core} )
Or both are calculated.

Ｎ_{ＮＥＵＲＯＮ}の部分和（Ｙ_{ｊ＿ｃｏｒｅｋ}）は別個のベクトル・ユニットで同時に計算でき、各計算ベクトルは単一のニューロンｊについて乗算および加算（Σｘ_ｉｗ_ｊｉ）を行う。Ｎ_{ＮＥＵＲＯＮ}の和および非線形処理（Ｙ_ｊ）は、別個の累算および非線形ユニットで同時に計算でき、各々が単一のニューロンｊについて計算を行う。 The partial sum of N _NEURON _{(Y j_core} ) can be calculated simultaneously in separate vector units, where each calculated vector multiplies and adds (Σx _i w _ji ) to a single neuron j. The sum and non-linear processing of N _NEURON _{(Y j} ) can be calculated simultaneously in separate accumulation and non-linear units, each performing a calculation for a single neuron j.

ここで図３を参照すると、コア内並列処理を示す例示的なニューラル・コアが示されている。コア３０１は（図２のユニット２０２のうちの１つに対応している）ベクトル乗算／加算ユニット３０２を含み、ベクトル乗算／加算ユニット３０２は入力３０３、［ｘ_１，・・・，ｘ_８］を受け取り、これらにパラメータ３０４、［ｗ_ｊ１，・・・，ｗ_ｊ８］を乗算する。行列乗算の結果は並列和および非線形処理ユニット３０５に提供され、非線形処理ユニット３０５はアレイ中の他のコアから部分和入力３０６を受け取り、ニューロン出力３０７を提供する。このようにして、所与のベクトル・ユニットは多くの入力に対して一斉に作用して、単一の出力または複数の出力を生成する。 Here, with reference to FIG. 3, an exemplary neural core showing in-core parallelism is shown. The core 301 includes a vector multiplication / addition unit 302 (corresponding to one of the units 202 in FIG. 2), where the vector multiplication / addition unit 302 has an input 303, [x ₁ , ..., X ₈ ]. _Is received, and these are multiplied by the parameters 304, [w _j1 , ..., w j8]. The result of the matrix multiplication is provided to the parallel sum and nonlinear processing unit 305, which receives the partial sum input 306 from the other cores in the array and provides the neuron output 307. In this way, a given vector unit acts on many inputs in unison to produce a single output or multiple outputs.

上で記載したように、ベクトル・ユニットは各々、並列／同時の様式で計算を行う。例えば、ｋ番目のコアのｊ番目のベクトル・ユニットは、Ｙ_{ｊ＿ｃｏｒｅｋ}＝Σｘ_ｉｗ_ｊｉを計算する。この演算は並列で／同時に行うことができる。ベクトル・ユニットは、例えば加算ツリーを用いて並列加算を行い、これは同時性を更に高めるためにパイプライン化することができる。例えば、（図示したように）ｉ＝１：８であれば、ベクトル・ユニットは８回の並列乗算を実施し、続いて４回の２入力加算、続いて２回の２入力加算、続いて１回の２入力加算を行う。これら４回の並列演算をパイプライン化して、より高いスループットが得られるように同時に実行することができる。 As mentioned above, each vector unit performs calculations in a parallel / simultaneous fashion. For example, the jth vector unit of the kth core calculates _{Y j_corek} = Σx _i w _ji. This operation can be performed in parallel / simultaneously. Vector units perform parallel addition, for example using an addition tree, which can be pipelined for further simultaneity. For example, if i = 1: 8 (as shown), the vector unit performs eight parallel multiplications, followed by four two-input additions, followed by two two-input additions, and so on. Perform two input additions at one time. These four parallel operations can be pipelined and executed simultaneously to obtain higher throughput.

表１を参照すると、コア・アレイのサイズ（列１×１・・・６４×６４）およびクロス・バーのサイズ（行１×１・・・１０２４×１０２４）に対して、様々なコア・アレイ並列処理の合計値が示されている。並列処理の合計は、コア・アレイのサイズとクロスバーのサイズの積である。 Referring to Table 1, various core arrays for the size of the core array (columns 1x1 ... 64x64) and the size of the crossbars (rows 1x1 ... 1024x1024). The total value of parallel processing is shown. The sum of parallelism is the product of the size of the core array and the size of the crossbar.

図４を参照すると、本開示の実施形態に係るニューラル計算の方法が示されている。４０１では、複数のニューラル・コアの各々において、ニューラル・ネットワークの層の入力アクティベーションのサブセットが受け取られる。複数のニューラル・コアの各々は、並列に動作するように構成されている複数のベクトル計算ユニットを備える。複数のニューラル・コアの各々は、その複数のベクトル計算ユニットを入力アクティベーションに適用することによって、出力アクティベーションを並列に計算するように構成されている。４０２では、ニューラル・ネットワークのある層の出力アクティベーションのサブセットが、複数のニューラル・コアの各々に割り当てられて、計算される。ニューラル・ネットワークの層の入力アクティベーションのサブセットを受け取ると、複数のニューラル・コアの各々は、４０３において、その割り当てられた出力アクティベーションの各々について部分和を計算し、４０５において、少なくとも計算された部分和からその割り当てられた出力アクティベーションを計算する。いくつかの実施形態では、複数のコアの各々は、４０４において、複数のニューラル・コアのうちの別の１つからその割り当てられた出力アクティベーションのうちの少なくとも１つについての部分和を受け取り、４０５において、計算された部分和および受け取られた部分和から、その割り当てられた出力アクティベーションを計算する。 Referring to FIG. 4, a method of neural calculation according to an embodiment of the present disclosure is shown. At 401, each of the plurality of neural cores receives a subset of the input activations of the layers of the neural network. Each of the plurality of neural cores comprises a plurality of vector computing units configured to operate in parallel. Each of the plurality of neural cores is configured to compute the output activation in parallel by applying the plurality of vector calculation units to the input activation. At 402, a subset of the output activations of one layer of the neural network is assigned to and calculated for each of the plurality of neural cores. Upon receiving a subset of the input activations of the layers of the neural network, each of the multiple neural cores calculated a partial sum for each of its assigned output activations at 403 and at least calculated at 405. Calculate the assigned output activation from the subset. In some embodiments, each of the plurality of cores receives, at 404, a partial sum for at least one of its assigned output activations from another one of the plurality of neural cores. At 405, the assigned output activation is calculated from the calculated partial sum and the received partial sum.

本明細書で提供されるニューラル・ネットワークが、分類器または生成器として使用され得ることが諒解されるであろう。 It will be appreciated that the neural networks provided herein can be used as classifiers or generators.

ここで図５を参照すると、計算ノードの例の概略図が示されている。計算ノード１０は好適な計算ノードの一例に過ぎず、本明細書に記載する実施形態の使用または機能性の範囲に関してどのような限定を示唆することも意図していない。いずれにせよ、計算ノード１０は実装され得る、または本明細書で上記した機能性のいずれかを実行できる、あるいはその両方である。 Here, with reference to FIG. 5, a schematic diagram of an example of a compute node is shown. Computational node 10 is merely an example of a suitable computing node and is not intended to suggest any limitation with respect to the use or scope of functionality of the embodiments described herein. In any case, the compute node 10 can be implemented, can perform either of the above-mentioned functionality herein, or both.

計算ノード１０には、多数の汎用または専用計算システム環境または構成と共に動作できる、コンピュータ・システム／サーバ１２が存在する。コンピュータ・システム／サーバ１２との使用に好適であり得る、よく知られた計算システム、環境、または構成あるいはその組合せの例としては、パーソナル・コンピュータ・システム、サーバ・コンピュータ・システム、シン・クライアント、シック・クライアント、携帯型デバイスまたはラップトップ・デバイス、マルチプロセッサ・システム、マイクロプロセッサ・ベースのシステム、セット・トップ・ボックス、プログラム可能消費者向け電子機器、ネットワークＰＣ、ミニコンピュータ・システム、メインフレーム・コンピュータ・システム、および上記システムまたはデバイスのいずれかを含む分散型クラウド・コンピューティング環境、などが挙げられるが、これらに限定されない。 At the compute node 10, there is a computer system / server 12 that can operate with a number of general purpose or dedicated compute system environments or configurations. Examples of well-known computing systems, environments, or configurations or combinations thereof that may be suitable for use with the computer system / server 12 include personal computer systems, server computer systems, thin clients, and the like. Chic clients, portable or laptop devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframes. Computer systems and, but are not limited to, distributed cloud computing environments including, but not limited to, any of the above systems or devices.

コンピュータ・システム／サーバ１２は、プログラム・モジュールなどの、コンピュータ・システムによって実行されるコンピュータ・システム実行可能命令の一般的な文脈で説明され得る。一般に、プログラム・モジュールは、特定のタスクを実行するかまたは特定の抽象データ型を実装する、ルーチン、プログラム、オブジェクト、コンポーネント、ロジック、データ構造などを含み得る。コンピュータ・システム／サーバ１２は、通信ネットワークを介してリンクされているリモート処理デバイスによってタスクが実行される、分散型クラウド・コンピューティング環境において実施されてもよい。分散型クラウド・コンピューティング環境では、プログラム・モジュールを、ローカルおよびリモートの両方のメモリ・ストレージ・デバイスを含むコンピュータ・システム・ストレージ媒体内に配置することができる。 The computer system / server 12 may be described in the general context of a computer system executable instruction executed by the computer system, such as a program module. In general, a program module can include routines, programs, objects, components, logic, data structures, etc. that perform a particular task or implement a particular abstract data type. The computer system / server 12 may be implemented in a distributed cloud computing environment in which tasks are performed by remote processing devices linked over a communication network. In a distributed cloud computing environment, program modules can be located within computer system storage media, including both local and remote memory storage devices.

図５に示すように、計算ノード１０中のコンピュータ・システム／サーバ１２は、汎用コンピューティング・デバイスの形態で示されている。コンピュータ・システム／サーバ１２のコンポーネントは、１つまたは複数のプロセッサまたは処理ユニット１６、システム・メモリ２８、およびシステム・メモリ２８からプロセッサ１６までを含む様々なシステム・コンポーネントを連結するバス１８を含み得るが、これらに限定されない。 As shown in FIG. 5, the computer system / server 12 in the compute node 10 is shown in the form of a general purpose computing device. The components of the computer system / server 12 may include one or more processors or processing units 16, system memory 28, and a bus 18 connecting various system components including system memory 28 to processor 16. However, it is not limited to these.

バス１８は、メモリ・バスまたはメモリ・コントローラ、周辺バス、アクセラレイティッド・グラフィックス・ポート、および様々なバス・アーキテクチャのうちのいずれかを使用するプロセッサまたはローカル・バスを含む、いくつかのタイプのバス構造のいずれかのうちの１つまたは複数を表している。例として、限定するものではないが、そのようなアーキテクチャとしては、業界標準アーキテクチャ（ＩＳＡ）バス、マイクロ・チャネル・アーキテクチャ（ＭＣＡ）バス、エンハンストＩＳＡ（ＥＩＳＡ）バス、米国のビデオ電子装置規格化協会（ＶｉｄｅｏＥｌｅｃｔｒｏｎｉｃｓＳｔａｎｄａｒｄｓＡｓｓｏｃｉａｔｉｏｎ；ＶＥＳＡ）ローカル・バス、周辺装置相互接続（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔｓ；ＰＣＩ）バス、周辺装置相互接続エキスプレス（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔＥｘｐｒｅｓｓ；ＰＣＩｅ）、およびアドバンスト・マイクロコントローラ・バス・アーキテクチャ（ＡｄｖａｎｃｅｄＭｉｃｒｏｃｏｎｔｒｏｌｌｅｒＢｕｓＡｒｃｈｉｔｅｃｔｕｒｅ；ＡＭＢＡ）が挙げられる。 Bus 18 is of several types, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus that uses one of a variety of bus architectures. Represents one or more of the bus structures of. Examples include, but are not limited to, industry standard architecture (ISA) buses, microchannel architecture (MCA) buses, enhanced ISA (EISA) buses, and the American Video Electronics Standards Association. (Video Electronics Standards Association; VESA) Local Bus, Peripheral Component Interconnects (PCI) Bus, Peripheral Component Interconnect (Peripheral Component Interconnect) Bus, Peripheral Component Interconnect (Peripheral Component Interconnect) Microcontroller Bus Archive (AMBA) can be mentioned.

コンピュータ・システム／サーバ１２は通常、様々なコンピュータ・システム可読媒体を含む。そのような媒体は、コンピュータ・システム／サーバ１２がアクセス可能な任意の利用可能な媒体であってよく、これには、揮発性媒体および不揮発性媒体、取り外し可能媒体および取り外し不可能媒体の両方が含まれる。 The computer system / server 12 typically includes various computer system readable media. Such media may be any available medium accessible to the computer system / server 12, including both volatile and non-volatile media, removable and non-removable media. included.

システム・メモリ２８は、ランダム・アクセス・メモリ（ＲＡＭ）３０またはキャッシュ・メモリ３２あるいはその両方などの、揮発性メモリの形態のコンピュータ・システム可読媒体を含み得る。コンピュータ・システム／サーバ１２は、他の取り外し可能／取り外し不可能な揮発性／不揮発性コンピュータ・システム・ストレージ媒体を更に含み得る。単なる例として、取り外し不可能な不揮発性磁気媒体（図示しないが典型的には「ハード・ドライブ」と呼ばれる）に対する読み取りおよび書き込みを行うための、ストレージ・システム３４が提供され得る。図示されていないが、取り外し可能な不揮発性磁気ディスク（例えば、「フロッピー（Ｒ）・ディスク」）に対する読み取りおよび書き込みを行うための磁気ディスク・ドライブ、ならびに、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、または他の光学媒体などの取り外し可能な不揮発性光ディスクに対する読み取りおよび書き込みを行うための光ディスク・ドライブを提供することができる。そのような例では、各々が１つまたは複数のデータ媒体インターフェースによってバス１８に接続され得る。以下で更に描写し記載するように、メモリ２８は、本開示の実施形態の機能を実行するように構成されている１組の（例えば少なくとも１つの）プログラム・モジュールを有する、少なくとも１つのプログラム製品を含み得る。 The system memory 28 may include a computer system readable medium in the form of volatile memory, such as random access memory (RAM) 30 and / or cache memory 32. The computer system / server 12 may further include other removable / non-removable volatile / non-volatile computer system storage media. As a mere example, a storage system 34 may be provided for reading and writing to a non-removable non-volatile magnetic medium (not shown, but typically referred to as a "hard drive"). Although not shown, magnetic disk drives for reading and writing to removable non-volatile magnetic disks (eg, "floppy (R) disks"), as well as CD-ROMs, DVD-ROMs, or others. An optical disk drive for reading and writing to a removable non-volatile optical disk such as an optical medium can be provided. In such an example, each may be connected to the bus 18 by one or more data medium interfaces. As further described and described below, the memory 28 has at least one program product having a set (eg, at least one) program module configured to perform the functions of the embodiments of the present disclosure. May include.

１組の（少なくとも１つの）プログラム・モジュール４２を有するプログラム／ユーティリティ４０は、限定ではなく例としてメモリ２８に格納され得るが、オペレーティング・システム、１つまたは複数のアプリケーション・プログラム、他のプログラム・モジュール、およびプログラム・データにも格納され得る。オペレーティング・システム、１つもしくは複数のアプリケーション・プログラム、他のプログラム・モジュール、およびプログラム・データの各々、またはこれらの何らかの組合せは、ネットワーキング環境の実装を含み得る。プログラム・モジュール４２は一般に、本明細書に記載する実施形態の機能または方法論あるいはその組合せを実行する。 A program / utility 40 having a set (at least one) program module 42 may be stored in memory 28 as an example, but not limited to, an operating system, one or more application programs, and other programs. It can also be stored in modules and program data. The operating system, one or more application programs, other program modules, and each of the program data, or any combination thereof, may include an implementation of a networking environment. The program module 42 generally implements the functions or methodologies or combinations thereof of the embodiments described herein.

コンピュータ・システム／サーバ１２はまた、キーボード、ポインティング・デバイス、ディスプレイ２４等などの１つもしくは複数の外部デバイス１４、ユーザとコンピュータ・システム／サーバ１２の対話を可能にする１つもしくは複数のデバイス、またはコンピュータ・システム／サーバ１２と１つもしくは複数の他のコンピューティング・デバイスとの通信を可能にする任意のデバイス（例えば、ネットワーク・カード、モデム等）、あるいはその組合せとも通信し得る。そのような通信は、入力／出力（Ｉ／Ｏ）インターフェース２２を介して行うことができる。また更に、コンピュータ・システム／サーバ１２は、ネットワーク・アダプタ２０を介して、ローカル・エリア・ネットワーク（ＬＡＮ）、一般的なワイド・エリア・ネットワーク（ＷＡＮ）、または公共ネットワーク（例えばインターネット）、あるいはその組合せなどの、１つまたは複数のネットワークと通信し得る。描かれているように、ネットワーク・アダプタ２０は、バス１８を介してコンピュータ・システム／サーバ１２のその他のコンポーネントと通信する。示されていないが、他のハードウェア・コンポーネントまたはソフトウェア・コンポーネントあるいはその両方を、コンピュータ・システム／サーバ１２と組み合わせて使用してもよいことが理解されるべきである。例としては以下が挙げられるが、これらに限定されない：マイクロコード、デバイス・ドライバ、冗長な処理ユニット、外部ディスク・ドライブ・アレイ、ＲＡＩＤシステム、テープ・ドライブ、およびデータ・アーカイブ・ストレージ・システム、等。 The computer system / server 12 may also be one or more external devices 14, such as a keyboard, pointing device, display 24, etc., one or more devices that allow interaction between the user and the computer system / server 12. Alternatively, it may communicate with any device (eg, network card, modem, etc.) that allows communication between the computer system / server 12 and one or more other computing devices, or a combination thereof. Such communication can be done via the input / output (I / O) interface 22. Furthermore, the computer system / server 12 may be a local area network (LAN), a general wide area network (WAN), or a public network (eg, the Internet), or a public network thereof (eg, the Internet), via a network adapter 20. It may communicate with one or more networks, such as a combination. As depicted, the network adapter 20 communicates with other components of the computer system / server 12 via bus 18. Although not shown, it should be understood that other hardware and / or software components may be used in combination with the computer system / server 12. Examples include, but are not limited to, microcodes, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data archive storage systems, etc. ..

本開示は、システム、方法、またはコンピュータ・プログラム製品あるいはそれらの組合せとして具現化され得る。コンピュータ・プログラム製品は、プロセッサに本開示の態様を実行させるためのコンピュータ可読プログラム命令を有する、コンピュータ可読記憶媒体を含んでもよい。 The present disclosure may be embodied as a system, method, or computer program product or a combination thereof. The computer program product may include a computer-readable storage medium having computer-readable program instructions for causing the processor to perform aspects of the present disclosure.

コンピュータ可読記憶媒体は、命令実行デバイスによって使用される命令を保持および保存できる有形のデバイスとすることができる。コンピュータ可読記憶媒体は、例えば、電子ストレージ・デバイス、磁気ストレージ・デバイス、光ストレージ・デバイス、電磁ストレージ・デバイス、半導体ストレージ・デバイス、または以上の任意の好適な組合せであり得るが、これらに限定されない。コンピュータ可読記憶媒体のより具体的な例の非網羅的なリストには、以下、すなわち、ポータブル・コンピュータ・ディスケット、ハード・ディスク、ランダム・アクセス・メモリ（ＲＡＭ）、読取り専用メモリ（ＲＯＭ）、消去可能なプログラマブル読取り専用メモリ（ＥＰＲＯＭもしくはフラッシュ・メモリ）、スタティック・ランダム・アクセス・メモリ（ＳＲＡＭ）、ポータブル・コンパクト・ディスク読取り専用メモリ（ＣＤ−ＲＯＭ）、デジタル多用途ディスク（ＤＶＤ）、メモリ・スティック、フロッピー（Ｒ）・ディスク、命令が記録されているパンチ・カードもしくは溝の中の隆起構造などの機械的に符号化されたデバイス、および以上の任意の好適な組合せが含まれる。本明細書において使用されるコンピュータ可読記憶媒体は、電波もしくは他の自由に伝播する電磁波、導波路もしくは他の伝送媒体を通じて伝播する電磁波（例えば、光ファイバ・ケーブルを通過する光パルス）、または配線を介して伝送される電気信号などの、一時的信号そのものであると解釈されるべきではない。 The computer-readable storage medium can be a tangible device that can hold and store the instructions used by the instruction execution device. The computer-readable storage medium can be, for example, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof, but is not limited thereto. .. A non-exhaustive list of more specific examples of computer-readable storage media includes the following: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erase. Possible Programmable Read-Only Memory (EPROM or Flash Memory), Static Random Access Memory (SRAM), Portable Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD), Memory Stick , Floppy (R) disks, mechanically encoded devices such as punch cards or raised structures in grooves on which instructions are recorded, and any suitable combination of the above. Computer-readable storage media as used herein are radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, optical pulses through optical fiber cables), or wiring. It should not be construed as a temporary signal itself, such as an electrical signal transmitted via.

本明細書に記載するコンピュータ可読プログラム命令は、コンピュータ可読記憶媒体からそれぞれのコンピューティング／処理デバイスに、あるいは、ネットワーク、例えば、インターネット、ローカル・エリア・ネットワーク、ワイド・エリア・ネットワーク、もしくはワイヤレス・ネットワーク、またはその組合せを経由して、外部のコンピュータもしくは外部ストレージ・デバイスに、ダウンロードされ得る。ネットワークは、銅伝送ケーブル、光伝送ファイバ、ワイヤレス伝送、ルータ、ファイアウォール、スイッチ、ゲートウェイ・コンピュータ、またはエッジ・サーバ、あるいはそれらの組合せを備え得る。各コンピューティング／処理デバイス内のネットワーク・アダプタ・カードまたはネットワーク・インターフェースが、ネットワークからコンピュータ可読プログラム命令を受け取り、それらのコンピュータ可読プログラム命令を、それぞれのコンピューティング／処理デバイス内のコンピュータ可読記憶媒体に保存されるように転送する。 The computer-readable program instructions described herein are from computer-readable storage media to their respective computing / processing devices or networks such as the Internet, local area networks, wide area networks, or wireless networks. , Or a combination thereof, may be downloaded to an external computer or external storage device. The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers, or edge servers, or a combination thereof. A network adapter card or network interface within each computing / processing device receives computer-readable program instructions from the network and sends those computer-readable program instructions to the computer-readable storage medium within each computing / processing device. Transfer to be saved.

本開示の動作を実行するためのコンピュータ可読プログラム命令は、アセンブラ命令、命令セット・アーキテクチャ（ＩＳＡ）命令、マシン命令、マシン依存型命令、マイクロコード、ファームウェア命令、状態設定データ、または、Ｓｍａｌｌｔａｌｋ（Ｒ）、Ｃ＋＋などのオブジェクト指向プログラミング言語、および「Ｃ」プログラミング言語もしくは類似のプログラミング言語などの従来の手続き型プログラミング言語を含む、１つもしくは複数のプログラミング言語の任意の組合せで書かれた、ソース・コードもしくはオブジェクト・コードのいずれか、であり得る。コンピュータ可読プログラム命令は、専らユーザのコンピュータ上で、スタンド・アロン・ソフトウェア・パッケージとして部分的にユーザのコンピュータ上で、部分的にユーザのコンピュータ上でかつ部分的に遠隔のコンピュータ上で、または専ら遠隔のコンピュータもしくはサーバ上で、実行することができる。後者のシナリオでは、遠隔のコンピュータを、ローカル・エリア・ネットワーク（ＬＡＮ）もしくはワイド・エリア・ネットワーク（ＷＡＮ）を含む任意の種類のネットワークを介してユーザのコンピュータに接続してもよく、または、外部のコンピュータへの接続を（例えば、インターネット・サービス・プロバイダを利用してインターネットを介して）行ってもよい。いくつかの実施形態では、例えばプログラマブル・ロジック回路、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）、またはプログラマブル・ロジック・アレイ（ＰＬＡ）を含む電子回路は、本開示の態様を行うために、コンピュータ可読プログラム命令の状態情報を利用することによって、コンピュータ可読プログラム命令を実行して電子回路を個人化することができる。 The computer-readable program instructions for performing the operations of the present disclosure are assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcodes, firmware instructions, state setting data, or Smalltalk (R). ), Object-oriented programming languages such as C ++, and any combination of one or more programming languages, including traditional procedural programming languages such as the "C" programming language or similar programming languages. It can be either code or object code. Computer-readable program instructions are exclusively on the user's computer, partly on the user's computer as a stand-alone software package, partly on the user's computer and partly on a remote computer, or exclusively. It can be run on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or wide area network (WAN), or externally. You may make a connection to your computer (eg, over the Internet using an Internet service provider). In some embodiments, electronic circuits including, for example, programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs) are computer readable to perform aspects of the present disclosure. By using the state information of a program instruction, a computer-readable program instruction can be executed to personalize an electronic circuit.

本明細書には、本開示の実施形態に係る方法、装置（システム）、およびコンピュータ・プログラム製品のフローチャート図またはブロック図あるいはその両方を参照して、本開示の態様が記載されている。フローチャート図またはブロック図あるいはその両方の各ブロック、およびフローチャート図またはブロック図あるいはその両方におけるブロックの組合せを、コンピュータ可読プログラム命令によって実施できることが、理解されるであろう。 Aspects of the present disclosure are described herein with reference to the flow charts and / or block diagrams of the methods, devices (systems), and computer program products according to embodiments of the present disclosure. It will be appreciated that each block of the flow chart and / or block diagram, and the combination of blocks in the flow chart and / or block diagram, can be performed by computer-readable program instructions.

これらのコンピュータ可読プログラム命令は、コンピュータまたは他のプログラム可能データ処理装置のプロセッサを介して実行される命令が、フローチャートまたはブロック図あるいはその両方の１つまたは複数のブロックに指定される機能／動作を実施する手段を作り出すべく、汎用コンピュータ、専用コンピュータ、または他のプログラム可能データ処理装置のプロセッサに提供されてマシンを作り出すものであってよい。これらのコンピュータ可読プログラム命令はまた、命令が保存されたコンピュータ可読記憶媒体が、フローチャートまたはブロック図あるいはその両方の１つまたは複数のブロックに指定される機能／動作の態様を実施する命令を含んだ製品を備えるように、コンピュータ可読記憶媒体に保存され、コンピュータ、プログラム可能なデータ処理装置、または他のデバイス、あるいはそれらの組合せに特定の方式で機能するように指示できるものであってもよい。 These computer-readable program instructions are functions / operations in which an instruction executed through the processor of a computer or other programmable data processing device is specified in one or more blocks of a flowchart, a block diagram, or both. It may be provided to a general purpose computer, a dedicated computer, or the processor of another programmable data processing device to create a machine to create a means of implementation. These computer-readable program instructions also include instructions in which the computer-readable storage medium in which the instructions are stored performs the mode of function / operation specified in one or more blocks of the flowchart and / or block diagram. As provided with the product, it may be stored on a computer-readable storage medium and capable of instructing a computer, programmable data processing device, or other device, or a combination thereof, to function in a particular manner.

コンピュータ可読プログラム命令はまた、コンピュータ、他のプログラム可能装置、または他のデバイスで実行される命令が、フローチャートまたはブロック図あるいはその両方の１つまたは複数のブロックに指定される機能／動作を実施するように、コンピュータによって実行されるプロセスを作り出すべく、コンピュータ、他のプログラム可能データ処理装置、または他のデバイスにロードされ、コンピュータ、他のプログラム可能装置、または他のデバイス上で一連の動作ステップを実行させるものであってもよい。 Computer-readable program instructions also perform a function / operation in which an instruction executed on a computer, other programmable device, or other device is specified in one or more blocks of a flowchart, a block diagram, or both. As such, it is loaded onto a computer, other programmable data processing device, or other device to create a process that is executed by the computer, and a series of operating steps on the computer, other programmable device, or other device. It may be something to be executed.

図中のフローチャートおよびブロック図には、本開示の様々な実施形態に係るシステム、方法、およびコンピュータ・プログラム製品の、可能な実装形態のアーキテクチャ、機能性、および動作が説明されている。この関連において、フローチャートまたはブロック図内の各ブロックは、指定された論理機能を実施するための１つまたは複数の実行可能命令を備える、モジュール、セグメント、または命令の一部分を表すことができる。いくつかの代替的実装形態において、ブロック内に記された機能は、図に記されたものとは異なる順序で行われ得る。例えば連続して示される２つのブロックは、実際は実質的に並行して実行され得、またはこれらのブロックは時には関わる機能に応じて、逆の順序で実行され得る。また、ブロック図またはフローチャート図あるいはその両方の各ブロック、およびブロック図またはフローチャート図あるいはその両方におけるブロックの組合せは、指定された機能もしくは動作を行う、または専用ハードウェアとコンピュータ命令の組合せを実行する、専用ハードウェア・ベースのシステムによって実施され得ることも、留意されるであろう。 The flowcharts and block diagrams in the figure describe the architecture, functionality, and operation of possible implementations of the systems, methods, and computer program products according to the various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram can represent a module, segment, or part of an instruction that comprises one or more executable instructions for performing a given logical function. In some alternative implementations, the functions described within the block may be performed in a different order than that shown in the figure. For example, two blocks shown in succession can actually be executed substantially in parallel, or these blocks can sometimes be executed in reverse order, depending on the function involved. Also, each block of the block diagram and / or flow chart, and the combination of blocks in the block diagram and / or flow chart, performs the specified function or operation, or performs a combination of dedicated hardware and computer instructions. It will also be noted that it can be implemented by a dedicated hardware-based system.

様々な実施形態において、１つまたは複数の推論処理ユニット（図示せず）がバス１８に連結される。そのような実施形態では、ＩＰＵはバス１８を介してメモリ２８からデータを受け取る、またはメモリ２８にデータを書き込む。同様に、ＩＰＵは本明細書に記載するように、バス１８を介して他のコンポーネントと相互作用することができる。 In various embodiments, one or more inference processing units (not shown) are connected to the bus 18. In such an embodiment, the IPU receives data from memory 28 via bus 18 or writes data to memory 28. Similarly, the IPU can interact with other components via the bus 18 as described herein.

本開示の様々な実施形態の説明は、例示の目的で提示されてきたが、網羅的であることも、開示される実施形態に限定されることも意図していない。当業者には記載される実施形態の範囲から逸脱することなく多くの修正および変更が明らかであろう。本明細書で用いられる専門用語は、実施形態の原理、実際の用途、もしくは市場で見られる技術に対する技術的な改善を最もよく説明するように、または、他の当業者が本明細書において開示される実施形態を理解できるように、選択された。 Descriptions of the various embodiments of the present disclosure have been presented for illustrative purposes, but are not intended to be exhaustive or limited to the disclosed embodiments. Many modifications and changes will be apparent to those of skill in the art without departing from the scope of the embodiments described. The terminology used herein is to best describe the principles of the embodiment, the actual application, or the technical improvements to the techniques found in the market, or disclosed herein by others. Selected so that the embodiments to be made can be understood.

Claims

It's a neural inference chip
With multiple neural cores, each with multiple vector computing units configured to operate in parallel,
Each of the plurality of neural cores is configured to calculate the output activation in parallel by applying the plurality of vector calculation units to the input activation.
Each of the plurality of neural cores is assigned a subset of the output activation of a layer of the neural network to perform the calculation.
Neural inference chip.

Upon receiving a subset of the input activations of said layer of the neural network, each of the plurality of neural cores
Calculate the partial sum for each of its assigned output activations and
The neural inference chip of claim 1, wherein at least the allocated output activation is calculated from the calculated partial sum.

Upon receiving a subset of the input activations of said layer of the neural network, each of the plurality of neural cores
Receives a partial sum for at least one of its assigned output activations from another one of the plurality of neural cores.
The neural inference chip of claim 2, wherein the assigned output activation is calculated from the calculated partial sum and the received partial sum.

The neural inference chip according to claim 1, wherein the vector calculation unit includes a multiplication and addition unit.

The neural inference chip according to claim 1, wherein the vector calculation unit includes a cumulative unit.

The neural inference chip according to claim 2, wherein the plurality of neural cores perform the calculation of the partial sum in parallel.

The neural inference chip according to claim 2, wherein the plurality of neural cores perform the calculation of the output activation in parallel.

The neural inference chip of claim 2, wherein calculating the partial sum comprises applying at least one of the plurality of vector calculation units to multiply the input activation by a synaptic weight. ..

The neural inference chip of claim 2, wherein calculating the assigned output activation comprises applying a plurality of addition units.

The neural inference chip of claim 2, wherein calculating the output activation comprises applying a non-linear function.

The vector calculation unit is
Perform multiple multiplication operations in parallel
Perform multiple additions in parallel,
The neural inference chip according to claim 2, which is configured to accumulate the partial sums.

The neural inference chip according to claim 2, wherein the plurality of vector calculation units are configured to calculate partial sums in parallel.

The neural inference chip of claim 1, wherein the calculation by each of the plurality of neural cores is pipelined.

13. The neural inference chip of claim 13, wherein each of the plurality of neural cores is configured to simultaneously perform each stage of the calculation.

The neural inference chip of claim 14, wherein the calculation maintains parallel processing.

Each of the plurality of neural cores receives a subset of the input activations of a layer of the neural network, each of the plurality of neural cores being configured to operate in parallel. Each of the plurality of neural cores comprising a vector computing unit is configured to compute the output activation in parallel by applying the plurality of vector multipliers to the input activation, said receiving. When,
To perform the calculation by assigning a subset of the output activation of a layer of the neural network to each of the plurality of neural cores.
Upon receiving a subset of the input activations of the layer of the neural network, each of the plurality of neural cores
Calculate the partial sum for each of its assigned output activations and
At least to calculate the assigned output activation from the calculated partial sum,
Including, how.

Upon receiving a subset of the input activations of said layer of the neural network, each of the plurality of neural cores
Receives a partial sum for at least one of its assigned output activations from another one of the plurality of neural cores.
16. The method of claim 16, wherein the assigned output activation is calculated from the calculated partial sum and the received partial sum.

16. The method of claim 16, wherein the vector calculation unit comprises a multiplication and addition unit.

16. The method of claim 16, wherein the vector calculation unit comprises a cumulative unit.

16. The method of claim 16, wherein the plurality of neural cores perform the partial sum calculation in parallel.

16. The method of claim 16, wherein the plurality of neural cores perform the calculation of output activation in parallel.

16. The method of claim 16, wherein calculating the partial sum comprises applying at least one of the plurality of vector calculation units to multiply the input activation by a synaptic weight.

16. The method of claim 16, wherein calculating the assigned output activation comprises applying a plurality of accrual units.

16. The method of claim 16, wherein calculating output activation comprises applying a non-linear function.

Performing multiple multiplication operations in parallel and
Performing multiple additions in parallel and
Accumulating the partial sums and
16. The method of claim 16.