JP7495480B2

JP7495480B2 - Vector Reduction Using Shared Scratch-Pad Memory.

Info

Publication number: JP7495480B2
Application number: JP2022513296A
Authority: JP
Inventors: ノリー，トーマス; ラジャマニ，グルシャンカー; フェルプス，アンドリュー・エバレット; ヘッドルンド，マシュー・リーバー; ジョピー，ノーマン・ポール
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2020-02-26
Filing date: 2020-11-30
Publication date: 2024-06-04
Anticipated expiration: 2040-11-30
Also published as: US11934826B2; WO2021173201A1; US20220156071A1; US20210263739A1; KR20220038148A; JP2023506343A; TW202132976A; US11182159B2; CN117421047A; CN115136115A; CN115136115B; EP4004826A1; JP2024116166A

Description

背景
本明細書は、概して、ニューラルネットワーク計算を実行するために用いられるハードウェア回路の回路構成に関する。 BACKGROUND This specification relates generally to circuit configurations of hardware circuits used to perform neural network computations.

ニューラルネットワークは、ノードの１つ以上の層を利用して、受信した入力について出力（たとえば分類）を生成する機械学習モデルである。ニューラルネットワークの中には、出力層に加えて１つ以上の隠れ層を含んでいるものもある。各隠れ層の出力は、ネットワーク内の１つ以上の他の層（たとえばネットワークの他の隠れ層または出力層）への入力として用いられる。ネットワークの層のうちのいくつかは、パラメータのそれぞれのセットの現在値に従って、受信した入力から出力を生成する。いくつかのニューラルネットワークは、画像処理用に構成された畳み込みニューラルネットワーク（ＣＮＮ）または音声および言語処理用に構成された回帰型ニューラルネットワーク（ＲＮＮ）であり得る。異なるタイプのニューラルネットワークアーキテクチャを用いて、分類またはパターン認識、データモデリングを含む予測、および情報クラスタリングに関連する各種タスクを実行することができる。 A neural network is a machine learning model that utilizes one or more layers of nodes to generate an output (e.g., a classification) for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as an input to one or more other layers in the network (e.g., other hidden layers or the output layer of the network). Some of the layers of the network generate outputs from the received input according to the current values of a respective set of parameters. Some neural networks may be convolutional neural networks (CNNs) configured for image processing or recurrent neural networks (RNNs) configured for speech and language processing. Different types of neural network architectures can be used to perform a variety of tasks related to classification or pattern recognition, prediction including data modeling, and information clustering.

ＣＮＮのニューラルネットワーク層は、パラメータまたは重みに対応し得る、カーネルの関連付けられたセットを有し得る。カーネルの関連付けられたセットを用いて、ニューラルネットワーク層を通して入力（たとえば入力のバッチ）を処理して、ニューラルネットワーク推論を計算するための層の対応する出力を生成する。入力のバッチおよびカーネルのセットは、入力および重みのテンソル（すなわち多次元配列）として表すことができる。ニューラルネットワークを実行するハードウェア回路は、アドレス値によって識別される場所を有するメモリを含む。メモリ場所はテンソルの要素に対応し得て、テンソル要素は回路の制御論理を用いてトラバースまたはアクセスされ得る。たとえば、制御論理は、要素のメモリアドレス値を決定または計算して、この要素の対応するデータ値をロードまたは格納することができる。 A neural network layer of a CNN may have an associated set of kernels, which may correspond to parameters or weights. The associated set of kernels is used to process an input (e.g., a batch of inputs) through the neural network layer to generate a corresponding output of the layer for computing a neural network inference. The batch of inputs and the set of kernels may be represented as a tensor (i.e., a multidimensional array) of inputs and weights. A hardware circuit that executes a neural network includes a memory having locations identified by address values. The memory locations may correspond to elements of the tensor, and the tensor elements may be traversed or accessed using the control logic of the circuit. For example, the control logic may determine or calculate a memory address value of an element to load or store a corresponding data value of the element.

概要
本文書には、大規模共有スクラッチパッドメモリにおいてデータ蓄積およびベクトル縮小を実行するための技術が記載されている。特に、これらの技術を用いて、計算システムのそれぞれのプロセッサコアで行われる計算の結果として生成される値または出力を縮小することを含むベクトル縮小を実行するために必要な演算の全体量を縮小する。たとえば、システムは、複数のプロセッサコアを有し得るハードウェア回路と、スタティックランダムアクセスメモリ（ＳＲＡＭ）のメモリリソースを組込むアーキテクチャとを含む。ＳＲＡＭのメモリリソースは、回路の複数のそれぞれのプロセッサコアの間で共有されるように割り当てられる。 SUMMARY This document describes techniques for performing data storage and vector reduction in a large shared scratch pad memory. In particular, these techniques are used to reduce the overall amount of operations required to perform vector reduction, including reducing values or outputs generated as a result of computations performed in each processor core of a computing system. For example, the system includes a hardware circuit that may have multiple processor cores and an architecture incorporating static random access memory (SRAM) memory resources. The SRAM memory resources are allocated to be shared among multiple respective processor cores of the circuit.

計算システムで行われる複数セットの計算は、値のベクトルのそれぞれを生成するように１つ以上のハードウェア回路のそれぞれのコアに分散させることができる。共有メモリは、共有メモリのダイレクトメモリアクセス（ＤＭＡ）データパスを用いて、プロセッサコアのそれぞれのリソースから値のベクトルのそれぞれを受信する。共有メモリは、共有メモリに結合された演算器ユニットを用いて、値のベクトルのそれぞれに対して累積演算（accumulation operation）を実行する。演算器ユニットは、演算器ユニットで符号化された算術演算に基づいて値を累積するように構成される。累積演算に基づいて結果ベクトルが生成される。 The sets of computations performed in the computing system can be distributed to respective cores of one or more hardware circuits to generate respective vectors of values. The shared memory receives each of the vectors of values from a resource of each of the processor cores using a direct memory access (DMA) data path of the shared memory. The shared memory performs an accumulation operation on each of the vectors of values using an arithmetic unit coupled to the shared memory. The arithmetic unit is configured to accumulate values based on arithmetic operations encoded in the arithmetic unit. A result vector is generated based on the accumulation operation.

本明細書に記載されている主題の一局面は、共有メモリと、上記共有メモリと通信する複数のプロセッサコアとを有するハードウェア回路を用いて実行される方法で具体化することができる。上記方法は、第１のプロセッサコアで実行される計算に基づいて、値の第１のベクトルを生成することと、上記共有メモリが、上記共有メモリのダイレクトメモリアクセス（ＤＭＡ）データパスを用いて、上記第１のプロセッサコアから上記値の第１のベクトルを受信することと、上記共有メモリにおいて、上記値の第１のベクトルと上記共有メモリに格納されているベクトルとの間の累積演算を実行することとを含む。上記累積演算は演算器ユニットを用いて実行され、上記演算器ユニットは、ｉ）上記共有メモリに結合され、ｉｉ）複数のベクトルを累積するように構成される。上記方法は、上記累積演算に基づいて結果ベクトルを生成することを含む。 One aspect of the subject matter described herein may be embodied in a method performed using a hardware circuit having a shared memory and a plurality of processor cores in communication with the shared memory. The method includes generating a first vector of values based on a calculation performed at a first processor core, the shared memory receiving the first vector of values from the first processor core using a direct memory access (DMA) data path of the shared memory, and performing an accumulation operation in the shared memory between the first vector of values and a vector stored in the shared memory. The accumulation operation is performed using a calculator unit that is i) coupled to the shared memory and ii) configured to accumulate a plurality of vectors. The method includes generating a result vector based on the accumulation operation.

これらおよび他の実現例の各々は、任意に以下の特徴のうちの１つ以上を含み得る。たとえば、いくつかの実現例において、上記共有メモリに格納されている上記ベクトルは、第２のプロセッサコアから受信したものであり、上記方法は、上記第１のプロセッサコアが、上記共有メモリのメモリ場所に上記値の第１のベクトルのそれぞれの値を累積するメモリへの累積動作（accumulate-to-memory operation）を実行することと、上記第２のプロセッサコアが、上記共有メモリの上記メモリ場所に値の第２のベクトルのそれぞれの値を累積するメモリへの累積動作を実行することとを含み、上記値の第２のベクトルは、上記共有メモリに格納されている上記ベクトルに対応する。場合によっては、上記第２のプロセッサは、上記値の第２のベクトルの値が初期値として書かれるようにフラグ（たとえば初期ベクトル／値フラグ）を設定するのに対して、上記第１のプロセッサコアは、上記値の第１のベクトルの値が上記値の第２のベクトルの値とともに累積されるように異なるフラグ（たとえば累積フラグ）を設定する。 Each of these and other implementations may optionally include one or more of the following features. For example, in some implementations, the vector stored in the shared memory was received from a second processor core, and the method includes the first processor core performing an accumulate-to-memory operation to accumulate each value of the first vector of values to memory locations of the shared memory, and the second processor core performing an accumulate-to-memory operation to accumulate each value of a second vector of values to memory locations of the shared memory, the second vector of values corresponding to the vector stored in the shared memory. In some cases, the second processor sets a flag (e.g., an initial vector/value flag) such that the values of the second vector of values are written as initial values, while the first processor core sets a different flag (e.g., an accumulation flag) such that the values of the first vector of values are accumulated with the values of the second vector of values.

いくつかの実現例において、上記累積演算に基づいて上記結果ベクトルを生成することは、上記第１のプロセッサコアで実行される計算から得られる積を事前に累積するステップを上記第１のプロセッサコアが実行することなく、上記結果ベクトルを生成することと、上記第２のプロセッサコアで実行される計算から得られる積を事前に累積するステップを上記第２のプロセッサコアが実行することなく、上記結果ベクトルを生成することとを含む。 In some implementations, generating the result vector based on the accumulation operation includes generating the result vector without the first processor core performing a step of pre-accumulating products resulting from calculations performed by the first processor core, and generating the result vector without the second processor core performing a step of pre-accumulating products resulting from calculations performed by the second processor core.

いくつかの実現例において、上記結果ベクトルを生成することは、上記値の第１のベクトルに対して上記累積演算を実行した結果として、累積値のベクトルを生成することと、上記累積値のベクトル内の各値に活性化関数を適用することと、上記累積値のベクトル内の各値に上記活性化関数を適用した結果として、上記結果ベクトルを生成することとを含む。累積は、上記第１のベクトルの値に対して行われてもよく、または、累積は、上記第１のベクトルの値と上記共有メモリに格納されている上記ベクトルの値とのペアワイズ累積を含み得る。 In some implementations, generating the result vector includes generating a vector of accumulated values as a result of performing the accumulation operation on a first vector of values, applying an activation function to each value in the vector of accumulated values, and generating the result vector as a result of applying the activation function to each value in the vector of accumulated values. The accumulation may be performed on values of the first vector, or accumulation may include pair-wise accumulation of values of the first vector and values of the vector stored in the shared memory.

いくつかの実現例において、上記第１のプロセッサコアのそれぞれのリソースは、第１の行列計算ユニットであり、上記方法はさらに、上記第１のプロセッサコアの上記第１の行列計算ユニットを用いて実行される行列乗算に基づいて、上記値の第１のベクトルに対応する累積値の第１のベクトルを生成することを含む。 In some implementations, each resource of the first processor core is a first matrix computation unit, and the method further includes generating a first vector of accumulated values corresponding to the first vector of values based on a matrix multiplication performed using the first matrix computation unit of the first processor core.

いくつかの実現例において、上記第２のプロセッサコアのそれぞれのリソースは、第２の行列計算ユニットであり、上記方法はさらに、上記第２のプロセッサコアの上記第２の行列計算ユニットを用いて実行される行列乗算に基づいて、上記値の第２のベクトルに対応する累積値の第２のベクトルを生成することを含む。上記ハードウェア回路は、複数のニューラルネットワーク層を有するニューラルネットワークを実行するように構成されたハードウェアアクセラレータであってもよく、上記方法は、上記結果ベクトルに基づいて上記ニューラルネットワークの層の出力を生成することを含む。 In some implementations, the respective resource of the second processor core is a second matrix computation unit, and the method further includes generating a second vector of accumulated values corresponding to the second vector of values based on a matrix multiplication performed with the second matrix computation unit of the second processor core. The hardware circuit may be a hardware accelerator configured to execute a neural network having multiple neural network layers, and the method includes generating an output of a layer of the neural network based on the result vector.

上記方法はさらに、上記第１のプロセッサコアで実行される計算に基づいて、上記値の第１のベクトルを生成することと、上記第２のプロセッサコアで実行される計算に基づいて、上記値の第２のベクトルを生成することとを含み得る。上記第１のプロセッサコアで実行される上記計算および上記第２のプロセッサコアで実行される上記計算は、可換性によって制御される数学的演算の一部であってもよい。いくつかの実現例において、上記数学的演算は、浮動小数点乗算演算、浮動小数点加算演算、整数加算演算、または最小－最大演算である。いくつかの実現例において、上記数学的演算は、浮動小数点加算演算および整数加算演算を含む。上記第１のプロセッサコアおよび第２のプロセッサコアは同一のプロセッサコアであってもよい。 The method may further include generating a first vector of values based on calculations performed on the first processor core and generating a second vector of values based on calculations performed on the second processor core. The calculations performed on the first processor core and the calculations performed on the second processor core may be part of a mathematical operation controlled by commutativity. In some implementations, the mathematical operation is a floating point multiplication operation, a floating point addition operation, an integer addition operation, or a min-max operation. In some implementations, the mathematical operation includes a floating point addition operation and an integer addition operation. The first processor core and the second processor core may be the same processor core.

いくつかの実現例において、上記共有メモリは、上記ハードウェア回路の２つ以上のプロセッサコアの間で共有されるメモリバンクおよびレジスタを含む共有グローバルメモリ空間として機能するように構成される。 In some implementations, the shared memory is configured to function as a shared global memory space including memory banks and registers shared between two or more processor cores of the hardware circuit.

このおよび他の局面の他の実現例は、コンピュータ記憶装置上に符号化された、方法のアクションを実行するように構成された対応するシステム、装置、およびコンピュータプログラムを含む。１つ以上のコンピュータのシステムは、動作時にシステムにアクションを実行させる、システムにインストールされたソフトウェア、ファームウェア、ハードウェア、またはそれらの組み合わせによってそのように構成され得る。１つ以上のコンピュータプログラムは、データ処理装置によって実行されると装置にアクションを実行させる命令を有することによってそのように構成され得る。 Other implementations of this and other aspects include corresponding systems, devices, and computer programs encoded on computer storage devices and configured to perform the actions of the methods. One or more computer systems may be so configured by software, firmware, hardware, or combinations thereof installed on the systems that, when operated, cause the systems to perform the actions. One or more computer programs may be so configured by having instructions that, when executed by a data processing device, cause the devices to perform the actions.

本明細書に記載されている主題は、以下の利点のうちの１つ以上を実現するように特定の実施形態において実現することができる。 The subject matter described herein can be implemented in particular embodiments to achieve one or more of the following advantages:

本文書に記載されている技術は、単に共有メモリ場所への入来ベクトルデータを上書きするのではなく、このデータをアトミックに縮小するＤＭＡモードをサポートする大規模共有スクラッチパッドメモリの能力を利用する。言い換えれば、複数のプロセッサコアまたはプロセッサは、同一の共有メモリ場所を更新する縮小演算を同時に実行することができるので、得られる縮小値は、縮小演算が同一のメモリ場所に関連付けられた値を含む場合でも演算が連続して起こったかのように計算される。そうではなく、各プロセッサが単に共有メモリ場所に上書きすれば、別のプロセッサによって書込まれた以前の値が意図せず失われる（たとえば更新喪失問題に対応する）可能性がある。システムは、制御ループに基づいて、データの「アトミックな」縮小を検出し、共有メモリ場所における値の上書きをどちらかの方法で許可することにより、メモリ場所に縮小演算の最終結果を保持（または格納）させることができる。場合によっては、本明細書に記載されているさまざまな技術は、システム全体にわたって存在する他のメモリタイプ（オンチップおよびオフチップメモリを含む）に拡張可能である。 The techniques described herein take advantage of the ability of large shared scratch pad memories that support a DMA mode to atomically shrink incoming vector data to a shared memory location, rather than simply overwriting this data. In other words, multiple processor cores or processors can simultaneously perform shrink operations that update the same shared memory location, so that the resulting shrink values are calculated as if the operations occurred consecutively, even if the shrink operations involve values associated with the same memory location. If instead each processor simply overwrites the shared memory location, a previous value written by another processor may be unintentionally lost (e.g., corresponding to a lost update problem). Based on the control loop, the system can detect an "atomic" shrink of the data and cause the memory location to hold (or store) the final result of the shrink operation by allowing the overwriting of the value in the shared memory location in either way. In some cases, the various techniques described herein are extendable to other memory types (including on-chip and off-chip memory) present throughout the system.

演算器ユニットは、共有メモリの近傍に結合されて、ベクトル値を共有メモリセル／場所に累積するための各種算術演算をサポートする。算術演算は、任意の縮小演算子（浮動小数点アトミック加算、整数加算、最大、最小、最大プーリング、およびさらには乗算など）に基づき得る。共有メモリに隣接して結合された演算器ユニットは、共有リソースのソフトウェア管理アドレッシングおよび可換数学的演算を単一のメモリシステムに統合するという利点を提供する。 The arithmetic unit is coupled proximate to the shared memory to support various arithmetic operations for accumulating vector values into the shared memory cells/locations. The arithmetic operations may be based on any reduction operator (such as floating point atomic addition, integer addition, maximum, minimum, max pooling, and even multiplication). The arithmetic unit coupled proximate to the shared memory provides the advantage of integrating software-managed addressing of shared resources and commutative mathematical operations into a single memory system.

これらの技術は、アトミック性を確保するように未処理の動作を追跡するために、かつ、古いベクトル値に対してベクトル値が累積されないように書込みトラフィックを必要に応じて停止するまたは並べ替えるために、共有メモリの制御ユニットで実行される読出・修正・書込制御ループを含む。また、読出・修正・書込制御ループは、第１のプロセッサコアに格納されたベクトルデータを読出すこと、読出したベクトル値に対して、第１のコアから離れている計算ユニットで算術演算を実行すること、およびその後に第１のプロセッサコアへの格納／書戻しを行うことが必要な、非効率な代替アプローチに対する性能およびエネルギーの向上を提供する。システムが大規模ベクトルメモリを有する場合、これらの非効率な代替アプローチでは、チップ全体にわたるかなりの距離をデータ移動させることが必要な場合がある。このようなアプローチでは、プロセッサコアにおける計算サイクルと、コアへのおよびコアからの配線の帯域幅とが不必要に消費されてしまう。また、これらの非効率によって、より深い計算スケジュールが生成され、レジスタ帯域幅が不必要に消費されてしまう。 These techniques include a read-modify-write control loop running in a shared memory control unit to keep track of outstanding operations to ensure atomicity, and to stop or reorder write traffic as necessary to prevent vector values from accumulating against old vector values. The read-modify-write control loop also provides performance and energy improvements over alternative, inefficient approaches that require reading vector data stored in a first processor core, performing arithmetic operations on the read vector values in a computation unit remote from the first core, and then storing/writing back to the first processor core. When a system has a large vector memory, these alternative, inefficient approaches may require moving data significant distances across the chip. Such approaches unnecessarily consume computation cycles in the processor core and wiring bandwidth to and from the core. These inefficiencies also generate deeper computation schedules and unnecessarily consume register bandwidth.

これらの技術は、プロセッサコアで生成されて共有メモリのＤＭＡパスとともに用いられる累積フラグに一部基づく、メモリへの累積特徴を含む。この特徴は、２つ以上のプロセッサコアが、共有メモリシステム内の共有メモリ場所にベクトルを直接累積することを可能にする。この特徴は、複数のコアからのＤＭＡが、コアの間の動作をアービトレートするための外部同期化またはソフトウェアロックを必要とせずに、同一のメモリセクタおよびアドレスを同時にターゲットにすることを可能にすることによって、マルチノードシステムにおいて特に有用であり得る。たとえば、これは、共有メモリセルを複数のチップ全体にわたる全縮小バッファとして、またはプロセッサコアの分散システムとして構成するのに役立ち得る。 These techniques include an accumulation-to-memory feature based in part on an accumulation flag generated in a processor core and used in conjunction with a shared memory DMA path. This feature allows two or more processor cores to accumulate vectors directly into a shared memory location in the shared memory system. This feature can be particularly useful in multi-node systems by allowing DMA from multiple cores to target the same memory sectors and addresses simultaneously without requiring external synchronization or software locks to arbitrate operations between the cores. For example, this can be useful for configuring shared memory cells as a fully-reduced buffer across multiple chips, or as a distributed system of processor cores.

いくつかの実現例は、ハードウェア回路の共有スクラッチパッドメモリを用いてベクトル縮小を実行するための、コンピュータ読取可能媒体を含む方法、システム、および装置に関し、ハードウェア回路は、この共有メモリと通信するプロセッサコアを有する。プロセッサコアごとに、プロセッサコアで実行される計算に基づいて、値のベクトルのそれぞれが生成される。共有メモリは、共有メモリのダイレクトメモリアクセス（ＤＭＡ）データパスを用いて、プロセッサコアのそれぞれのリソースから値のベクトルのそれぞれを受信する。共有メモリは、共有メモリに結合された演算器ユニットを用いて、値のベクトルのそれぞれに対して累積演算を実行する。演算器ユニットは、演算器ユニットで符号化された算術演算に基づいて値を累積するように構成される。累積演算に基づいて結果ベクトルが生成される。 Some implementations relate to methods, systems, and apparatus, including computer-readable media, for performing vector reduction using a shared scratchpad memory of a hardware circuit, the hardware circuit having processor cores in communication with the shared memory. For each processor core, a respective vector of values is generated based on computations performed in the processor core. The shared memory receives each of the vectors of values from a resource of each of the processor cores using a direct memory access (DMA) data path of the shared memory. The shared memory performs an accumulation operation on each of the vectors of values using an arithmetic unit coupled to the shared memory. The arithmetic unit is configured to accumulate values based on arithmetic operations encoded in the arithmetic unit. A result vector is generated based on the accumulation operation.

本明細書に記載されている主題の１つ以上の実現例の詳細は、添付の図面および以下の説明に記載されている。主題の他の潜在的な特徴、局面、および利点は、説明、図面、および請求項から明らかになるであろう。 Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, drawings, and claims.

例示的な共有メモリを含むハードウェア回路を有する計算システムのブロック図である。1 is a block diagram of a computing system having hardware circuitry including an exemplary shared memory. ハードウェア回路の共有メモリと通信するプロセッサコアの一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of a processor core communicating with a shared memory in a hardware circuit. ハードウェア回路の行列計算ユニットと通信するベクトルプロセッサの一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of a vector processor in communication with a matrix computation unit of a hardware circuit. 例示的な累積パイプラインを示すブロック図である。FIG. 2 is a block diagram illustrating an example accumulation pipeline. 入力テンソル、重みテンソル、および出力テンソルの例を示す図である。FIG. 2 is a diagram illustrating an example of an input tensor, a weight tensor, and an output tensor. 図１の共有メモリを用いてベクトル縮小を実行するための例示的なプロセスを示すフロー図である。2 is a flow diagram illustrating an example process for performing vector reduction using the shared memory of FIG. 1.

詳細な説明
さまざまな図面における同様の参照番号および名称は、同様の要素を示す。 DETAILED DESCRIPTION Like reference numbers and designations in the various drawings indicate like elements.

縮小演算は、人工ニューラルネットワークを含む演算のための計算集約的なワークロードなどの線形代数を利用する計算時によく使用される。たとえば、縮小演算は、ニューラルネットワークの訓練時に分散システムの異なる処理ノード全体にわたって計算される勾配値を平均するために必要な場合がある。縮小演算は、全縮小演算の場合などは分散して行うことができ、または所与の計算（行列乗算タイル合計演算など）の場合はローカルに行うことができる。 Reduction operations are often used in computations that utilize linear algebra, such as computationally intensive workloads for operations involving artificial neural networks. For example, a reduction operation may be necessary to average gradient values computed across different processing nodes of a distributed system when training a neural network. Reduction operations can be performed distributed, such as in the case of a full reduction operation, or locally for a given computation (such as a matrix multiplication tile sum operation).

性能および電力に関する懸念は、計算システムにおいてこれらの演算を効率的に構築して実行するための重要な要因であり得る。典型的に、縮小演算では、システム（たとえば分散システム）のメモリ階層を通してデータをプロセッサ（またはプロセッサコア）の算術論理ユニット（ＡＬＵ）に引き出し、引き出したデータに対して計算／縮小を実行した後、メモリシステムを通して結果を書戻すことが必要である。しかしながら、システムにおいてこれらのさまざまなステップを実行すると、性能および電力の双方のコストが高くなる。加えて、複数のプロセッサコアから見えるメモリにおいて、コアをまたいで縮小演算を実行することは、典型的に、重複しないメモリ領域における同期化および／またはリソースの予約が必要であり、これによって、多大な性能および容量オーバーヘッド、ならびにプログラミングの複雑さが増大し得る。 Performance and power concerns can be important factors for efficiently building and executing these operations in a computing system. Typically, a reduction operation requires pulling data through the memory hierarchy of the system (e.g., a distributed system) to the arithmetic logic unit (ALU) of a processor (or processor core), performing the calculation/reduction on the pulled data, and then writing the result back through the memory system. However, performing these various steps in a system comes at a high cost in both performance and power. In addition, performing reduction operations across cores in memory visible to multiple processor cores typically requires synchronization and/or resource reservation in non-overlapping memory regions, which can result in significant performance and capacity overhead, as well as increased programming complexity.

前述の文脈に基づいて、本明細書には、大規模共有スクラッチパッドメモリ内の１つ以上のメモリアドレス場所に値のベクトルを累積することによってベクトル縮小を実行するためのデータ処理技術が記載されている。ベクトル縮小および累積は、ハードウェア管理キャッシュメモリシステムに典型的に用いられるアドレッシングスキームに基づくのではなく、計算の結果を書込む（格納する）ために用いられるメモリ場所のソフトウェア管理アドレッシングに基づいて、共有スクラッチパッドメモリで実行される。共有メモリは、プロセッサコアの分散システム全体で共有されるメモリセルなどのリソースを含む。記載されている技術は、値のベクトルを処理する際に、（たとえばベクトル縮小のための）累積縮小ステップを実行するためのメモリへの累積機能を含む。たとえば、累積縮小ステップを、ニューラルネットワークの層を通して処理される入力の異なるセットに対して実行される行列乗算全体にわたって実行して、層の出力を生成することができる。 Based on the foregoing context, data processing techniques are described herein for performing vector reduction by accumulating vectors of values in one or more memory address locations in a large shared scratch pad memory. The vector reduction and accumulation are performed in the shared scratch pad memory based on software managed addressing of memory locations used to write (store) results of computations, rather than based on addressing schemes typically used in hardware managed cache memory systems. The shared memory includes resources such as memory cells that are shared across a distributed system of processor cores. The described techniques include accumulation into memory for performing an accumulative reduction step (e.g., for vector reduction) as the vectors of values are processed. For example, the accumulative reduction step may be performed across matrix multiplications performed on different sets of inputs processed through a layer of a neural network to generate the output of the layer.

共有スクラッチパッドメモリを含むデータ処理技術は、従前の設計と比較して改良されたハードウェア回路のアーキテクチャを用いて実行される。ハードウェア回路は、専用プロセッサ（ニューラルネットワークプロセッサ、特定用途向け集積回路（ＡＳＩＣ）、またはハードウェアアクセラレータなど）であり得る。ハードウェア回路は、複数のニューラルネットワーク層を含むニューラルネットワークを実行するように構成される。本文書に記載されている改良されたアーキテクチャおよびデータ処理技術は、ハードウェアアクセラレータを表す回路が、速度および帯域幅の増加を実現して計算をさらに加速させることを可能にする。 The data processing techniques including the shared scratchpad memory are implemented using an improved hardware circuit architecture compared to previous designs. The hardware circuit may be a special-purpose processor (such as a neural network processor, an application specific integrated circuit (ASIC), or a hardware accelerator). The hardware circuit is configured to execute a neural network including multiple neural network layers. The improved architecture and data processing techniques described herein enable the circuit representing the hardware accelerator to achieve increased speed and bandwidth to further accelerate computations.

計算は、特定のタイプの数学的演算（浮動小数点乗算、浮動小数点加算、または整数加算演算など）であり得る。また、計算は、推論を計算するためにまたはモデルを訓練するために実行される例示的なニューラルネットワークモデルの演算の中に含まれてもよい。いくつかの例において、計算は、ＣＮＮもしくはＲＮＮの層を通して入力を処理してニューラルネットワーク推論に対応する出力を生成するために用いられるか、または、ＣＮＮもしくはＲＮＮのパラメータに関する勾配を計算して訓練時にニューラルネットワークのパラメータを更新するために用いられる。 The computation may be a particular type of mathematical operation (such as a floating-point multiplication, floating-point addition, or integer addition operation). The computation may also be included among the operations of an example neural network model that are performed to compute an inference or to train the model. In some examples, the computation is used to process inputs through layers of a CNN or RNN to generate outputs corresponding to neural network inferences, or to calculate gradients with respect to the parameters of a CNN or RNN to update the parameters of the neural network during training.

数学的演算は可換性によって制御され、アトミック縮小（たとえばアトミック浮動小数点縮小）を含み得る。アトミック縮小は、累積を必要とする値のベクトルを提供するコアの間でアクティビティを同期させる必要なしに、ベクトルが共有メモリのメモリ場所（またはそこに格納されたベクトル）に直接累積される、累積またはベクトル縮小ステップとして処理される。言い換えれば、ハードウェア回路の２つ以上のコアは、最終結果ベクトルが累積の正しい数学的結果を提供するように、任意の順序で共有メモリセルの中央アドレス場所に値を累積し得る。一例では、累積は、第１のコアによって提供される第１のベクトルの値と第２のコアによって提供される第２のベクトルの値とのペアワイズ累積に関係する。 The mathematical operations are controlled by commutativity and may include atomic reductions (e.g., atomic floating-point reductions). An atomic reduction is treated as an accumulation or vector reduction step in which a vector is accumulated directly into a memory location of a shared memory (or into a vector stored therein) without the need to synchronize activity among the cores providing the vectors of values requiring accumulation. In other words, two or more cores of the hardware circuit may accumulate values into a central address location of a shared memory cell in any order such that the final result vector provides the correct mathematical result of the accumulation. In one example, the accumulation involves a pair-wise accumulation of values of a first vector provided by a first core and values of a second vector provided by a second core.

図１は、例示的なハードウェア回路１０１を含む計算システム１００のブロック図である。上述のように、ハードウェア回路１０１は、ハードウェアアクセラレータまたは他の何らかの専用プロセッサを表し得る。場合によっては、システム１００は、人工ディープニューラルネットワーク（ＤＮＮ）（ＲＮＮまたはＣＮＮなど）に関連付けられたテンソルまたはニューラルネットワーク計算を加速させるための例示的な計算システムである。たとえば、システム１００は、例示的なハードウェアアクセラレータ上でＣＮＮを実行し、ハードウェアアクセラレータにデータ値を渡して、推論を計算するための出力を生成するように構成される。いくつかの実現例において、システム１００はシステムオンチップである。たとえば、システムオンチップは、ハードウェア回路１０１と、システム１００に含まれるものとして本文書に記載されている他のコンポーネントおよびデバイスの一部（または全部）とを含み得る。 1 is a block diagram of a computing system 100 including an exemplary hardware circuit 101. As mentioned above, the hardware circuit 101 may represent a hardware accelerator or some other dedicated processor. In some cases, the system 100 is an exemplary computing system for accelerating tensor or neural network calculations associated with an artificial deep neural network (DNN) (such as an RNN or CNN). For example, the system 100 is configured to run the CNN on an exemplary hardware accelerator and pass data values to the hardware accelerator to generate output for computing inference. In some implementations, the system 100 is a system-on-chip. For example, the system-on-chip may include the hardware circuit 101 and some (or all) of the other components and devices described in this document as being included in the system 100.

ハードウェア回路１０１は、ニューラルネットワークモデルの実行および／または性能を加速させるように構成されたハードウェアアクセラレータであってもよい。たとえば、ニューラルネットワークモデルの実行は、例示的な汎用機（中央処理装置（ＣＰＵ）など）上のモデルの実行と比較して加速されてもよい。同様に、ニューラルネットワークモデルの性能および実行は、本明細書に記載されている改良されたハードウェア特徴および技術を有していない別のハードウェアアクセラレータ（グラフィックス処理装置（ＧＰＵ）など）上のモデルの実行と比較して加速されてもよい。 The hardware circuit 101 may be a hardware accelerator configured to accelerate the execution and/or performance of a neural network model. For example, the execution of the neural network model may be accelerated as compared to the execution of the model on an exemplary general purpose machine (such as a central processing unit (CPU)). Similarly, the performance and execution of the neural network model may be accelerated as compared to the execution of the model on another hardware accelerator (such as a graphics processing unit (GPU)) that does not have the improved hardware features and techniques described herein.

回路１０１を含むシステム１００は、システムメモリ１０２および共有メモリ１０４を含む。システムメモリ１０２は、ハードウェア回路１０１のプロセッサコア１０５－１、１０５－２とデータ通信をやり取りする高帯域幅メモリ（「ＨＢＭ１０２」）または入出力（Ｉ／Ｏ）デバイスを表し得る。データ通信は、一般的に、特定のプロセッサコア１０５－１、１０５－２に位置するベクトルメモリ１０６、１０８にデータ値を書込むこと、または特定のプロセッサコアのベクトルメモリ１０６、１０８からデータを読出すことを含み得る。たとえば、ＨＢＭ１０２は、プロセッサコア１０５－１とデータ通信をやり取りして、コアに入力を渡し、コアの１つ以上の計算リソースが生成した出力を受信することができる。 The system 100, including the circuit 101, includes a system memory 102 and a shared memory 104. The system memory 102 may represent a high bandwidth memory ("HBM 102") or an input/output (I/O) device that communicates data to and from the processor cores 105-1, 105-2 of the hardware circuit 101. The data communication may generally include writing data values to or reading data from the vector memories 106, 108 located at a particular processor core 105-1, 105-2. For example, the HBM 102 may communicate data to and from the processor core 105-1 to provide input to the core and receive output generated by one or more computational resources of the core.

データ値は、ベクトル要素またはベクトル値の配列を表し得る。たとえば、第１のベクトル配列は、ニューラルネットワーク層を通して処理すべき入力のバッチを表し得るのに対して、第２のベクトル配列は、その層の重みのセットを表し得る。関連して、第３のベクトル配列は、プロセッサコア１０５－１で生成された出力に対応する累積値のベクトルを表し得るのに対して、第４のベクトル配列は、プロセッサコア１０５－２で生成された出力を表す活性化値のベクトルを表し得る。 The data values may represent arrays of vector elements or vector values. For example, a first vector array may represent a batch of inputs to be processed through a neural network layer, while a second vector array may represent a set of weights for that layer. Relatedly, a third vector array may represent a vector of accumulation values corresponding to outputs generated by processor core 105-1, while a fourth vector array may represent a vector of activation values representing outputs generated by processor core 105-2.

ＨＢＭ１０２は、システム１００のダイナミックランダムアクセスメモリ（ＤＲＡＭ）資産であり得る。いくつかの実現例において、ＨＢＭ１０２は、回路１０１に対して外部またはオフチップメモリであり、システム１００のオンチップベクトルメモリバンク（後述）とデータ通信をやり取りするように構成される。たとえば、ＨＢＭ１０２は、回路１０１を表す集積回路ダイの外部にある物理的な場所に配置され得る。したがって、ＨＢＭ１０２は、集積回路ダイの内部に位置する計算リソースから離れているかまたはこの計算リソースに対して非ローカルであり得る。これに代えて、ＨＢＭ１０２またはそのリソースの一部は、ＨＢＭ１０２が回路の計算リソースに対してローカルであるかまたはこの計算リソースと同一場所に配置されるように、回路１０１を表す集積回路ダイの内部に配置されてもよい。 HBM 102 may be a dynamic random access memory (DRAM) asset of system 100. In some implementations, HBM 102 is external or off-chip memory to circuit 101 and is configured to communicate data to and from an on-chip vector memory bank (described below) of system 100. For example, HBM 102 may be located at a physical location that is external to the integrated circuit die representing circuit 101. Thus, HBM 102 may be separate from or non-local to computational resources located within the integrated circuit die. Alternatively, HBM 102, or portions of the resources, may be located within the integrated circuit die representing circuit 101 such that HBM 102 is local to or co-located with the computational resources of the circuit.

システム１００は、１つ以上のプロセッサコア１０５－１、１０５－２を含み得る。いくつかの実現例において、システム１００は複数のプロセッサコア１０５－ｎを含み、ｎは１以上の整数である。図１、ならびに後述する図２および図３の例では、システム１００は２つのプロセッサコアを含むものと示されているが、本明細書に記載されているハードウェア回路１０１を含むシステム１００は、それよりも多いまたは少ないプロセッサコアを有してもよい。一般的に、プロセッサコア１０５－ｎは、システム１００（またはハードウェア回路１０１）の別個の自立型処理／計算ユニットである。 System 100 may include one or more processor cores 105-1, 105-2. In some implementations, system 100 includes multiple processor cores 105-n, where n is an integer greater than or equal to 1. In the examples of FIG. 1 and FIGS. 2 and 3 described below, system 100 is shown as including two processor cores, but system 100 including hardware circuitry 101 described herein may have more or fewer processor cores. Generally, processor core 105-n is a separate, self-contained processing/computation unit of system 100 (or hardware circuitry 101).

各プロセッサコア１０５は、多層ニューラルネットワークの１つ以上の層が必要とする計算（たとえばニューラルネットワーク計算）を独立して実行するように構成される。計算は、機械学習ワークロードのデータを処理するために、またはワークロードの特定のタスクを実行するために必要であり得る。１つ以上のニューラルネットワーク層を通して入力を処理するためにプロセッサコアで実行される計算は、データ値の第１のセット（たとえば入力または活性化）とデータ値の第２のセット（たとえば重み）との乗算を含み得る。たとえば、計算は、１つ以上のサイクルにおいて入力または活性化値と重み値とを乗算することと、多数のサイクルにわたって積の累算を実行することとを含み得る。 Each processor core 105 is configured to independently perform computations (e.g., neural network computations) required by one or more layers of a multi-layer neural network. The computations may be necessary to process data of a machine learning workload or to perform a particular task of the workload. The computations performed in the processor cores to process inputs through one or more neural network layers may include multiplication of a first set of data values (e.g., inputs or activations) with a second set of data values (e.g., weights). For example, the computations may include multiplying input or activation values with weight values in one or more cycles and performing an accumulation of products over multiple cycles.

データ値の第１および第２のセット内の異なる値は、ハードウェア回路１０１のプロセッサコア内のメモリ構成体の特定のメモリ場所に格納される。いくつかの実現例において、データ値の第１のセット内の個々の値は、入力テンソルのそれぞれの要素に対応し得るのに対して、データ値の第２のセット内の個々の値は、重み（またはパラメータ）テンソルのそれぞれの要素に対応し得る。一例として、一連の層におけるニューラルネットワーク層は、入力（画像画素データの入力など）のセット、またはこの一連の層における別のニューラルネットワーク層によって生成される活性化値のセットを処理することができる。 The different values in the first and second sets of data values are stored in specific memory locations of a memory construct in a processor core of the hardware circuit 101. In some implementations, the individual values in the first set of data values may correspond to respective elements of an input tensor, while the individual values in the second set of data values may correspond to respective elements of a weight (or parameter) tensor. As an example, a neural network layer in a series of layers may process a set of inputs (such as inputs of image pixel data) or a set of activation values generated by another neural network layer in the series of layers.

入力のセットまたは活性化値のセットは、１次元（１Ｄ）またはそれぞれの次元に沿って複数の要素を有する多次元テンソル（たとえば２Ｄまたは３Ｄ）として表すことができる。データ値を格納するメモリ場所の各々は１次元または多次元テンソルの対応する要素にマッピングされ得て、テンソル要素は回路の制御論理を用いてトラバースまたはアクセスされ得る。たとえば、制御論理は、要素にマッピングされるメモリアドレス値を決定または計算して、この要素の対応するデータ値をロードまたは格納することができる。 The set of inputs or activation values can be represented as a one-dimensional (1D) or multidimensional tensor (e.g., 2D or 3D) having multiple elements along each dimension. Each of the memory locations that store data values can be mapped to a corresponding element of the one-dimensional or multidimensional tensor, and the tensor elements can be traversed or accessed using the control logic of the circuit. For example, the control logic can determine or calculate a memory address value that maps to an element to load or store the corresponding data value of the element.

ハードウェア回路１０１は、異なるメモリ構成体を含む専用のメモリ階層を有する。これらのメモリ構成体の各々は、他の構成体と比較してさまざまな帯域幅およびレイテンシ特性を有し、ハードウェア回路１０１内の物理的配置もさまざまであり得る。例示的なメモリ構成体は、共有メモリ１０４と、ベクトルメモリ１０６、１０８と、ベクトルレジスタ１１０、１１２とを含む。一般的に、メモリ構成体は、ニューラルネットワーク層で処理すべきデータ値（入力、活性化に関連するベクトル値、またはゲイン値など）と、ニューラルネットワーク層を通して入力または活性化を処理したことに応じてこの層によって生成された出力活性化とを格納するように動作可能である。出力活性化の生成および格納、ならびにこれらの動作を実行するために用いられるさまざまなメモリ構成体については、図２および図３を参照して以下でより詳細に説明する。 The hardware circuit 101 has a dedicated memory hierarchy that includes different memory constructs. Each of these memory constructs may have different bandwidth and latency characteristics compared to the other constructs, and may also have different physical locations within the hardware circuit 101. Exemplary memory constructs include a shared memory 104, vector memories 106, 108, and vector registers 110, 112. In general, the memory constructs are operable to store data values (such as inputs, vector values associated with activations, or gain values) to be processed by a neural network layer, and output activations generated by the neural network layer in response to processing the inputs or activations through the layer. The generation and storage of output activations, as well as the various memory constructs used to perform these operations, are described in more detail below with reference to Figures 2 and 3.

図２は、ハードウェア回路のさまざまなコンポーネントの間のデータ通信を容易にするために共有メモリ１０４のリソースまたはセクションがハードウェア回路１０１でどのように配置されるかの例を示すブロック図２００である。上述のように、共有メモリ１０４は、システム１００のハードウェアアーキテクチャおよびデータ処理技術を改良するための基礎を提供する。共有メモリ１０４は、いくつかの他のニューラルネットワークプロセッサチップのオンチップメモリと比較して、さらに大きなオンチップＳＲＡＭである。いくつかの実現例において、共有メモリ１０４は、ＨＢＭ１０２と、対応するプロセッサコア１０５－１、１０５－２のそれぞれのベクトルメモリ１０６、１０８との間に（たとえば論理的または物理的に）あると説明することができる。たとえば、共有メモリ１０４を利用してＨＢＭ１０２とベクトルメモリ１０６、１０８との間でデータを移動させる動作は、このデータが共有メモリ１０４の共有リソースをトラバースすることを含むであろう。 FIG. 2 is a block diagram 200 illustrating an example of how resources or sections of the shared memory 104 may be arranged in the hardware circuit 101 to facilitate data communication between various components of the hardware circuit. As described above, the shared memory 104 provides a basis for improving the hardware architecture and data processing techniques of the system 100. The shared memory 104 is a larger on-chip SRAM compared to the on-chip memory of some other neural network processor chips. In some implementations, the shared memory 104 may be described as being between the HBM 102 and the vector memories 106, 108 of the corresponding processor cores 105-1, 105-2 (e.g., logically or physically). For example, the act of using the shared memory 104 to move data between the HBM 102 and the vector memories 106, 108 may involve the data traversing the shared resources of the shared memory 104.

共有メモリ１０４は、チップまたは回路１０１の上の共有中央空間を表し得る。たとえば、共有メモリ１０４は共有グローバルメモリ空間として機能するように構成され、この共有グローバルメモリ空間は、システム１００に存在し得るおよび／またはハードウェア回路１０１に含まれ得る複数のプロセッサコアのうちの１つ以上のプロセッサコア１０５－１、１０５－２の間で共有されるメモリバンクおよびレジスタに対応するメモリリソースを含む。以下でより詳細に説明するように、共有メモリ１０４は、（たとえば例示的なベクトルと同様の）ソフトウェア制御スクラッチパッドとして機能するように構成される。いくつかの実現例において、共有メモリ１０４のリソースの一部（または全部）は、ハードウェア管理キャッシュではなくソフトウェア制御スクラッチパッド（ステージングリソース）として機能するように構成される。 The shared memory 104 may represent a shared central space on the chip or circuit 101. For example, the shared memory 104 may be configured to function as a shared global memory space, which includes memory resources corresponding to memory banks and registers shared among one or more of the processor cores 105-1, 105-2 of the multiple processor cores that may be present in the system 100 and/or included in the hardware circuit 101. As described in more detail below, the shared memory 104 may be configured to function as a software-controlled scratch pad (e.g., similar to an exemplary vector). In some implementations, some (or all) of the resources of the shared memory 104 may be configured to function as a software-controlled scratch pad (staging resource) rather than a hardware-managed cache.

システム１００は、共有メモリ１０４によって与えられるデータ転送機能を利用するために、少なくとも２つのプログラミングインターフェイスをユーザに公開するように構成される。第１のインターフェイスはプログラマブルＤＭＡデータ転送機能および動作を公開するのに対して、第２の異なるインターフェイスはプログラマブルロード／格納データ転送機能および動作を公開する。これらのインターフェイス機能の各々は、以下でより詳細に説明する共有メモリ１０４の論理属性を表し得る。 The system 100 is configured to expose at least two programming interfaces to a user to take advantage of the data transfer functions provided by the shared memory 104. A first interface exposes programmable DMA data transfer functions and operations, while a second, different interface exposes programmable load/store data transfer functions and operations. Each of these interface functions may represent a logical attribute of the shared memory 104, which are described in more detail below.

上述のように、システム１００のメモリ構成体はさまざまな帯域幅およびレイテンシ特性を有する。たとえば、共有メモリ１０４は、ＨＢＭ１０２のＤＲＡＭアクセスよりも高い帯域幅および低いレイテンシを有し得るが、ベクトルメモリ１０６、１０８へのアクセスよりも低い帯域幅および高いレイテンシを有し得る。いくつかの例において、共有メモリ１０４は、ＨＢＭ１０２のＤＲＡＭ資産よりも低いデータ容量を有するが、プロセッサコアのそれぞれのベクトルメモリよりも高いデータ容量を有する。一般的に、これらのさまざまな帯域幅およびレイテンシ特性は、標準的なメモリ階層トレードオフを表す。 As discussed above, the memory constructs of the system 100 have various bandwidth and latency characteristics. For example, the shared memory 104 may have a higher bandwidth and lower latency than the DRAM accesses of the HBM 102, but a lower bandwidth and higher latency than the accesses to the vector memories 106, 108. In some examples, the shared memory 104 has a lower data capacity than the DRAM assets of the HBM 102, but a higher data capacity than the respective vector memories of the processor cores. In general, these various bandwidth and latency characteristics represent standard memory hierarchy trade-offs.

また、システム１００のメモリ構成体、特に共有メモリ１０４も、ハードウェア回路１０１内の物理的配置がさまざまであり得る。共有メモリ１０４は、プロセッサコア１０５－１、１０５－２の特定の計算リソースの配置に関して物理的および論理的に配置され得るメモリバンクおよびレジスタなどのリソースを含む。この文脈において、共有メモリ１０４は一般的に、その物理的構造およびその論理的構造を参照して特徴付けることができる。共有メモリ１０４の物理的構造を最初に説明し、次にその論理的構造を記載する。 The memory constructs of system 100, and in particular shared memory 104, may also vary in their physical arrangement within hardware circuitry 101. Shared memory 104 includes resources such as memory banks and registers that may be physically and logically arranged with respect to the arrangement of the particular computational resources of processor cores 105-1, 105-2. In this context, shared memory 104 may be generally characterized with reference to its physical structure and its logical structure. The physical structure of shared memory 104 is described first, followed by its logical structure.

その物理的構造に関して、共有メモリ１０４のリソースは、ハードウェア回路１０１に対応する専用またはニューラルネットプロセッサチップ上に物理的に分散されてもよい。たとえば、共有メモリ１０４を形成するリソースの異なるサブセット、部分、またはセクションは、異なるタイプのデータ転送動作および処理技術をシステム１００で実行できるように、回路１０１のさまざまな場所に物理的に分散されてもよい。いくつかの実現例において、共有メモリ１０４のリソースの１つのセクションは、回路１０１のプロセッサコアの内部に存在することができるのに対して、リソースの別のセクションは、回路１０１のプロセッサコアの外部に存在することができる。図２の例では、共有メモリ１０４のセクションは、ＨＢＭ１０２のメモリ場所と共有メモリ１０４のメモリ場所との間で大きなデータブロックを移動させるＤＭＡ動作を可能にするように、プロセッサコア１０５－１、１０５－２の各々の外部にある。 In terms of its physical structure, the shared memory 104 resources may be physically distributed on dedicated or neural net processor chips corresponding to the hardware circuit 101. For example, different subsets, portions, or sections of the resources forming the shared memory 104 may be physically distributed at various locations of the circuit 101 so that different types of data transfer operations and processing techniques can be performed in the system 100. In some implementations, one section of the shared memory 104 resources may be internal to the processor core of the circuit 101, while another section of the resources may be external to the processor core of the circuit 101. In the example of FIG. 2, a section of the shared memory 104 is external to each of the processor cores 105-1, 105-2 to enable DMA operations to move large blocks of data between memory locations of the HBM 102 and memory locations of the shared memory 104.

再びＨＢＭ１０２を簡単に参照して、このタイプのシステムメモリは、それぞれのプロセッサコアのベクトルメモリに高帯域幅データを提供するためにおよび／またはこのベクトルメモリと高帯域幅データをやり取りするために、システム１００によって用いられる外部メモリ構造であり得る。いくつかの実現例において、ＨＢＭ１０２は、回路１０１のプロセッサコア内のベクトルメモリのメモリアドレス場所からデータを取得するための、またはこのメモリアドレス場所にデータを提供するための、さまざまなダイレクトメモリアクセス（ＤＭＡ）動作のために構成される。より具体的には、ＨＢＭ１０２がベクトルメモリ１０６、１０８とデータをやり取りすることを含むＤＭＡ動作は、例示的な制御スキームおよび共有メモリ１０４のメモリリソースによって可能になる。 Returning briefly to the HBM 102, this type of system memory may be an external memory structure used by the system 100 to provide high bandwidth data to and/or from the vector memory of each processor core. In some implementations, the HBM 102 is configured for various direct memory access (DMA) operations to retrieve data from or provide data to memory address locations of the vector memory within the processor core of the circuit 101. More specifically, DMA operations involving the HBM 102 transferring data to and from the vector memories 106, 108 are enabled by an exemplary control scheme and memory resources of the shared memory 104.

図２、および図３（後述）の例では、共有メモリ１０４は共有メモリ制御ユニット２０１（「制御ユニット２０１」）を含む。制御ユニット２０１は、メモリアクセス動作（ＨＢＭ１０２、共有メモリ１０４、ベクトルメモリ１０６、１０８、およびベクトルレジスタ１１０、１１２の各々を含む）を制御するための制御信号１１４を生成するように構成される。 In the example of FIG. 2 and FIG. 3 (described below), the shared memory 104 includes a shared memory control unit 201 ("control unit 201"). The control unit 201 is configured to generate control signals 114 for controlling memory access operations (including the HBM 102, the shared memory 104, the vector memories 106, 108, and each of the vector registers 110, 112).

制御ユニット２０１は、システム１００の異なるメモリ（たとえば、ＨＢＭ１０２、共有メモリ１０４、ベクトルメモリ１０６、１０８、およびベクトルレジスタ１１０、１１２）に分散される制御スキームを実行する。いくつかの実現例において、この制御スキームは、制御ユニット２０１と各メモリのそれぞれの制御ユニットとの間の通信に基づいて、異なるメモリに分散される。たとえば、制御スキームは、これらの異なるメモリのそれぞれの制御ユニットによってローカルに処理される、制御ユニット２０１によって提供される制御信号に基づいて、メモリに分散させることができる。データパスの共有を用いて、ＨＢＭ１０２とプロセッサコア１０５－１、１０５－２のそれぞれのベクトルメモリとの間でデータを移動させることができる。これが行われると、システム１００は、所与のメモリまたはデータパスについての任意の（およびすべての）必要な制御ユニットを起動させて、適切なタッチポイントで行われる必要があるデータハンドオフを管理する。 The control unit 201 executes a control scheme that is distributed to the different memories of the system 100 (e.g., the HBM 102, the shared memory 104, the vector memories 106, 108, and the vector registers 110, 112). In some implementations, this control scheme is distributed to the different memories based on communication between the control unit 201 and the respective control units of each memory. For example, the control scheme can be distributed to the memories based on control signals provided by the control unit 201 that are processed locally by the respective control units of these different memories. Data path sharing can be used to move data between the HBM 102 and the respective vector memories of the processor cores 105-1, 105-2. Once this is done, the system 100 will activate any (and all) necessary control units for a given memory or data path to manage the data handoff that needs to take place at the appropriate touchpoint.

制御ユニット２０１は、ソフトウェア命令を実行し、共有メモリ１０４のメモリリソースの第１の部分をＤＭＡメモリユニットとして機能させる制御信号を生成するように構成される。リソースの第１の部分は、プロセッサコア１０５－１を基準とする共有コアデータパス２０４と、プロセッサコア１０５－２を基準とする共有コアデータパス２２４とによって表すことができる。この代表的なＤＭＡメモリユニットは、制御ユニット２０１によって生成される制御信号に基づいて、ＨＢＭ１０２と第１のプロセッサコア１０５－１および第２のプロセッサコア１０５－２の各々との間でデータを移動させるように動作可能である。 The control unit 201 is configured to execute software instructions and generate control signals that cause a first portion of the memory resources of the shared memory 104 to function as a DMA memory unit. The first portion of the resources can be represented by a shared core data path 204 referenced to the processor core 105-1 and a shared core data path 224 referenced to the processor core 105-2. This representative DMA memory unit is operable to move data between the HBM 102 and each of the first and second processor cores 105-1 and 105-2 based on the control signals generated by the control unit 201.

たとえば、制御信号は、ａ）データパス２０２、共有コアデータパス２０４、またはデータパス２０６を用いて、共有メモリ１０４のメモリ場所とベクトルメモリ１０６との間で、かつ、ｂ）データパス２２２、共有コアデータパス２２４、またはデータパス２２６を用いて、共有メモリ１０４のメモリ場所とベクトルメモリ１０８との間で、データ（たとえばベクトル）のブロックを移動させるＤＭＡ動作を実行するように生成され得る。いくつかの実現例において、共有メモリ１０４はあるいは共有ＣＭＥＭ１０４と呼ばれることがある。 For example, control signals may be generated to perform DMA operations to move blocks of data (e.g., vectors) between memory locations in shared memory 104 and vector memory 106 using data path 202, shared core data path 204, or data path 206, and between memory locations in shared memory 104 and vector memory 108 using data path 222, shared core data path 224, or data path 226. In some implementations, shared memory 104 may alternatively be referred to as shared CMEM 104.

本文書において、ＣＭＥＭは、一般的に、データバッファおよびオンチップＳＲＡＭストレージとして有用な構成を提供する物理的に連続したメモリ（ＣＭＥＭ）のブロックに対応する。以下でより詳細に説明するように、システム１００では、ＣＭＥＭリソースのブロックは、ハードウェア回路１０１において物理的に分散され、ハードウェアアクセラレータまたは他のタイプの専用プロセッサとして構成され得るプロセッサコアのコンポーネントの間で共有されるように配置される。共有コアデータパス２０４および２２４の各々は、システム内のこれらの点を横切るベクトルデータの移動のために共有データパス上で起こり得る静的コンテンションを示し得る例示的なノードである。 In this document, CMEM generally corresponds to a block of physically contiguous memory (CMEM) that provides a configuration useful as a data buffer and on-chip SRAM storage. As described in more detail below, in system 100, blocks of CMEM resources are physically distributed in hardware circuit 101 and arranged to be shared among components of a processor core, which may be configured as hardware accelerators or other types of dedicated processors. Each of shared core data paths 204 and 224 are exemplary nodes that may exhibit static contention that may occur on the shared data path due to the movement of vector data across these points in the system.

図２の例に示されるように、図１に示されるハードウェア回路１０１およびシステム１００は、複数のロード・格納データパス２０２、２０６と、複数のＣＭＥＭロードデータパス２０８、２１４、２２８、２３４と、複数のＣＭＥＭ格納データパス２１５、２３５とを含むように構成される。また、ハードウェア回路１０１およびシステム１００は、複数の共有ステージングブロック２１０、２３０（後述）を含む。図２の例では、データパス２０２、２２２の各々は、ＤＭＡ動作を実行したことに応じてデータ（たとえばベクトルもしくはスカラー値）をルーティングするためのデータパスとして、ＣＭＥＭロード／格納動作を実行したことに応じてデータをルーティングするためのデータパスとして、またはその双方として、構成され得る。共有メモリ１０４によってサポートされるＤＭＡ動作およびデータパス２０２、２０６、２２２、および２２６を用いて、特定のメモリオフセットおよびストライドパラメータを参照して異なるメモリ構造の間でデータを移動させることができる。 As shown in the example of FIG. 2, the hardware circuit 101 and system 100 shown in FIG. 1 are configured to include multiple load/store data paths 202, 206, multiple CMEM load data paths 208, 214, 228, 234, and multiple CMEM store data paths 215, 235. The hardware circuit 101 and system 100 also include multiple shared staging blocks 210, 230 (described below). In the example of FIG. 2, each of the data paths 202, 222 can be configured as a data path for routing data (e.g., vector or scalar values) in response to performing a DMA operation, as a data path for routing data in response to performing a CMEM load/store operation, or both. The DMA operations and data paths 202, 206, 222, and 226 supported by the shared memory 104 can be used to move data between different memory structures with reference to specific memory offsets and stride parameters.

たとえば、システム１００は、共有メモリ１０４を用いてＤＭＡ動作を実行するように構成され、このＤＭＡ動作は、１メガバイトのデータを１組のメモリ場所から別の１組のメモリ場所にオフセット０ｘ０４で移動させるまたはコピーすることを含む。共有メモリ１０４およびシステム１００は、ＤＭＡ動作を実行する際にさまざまなストライド機能をサポートするように動作可能である。たとえば、１メガバイトのデータを移動させるためのＤＭＡ動作は、ベースメモリ場所のアドレスベースまたはアドレス値に対して２００キロバイトごとにアドレス間隔を挿入するストライド動作を含み得る。 For example, the system 100 may be configured to perform a DMA operation with the shared memory 104, the DMA operation including moving or copying 1 megabyte of data from one set of memory locations to another set of memory locations at an offset of 0x04. The shared memory 104 and the system 100 are operable to support various stride functions when performing the DMA operation. For example, the DMA operation to move 1 megabyte of data may include a stride operation that inserts an address interval every 200 kilobytes relative to an address base or address value of a base memory location.

いくつかの実現例において、ストライド動作を用いて、所望の読出シーケンスに基づいてアドレス間隔を挿入し、この所望の読出シーケンスは、１メガバイトのデータを宛先場所に移動させた後にこのデータを読出すために後で実行される。たとえば、１メガバイトのデータブロックは、ニューラルネットワークの異なる層において、または特定のニューラルネットワーク層のフィルタもしくは重みの異なるセット全体にわたって、データがどのように読出されてまたは取出されて処理されるかに対応するストライド動作に基づいて格納されてもよい。 In some implementations, the stride operation is used to insert address intervals based on a desired read sequence that is later executed to read the 1 megabyte of data after it has been moved to a destination location. For example, a 1 megabyte block of data may be stored based on a stride operation that corresponds to how the data is read or retrieved and processed in different layers of a neural network, or across different sets of filters or weights in a particular neural network layer.

また、共有メモリ１０４の制御ユニット２０１は、さまざまなロード・格納動作を実行させるように構成される。たとえば、制御ユニット２０１は、ａ）（コア１０５－１でのロード動作の場合は）データパス２０２、共有コアデータパス２０４、またはデータパス２０８を用いて共有メモリ１０４のメモリ場所と共有ステージングブロック２１０のメモリ場所との間で、かつ、ｂ）（コア１０５－２でのロード動作の場合は）データパス２２２、共有コアデータパス２２４、またはデータパス２２８を用いて共有メモリ１０４のメモリ場所と共有ステージングブロック２３０のメモリ場所との間で、さまざまな量のデータ（たとえばベクトルまたはベクトル値）を移動させるロード・格納動作を実行するための制御信号を生成する。 The control unit 201 of the shared memory 104 is also configured to cause various load and store operations to be performed. For example, the control unit 201 generates control signals to perform load and store operations that move various amounts of data (e.g., vectors or vector values) a) between memory locations of the shared memory 104 and memory locations of the shared staging block 210 using data path 202, shared core data path 204, or data path 208 (in the case of a load operation on core 105-1), and b) between memory locations of the shared memory 104 and memory locations of the shared staging block 230 using data path 222, shared core data path 224, or data path 228 (in the case of a load operation on core 105-2).

同様に、制御信号は、ａ）（コア１０５－１での格納動作の場合は）データパス２０２、共有コアデータパス２０４、またはデータパス２１５を用いて共有メモリ１０４のメモリ場所とベクトルレジスタ１１０との間で、かつ、ｂ）（コア１０５－２での格納動作の場合は）データパス２２２、共有コアデータパス２２４、またはデータパス２３５を用いて共有メモリ１０４のメモリ場所とベクトルレジスタ１１２との間で、さまざまな量のデータ（たとえばベクトルまたはベクトル値）を移動させるロード・格納動作を実行するために生成され得る。 Similarly, control signals may be generated to perform load and store operations that move various amounts of data (e.g., vectors or vector values) a) between memory locations in shared memory 104 and vector register 110 using data path 202, shared core data path 204, or data path 215 (for store operations in core 105-1), and b) between memory locations in shared memory 104 and vector register 112 using data path 222, shared core data path 224, or data path 235 (for store operations in core 105-2).

次に共有メモリ１０４の論理的構造を参照して、上述のように、システム１００は、共有メモリ１０４によって与えられるデータ転送機能を利用するために、少なくとも２つのプログラミングインターフェイスをユーザに公開するように構成される。少なくとも１つのインターフェイスはプログラマブルＤＭＡ機能を公開し、別のインターフェイスはプログラマブルＣＭＥＭロード／格納機能を公開し、各々が共有メモリ１０４の論理属性を表し得る。ロード／格納目的のために、共有メモリ１０４は、ベクトルメモリ１０６、１０８に対する並列メモリとして論理的に公開される。このように、各ロード・格納データパスは、メモリシステムを通して（それぞれのプロセッサコア１０５－１、１０５－２のベクトルレジスタ、または回路１０１の複数のコアを通してなど）データブロックまたは特定のデータを移動させるための追加（または並列）データパスを提供するように動作可能である。たとえば、ロード・格納動作は、ＤＭＡ動作と同時に共有メモリ１０４のメモリリソースに対して実行されてもよい。 Turning now to the logical structure of the shared memory 104, as described above, the system 100 is configured to expose at least two programming interfaces to a user to take advantage of the data transfer capabilities provided by the shared memory 104. At least one interface exposes programmable DMA capabilities and another interface exposes programmable CMEM load/store capabilities, each of which may represent logical attributes of the shared memory 104. For load/store purposes, the shared memory 104 is logically exposed as a parallel memory to the vector memories 106, 108. In this manner, each load/store data path is operable to provide an additional (or parallel) data path for moving a block of data or specific data through the memory system (such as through the vector registers of the respective processor cores 105-1, 105-2, or through multiple cores of the circuit 101). For example, load/store operations may be performed on the memory resources of the shared memory 104 simultaneously with DMA operations.

より具体的には、ＤＭＡ動作を実行して、ＤＭＡデータパス２０６を用いて共有メモリ１０４のメモリ場所とベクトルメモリ１０６との間で値のベクトルを移動させ、ＤＭＡ動作と同時にロード・格納動作を実行して、共有メモリ１０４のメモリ場所と共有ステージングブロック２１０との間で値の異なるベクトルを移動させてもよい。同様の同時動作が、プロセッサコア１０５－１のリソースに対応するプロセッサコア１０５－２のリソースを用いてプロセッサコア１０５－２（または他のコア）で行われてもよい。 More specifically, DMA operations may be performed to move vectors of values between memory locations in shared memory 104 and vector memory 106 using DMA data path 206, and load and store operations may be performed simultaneously with the DMA operations to move different vectors of values between memory locations in shared memory 104 and shared staging block 210. Similar concurrent operations may be performed in processor core 105-2 (or other cores) using resources of processor core 105-2 that correspond to the resources of processor core 105-1.

共有メモリ１０４のＣＭＥＭリソースを用いて実行されるロード／格納動作は、ＤＭＡ動作と比較して、共有メモリ１０４の高性能機能、または共有メモリ１０４を用いる高性能方法を表し得る。いくつかの実現例において、制御ユニット２０１は、ソフトウェア命令を実行し、共有メモリ１０４のメモリリソースの第２の部分を、ロード／格納動作を実行するために用いられるソフトウェア制御ステージングリソースとして機能させる制御信号を生成するように構成される。 Load/store operations performed using the CMEM resources of the shared memory 104 may represent a high performance feature of the shared memory 104, or a high performance method of using the shared memory 104, as compared to DMA operations. In some implementations, the control unit 201 is configured to execute software instructions and generate control signals that cause a second portion of the memory resources of the shared memory 104 to function as a software controlled staging resource used to perform the load/store operations.

リソースの第２の部分は、プロセッサコア１０５－１を基準とする共有ステージングブロック２１０と、プロセッサコア１０５－２を基準とする共有ステージングブロック２３０とによって表すことができる。したがって、共有ステージングブロック２１０、２３０の各々は、共有メモリ１０４のメモリリソースのサブセットから形成されるソフトウェア制御ステージングリソース（またはスクラッチパッド）を表し得る。いくつかの例において、システム１００のソフトウェア制御ステージングリソースは、ＨＢＭ１０２から第１のプロセッサコア１０５－１または第２のプロセッサコア１０５－２のそれぞれのベクトルレジスタ１１０または１１２へのベクトルデータ値の流れを管理するように構成される。 The second portion of resources may be represented by a shared staging block 210 referenced to processor core 105-1 and a shared staging block 230 referenced to processor core 105-2. Each of shared staging blocks 210, 230 may thus represent a software-controlled staging resource (or scratch pad) formed from a subset of memory resources of shared memory 104. In some examples, the software-controlled staging resources of system 100 are configured to manage the flow of vector data values from HBM 102 to vector registers 110 or 112 of first processor core 105-1 or second processor core 105-2, respectively.

共有メモリ１０４およびそのリソースは、たとえばメモリ構成体（ＨＢＭ１０２またはベクトルメモリ１０６、１０８など）の間でデータを移動させるためのＤＭＡメモリとしてだけではなく、各プロセッサコア１０５－１、１０５－２上のそれぞれのベクトルレジスタ１１０、１１２内にデータを直接移動させるためのロード／格納メモリとしても一意に構成可能であるという性質を有する。共有メモリ１０４のこれらの構成可能な局面によって、そのリソースおよびアドレッシングを、コア上で実行されるソフトウェアによって細かい粒度でスケジューリングすることができる。たとえば、共有メモリ１０４はソフトウェア管理型（ハードウェア管理型ではない）ＳＲＡＭリソースとすることができ、このＳＲＡＭリソースにおいて、プロセッサコアのコンパイラは、そのメモリ（共有メモリ１０４のメモリアドレス場所に存在する場合もあればそうでない場合もあるタイプのデータを含む）のアドレッシングを特別に管理する。 The shared memory 104 and its resources have the property that they are uniquely configurable, for example, not only as a DMA memory for moving data between memory constructs (such as the HBM 102 or vector memories 106, 108), but also as a load/store memory for moving data directly into the respective vector registers 110, 112 on each processor core 105-1, 105-2. These configurable aspects of the shared memory 104 allow its resources and addressing to be scheduled at a fine granularity by software running on the cores. For example, the shared memory 104 can be a software-managed (not hardware-managed) SRAM resource, in which the processor core's compiler specifically manages the addressing of that memory (including types of data that may or may not reside at memory address locations of the shared memory 104).

いくつかの実現例において、共有メモリ１０４のソフトウェア制御ステージングリソースは、データを共有ＣＭＥＭ２０３またはＨＢＭ１０２に格納するようにルーティングするためのＣＭＥＭ格納データパス２１５または２３５を含むプロセッサコアのロード・格納データパスのロード部分に沿った先入れ先出し（first-in-first-out：ＦＩＦＯ）メモリ構造（たとえば共有ステージングブロック２１０または２３０）として構成される。ＦＩＦＯメモリ構造は、しきい値数のプロセッササイクルにわたってデータ値のセットを一時的に格納した後に、第１のプロセッサコア１０５－１または第２のプロセッサコア１０５－２のそれぞれのベクトルレジスタ１１０、１１２にこの値のセットをルーティングするように構成される。ＦＩＦＯメモリ構造を用いて、特定のロードレイテンシを有するＣＭＥＭロード動作によって生じ得るレジスタ圧力およびスケジューリングの複雑さを緩和する。 In some implementations, the software-controlled staging resources of the shared memory 104 are configured as a first-in-first-out (FIFO) memory structure (e.g., shared staging block 210 or 230) along the load portion of the load-store data path of the processor core, including a CMEM store data path 215 or 235 for routing data to be stored in the shared CMEM 203 or HBM 102. The FIFO memory structure is configured to temporarily store a set of data values for a threshold number of processor cycles before routing the set of values to the vector registers 110, 112 of the first processor core 105-1 or second processor core 105-2, respectively. The FIFO memory structure is used to mitigate register pressure and scheduling complications that may be caused by CMEM load operations having a particular load latency.

いくつかの実現例において、クロックサイクルのしきい値数は、５０サイクル全体にわたって所与のレジスタを予約することに関連付けられるレジスタ圧力およびスケジューリングの複雑さを引き起こす可能性がある例示的な高レイテンシ（たとえば５０サイクル）ＣＭＥＭロード動作に基づいて決定される。レジスタ圧力に関する懸念を和らげるまたは軽減するために、共有メモリ１０４のリソースを用いて、ＣＭＥＭ結果ＦＩＦＯ（「ＣＲＦ」）がハードウェア回路１００で物理的にインスタンス化される。図２の例では、第１のＣＲＦはプロセッサコア１０５－１のステージングブロック２１０によって表されているが、第２のＣＲＦはステージングブロック２３０によって表されている。ＣＲＦの各々は、例示的なＣＭＥＭロード動作を、ｉ）ＣＭＥＭ→ＣＲＦフェーズ（ＣＭＥＭアドレス情報が提供される）、およびｉｉ）ＣＲＦ→レジスタフェーズ（ベクトルレジスタターゲットが提供される）、の少なくとも２つのフェーズに分割することができる。 In some implementations, the threshold number of clock cycles is determined based on an exemplary high latency (e.g., 50 cycles) CMEM load operation that may cause register pressure and scheduling complexities associated with reserving a given register for the entire 50 cycles. To alleviate or mitigate concerns regarding register pressure, a CMEM result FIFO ("CRF") is physically instantiated in hardware circuit 100 using resources of shared memory 104. In the example of FIG. 2, a first CRF is represented by staging block 210 of processor core 105-1, while a second CRF is represented by staging block 230. Each of the CRFs may divide an exemplary CMEM load operation into at least two phases: i) a CMEM → CRF phase (where CMEM address information is provided), and ii) a CRF → register phase (where a vector register target is provided).

たとえば、共有ステージングブロック２１０、２３０の各々は、データ値（たとえばスカラー値またはベクトル値）を受信し、しきい値数のプロセッササイクルにわたってデータ値を一時的に格納するように構成される。プロセッサコア１０５－１では、データ値は、ステージングブロック２１０を共有メモリ１０４の他のメモリ場所に接続するロードデータパス２０８（および共有コアデータパス２０４）に沿って、共有ステージングブロック２１０にルーティングされる。プロセッサコア１０５－２では、データ値は、ステージングブロック２３０を共有メモリ１０４の他のメモリ場所に接続するロードデータパス２２８（および共有コアデータパス２２４）に沿って、共有ステージングブロック２３０にルーティングされる。 For example, each of the shared staging blocks 210, 230 is configured to receive a data value (e.g., a scalar or vector value) and temporarily store the data value for a threshold number of processor cycles. In the processor core 105-1, the data value is routed to the shared staging block 210 along the load data path 208 (and the shared core data path 204) that connects the staging block 210 to other memory locations in the shared memory 104. In the processor core 105-2, the data value is routed to the shared staging block 230 along the load data path 228 (and the shared core data path 224) that connects the staging block 230 to other memory locations in the shared memory 104.

共有ステージングブロック２１０は、しきい値数のプロセッササイクルにわたってデータ値を一時的に格納したことに応じて、プロセッサコア１０５－１のベクトルレジスタ１１０にデータ値を提供するように構成される。同様に、共有ステージングブロック２３０は、しきい値数のプロセッササイクルにわたってデータ値を一時的に格納したことに応じて、プロセッサコア１０５－２のベクトルレジスタ１１２にデータ値を提供するように構成される。 The shared staging block 210 is configured to provide a data value to the vector register 110 of the processor core 105-1 in response to temporarily storing the data value for the threshold number of processor cycles. Similarly, the shared staging block 230 is configured to provide a data value to the vector register 112 of the processor core 105-2 in response to temporarily storing the data value for the threshold number of processor cycles.

システム１００は、同一のサイクルで複数のＣＭＥＭロード命令を発行するように構成される。たとえば、システム１００は、データパス２０８（または２１４）および共有ステージングブロック２１０を用いて実行されるＣＭＥＭロード命令を発行し、同一のサイクルで、データパス２１２を用いて実行されるロードをベクトルメモリ１０６に発行することができる。いくつかの例において、ソフトウェア制御の観点から、リソース２１０とベクトルレジスタ１１０との間のデータパス２１４をトラバースするＣｍｅｍロード動作、およびベクトルメモリ１０６とベクトルレジスタ１１０との間のデータパス２１２をトラバースするＶｍｅｍロード動作の各々は、同一のサイクルで発行および実行することができる。いくつかの実現例において、ベクトルレジスタ１１０、１１２は、従前の設計と比較して、ベクトルレジスタ１１０、１１２が同時ロード動作を受信することを可能にする追加ポートを含むように適合される。 The system 100 is configured to issue multiple CMEM load instructions in the same cycle. For example, the system 100 can issue a CMEM load instruction executed using the data path 208 (or 214) and the shared staging block 210, and issue a load to the vector memory 106 executed using the data path 212 in the same cycle. In some examples, from a software control perspective, each of a Cmem load operation traversing the data path 214 between the resource 210 and the vector register 110, and a Vmem load operation traversing the data path 212 between the vector memory 106 and the vector register 110 can be issued and executed in the same cycle. In some implementations, the vector registers 110, 112 are adapted to include additional ports that allow the vector registers 110, 112 to receive concurrent load operations, as compared to previous designs.

たとえば、ベクトルレジスタ１１２は、プロセッサコア１０５－２で実行される同時ロード動作中にレジスタがベクトルメモリ１０８および共有ステージングブロック２３０からそれぞれのベクトルペイロードを受信することを可能にする追加ポートを含むように構成される。いくつかの例において、ベクトルレジスタ１１０、１１２の各々にロードされるペイロードの単一のデータは、単一のロード動作中にベクトルレジスタ１１０またはベクトルレジスタ１１２に移動させられ得る、最大で１２８個のデータ項目に基づく１２８個の別個のロードを含む。 For example, vector register 112 is configured to include additional ports that allow the register to receive respective vector payloads from vector memory 108 and shared staging block 230 during concurrent load operations executed by processor core 105-2. In some examples, a single datum of payload loaded into each of vector registers 110, 112 includes 128 separate loads based on up to 128 data items that may be moved to vector register 110 or vector register 112 during a single load operation.

共有メモリ１０４のＣＭＥＭロード／格納機能は、ベクトルメモリマクロを通ってデータをルーティングする必要がないので、従前の設計と比較してより高いピーク性能を提供することができる。たとえば、（データパス２１５、２３５に沿った）ロードおよび格納は、ベクトルレジスタ１１０、１１２における利用可能な追加レジスタポートなどにより、ベクトルメモリロードおよび格納と並行して実行することができる。 The CMEM load/store functions of shared memory 104 can provide higher peak performance compared to previous designs because they do not require routing data through vector memory macros. For example, loads and stores (along data paths 215, 235) can be performed in parallel with vector memory loads and stores, such as by using available additional register ports in vector registers 110, 112.

いくつかの実現例において、システム１００は、ベクトルメモリ１０６、１０８を通ってデータパスをトラバースする際に存在し得る帯域幅制限の一部（または全部）をバイパスする並列インターフェイスを共有ステージングブロック２１０、２３０の各々に提供する例示的なロード・格納インターフェイスを含む。この例示的なロード・格納インターフェイスは、例示的なワークロードから追加性能を引き出すことを可能にする、より高いメモリ帯域幅を効果的に提供することができる。たとえば、システム１００は、共有メモリ１０４のリソース（たとえばソフトウェア制御ステージングリソース）を用いてさまざまなロード／格納動作を実行するように構成され、ロード／格納動作は、データをプロセッサコアにおけるベクトルメモリの中を移動させることをバイパスするように実行され得る。 In some implementations, the system 100 includes an exemplary load-store interface that provides a parallel interface to each of the shared staging blocks 210, 230 that bypasses some (or all) of the bandwidth limitations that may exist when traversing a data path through the vector memories 106, 108. This exemplary load-store interface can effectively provide higher memory bandwidth that allows additional performance to be extracted from the exemplary workload. For example, the system 100 can be configured to perform various load/store operations using resources (e.g., software-controlled staging resources) of the shared memory 104, and the load/store operations can be performed to bypass moving data through the vector memory in the processor core.

たとえば、ハードウェア回路１０１のコンポーネントは、共有メモリ１０４と通信して、共有メモリ１０４のメモリバンクまたはレジスタファイルの単一のアドレス場所からデータを読出すことができる。いくつかの例において、メモリ内の単一のアドレスに格納されたデータが読出され、その単一のデータは、プロセッサコアの内部に位置するレジスタファイルまたはステージングブロックに移動させられ得る。たとえば、単一のデータは、共有ＣＭＥＭ１０４のアドレス場所から読出され、共有コアデータパス２２４の中を移動させられ、さらなる処理のためにプロセッサコア１０５－２内の共有ステージングブロック２３０のアドレス場所に移動させられ得る。この動作は、データをベクトルメモリ１０８を介してメモリシステムの中を移動させることをバイパスすることにより、コア１０５－２におけるプロセッサクロックサイクルとベクトルメモリ１０８に接続するデータパスにおける帯域幅とを節約するように実行され得る。 For example, a component of hardware circuit 101 may communicate with shared memory 104 to read data from a single address location of a memory bank or register file of shared memory 104. In some examples, data stored at a single address in memory is read and the single datum may be moved to a register file or staging block located internal to the processor core. For example, the single datum may be read from an address location of shared CMEM 104, moved through shared core data path 224, and moved to an address location of shared staging block 230 in processor core 105-2 for further processing. This operation may be performed to save processor clock cycles in core 105-2 and bandwidth in the data path connecting to vector memory 108 by bypassing moving the data through the memory system via vector memory 108.

図３は、ハードウェア回路１０１の行列計算ユニットと通信するベクトルプロセッサの例を示すブロック図３００である。より具体的には、いくつかの実現例において、ハードウェア回路１０１のテンソルプロセッサコア３０２－１は、ベクトル処理ユニット３０４（「ベクトルプロセッサ３０４」）と、ベクトルプロセッサ３０４に結合される行列計算ユニット３０８とを含む。同様に、ハードウェア回路１０１の別のテンソルプロセッサコア３０２－２は、ベクトルプロセッサ３０６と、ベクトルプロセッサ３０６に結合される行列計算ユニット３１０とを含む。したがって、行列計算ユニット３０８、３１０はプロセッサコア３０２－１、３０２－２のそれぞれのリソースである。 Figure 3 is a block diagram 300 illustrating an example of a vector processor in communication with a matrix computation unit of hardware circuit 101. More specifically, in some implementations, a tensor processor core 302-1 of hardware circuit 101 includes a vector processing unit 304 ("vector processor 304") and a matrix computation unit 308 coupled to vector processor 304. Similarly, another tensor processor core 302-2 of hardware circuit 101 includes a vector processor 306 and a matrix computation unit 310 coupled to vector processor 306. Thus, matrix computation units 308, 310 are resources of each of processor cores 302-1, 302-2.

一般的に、ハードウェア回路１０１は、ニューラルネットワーク層の出力を生成するための計算を実行するように構成される。回路１０１に含まれる行列計算ユニット３０８および３１０の各々は、ニューラルネットワーク層の出力を生成するために用いられる累積値を生成するための計算のサブセットを実行するように構成される。いくつかの実現例において、上述のソフトウェア制御ステージングリソース（たとえばステージングブロック２１０、２３０）は、図１に示されるＨＢＭ１０２から行列計算ユニット３０８、３１０の各々へのデータ（ベクトルオペランドなど）の流れを管理するように構成される。場合によっては、オペランドはＨＢＭ１０２によって提供される入力および重みである。オペランドは、ベクトルプロセッサ３０４または３０６の算術論理ユニット（ＡＬＵ）を用いて実行されるデータ演算に基づくベクトル配列として構造化されてもよい。 In general, the hardware circuit 101 is configured to perform calculations to generate the outputs of the neural network layers. Each of the matrix computation units 308 and 310 included in the circuit 101 is configured to perform a subset of the calculations to generate the accumulation values used to generate the outputs of the neural network layers. In some implementations, the software-controlled staging resources described above (e.g., staging blocks 210, 230) are configured to manage the flow of data (e.g., vector operands) from the HBM 102 shown in FIG. 1 to each of the matrix computation units 308, 310. In some cases, the operands are inputs and weights provided by the HBM 102. The operands may be structured as vector arrays based on data operations performed using an arithmetic logic unit (ALU) of the vector processor 304 or 306.

図３の例では、制御ユニット２０１は、共有メモリ１０４、ベクトルメモリ１０６、１０８、およびベクトルレジスタ１１０、１１２のメモリ場所から複数の入力のバッチおよび重みのセットを取出す（または読出す）動作を管理するための制御信号を生成する。取出された入力および重みをニューラルネットワーク層を通して処理して、行列計算ユニット３０８、３１０で実行される計算に基づいて累積値を生成することができる。累積値をベクトルプロセッサ３０４、３０６で処理して、ニューラルネットワーク層の出力に対応する活性化値を生成することができる。 In the example of FIG. 3, the control unit 201 generates control signals to manage the retrieval (or reading) of batches of inputs and sets of weights from memory locations in the shared memory 104, the vector memories 106, 108, and the vector registers 110, 112. The retrieved inputs and weights may be processed through neural network layers to generate accumulated values based on calculations performed in the matrix computation units 308, 310. The accumulated values may be processed in the vector processors 304, 306 to generate activation values corresponding to the outputs of the neural network layers.

制御ユニット２０１によって生成された制御信号を用いて、ベクトルプロセッサ３０４、３０６によって生成された出力または出力活性化の複数のセットを、１つ以上の他のニューラルネットワーク層で処理するためにＨＢＭ１０２またはハードウェア回路１０１の他のメモリ場所に格納する（または書込む）。より具体的には、システム１００は、大規模共有スクラッチパッドメモリ（共有メモリ１０４など）内の１つ以上のメモリアドレス場所に値のベクトルを累積することを含むベクトル縮小を実行するためのデータ処理技術を実行するように構成される。上述のように、ベクトル縮小および累積は、共有メモリ１０４のメモリセル内の場所のソフトウェア管理アドレッシングに基づいて、共有スクラッチパッドメモリ１０４で実行することができる。共有メモリ１０４のメモリセル内のアドレス場所を用いて、システム１００の異なるコンポーネントで行われる計算の結果を書込む（格納する）ことができる。 Control signals generated by the control unit 201 are used to store (or write) multiple sets of outputs or output activations generated by the vector processors 304, 306 in other memory locations of the HBM 102 or hardware circuit 101 for processing in one or more other neural network layers. More specifically, the system 100 is configured to execute data processing techniques for performing vector reduction, which includes accumulating vectors of values in one or more memory address locations in a large shared scratch-pad memory (such as the shared memory 104). As described above, vector reduction and accumulation can be performed in the shared scratch-pad memory 104 based on software-managed addressing of locations in memory cells of the shared memory 104. Address locations in memory cells of the shared memory 104 can be used to write (store) results of computations performed in different components of the system 100.

システム１００は、共有メモリ１０４に結合される（または結合され得る）演算器／累算器ユニット３２０（「演算器３２０」）を含む。演算器３２０は、１つ以上の算術演算に基づいて値を累積するように構成される。算術演算は、ソフトウェアで、ファームウェアで、ハードウェアで、または各々の組み合わせで、演算器３２０においてプログラム化または符号化することができる。演算器３２０は、共有メモリ１０４のメモリセルの近くに結合されて、共有メモリ１０４の共有メモリセルにルーティング中のベクトル値に対して累積演算を実行する、計算論理の密集部分を表し得る。 The system 100 includes a calculator/accumulator unit 320 ("calculator 320") that is coupled (or may be coupled) to the shared memory 104. The calculator 320 is configured to accumulate values based on one or more arithmetic operations. The arithmetic operations may be programmed or encoded in the calculator 320 in software, firmware, hardware, or a combination of each. The calculator 320 may represent a dense portion of computational logic that is coupled near the memory cells of the shared memory 104 and performs accumulation operations on vector values being routed to the shared memory cells of the shared memory 104.

いくつかの実現例において、演算器３２０は、異なるタイプの数値フォーマットを有する値に対して異なるタイプの数学的演算を実行するように各々が構成された異なるタイプの加算器（たとえば正規化加算器）および乗算器を実行するためのハードウェア回路を含む計算ユニットである。たとえば、演算器３２０は、数学的演算（浮動小数点乗算、浮動小数点加算、整数加算演算、および最小－最大演算など）を実行するように構成される。いくつかの他の実現例において、演算器３２０は、共有メモリ１０４のハードウェア特徴としてシステム１００に含まれる。また、演算器３２０の１つ以上の算術演算または関数も、ソフトウェアおよびハードウェアで実現されてもよい。 In some implementations, the operator 320 is a computation unit that includes hardware circuits for performing different types of adders (e.g., normalizing adders) and multipliers, each configured to perform different types of mathematical operations on values having different types of numeric formats. For example, the operator 320 is configured to perform mathematical operations (such as floating-point multiplication, floating-point addition, integer addition operations, and min-max operations). In some other implementations, the operator 320 is included in the system 100 as a hardware feature of the shared memory 104. One or more arithmetic operations or functions of the operator 320 may also be implemented in software and hardware.

演算器３２０は、特定の算術演算を選択するための、または特定の算術演算を実行するように構成された演算器３２０における回路を選択するための論理３２５を含み得る。いくつかの実現例において、演算器３２０は、値のベクトル内の値の１つ以上の数値フォーマット（たとえば２の補数整数および浮動小数点）に基づいて、共有メモリ１０４および／またはハードウェア回路１０１でインスタンス化される。たとえば、数値フォーマットは、ベクトルの数または数値を表すために用いられるデータフォーマットに対応する。いくつかの実現例において、演算器３２０は、正規化ユニットのための回路、プーリングユニットのための回路、またはその双方のための回路を含む。 The operator 320 may include logic 325 for selecting a particular arithmetic operation or for selecting circuitry in the operator 320 configured to perform a particular arithmetic operation. In some implementations, the operator 320 is instantiated in the shared memory 104 and/or the hardware circuitry 101 based on one or more numeric formats (e.g., two's complement integer and floating point) of the values in the vector of values. For example, the numeric format corresponds to a data format used to represent the numbers or values in the vector. In some implementations, the operator 320 includes circuitry for a normalization unit, a pooling unit, or both.

上述のように、記載されている技術は、値のベクトルを処理する際に（たとえばベクトル縮小のための）累積縮小ステップを実行するためのメモリへの累積機能を含む。図３の例では、プロセッサコア３０２－１、３０２－２の各々は、それぞれの累積フラグ３３０、３３５を生成して、共有メモリ１０４の制御ユニット２０１に、値の例示的なベクトルに対してメモリへの累積機能を実行させることができる。値のベクトルは、たとえばデータパス２０６またはデータパス２２６を使用してベクトルを共有メモリ１０４に移動させるＤＭＡ動作を用いて、共有メモリ１０４に移動させることができる。 As mentioned above, the described techniques include an accumulate to memory function for performing an accumulated reduction step (e.g., for vector reduction) when processing a vector of values. In the example of FIG. 3, each of the processor cores 302-1, 302-2 may generate a respective accumulation flag 330, 335 to cause the control unit 201 of the shared memory 104 to perform an accumulate to memory function on an exemplary vector of values. The vector of values may be moved to the shared memory 104, for example, using a DMA operation that moves the vector to the shared memory 104 using data path 206 or data path 226.

図４は、例示的な累積パイプライン４００（「パイプライン４００」）を示すブロック図である。パイプライン４００は、共有メモリ１０４の共有メモリセル４４５に値のベクトルを累積する例示的な動作のための例示的なデータ処理ステップを示す。 FIG. 4 is a block diagram illustrating an exemplary accumulation pipeline 400 ("pipeline 400"). Pipeline 400 illustrates exemplary data processing steps for an exemplary operation of accumulating a vector of values in a shared memory cell 445 of shared memory 104.

個々の入力および重み値などのベクトルオペランドは、プロセッサコアの例示的な行列ユニットの乗算セルを用いて乗算されてからコアのベクトルメモリに格納されるテンソル値として表され得る（４０２）。いくつかの実現例において、ベクトルオペランドの入力は、入力行列または入力テンソルのパーティションに対応する。たとえば、入力テンソルは２つのセクションに分割されてもよく、各セクションの異なるそれぞれの次元からの入力値が、重み値と乗算されて出力値を生成するように特定のプロセッサコアに送信されてもよい。入力テンソルについては、重みテンソルおよび出力テンソルとともに、図５を参照して以下でより詳細に説明する。 Vector operands, such as individual input and weight values, may be represented as tensor values that are multiplied using multiplication cells of an exemplary matrix unit of the processor core and then stored in the core's vector memory (402). In some implementations, the inputs of the vector operands correspond to partitions of an input matrix or input tensor. For example, an input tensor may be split into two sections, and input values from different respective dimensions of each section may be sent to a particular processor core to be multiplied with weight values to generate output values. The input tensors, along with the weight tensors and output tensors, are described in more detail below with reference to FIG. 5.

最終結果ベクトル４５０は、入力行列／テンソルの入力の各々を用いてニューラルネットワークの層について計算される出力を表す出力値の最終セットに基づき得る。そのため、入力テンソルのデータ／入力値が異なるプロセッサコアで処理されるように分割され得る場合でも、正しい正確な最終結果ベクトル４５０を生成することは、実際には、それぞれのコアによって生成される出力値の少なくとも２つの異なるセットの正しい正確な累積に依存する。たとえば、正しい最終結果ベクトル４５０を生成するためには、それぞれのコアによって生成される出力値の異なるセットを合計または累積する必要がある。 The final result vector 450 may be based on a final set of output values representing the outputs calculated for the layer of the neural network using each of the input matrices/tensors of input. So, even if the data/input values of the input tensors may be split to be processed by different processor cores, generating the correct and accurate final result vector 450 actually depends on the correct and accurate accumulation of at least two different sets of output values generated by each core. For example, the different sets of output values generated by each core may need to be summed or accumulated to generate the correct final result vector 450.

図４の例では、それぞれのプロセッサコアはコア＿０（たとえばプロセッサコア３０２－１）およびコア＿１（たとえばプロセッサコア３０２－２）として示されている。各プロセッサコアのそれぞれの行列ユニット（たとえば行列３０８または３１０）によって実行される行列乗算に応じて、複数の出力値が生成されてもよい。いくつかの実現例において、この出力値は、行列乗算を実行するプロセッサコアのベクトルメモリに格納された後に、累積演算のために共有メモリ１０４に送られる。最終結果ベクトル４５０は、双方のプロセッサコアが、これらのプロセッサコアに割り当てられた計算のそれぞれの半分を集約することに基づいて、得ることができる。いくつかの実現例において、最終結果ベクトルを得るための集約は「事前累積演算」に対応する。 In the example of FIG. 4, each processor core is shown as core_0 (e.g., processor core 302-1) and core_1 (e.g., processor core 302-2). Depending on the matrix multiplication performed by each matrix unit (e.g., matrix 308 or 310) of each processor core, multiple output values may be generated. In some implementations, the output values are stored in a vector memory of the processor core performing the matrix multiplication and then sent to the shared memory 104 for an accumulation operation. The final result vector 450 can be obtained based on both processor cores aggregating their respective halves of the calculations assigned to them. In some implementations, the aggregation to obtain the final result vector corresponds to a "pre-accumulation operation".

ベクトル値を累積する従前のアプローチでは、１つのコアがその結果を別のコアに移動させることが必要であった。これらのアプローチでは、結果値の異なるセットをシステムの異なるコアの間で移動させるために、追加のプロセッササイクル、メモリリソース、計算帯域幅、および特定のソフトウェア制御が必要であった。本明細書の累積縮小技術は、これらの集約が、共有メモリ１０４でネイティブに実行可能な累積機能に基づいて共有メモリシステムで行われることを可能にする。 Previous approaches to accumulating vector values required one core to move its results to another core. These approaches required additional processor cycles, memory resources, computational bandwidth, and specific software control to move different sets of result values between different cores in the system. The accumulation reduction techniques herein allow these aggregations to occur in shared memory systems based on accumulation functions natively executable in the shared memory 104.

プロセッサコアの各々は、それぞれの累積フラグ３３０、３３５を生成して、共有メモリ１０４の制御ユニット２０１に、値の例示的なベクトルに対してメモリへの累積機能を実行させることができる（４０４）。各プロセッサコア１０５で生成された値のベクトルは、上述のようにＤＭＡ動作を用いて共有メモリ１０４に移動させることができる。共有スクラッチパッドメモリ１０４の共有メモリセルまたはアドレス場所に値のベクトルを累積するための技術は、システム１００のプログラマブルＤＭＡデータ転送機能を介して実行することができる。たとえば、共有メモリ１０４のメモリセル内にデータを移動させるように動作可能な任意のＤＭＡ動作は、本文書に記載されている累積技術を用いることができる。このように、図２および図３の例におけるコア０およびコア１の各々は、双方が、共有メモリ１０４の特定の共有メモリセルの同一のアドレス場所にベクトル値を累積することができる。一例では、累積は、コア０によって提供される第１のベクトルの値とコア１によって提供される第２のベクトルの対応する値とのペアワイズ累積に関係する。 Each of the processor cores may generate a respective accumulation flag 330, 335 to cause the control unit 201 of the shared memory 104 to perform an accumulation to memory function on the exemplary vector of values (404). The vector of values generated by each processor core 105 may be moved to the shared memory 104 using DMA operations as described above. The technique for accumulating the vector of values into shared memory cells or address locations of the shared scratch pad memory 104 may be performed via the programmable DMA data transfer functionality of the system 100. For example, any DMA operation operable to move data into memory cells of the shared memory 104 may use the accumulation techniques described in this document. In this manner, each of the cores 0 and 1 in the examples of Figures 2 and 3 may both accumulate vector values into the same address locations of particular shared memory cells of the shared memory 104. In one example, the accumulation involves a pair-wise accumulation of values of a first vector provided by core 0 and corresponding values of a second vector provided by core 1.

いくつかの実現例において、システム１００は、共有メモリのＤＭＡモードを用いるのではなく、共有メモリ１０４のロード／格納使用モードにおいて大規模なベクトル「格納累積」を提供するように構成される。たとえば、複数のプロセッサコアの間の共有ロード／格納メモリ層を用いて、プロセッサコアの間の一部（または全部）の同期の必要性を切離す「格納累積」機能を実行することができる。いくつかの実現例において、格納累積機能を実行するために用いられる複数のプロセッサコアの間の共有ロード／格納メモリ層は、少なくとも図２を参照して上述したデータパス２１２、２３２を含む。 In some implementations, the system 100 is configured to provide large vector "store accumulate" in a load/store usage mode of the shared memory 104, rather than using the DMA mode of the shared memory. For example, a shared load/store memory layer between multiple processor cores can be used to perform the "store accumulate" function, which isolates the need for some (or all) synchronization between the processor cores. In some implementations, the shared load/store memory layer between multiple processor cores used to perform the store accumulate function includes at least the data paths 212, 232 described above with reference to FIG. 2.

共有メモリ１０４および制御ユニット２０１は、演算器３２０を用いて、値のベクトルのそれぞれに対して累積演算を実行する（４０６）。たとえば、制御ユニット２０１は、ニューラルネットワークの層を通して処理される入力の異なるセットに対して実行される行列乗算全体にわたって累積縮小ステップを実行して、層の出力を生成する。いくつかの実現例において、値のベクトルは、上述の行列乗算の結果として生成される累積値のベクトルのそれぞれであり得る。 The shared memory 104 and the control unit 201 perform an accumulation operation on each of the vectors of values using the operator 320 (406). For example, the control unit 201 performs an accumulation reduction step across matrix multiplications performed on different sets of inputs processed through the layer of the neural network to generate the output of the layer. In some implementations, the vectors of values can be each of the vectors of accumulated values generated as a result of the matrix multiplications described above.

制御ユニット２０１は、特定のベクトル要素における累積を有効または無効にするように１つ以上のベクトル要素をマスクし、異なるベクトルの累積を管理するための制御を実行し、未処理の累積演算を追跡するように構成される（４０８）。 The control unit 201 is configured to mask one or more vector elements to enable or disable accumulation on particular vector elements, perform control to manage accumulation of different vectors, and keep track of outstanding accumulation operations (408).

マスク要素に関して、システム１００は、１６Ｂ（１６ビット）幅のベクトルユニット（たとえばベクトルプロセッサ）を各々が含むマシン（計算サーバまたは関連のハードウェア回路など）を含み得る。ベクトルユニットは、１６ビット幅のデータ要素に対して動作するように構成され得るが、ハードウェア回路（またはサーバ）のリソースによって生成される値のベクトルはわずか９Ｂ幅のベクトルであってもよい。いくつかの実現例において、システム１００は、１つ以上の９要素幅のベクトルに対して動作し、これらのベクトルの各々は、各々が１６ビットである９個のデータ値を含む。この例では、制御ユニット２０１は、共有メモリ１０４の共有メモリ場所に累積すべき値のベクトルのデータ構造を識別することができる。制御ユニット２０１は、このデータ構造に基づいて、共有場所に累積すべき値が、ベクトルユニットの１６Ｂ幅のベクトル構成に対して９Ｂ幅のベクトルであると判断することができる。 With respect to mask elements, the system 100 may include machines (such as computational servers or associated hardware circuitry) that each include a 16B (16-bit) wide vector unit (e.g., a vector processor). The vector units may be configured to operate on 16-bit wide data elements, but the vectors of values generated by the hardware circuit's (or server's) resources may be only 9B wide vectors. In some implementations, the system 100 operates on one or more 9-element wide vectors, each of which includes nine data values that are 16 bits each. In this example, the control unit 201 may identify a data structure for the vector of values to be accumulated in a shared memory location of the shared memory 104. Based on this data structure, the control unit 201 may determine that the values to be accumulated in the shared location are 9B wide vectors for the 16B wide vector configuration of the vector units.

制御ユニット２０１は、累積または縮小を行う際に、演算器３２０にベクトル内のたとえば最初の９フィールドのみに算術演算を適用させるマスク関数４３０を実行することができる。たとえば、プロセッサコア３０２－１から共有メモリ１０４への、共有メモリセル４４５内にベクトルを累積する要求が、プロセッサコア３０２－１のベクトル処理ユニット３０４の構成に基づいて１６Ｂ幅のベクトルとして提示されてもよい。制御ユニット２０１は、累積中の値が、１６Ｂ幅のベクトルの後半によって表されているか、または１６Ｂ幅のベクトルの最初の９フィールドで表される９Ｂ幅のベクトルによって表されているかを判断するように構成される。したがって、システム１００は、ベクトル内のどの特定の要素が共有メモリセル４４５内に累積されるかを識別および選択するように、または他の方法で制御するように動作可能である。 The control unit 201 may perform a mask function 430 that causes the arithmetic unit 320 to apply arithmetic operations to, for example, only the first 9 fields in a vector when performing accumulation or reduction. For example, a request from the processor core 302-1 to the shared memory 104 to accumulate a vector in the shared memory cell 445 may be presented as a 16B wide vector based on the configuration of the vector processing unit 304 of the processor core 302-1. The control unit 201 is configured to determine whether the value being accumulated is represented by the second half of the 16B wide vector or by a 9B wide vector represented by the first 9 fields of the 16B wide vector. Thus, the system 100 is operable to identify and select or otherwise control which particular elements in the vector are accumulated in the shared memory cell 445.

累積制御に関して、制御ユニット２０１は、読出・修正・書込制御４３５（「制御４３５」）を実行して、共有メモリシステムにおける値の異なるベクトルの累積を制御および管理するように構成される。制御４３５は、第１のコアでデータを読出すこと、読出した値に対して、第１のコアから離れている計算ユニットで計算を実行すること、およびその後に第１のコアへの格納／書戻しを行うことが必要な、非効率な代替アプローチに対する性能およびエネルギーの向上を提供する。 With respect to accumulation control, control unit 201 is configured to execute read-modify-write control 435 ("control 435") to control and manage the accumulation of distinct vectors of values in the shared memory system. Control 435 provides performance and energy improvements over inefficient alternative approaches that require reading data in a first core, performing computations on the read values in a computation unit separate from the first core, and then storing/writing back to the first core.

未処理の動作の追跡に関して、制御ユニット２０１は動作トラッカー４４０を実行して、値の異なるベクトルを共有メモリシステムに累積するための未処理の要求および現在の（または待機中の）動作を追跡するように構成される。たとえば、制御ユニット２０１は動作トラッカー４４０を用いて、共有メモリのメモリ場所（共有メモリセル４４５など）に値のベクトルを書込むことを要求する各書込み動作を追跡する。いくつかの実現例において、制御ユニット２０１は、プロセッサコアからの書込み要求に付随する累積フラグ３３０、３３５に基づいて動作を追跡する。累積フラグ３３０、３３５は、値のベクトルが、共有メモリ１０４の特定のメモリ場所で初期値として書込まれるべきであること、または既存の値とともに累積されるべきであることを示す。 With respect to tracking of outstanding operations, the control unit 201 is configured to execute an operation tracker 440 to track outstanding requests and current (or pending) operations to accumulate different vectors of values in the shared memory system. For example, the control unit 201 uses the operation tracker 440 to track each write operation that requests to write a vector of values to a memory location (such as the shared memory cell 445) of the shared memory. In some implementations, the control unit 201 tracks the operations based on an accumulation flag 330, 335 that accompanies the write request from the processor core. The accumulation flag 330, 335 indicates that the vector of values should be written as an initial value or accumulated with an existing value at a particular memory location of the shared memory 104.

制御ユニット２０１は、演算器３２０に制御信号を送信して、特定のメモリアドレス場所に格納されている現在の値とその共有メモリ場所に書込み中の値のベクトルとの間の累積演算を演算器３２０に実行させる。いくつかの実現例において、値のベクトルを共有メモリセル４４５に書込むプロセッサコアからの要求は、処理するのに少なくとも２クロックサイクルが必要である。この書込み要求の処理には少なくとも２クロックサイクルが必要であり得るので、制御ユニット２０１が共有メモリ場所で値を読出そうとしているときに別のベクトルが同一の共有メモリ場所に書込まれると、読出／書込ハザードが発生し得る。この場合、値の読出を実行する前に書込み動作が完全に処理されなかったので、読出中の値は最新の値ではない。 The control unit 201 sends control signals to the calculator 320 to cause the calculator 320 to perform an accumulation operation between the current value stored in a particular memory address location and the vector of values being written to that shared memory location. In some implementations, a request from a processor core to write a vector of values to a shared memory cell 445 requires at least two clock cycles to process. Because this write request may require at least two clock cycles to process, a read/write hazard may occur if the control unit 201 is attempting to read a value at a shared memory location while another vector is being written to the same shared memory location. In this case, the value being read is not the most current value because the write operation was not fully processed before performing the read of the value.

制御ユニット２０１は、動作トラッカー４４０を用いて、最後のいくつかのクロックサイクルにおいてどの要求が共有メモリ１０４に送信されたかを判断し、特定のメモリ場所に格納されている値が古いか新しいかを判断する。制御ユニット２０１は、最後の書込み要求のタイムスタンプに基づいて、または最後の書込み要求を処理するのに必要な時間（たとえば２クロックサイクル以上）に基づいて、値が古いか新しいかを判断することができる。たとえば、タイムスタンプは、最後の要求が共有メモリ１０４で開始または処理されてから３クロックサイクル以上が経過したことを示すことができる。値が新しいと判断された場合は、制御ユニット２０１は値を読出す。値が古いと判断された場合は、制御ユニット２０１は、値が読出のためにまたは累積のために再び新しいことを示すのに必要な数のクロックサイクルが経過するまで、値の読出を停止する。 The control unit 201 uses the operation tracker 440 to determine which requests have been sent to the shared memory 104 in the last few clock cycles and determines whether the value stored at a particular memory location is stale or new. The control unit 201 can determine whether the value is stale or new based on a timestamp of the last write request or based on the time required to process the last write request (e.g., more than one clock cycle). For example, the timestamp can indicate that more than two clock cycles have passed since the last request was initiated or processed in the shared memory 104. If the value is determined to be new, the control unit 201 reads the value. If the value is determined to be stale, the control unit 201 stops reading the value until the number of clock cycles required to indicate that the value is new again for reading or accumulation has passed.

システム１００は、共有メモリ場所４４５に格納されている以前の累積を失うことなく、値（たとえばベクトル）を受信してそれを共有メモリ場所の既存の値に累積するように構成される（４１０）。たとえば、システム１００は、メモリ場所（共有メモリセル４４５など）に以前に格納されたベクトル累積を上書き可能な競合状態を緩和するための外部ソフトウェアロックを必要とせずに累積演算を実行するように構成される。システム１００は、ローカルな事前累積演算をそれぞれのプロセッサコアで実行することを必要とせずに、かつ、プロセッサコアの間の事前同期を必要とせずに、累積演算を実行する。たとえば、ローカルな事前累積演算を実行して、所与のプロセッサコアでローカルに計算される部分和のそれぞれのセットを累積してもよい。 The system 100 is configured to receive a value (e.g., a vector) and accumulate it onto an existing value in a shared memory location (410) without losing any previous accumulations stored in the shared memory location 445. For example, the system 100 is configured to perform an accumulation operation without requiring external software locks to mitigate race conditions that may overwrite a vector accumulation previously stored in a memory location (e.g., the shared memory cell 445). The system 100 performs the accumulation operation without requiring a local pre-accumulation operation to be performed on each processor core and without requiring pre-synchronization between the processor cores. For example, a local pre-accumulation operation may be performed to accumulate each set of partial sums that are calculated locally on a given processor core.

共有メモリ１０４は、本技術のベクトル縮小特徴のアトミック局面を表すこの機能をネイティブにサポートするように構成される。たとえば、システム１００の複数のコア（たとえば１０個のコア）はすべて、値の異なるベクトルを生成していてもよく、各コアは、共有メモリ場所にそれぞれのベクトルを累積する要求を提出することができる。いくつかの実現例において、この要求は、累積フラグ３３０、３３５および対応するコアＩＤ（たとえばコア０、コア１、コアＮなど）と、メモリ場所に累積すべき値とを含む。いくつかの実現例において、大規模な行列乗算ジョブは、システム１００の少なくとも２つのプロセッサコアの間で分割されてもよく、この累積／ベクトル縮小技術を用いて、行列乗算から生成される部分和またはドット積の累算を簡素化する。 The shared memory 104 is configured to natively support this functionality, which represents the atomic aspect of the vector reduction feature of the present technology. For example, multiple cores (e.g., 10 cores) of the system 100 may all be generating different vectors of values, and each core may submit a request to accumulate their respective vectors in a shared memory location. In some implementations, the request includes an accumulation flag 330, 335 and a corresponding core ID (e.g., core 0, core 1, core N, etc.), and the value to be accumulated in the memory location. In some implementations, a large matrix multiplication job may be split between at least two processor cores of the system 100, and this accumulation/vector reduction technique is used to simplify the accumulation of partial sums or dot products generated from the matrix multiplication.

いくつかの実現例において、値のベクトルを共有メモリセルに累積するためのこれらの技術は、ニューラルネットワークモデルの訓練時に用いられる。たとえば、これらの技術を用いて、プロセッサコアの分散システム全体にわたって訓練ステップの一部として計算される勾配を減少させる勾配累積のための全縮小演算を実行することができる。特に、開示されている累積縮小技術に基づいて、ニューラルネットワークモデルを訓練するためのこの勾配累積は、メモリシステムまたは共有メモリ１０４の機能としてシステム１００でネイティブに実行することができる。 In some implementations, these techniques for accumulating vectors of values in a shared memory cell are used during training of a neural network model. For example, these techniques can be used to perform a full reduction operation for gradient accumulation that reduces gradients calculated as part of a training step across a distributed system of processor cores. In particular, based on the disclosed accumulation reduction techniques, this gradient accumulation for training a neural network model can be performed natively in system 100 as a function of the memory system or shared memory 104.

図５は、入力テンソル５０４と、重みテンソル５０６の変形と、出力テンソル５０８とを含むテンソルまたは多次元行列５００の例を示す。図５では、テンソル５００の各々は、ニューラルネットワークの所与の層で実行される計算のためのデータ値に対応する要素を含む。この計算は、１つ以上のクロックサイクルで入力／活性化テンソル５０４とパラメータ／重みテンソル５０６とを乗算して出力または出力値を生成することを含み得る。出力のセット内の各出力値は、出力テンソル５０８のそれぞれの要素に対応する。活性化テンソル５０４と重みテンソル５０６とを乗算することは、テンソル５０４の要素からの活性化とテンソル５０６の要素からの重みとを乗算して部分和（複数可）を生成することを含む。 5 illustrates an example of a tensor or multi-dimensional matrix 500 that includes an input tensor 504, a transformation of a weight tensor 506, and an output tensor 508. In FIG. 5, each of the tensors 500 includes elements that correspond to data values for a computation performed at a given layer of the neural network. The computation may include multiplying the input/activation tensor 504 with the parameter/weight tensor 506 on one or more clock cycles to generate an output or output value. Each output value in the set of outputs corresponds to a respective element of the output tensor 508. Multiplying the activation tensor 504 with the weight tensor 506 includes multiplying the activations from the elements of the tensor 504 with the weights from the elements of the tensor 506 to generate partial sum(s).

いくつかの実現例において、システム１００のプロセッサコアは、ｉ）ある多次元テンソルにおける離散要素、ｉｉ）ある多次元テンソルの同一のまたは異なる次元に沿った複数の離散要素を含む値のベクトル、またはｉｉｉ）各々の組み合わせ、に対応するベクトルに対して動作する。ある多次元テンソルにおける離散要素または複数の離散要素の各々は、テンソルの次元に応じて、Ｘ、Ｙ座標（２Ｄ）を用いてまたはＸ、Ｙ、Ｚ座標（３Ｄ）を用いて表すことができる。 In some implementations, the processor cores of the system 100 operate on vectors that correspond to i) discrete elements in a multidimensional tensor, ii) vectors of values that include multiple discrete elements along the same or different dimensions of a multidimensional tensor, or iii) combinations of each. Each discrete element or multiple discrete elements in a multidimensional tensor can be represented using X,Y coordinates (2D) or using X,Y,Z coordinates (3D), depending on the dimensionality of the tensor.

システム１００は、バッチ入力を対応する重み値と乗算することにより生成された積に対応する複数の部分和を計算することができる。上述のように、システム１００は、多数のクロックサイクルにわたって積（たとえば部分和）の累算を実行することができる。たとえば、積の累算は、本文書に記載されている技術に基づいて共有メモリ１０４において実行することができる。いくつかの実現例において、入力・重み乗算は、各重み要素を入力ボリュームの離散入力（入力テンソル５０４の行またはスライスなど）の離散入力と乗算した積和として書くことができる。この行またはスライスは、所与の次元（入力テンソル５０４の第１の次元５１０、または入力テンソル５０４の第２の異なる次元５１５など）を表し得る。 The system 100 can calculate multiple partial sums corresponding to products generated by multiplying the batch inputs with corresponding weight values. As described above, the system 100 can perform the accumulation of products (e.g., partial sums) over multiple clock cycles. For example, the accumulation of products can be performed in the shared memory 104 based on techniques described herein. In some implementations, the input-weight multiplication can be written as a sum of products where each weight element is multiplied with a discrete input of the input volume (e.g., a row or slice of the input tensor 504). The row or slice can represent a given dimension (e.g., a first dimension 510 of the input tensor 504, or a second, different dimension 515 of the input tensor 504).

いくつかの実現例において、例示的な１組の計算を用いて畳み込みニューラルネットワーク層の出力を計算することができる。ＣＮＮ層についての計算は、３Ｄ入力テンソル５０４と少なくとも１つの３Ｄフィルタ（重みテンソル５０６）との間の２Ｄ空間畳み込みを実行することを含み得る。たとえば、１つの３Ｄフィルタ５０６を３Ｄ入力テンソル５０４に対して畳み込むと、２Ｄ空間平面５２０または５２５が生成され得る。計算は、入力ボリュームの特定の次元についてドット積の和を計算することを含み得る。 In some implementations, an exemplary set of computations can be used to compute the output of a convolutional neural network layer. The computations for a CNN layer may include performing a 2D spatial convolution between a 3D input tensor 504 and at least one 3D filter (weight tensor 506). For example, convolving one 3D filter 506 with a 3D input tensor 504 may produce a 2D spatial plane 520 or 525. The computations may include computing a sum of dot products for a particular dimension of the input volume.

たとえば、空間平面５２０は、次元５１０に沿った入力から計算される積和の出力値を含み得るのに対して、空間平面５２５は、次元５１５に沿った入力から計算される積和の出力値を含み得る。空間平面５２０および５２５の各々において出力値の積和を生成するための計算は、本文書に記載されている累積縮小ステップを用いて共有メモリ１０４において（たとえば共有メモリセル４４５で）実行することができる。 For example, spatial plane 520 may include sum-of-products output values calculated from inputs along dimension 510, while spatial plane 525 may include sum-of-products output values calculated from inputs along dimension 515. The calculations to generate the sum-of-products output values in each of spatial planes 520 and 525 may be performed in shared memory 104 (e.g., in shared memory cells 445) using the cumulative reduction steps described herein.

図６は、ハードウェア回路の共有スクラッチパッドメモリを用いてベクトル縮小を実行するための例示的なプロセス６００を示すフロー図であり、ハードウェア回路は、この共有メモリと通信するプロセッサコアを有する。いくつかの実現例において、プロセス６００は、図１の共有メモリを用いてニューラルネットワーク計算を加速させるために用いられる技術の一部である。 FIG. 6 is a flow diagram illustrating an example process 600 for performing vector reduction using a shared scratchpad memory in a hardware circuit having processor cores in communication with the shared memory. In some implementations, process 600 is part of a technique used to accelerate neural network computations using the shared memory of FIG. 1.

プロセス６００は、上記のシステム１００を用いて実現または実行することができる。プロセス６００の説明は、システム１００の上記の計算リソースを参照することができる。プロセス６００のステップまたはアクションは、本文書に記載されているデバイスおよびリソースの１つ以上のプロセッサによって実行可能なプログラムされたファームウェアまたはソフトウェア命令によって可能にされ得る。いくつかの実現例において、プロセス６００のステップは、ニューラルネットワークを実行するように構成されたハードウェア回路を用いてニューラルネットワーク層の出力を生成するための計算を実行する方法に対応する。 Process 600 may be implemented or performed using system 100 described above. A description of process 600 may refer to the computational resources of system 100 described above. Steps or actions of process 600 may be enabled by programmed firmware or software instructions executable by one or more processors of the devices and resources described in this document. In some implementations, steps of process 600 correspond to a method of performing computations to generate outputs of neural network layers using hardware circuitry configured to execute a neural network.

ここでプロセス６００を参照して、システム１００において値のベクトルを生成する（６０２）。たとえば、システム１００の１つ以上のハードウェア回路に含まれるプロセッサコアごとに、プロセッサコアで実行される計算に基づいて、値のベクトルのそれぞれを生成する。 Now referring to process 600, vectors of values are generated (602) in system 100. For example, for each processor core included in one or more hardware circuits of system 100, a respective vector of values is generated based on computations performed by the processor core.

システム１００の共有メモリは、値のベクトルのそれぞれを受信する（６０４）。たとえば、共有メモリ１０４は、共有メモリ１０４のダイレクトメモリアクセス（ＤＭＡ）データパスを用いて、プロセッサコアのそれぞれのリソースから値のベクトルを受信する。いくつかの実現例において、ベクトルまたは値のベクトルが、単一のプロセッサコア（または複数のプロセッサコアの各々）によって生成された後、値のベクトルを用いて計算を実行するシステム１００の共有メモリに提供される。たとえば、共有メモリは、第１のプロセッサコアからベクトルを取得し、取得したベクトルと１つ以上の他のベクトルとを用いて縮小演算を実行することができる。１つ以上の他のベクトルは、第１のプロセッサコア以外のプロセッサコアから受信または取得されてもよい。 The shared memory of the system 100 receives each of the vectors of values (604). For example, the shared memory 104 receives the vectors of values from the resources of each of the processor cores using the direct memory access (DMA) data path of the shared memory 104. In some implementations, the vectors or vectors of values are generated by a single processor core (or each of multiple processor cores) and then provided to the shared memory of the system 100, which performs a calculation using the vector of values. For example, the shared memory can obtain a vector from a first processor core and perform a reduction operation using the obtained vector and one or more other vectors. The one or more other vectors may be received or obtained from processor cores other than the first processor core.

いくつかの他の実現例において、システム１００は、累積演算とともに直接格納動作を実行するように構成される。たとえば、システム１００は、共有メモリ１０４の共有メモリ場所に値の１つ以上のベクトルを直接格納するために用いられる累積フラグ３３０、３３５を生成することができる。ベクトルは、単一のプロセッサコアからであってもよく、または複数の異なるプロセッサコアからであってもよい。たとえば、プロセッサコア１０５－１または３０２－２は、累積フラグを表す制御信号を生成し、その制御信号を共有メモリ１０４の制御ユニット２０１に渡すことができる。システム１００は、値のベクトルをベクトルメモリ１０６、１０８に格納してから、値のベクトルをベクトルメモリから共有メモリ１０４に移動させるＤＭＡ動作を実行するように構成され得る。 In some other implementations, the system 100 is configured to perform direct store operations in conjunction with accumulation operations. For example, the system 100 can generate accumulation flags 330, 335 that are used to directly store one or more vectors of values in shared memory locations in the shared memory 104. The vectors may be from a single processor core or may be from multiple different processor cores. For example, the processor cores 105-1 or 302-2 can generate control signals representing the accumulation flags and pass the control signals to the control unit 201 of the shared memory 104. The system 100 can be configured to store the vectors of values in the vector memories 106, 108 and then perform a DMA operation to move the vectors of values from the vector memories to the shared memory 104.

システム１００は、値のベクトルのそれぞれに対して累積演算を実行する（６０６）。より具体的には、共有メモリ１０４は、値のベクトルのそれぞれが共有メモリ場所に書込まれると累積演算を実行する。たとえば、システム１００は、共有メモリ１０４に、共有メモリ１０４に結合された演算器３２０を用いて値のベクトルのそれぞれに対して累積演算を実行させる。システム１００は、同一のベクトルの異なる要素（または値）、および異なるベクトルの要素に対応する値を累積するように動作可能である。演算器３２０は、演算器ユニットで符号化された算術演算に基づいて値を累積するように構成される。いくつかの実現例において、算術演算は、可換性によって制御される数学的演算である。算術演算はアトミック縮小（たとえばアトミック浮動小数点縮小）を含み得る。 The system 100 performs an accumulation operation on each of the vectors of values (606). More specifically, the shared memory 104 performs the accumulation operation as each of the vectors of values is written to a shared memory location. For example, the system 100 causes the shared memory 104 to perform an accumulation operation on each of the vectors of values using an operator 320 coupled to the shared memory 104. The system 100 is operable to accumulate values corresponding to different elements (or values) of the same vector and elements of different vectors. The operator 320 is configured to accumulate values based on arithmetic operations encoded in the operator units. In some implementations, the arithmetic operations are mathematical operations controlled by commutativity. The arithmetic operations may include atomic reductions (e.g., atomic floating-point reductions).

たとえば、アトミック縮小は、値のベクトルが共有メモリのメモリ場所（共有セルなど）に直接累積される累積またはベクトル縮小ステップとして処理される。一例では、システム１００は、累積演算の一部として、複数の異なるコアから生成された複数のベクトルを累積する。別の例では、システム１００は、共有メモリ１０４に（メモリの共有セルなどに）すでに格納されている値（たとえばベクトル）を、コアによって生成された値と累積する。別の例では、システム１００は、複数の異なるコアから生成された複数のベクトルを、共有メモリ１０４にすでに格納されている１つ以上の値と累積する。コアで生成されたベクトルと共有メモリにすでに格納されている値とを含む前述の例は、縮小演算にも適用可能であり、演算器３２０を用いて実行され得る他のタイプの算術演算にも適用可能である。 For example, atomic reduction is treated as an accumulation or vector reduction step in which a vector of values is accumulated directly into a memory location (e.g., a shared cell) of a shared memory. In one example, the system 100 accumulates multiple vectors generated from multiple different cores as part of an accumulation operation. In another example, the system 100 accumulates values (e.g., vectors) already stored in the shared memory 104 (e.g., in a shared cell of the memory) with values generated by the core. In another example, the system 100 accumulates multiple vectors generated from multiple different cores with one or more values already stored in the shared memory 104. The above examples involving vectors generated by the cores and values already stored in the shared memory are also applicable to reduction operations, and other types of arithmetic operations that may be performed using the arithmetic unit 320.

いくつかの他の実現例において、プロセッサコア３０２－１、３０２－２の各々は、累積が必要なベクトルを提供し、値は、プロセッサコア３０２－１、３０２－２の間でアクティビティを同期させることなくメモリ場所に直接累積される。同様に、値は、プロセッサコア３０２－１、３０２－２のいずれかが、これらのプロセッサコアのいずれかで実行された計算から生じ得る積（たとえば部分和）を事前に累積するステップを実行しなくても、メモリ場所に直接累積することができる。言い換えれば、システム１００の２つ以上のコアは、部分和を含む値のベクトルを、メモリ１０４の共有メモリセルのアドレス場所（たとえば中央アドレス場所）に任意の順序で累積することができる。システム１００は、いくつかの実現例において、コアでローカルに行う必要がある事前累積演算がないように、かつ、いくつかの他の実現例において、部分和の一部または特定のタイプの部分和が所与のコアで累積され得るように、構成可能である。 In some other implementations, each of the processor cores 302-1, 302-2 provides a vector that needs to be accumulated, and the values are accumulated directly in memory locations without synchronizing activity between the processor cores 302-1, 302-2. Similarly, values can be accumulated directly in memory locations without either of the processor cores 302-1, 302-2 performing a step of pre-accumulating products (e.g., partial sums) that may result from calculations performed on either of the processor cores. In other words, two or more cores of the system 100 can accumulate vectors of values containing partial sums in address locations (e.g., central address locations) of shared memory cells of the memory 104 in any order. The system 100 is configurable in some implementations such that no pre-accumulation operations need to be performed locally at a core, and in some other implementations such that some of the partial sums or certain types of partial sums can be accumulated at a given core.

システム１００は、累積演算に基づいて結果ベクトル（たとえば最終結果ベクトル）を生成する（６０８）。たとえば、システム１００は、値の１つ以上のベクトルと共有メモリに格納されているベクトルとを用いて累積演算を実行したことに基づいて、最終結果ベクトルを生成する。いくつかの実現例において、システム１００は結果ベクトルを生成し、この結果ベクトルの個々の要素は、第１のベクトルの各要素と共有メモリに格納されているベクトルの対応する各要素とに累積演算をペアワイズで適用したことによって生じる。結果ベクトルは、最終結果を生成するために累積されるそれぞれのベクトルがどのような順序で共有メモリセルに到着した場合でも、累積の正しい数学的結果を提供する。 The system 100 generates a result vector (e.g., a final result vector) based on the accumulation operation (608). For example, the system 100 generates the final result vector based on performing an accumulation operation with one or more vectors of values and a vector stored in the shared memory. In some implementations, the system 100 generates a result vector whose individual elements result from pairwise application of the accumulation operation to each element of the first vector and each corresponding element of the vector stored in the shared memory. The result vector provides the correct mathematical result of the accumulation regardless of the order in which each of the vectors accumulated to generate the final result arrive at the shared memory cell.

いくつかの実現例において、この所望の結果を達成するために、制御ユニット２０１および少なくとも演算器３２０の制御論理３２５を用いて実行される制御４３５（たとえば読出・修正・書込制御ループ）に基づいて、１つ以上のＤＭＡ動作が検出され、優先順位を付けられ、順序付けられてもよい。たとえば、制御ユニット２０１は、ベクトルを提供する対応するコアを含む入来ベクトル／ベクトル値のセットを検出し、演算器３２０を用いて、制御４３５によって指定される所与の優先順位スキームに基づいて個々の累積演算をシリアライズすることができる。優先順位スキームを用いて、書込みトラフィックを必要に応じて停止または再順序付けすることにより、古いベクトル値に対してベクトル値が累積されないようにすることができる。 In some implementations, to achieve this desired result, one or more DMA operations may be detected, prioritized, and ordered based on control 435 (e.g., a read-modify-write control loop) performed using control unit 201 and at least control logic 325 of operator 320. For example, control unit 201 may detect a set of incoming vectors/vector values with corresponding cores providing vectors, and use operator 320 to serialize individual accumulation operations based on a given priority scheme specified by control 435. The priority scheme may be used to stop or reorder write traffic as necessary to prevent vector values from accumulating against older vector values.

結果ベクトルは、ニューラルネットワーク層の出力のセットを表す最終結果ベクトルであり得る。たとえば、ニューラルネットワーク層は畳み込みニューラルネットワーク層であり得て、出力は、入力テンソル５０４の特定の入力ボリューム全体にわたって各カーネル（たとえばテンソル５０６のパラメータ／重み）を畳み込んだことに応じて生成される活性化値のセットであり得る。 The result vector may be a final result vector representing a set of outputs of the neural network layer. For example, the neural network layer may be a convolutional neural network layer, and the output may be a set of activation values generated in response to convolving each kernel (e.g., parameters/weights of tensor 506) over a particular input volume of input tensor 504.

システム１００は、値のベクトルのそれぞれに対して累積演算を実行した結果として、累積値のベクトルを生成することができる。いくつかの実現例において、値のベクトルのそれぞれは、ドット積に対応する部分和である。たとえば、畳み込みニューラルネットワーク層を再び参照して、上述の入力ボリュームの入力は、ｉ）入力テンソル５０４の所与の次元（たとえば次元５１０）に沿った各入力値と、ｉｉ）畳み込み層のパラメータのセットとを用いて、ドット積演算を実行することによって処理される。重みテンソル５０６の少なくとも１つのカーネルと入力ボリュームの所与の次元に沿った入力の一部とを畳み込んだことに応じて、ドット積または部分和の対応するセットを共有メモリ１０４のメモリ場所に累積して累積値のセットを生成することができる。 The system 100 can generate a vector of accumulated values as a result of performing an accumulation operation on each of the vectors of values. In some implementations, each of the vectors of values is a partial sum corresponding to a dot product. For example, referring back to the convolutional neural network layer, the inputs of the input volume described above are processed by performing a dot product operation with i) each input value along a given dimension (e.g., dimension 510) of the input tensor 504 and ii) a set of parameters of the convolutional layer. In response to convolving at least one kernel of the weight tensor 506 with a portion of the input along the given dimension of the input volume, a corresponding set of dot products or partial sums can be accumulated in memory locations of the shared memory 104 to generate a set of accumulated values.

システム１００は、累積値のベクトル内の各値に活性化関数を適用することができる。たとえば、ニューラルネットワークの層は、ニューラルネットワークにおいて非線形性を提供する非線形関数を表す活性化関数（ＲｅＬＵ、シグモイド、またはｔａｎｈなど）を有する場合もある（または有さない場合もある）。システム１００は、累積値のベクトル内の各値に活性化関数を適用したことに応じて、結果ベクトルを生成する。いくつかの実現例において、ハードウェア回路１０１は、複数のニューラルネットワーク層を含むニューラルネットワークを実行するように構成されたハードウェアアクセラレータであり、システム１００は、結果ベクトルに基づいてニューラルネットワークの層の出力を生成する。たとえば、ニューラルネットワーク層で層入力を処理することは、この層が活性化関数を適用して、ニューラルネットワーク層の出力である活性化値のセットを生成することを含み得る。第１のニューラルネットワーク層によって生成された活性化は、ニューラルネットワークの第２の層または後続層を通して処理することができる。 The system 100 can apply an activation function to each value in the vector of accumulated values. For example, the layer of the neural network may (or may not) have an activation function (such as ReLU, sigmoid, or tanh) that represents a nonlinear function that provides nonlinearity in the neural network. The system 100 generates a result vector in response to applying the activation function to each value in the vector of accumulated values. In some implementations, the hardware circuit 101 is a hardware accelerator configured to execute a neural network that includes multiple neural network layers, and the system 100 generates an output of the layer of the neural network based on the result vector. For example, processing the layer inputs at the neural network layer may include the layer applying an activation function to generate a set of activation values that are the output of the neural network layer. The activations generated by the first neural network layer can be processed through a second or subsequent layer of the neural network.

本明細書に記載されている主題の実施形態および機能的動作は、デジタル電子回路において、有形的に実現されたコンピュータソフトウェアもしくはファームウェアにおいて、本明細書に開示されている構造およびその構造的均等物を含むコンピュータハードウェアにおいて、または、これらのうちの１つ以上を組み合わせたものにおいて、実現することができる。本明細書に記載されている主題の実施形態は、１つ以上のコンピュータプログラムとして、すなわち、データ処理装置によって実行されるまたはデータ処理装置の動作を制御するための有形の非一時的なプログラムキャリア上で符号化されたコンピュータプログラム命令の１つ以上のモジュールとして、実現することができる。 The embodiments and functional operations of the subject matter described herein may be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware including the structures disclosed herein and structural equivalents thereof, or in a combination of one or more of these. The embodiments of the subject matter described herein may be implemented as one or more computer programs, i.e., as one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by or for controlling the operation of a data processing apparatus.

これに代えてまたはこれに加えて、プログラム命令は、データ処理装置による実行のために適切な受信装置に送信される情報を符号化するために生成された、人為的に生成された伝搬信号（たとえばマシンによって生成された電気、光、または電磁信号）上で符号化することができる。コンピュータ記憶媒体は、機械読取可能記憶媒体、機械読取可能記憶基板、ランダムもしくはシリアルアクセスメモリデバイス、または、これらのうちの１つ以上を組み合わせたものであってもよい。 Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information that is transmitted to an appropriate receiving device for execution by a data processing device. The computer storage medium may be a machine-readable storage medium, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of these.

「計算システム」という用語は、データを処理するためのすべての種類の装置、デバイスおよびマシンを包含し、一例として、プログラマブルプロセッサ、コンピュータ、または複数のプロセッサもしくはコンピュータを含む。装置は、特別目的論理回路（たとえば、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）またはＡＳＩＣ（特定用途向け集積回路））を含み得る。また、装置は、ハードウェアに加えて、対象のコンピュータプログラムのための実行環境を作成するコード（たとえば、プロセッサファームウェア、プロトコルスタック、データベース管理システム、オペレーティングシステム、またはそれらのうちの１つ以上の組み合わせを構成するコード）も含み得る。 The term "computing system" encompasses all kinds of apparatus, devices, and machines for processing data, including, by way of example, a programmable processor, computer, or multiple processors or computers. An apparatus may include special purpose logic circuitry (e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit)). In addition to hardware, an apparatus may also include code that creates an execution environment for a target computer program (e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or one or more combinations of these).

コンピュータプログラム（プログラム、ソフトウェア、ソフトウェアアプリケーション、モジュール、ソフトウェアモジュール、スクリプト、もしくはコードと称されてもよく、またはそのようなものとして説明されてもよい）は、任意の形式のプログラミング言語（コンパイル型もしくはインタプリタ型言語、または宣言型もしくは手続き型言語を含む）で書くことができ、任意の形式で（スタンドアロンのプログラムとして、または、計算環境での使用に適したモジュール、コンポーネント、サブルーチンもしくは他のユニットとして、など）デプロイすることができる。 Computer programs (which may also be referred to or described as programs, software, software applications, modules, software modules, scripts, or code) may be written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages) and may be deployed in any form (such as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment).

コンピュータプログラムは、ファイルシステムにおけるファイルに対応し得るが、必ずしもそうでなくてもよい。プログラムは、他のプログラムもしくはデータを保持するファイルの一部（たとえば、マークアップ言語文書に格納された１つ以上のスクリプト）に格納されてもよく、対象のプログラムに専用の単一のファイルに格納されてもよく、または複数の連携したファイル（たとえば、１つ以上のモジュール、サブプログラム、もしくはコードの一部を格納するファイル）に格納されてもよい。コンピュータプログラムは、１つのコンピュータで実行されるようにデプロイされてもよく、または、一箇所に位置しているかもしくは複数箇所に分散されて通信ネットワークによって相互接続されている複数のコンピュータで実行されるようにデプロイされてもよい。 A computer program may, but need not, correspond to a file in a file system. A program may be stored as part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple associated files (e.g., a file that stores one or more modules, subprograms, or portions of code). A computer program may be deployed to run on one computer, or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communications network.

本明細書に記載されているプロセスおよび論理フローは、入力データに対して動作して出力を生成することによって機能を実行するように１つ以上のコンピュータプログラムを実行する１つ以上のプログラマブルコンピュータによって実行することができる。これらのプロセスおよび論理フローも、特別目的論理回路（たとえば、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、ＡＳＩＣ（特定用途向け集積回路）、またはＧＰＧＰＵ（汎用グラフィックス処理装置））によって実行することができ、装置も、特別目的論理回路として実現することができる。 The processes and logic flows described herein may be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data to generate output. These processes and logic flows may also be performed by, and devices may be implemented as, special purpose logic circuitry (e.g., an FPGA (field programmable gate array), an ASIC (application specific integrated circuit), or a GPGPU (general purpose graphics processing unit)).

コンピュータプログラムの実行に適したコンピュータは、一例として、汎用もしくは専用マイクロプロセッサまたはそれら両方、またはその他の種類の中央処理装置を含み、それらに基づき得る。一般的に、中央処理装置は、読み取り専用メモリまたはランダムアクセスメモリまたはそれら両方から命令およびデータを受信する。コンピュータのいくつかの要素は、命令を実施または実行するための中央処理装置と、命令およびデータを格納するための１つ以上のメモリデバイスとである。一般的に、コンピュータは、データを格納するための１つ以上のマスストレージデバイス（たとえば、磁気ディスク、光磁気ディスク、もしくは光ディスク）も含み、または、これらの１つ以上のマスストレージデバイスとの間でデータを受信したり、送信したり、送受信したりするように作動的に結合されている。しかし、コンピュータは、このようなデバイスを有していなくてもよい。さらに、コンピュータは、別のデバイス（たとえば、いくつか例を挙げると、携帯電話、パーソナルデジタルアシスタント（ＰＤＡ）、モバイルオーディオもしくはビデオプレーヤ、ゲーム機、グローバルポジショニングシステム（ＧＰＳ）受信機、または携帯型ストレージデバイス（たとえば、ユニバーサルシリアルバス（ＵＳＢ）フラッシュドライブ））に組み込むことができる。 A computer suitable for executing a computer program may include or be based on, by way of example, a general-purpose or special-purpose microprocessor or both, or other types of central processing units. Typically, the central processing unit receives instructions and data from a read-only memory or a random-access memory or both. Some elements of a computer are a central processing unit for implementing or executing instructions, and one or more memory devices for storing instructions and data. Typically, a computer also includes one or more mass storage devices (e.g., magnetic disks, magneto-optical disks, or optical disks) for storing data, or is operatively coupled to receive, transmit, or transmit data to or from one or more of these mass storage devices. However, a computer need not have such devices. Additionally, a computer may be incorporated in another device (e.g., a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name a few).

コンピュータプログラム命令およびデータを格納するのに適したコンピュータ読取可能媒体は、すべての形態の不揮発性メモリ、媒体およびメモリデバイス（一例として、半導体メモリデバイス（たとえば、ＥＰＲＯＭ、ＥＥＰＲＯＭ、およびフラッシュメモリデバイス）、磁気ディスク（たとえば、内部ハードディスクまたはリムーバブルディスク）、光磁気ディスク、ならびにＣＤＲＯＭおよびＤＶＤ－ＲＯＭディスクを含む）を含む。プロセッサおよびメモリは、特別目的論理回路によって補完されることができ、または特別目的論理回路に組み入れられることができる。 Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including, by way of example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disks or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

ユーザとのやり取りを提供するために、本明細書に記載されている主題の実施形態は、ユーザに対して情報を表示するためのディスプレイデバイス（たとえば、ＬＣＤ（液晶ディスプレイ）モニタ）と、ユーザがコンピュータに入力を提供することができるキーボードおよびポインティングデバイス（たとえば、マウスまたはトラックボール）とを有するコンピュータ上で実現することができる。その他の種類のデバイスを用いてユーザとのやり取りが行われるようにしてもよい。たとえば、ユーザに提供されるフィードバックは、任意の形式の感覚フィードバック（たとえば、視覚フィードバック、聴覚フィードバックまたは触覚フィードバック）であり得て、ユーザからの入力は、任意の形式（音響入力、音声入力または触覚入力を含む）で受信することができる。加えて、コンピュータは、ユーザとのやり取りを、ユーザが使用するデバイスとの間で文書を送受信することによって（たとえば、ユーザのクライアントデバイス上のウェブブラウザから要求を受信したことに応答してウェブブラウザにウェブページを送信することによって）実現してもよい。 To provide for user interaction, embodiments of the subject matter described herein may be implemented on a computer having a display device (e.g., an LCD (liquid crystal display) monitor) for displaying information to the user, and a keyboard and pointing device (e.g., a mouse or trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide user interaction. For example, feedback provided to the user may be any form of sensory feedback (e.g., visual, auditory, or tactile feedback), and input from the user may be received in any form, including acoustic, speech, or tactile input. Additionally, the computer may provide for user interaction by sending and receiving documents to and from a device used by the user (e.g., by sending a web page to a web browser on the user's client device in response to receiving a request from the web browser).

本明細書に記載されている主題の実施形態は、たとえばデータサーバとしてバックエンドコンポーネントを含む計算システムで実現されてもよく、または、ミドルウェアコンポーネント（たとえば、アプリケーションサーバ）を含む計算システムで実現されてもよく、または、フロントエンドコンポーネント（たとえば、ユーザが本明細書に記載されている主題の実現例とやり取りすることができるグラフィカルユーザインターフェイスもしくはウェブブラウザを有するクライアントコンピュータ）を含む計算システムで実現されてもよく、または、１つ以上のこのようなバックエンドコンポーネント、ミドルウェアコンポーネントもしくはフロントエンドコンポーネントの任意の組み合わせを含む計算システムで実現されてもよい。システムのこれらのコンポーネントは、デジタルデータ通信の任意の形式または媒体（たとえば、通信ネットワーク）によって相互接続されることができる。通信ネットワークの例としては、ローカルエリアネットワーク（「ＬＡＮ」）およびワイドエリアネットワーク（「ＷＡＮ」）（たとえば、インターネット）が挙げられる。 Embodiments of the subject matter described herein may be implemented in a computing system that includes a back-end component, e.g., as a data server, or a middleware component (e.g., an application server), or a front-end component (e.g., a client computer having a graphical user interface or web browser through which a user can interact with an implementation of the subject matter described herein), or any combination of one or more such back-end, middleware, or front-end components. These components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include local area networks ("LANs") and wide area networks ("WANs") (e.g., the Internet).

計算システムはクライアントとサーバとを含み得る。クライアントとサーバは、通常は互いに離れており、通信ネットワークを通してやり取りするのが一般的である。クライアントとサーバとの関係は、それぞれのコンピュータ上で実行されクライアントとサーバとの関係を有するコンピュータプログラムによって発生する。 A computing system may include clients and servers. Clients and servers are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship.

本明細書には実装の具体的詳細事項が多く含まれているが、これらは、どの発明の範囲またはクレームし得るものの範囲の限定としても解釈されてはならないものであって、むしろ、特定の発明の特定の実施形態に固有であり得る特徴の説明として解釈されるべきものである。本明細書において、別々の実施形態という観点で記載されている特定の特徴は、１つの実施形態において組み合わせて実現することも可能である。逆に、１つの実施形態という観点から記載されている各種特徴を、複数の実施形態において別々に、または任意の適切な下位の組み合わせとして実現することも可能である。さらに、特徴は、特定の組み合わせで機能するものとして記載され最初にそういうものとしてクレームされている場合があるが、クレームされている組み合わせに含まれる１つ以上の特徴は、場合によってはこの組み合わせから省略することができ、クレームされている組み合わせは下位の組み合わせまたは下位の組み合わせの変形に関するものである場合がある。 Although the specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of a particular invention. Certain features described in the specification in terms of separate embodiments may also be realized in combination in a single embodiment. Conversely, various features described in the specification in terms of a single embodiment may also be realized in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described as functioning in a particular combination and initially claimed as such, one or more features included in a claimed combination may, in some cases, be omitted from the combination, and the claimed combination may relate to a subcombination or a variation of the subcombination.

同様に、動作は図面において特定の順序で示されているが、これは、このような動作が、示されている特定の順序もしくは連続した順序で実行されることを要する、または、示されているすべての動作が所望の結果を得るために実行されることを要する、と理解されてはならない。特定の状況ではマルチタスキングおよび並列処理が有利な場合がある。さらに、上記実施形態における各種システムモジュールおよびコンポーネントの分離は、すべての実施形態においてこのような分離を要するものと理解されてはならない。記載されているプログラムコンポーネントおよびシステムは一般的に、１つのソフトウェアプロダクトに統合できる、または、パッケージングして複数のソフトウェアプロダクトにできることが、理解されるはずである。 Similarly, although operations are shown in a particular order in the figures, this should not be understood as requiring such operations to be performed in the particular order or sequential order shown, or that all of the operations shown must be performed to achieve a desired result. Multitasking and parallel processing may be advantageous in certain situations. Furthermore, the separation of various system modules and components in the above embodiments should not be understood as requiring such separation in all embodiments. It should be understood that the program components and systems described may generally be integrated into a single software product or packaged into multiple software products.

主題の特定の実施形態について説明してきた。他の実施形態は、以下の請求項の範囲内である。たとえば、請求項に記載されている動作は、異なる順序で実行されても所望の結果を達成することができる。一例として、添付の図面に示されているプロセスは、所望の結果を達成するために、示されている特定の順序または連続した順序を必ずしも必要としない。特定の実現例では、マルチタスキングおよび並列処理が有利な場合がある。 Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims may be performed in a different order and still achieve desirable results. By way of example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Claims

1. A method implemented using a hardware circuit having a shared memory and a plurality of processor cores in communication with the shared memory, the method comprising:
generating a first vector of values based on vector operands operated on by a vector processing unit of the first processor core;
the shared memory receiving, using a direct memory access (DMA) data path of the shared memory, the first vector of values from the first processor core;
performing an accumulation operation between the first vector of values and a vector stored in the shared memory;
The accumulation operation is performed using a calculator unit, the calculator unit comprising:
i) configured to accumulate respective values of one or more vectors;
ii) a first vector of values is routed to the shared memory such that the first vector of values is accumulated into the vector stored in the shared memory external to the first processor core, the method further comprising:
generating a result vector based on the accumulation operation.

The vector stored in the shared memory is received from a second processor core, and the method further comprises:
performing an accumulate operation into memory to accumulate each value of the first vector of values using memory locations of the shared memory;
and performing an accumulate operation into memory to accumulate each value of a second vector of values using the memory locations of the shared memory.

generating the result vector based on the accumulation operation,
generating the result vector without the first processor core performing a step of pre-accumulating products resulting from calculations performed by the first processor core;
and generating the result vector without the second processor core performing a step of pre-accumulating products resulting from calculations performed by the second processor core.

Generating the result vector comprises:
generating a vector of accumulated values as a result of performing the accumulation operation on the first vector of values;
applying an activation function to each value in the vector of accumulated values;
and generating the result vector as a result of applying the activation function to each value in the vector of accumulated values.

The resource of each of the first processor cores is a first matrix calculation unit, and the method further comprises:
4. The method of claim 2 or 3, comprising generating a first vector of accumulated values corresponding to the first vector of values based on a matrix multiplication performed using the first matrix computation unit of the first processor core.

Each resource of the second processor core is a second matrix calculation unit, and the method further comprises:
6. The method of claim 5, comprising generating a second vector of accumulated values corresponding to the second vector of values based on a matrix multiplication performed using the second matrix computation unit of the second processor core.

the hardware circuit is a hardware accelerator configured to execute a neural network including a plurality of neural network layers;
The method of any one of claims 1 to 6, wherein the method comprises generating an output of a layer of the neural network based on the result vector.

generating a first vector of values based on calculations performed on the first processor core;
generating a second vector of values based on calculations performed on the second processor core;
7. The method of claim 2, 3, 5, or 6, wherein the calculations performed by the first processor core and the calculations performed by the second processor core are part of a mathematical operation that is controlled by commutativity.

The mathematical operation is
Floating-point multiplication operations,
Floating-point addition operations,
The method of claim 8, wherein the operation is an integer addition operation; or a min-max operation.

The method of claim 8, wherein the mathematical operations include floating point addition operations and integer addition operations.

The method according to any one of claims 2, 3, 5, 6, and 8 to 10, wherein the first processor core and the second processor core are the same processor core.

The method of any one of claims 1 to 11, wherein the shared memory is configured to function as a shared global memory space including memory banks and registers shared between two or more processor cores of the hardware circuit.

1. A system comprising:
A processing device;
a hardware circuit having a shared memory and a plurality of processor cores in communication with the shared memory;
and a non-transitory machine-readable storage device for storing instructions executable by the processing device to perform operations, the operations including:
generating a first vector of values based on vector operands operated on by a vector processing unit of the first processor core;
the shared memory receiving, using a direct memory access (DMA) data path of the shared memory, the first vector of values from the first processor core;
performing an accumulation operation between the first vector of values and a vector stored in the shared memory;
The accumulation operation is performed using a calculator unit, the calculator unit comprising:
i) configured to accumulate respective values of one or more vectors;
ii) the first vector of values is routed to the shared memory such that the first vector of values is accumulated into the vector stored in the shared memory external to the first processor core, the operations further comprising:
generating a result vector based on the accumulation operation .

the vector stored in the shared memory is received from a second processor core, and the operation comprises:
performing an accumulate operation into memory to accumulate each value of the first vector of values using memory locations of the shared memory;
and performing an accumulate operation into memory to accumulate each value of a second vector of values using the memory locations of the shared memory.

generating the result vector based on the accumulation operation ,
generating the result vector without the first processor core performing a step of pre-accumulating products resulting from calculations performed by the first processor core;
and generating the result vector without the second processor core performing a step of pre-accumulating products resulting from calculations performed by the second processor core.

The resource of each of the first processor cores is a first matrix calculation unit, and the operation further comprises:
16. The system of claim 14 or 15, further comprising generating a first vector of accumulated values corresponding to the first vector of values based on a matrix multiplication performed using the first matrix computation unit of the first processor core.

The respective resources of the second processor core are second matrix calculation units, and the operations further include:
20. The system of claim 17, further comprising generating a second vector of accumulated values corresponding to the second vector of values based on a matrix multiplication performed using the second matrix computation unit of the second processor core.

the hardware circuit is a hardware accelerator configured to execute a neural network including a plurality of neural network layers;
The system of any one of claims 13 to 18, wherein the operations include generating an output of a layer of the neural network based on the result vector.

generating a first vector of values based on calculations performed on the first processor core;
generating a second vector of values based on calculations performed on the second processor core;
19. The system of claim 14, 15, 17, or 18, wherein the computations performed by the first processor core and the computations performed by the second processor core are part of a mathematical operation controlled by commutativity.

The mathematical operation is
Floating-point multiplication operations,
Floating-point addition operations,
21. The system of claim 20, wherein the operation is an integer addition operation; or a min-max operation.

The system of claim 20, wherein the mathematical operations include floating point addition operations and integer addition operations.

The system of any one of claims 14, 15, 17, 18, and 20-22, wherein the first processor core and the second processor core are the same processor core.

The system of any one of claims 13 to 23, wherein the shared memory is configured to function as a shared global memory space including memory banks and registers shared between two or more processor cores of the hardware circuit.

A computer program storing instructions executable by a processor to cause the processor to perform operations, the operations comprising:
generating a first vector of values based on vector operands operated on by a vector processing unit of the first processor core;
a shared memory receiving the first vector of values from the first processor core using a direct memory access (DMA) data path of the shared memory;
performing an accumulation operation between the first vector of values and a vector stored in the shared memory;
The accumulation operation is performed using a calculator unit, the calculator unit comprising:
i) configured to accumulate respective values of one or more vectors;
ii) the first vector of values is routed to the shared memory such that the first vector of values is accumulated into the vector stored in the shared memory external to the first processor core, the operations further comprising:
generating a result vector based on the accumulation operation.