JP6888073B2

JP6888073B2 - Chip equipment and related products

Info

Publication number: JP6888073B2
Application number: JP2019221533A
Authority: JP
Inventors: シャオリリォウ; ティエンスチェン; ビンルイワン; ヤオジャン
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2021-06-16
Anticipated expiration: 2037-08-31
Also published as: JP2020177640A

Description

本開示は、通信およびチップ技術の分野に関し、特にチップ装置および関連製品に関する。 The present disclosure relates to the fields of communications and chip technology, and in particular to chip devices and related products.

人工ニューラルネットワーク（ＡＮＮ）は、２０世紀８０年代以降の人工知能の分野における研究ホットスポットである。それは情報処理の観点から人間の脳ニューロンネットワークを抽象化し、単純なモデルを確立し、そして異なる接続方法に従って異なるネットワークを形成する。エンジニアリングおよび学界では、それはしばしばニューラルネットワークまたは類ニューラルネットワークと直接呼ばれる。ニューラルネットワークは、互いに接続された多数のノード（またはニューロン）からなる計算モデルである。既存のニューラルネットワークの計算は、ＣＰＵ（Central Processing Unit,中央処理装置）またはＧＰＵ（Graphics Processing Unit,グラフィック処理装置）に基づいて実現され、このような計算は消費電力が大きく、計算時間が長い。 Artificial neural networks (ANN) have been research hotspots in the field of artificial intelligence since the 1980s in the 20th century. It abstracts the human brain neuron network in terms of information processing, establishes a simple model, and forms different networks according to different connection methods. In engineering and academia, it is often referred to directly as neural networks or similar neural networks. A neural network is a computational model consisting of a large number of nodes (or neurons) connected to each other. The calculation of the existing neural network is realized based on the CPU (Central Processing Unit) or the GPU (Graphics Processing Unit), and such a calculation consumes a large amount of power and takes a long calculation time.

本開示の実施形態は、計算時間を短縮し、モジュールの電力消費を低減することができるニューラルネットワーク計算方法および関連製品を提供する。 Embodiments of the present disclosure provide neural network calculation methods and related products that can reduce calculation time and power consumption of modules.

第１の態様では、本開示の実施形態は、メインユニットと複数の基本ユニットとを備えるチップ装置において適用されるニューラルネットワークの計算方法を提供する。この方法は、以下のステップを含む。メインユニットは計算予定データブロックと演算コマンドとを取得し、演算コマンドに従って計算予定データブロックを配信データブロックとブロードキャストデータブロックとに分割する。メインユニットは配信データブロックを分割して複数の基本データブロックを得て、前記複数の基本データブロックを複数の基本ユニットに配信し、メインユニットはブロードキャストデータブロックを複数の基本ユニットにブロードキャストする。基本ユニットは、基本データブロックおよびブロードキャストデータブロックに対して内積演算を行い演算結果を得て、その演算結果をメインユニットに送信する。メインユニットは演算結果を処理し、前記計算予定データブロックと演算コマンドとのコマンド結果を得る。 In a first aspect, embodiments of the present disclosure provide a method of calculating a neural network that is applied in a chip device comprising a main unit and a plurality of basic units. This method involves the following steps: The main unit acquires the calculation schedule data block and the calculation command, and divides the calculation schedule data block into a distribution data block and a broadcast data block according to the calculation command. The main unit divides the distribution data block to obtain a plurality of basic data blocks, distributes the plurality of basic data blocks to the plurality of basic units, and the main unit broadcasts the broadcast data block to the plurality of basic units. The basic unit performs an inner product operation on the basic data block and the broadcast data block, obtains an operation result, and transmits the operation result to the main unit. The main unit processes the calculation result and obtains the command result of the calculation schedule data block and the calculation command.

任意選択で、メインユニットはブロードキャストデータブロックを複数の基本ユニットにブロードキャストすることは、以下を含む。 Optionally, the main unit broadcasts a broadcast data block to multiple basic units, including:

前記メインユニットは、前記ブロードキャストデータブロックを複数の基本ユニットに一回でブロードキャストする。 The main unit broadcasts the broadcast data block to a plurality of basic units at one time.

任意選択で、前記基本ユニットは、基本データブロックおよびブロードキャストデータブロックに対して内積演算を行い演算結果を得て、その演算結果をメインユニットに送信することは、以下を含む。 Arbitrarily, the basic unit performs an inner product calculation on the basic data block and the broadcast data block to obtain a calculation result, and transmitting the calculation result to the main unit includes the following.

基本ユニットは、基本データブロックおよびブロードキャストデータブロックに対して内積処理を行い内積処理の結果を得て、その内積処理の結果を累算して演算結果を得て、その演算結果をメインユニットに送信する。 The basic unit performs inner product processing on the basic data block and the broadcast data block to obtain the result of the inner product processing, accumulates the result of the inner product processing to obtain the calculation result, and transmits the calculation result to the main unit. To do.

任意選択で、前記演算結果は内積処理の結果であり、メインユニットは演算結果を処理し、前記計算予定データブロックと演算コマンドとのコマンド結果を得ることは、以下を含む。 Arbitrarily, the calculation result is the result of the inner product processing, the main unit processes the calculation result, and obtaining the command result of the calculation schedule data block and the calculation command includes the following.

前記メインユニットは前記演算結果を累算して累算結果を得て、その累算結果を配列し前記計算予定データブロックと演算コマンドとのコマンド結果とを得る。 The main unit accumulates the calculation results to obtain the accumulation results, arranges the accumulation results, and obtains the command results of the calculation schedule data block and the calculation command.

任意選択で、前記メインユニットはブロードキャストデータブロックを複数の基本ユニットにブロードキャストすることは、以下を含む。 Optionally, the main unit broadcasts a broadcast data block to a plurality of basic units, including:

前記ブロードキャストデータブロックを複数の部分ブロードキャストデータブロックに分割し、その複数の部分ブロードキャストデータブロックを前記複数の基本ユニットに数回でブロードキャストする。 The broadcast data block is divided into a plurality of partial broadcast data blocks, and the plurality of partial broadcast data blocks are broadcast to the plurality of basic units several times.

前記基本ユニットは、前記部分ブロードキャストデータブロックおよび基本データブロックに対して内積処理を一回行い内積処理の結果を得て、その内積処理の結果を累算して部分演算結果を得て、その部分演算結果をメインユニットに送信する。 The basic unit performs inner product processing once on the partial broadcast data block and the basic data block to obtain the result of the inner product processing, and accumulates the results of the inner product processing to obtain the partial calculation result, and the portion thereof. The calculation result is sent to the main unit.

前記基本ユニットは、当該部分ブロードキャストデータブロックをｎ回繰り返し使用して当該部分ブロードキャストデータブロックおよびｎ個の基本データブロックの内積演算を行いｎ個の部分処理結果を得て、ｎ個の部分処理結果を累算してｎ個の部分演算結果を得て、そのｎ個の部分演算結果をメインユニットに送信し、ここで、ｎは２以上の整数となる。 The basic unit repeatedly uses the partial broadcast data block n times to perform an inner product operation of the partial broadcast data block and n basic data blocks to obtain n partial processing results, and n partial processing results. Is accumulated to obtain n partial calculation results, and the n partial calculation results are transmitted to the main unit, where n is an integer of 2 or more.

第２の態様では、メインユニットと複数の基本ユニットとを備えるチップ装置が提供される。前記基本ユニットは、計算予定データブロックと演算コマンドとを取得し、演算コマンドに従って計算予定データブロックを配信データブロックとブロードキャストデータブロックとに分割する。配信データブロックを分割して複数の基本データブロックを得て、複数の基本データブロックを複数の基本ユニットに配信し、ブロードキャストデータブロックを複数の基本ユニットにブロードキャストする。基本ユニットは、基本データブロックおよびブロードキャストデータブロックに対して内積演算を行い演算結果を得て、その演算結果をメインユニットに送信する。メインユニットは、演算結果を処理して、計算予定データブロックと演算コマンドとのコマンド結果を得る。 In the second aspect, a chip device including a main unit and a plurality of basic units is provided. The basic unit acquires a calculation schedule data block and a calculation command, and divides the calculation schedule data block into a distribution data block and a broadcast data block according to the calculation command. The distribution data block is divided to obtain a plurality of basic data blocks, the plurality of basic data blocks are distributed to a plurality of basic units, and the broadcast data block is broadcast to a plurality of basic units. The basic unit performs an inner product operation on the basic data block and the broadcast data block, obtains an operation result, and transmits the operation result to the main unit. The main unit processes the calculation result and obtains the command result of the calculation schedule data block and the calculation command.

任意選択で、チップ装置は、分岐ユニットをさらに備える。分岐ユニットは、メインユニットと基本ユニットとの間に配置される。分岐ユニットはデータを転送するために使用される。 Optionally, the chip device further comprises a branching unit. The branch unit is arranged between the main unit and the basic unit. Branch units are used to transfer data.

任意選択で、メインユニットは具体的に、ブロードキャストデータブロックを複数の基本ユニットに一回でブロードキャストするために使用される。 Optionally, the main unit is specifically used to broadcast a broadcast data block to multiple base units at once.

任意選択で、基本ユニットは具体的に、基本データブロックおよびブロードキャストデータブロックに対して内積処理を行い内積処理の結果を得て、その内積処理の結果を累算して演算結果を得て、その演算結果をメインユニットに送信するために使用される。 Arbitrarily, the basic unit specifically performs inner product processing on the basic data block and the broadcast data block to obtain the result of the inner product processing, and the result of the inner product processing is accumulated to obtain the calculation result. It is used to send the calculation result to the main unit.

任意選択で、メインユニットは、演算結果が内積処理の結果である場合、演算結果を累算して累算結果を得て、累算結果を配列して計算予定データブロックと演算コマンドとのコマンド結果を得る。 Arbitrarily, when the operation result is the result of inner product processing, the main unit accumulates the operation results to obtain the accumulation result, arranges the accumulation results, and commands the calculation schedule data block and the operation command. Get results.

任意選択で、メインユニットは具体的に、ブロードキャストデータブロックを複数の部分ブロードキャストデータブロックに分割し、複数の部分ブロードキャストデータブロックを複数の基本ユニットに数回でブロードキャストするために使用される。 Optionally, the main unit is specifically used to divide a broadcast data block into multiple partial broadcast data blocks and broadcast the multiple partial broadcast data blocks to multiple basic units several times.

任意選択で、基本ユニットは具体的に、部分ブロードキャストデータブロックおよび基本データブロックに対して内積処理を行い内積処理の結果を得て、その内積処理の結果を累算して部分演算結果を得て、部分演算結果をメインユニットに送信するために使用される。 Arbitrarily, the basic unit specifically performs inner product processing on the partial broadcast data block and the basic data block to obtain the result of the inner product processing, and accumulates the results of the inner product processing to obtain the partial calculation result. , Used to send the partial calculation result to the main unit.

任意選択で、基本ユニットは具体的に、部分ブロードキャストデータブロックをｎ回繰り返し使用して部分ブロードキャストデータブロックおよびｎ個の基本データブロックの内積演算を行い、ｎ個の部分処理結果を得て、ｎ個の部分処理結果をそれぞれ累算した後、ｎ個の部分演算結果を得て、ｎ個の部分演算結果をメインユニットに送信するために使用され、ここで、ｎは２以上の整数である。 Arbitrarily, the basic unit specifically uses the partial broadcast data block n times repeatedly to perform an inner product operation of the partial broadcast data block and n basic data blocks, obtains n partial processing results, and n After accumulating each of the partial processing results, it is used to obtain n partial calculation results and send n partial calculation results to the main unit, where n is an integer of 2 or more. ..

任意選択で、メインユニットは、メインレジスタまたはメインオンチップキャッシュ回路の１つまたは任意の組み合わせを備える。 Optionally, the main unit comprises one or any combination of main registers or main on-chip cache circuits.

基礎ユニットは、基本レジスタまたは基本オンチップキャッシュ回路の１つまたは任意の組み合わせを備える。 The basic unit comprises one or any combination of basic registers or basic on-chip cache circuits.

任意選択で、メインユニットは、ベクトル演算回路、算術論理演算回路、アキュムレータ回路、マトリックス転置回路、直接メモリアクセス回路、またはデータ並べ替え回路のうちの１つまたは任意の組み合わせを備える。 Optionally, the main unit comprises one or any combination of a vector operation circuit, an arithmetic logic operation circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, or a data sorting circuit.

任意選択で、ユニットは内積演算回路またはアキュムレータ回路などのうちの１つまたは任意の組合せを備える。 Optionally, the unit comprises one or any combination of an inner product arithmetic circuit, an accumulator circuit, and the like.

任意選択で、分岐ユニットは複数の分岐ユニットであり、メインユニットは複数の分岐ユニットにそれぞれ接続され、各分岐ユニットは少なくとも１つの基礎ユニットに接続される。 Optionally, the branch unit is a plurality of branch units, the main unit is connected to each of the plurality of branch units, and each branch unit is connected to at least one foundation unit.

任意選択で、分岐ユニットは複数の分岐ユニットであり、複数の分岐ユニットは直列に接続されてからメインユニットに接続され、各分岐ユニットはそれぞれ少なくとも１つの基礎ユニットに接続される。 Optionally, the branch unit is a plurality of branch units, the plurality of branch units are connected in series and then connected to the main unit, and each branch unit is connected to at least one basic unit.

任意選択で、分岐ユニットは具体的に、メインユニットと基礎ユニットとの間のデータを転送するために使用される。 Optionally, the branching unit is specifically used to transfer data between the main unit and the underlying unit.

任意選択で、分岐ユニットは具体的に、メインユニットと基礎ユニットまたは他の分岐ユニットとの間のデータを転送するために使用される。 Optionally, the branching unit is specifically used to transfer data between the main unit and the underlying unit or other branching unit.

任意選択で、データは、ベクトル、マトリックス、三次元データブロック、四次元データブロック、およびｎ次元データブロックのうちの１つまたは任意の組み合わせである。 Optionally, the data is one or any combination of vectors, matrices, three-dimensional data blocks, four-dimensional data blocks, and n-dimensional data blocks.

任意選択で、演算コマンドが乗算コマンドであれば、乗数データブロックがブロードキャストデータブロックであり、被乗数データブロックが配信データブロックであると確定される。 If the operation command is a multiplication command by arbitrary selection, it is determined that the multiplier data block is a broadcast data block and the multiplicand data block is a distribution data block.

演算コマンドが畳み込みコマンドであれば、入力データブロックはブロードキャストデータブロックであり、畳み込みカーネルは配信データブロックであると確定される。 If the arithmetic command is a convolution command, the input data block is determined to be a broadcast data block and the convolution kernel is determined to be a distribution data block.

第３の態様では、第２の態様によって提供されるチップ装置の適用方法が提供される。当該チップ装置はマトリックスとマトリックスの乗算、マトリックスとベクトルの乗算、畳み込み演算、または完全接続演算のうちの１つまたは任意の組み合わせを行うために使用される。 A third aspect provides a method of applying the chip device provided by the second aspect. The chip device is used to perform one or any combination of matrix-matrix multiplication, matrix-vector multiplication, convolution operations, or fully connected operations.

第４の態様では、第２の態様によって提供されるチップ装置を集積するチップが提供される。 In the fourth aspect, a chip that integrates the chip apparatus provided by the second aspect is provided.

第５の態様では、第６の態様によって提供されるチップを備えるスマートデバイスが提供される。 A fifth aspect provides a smart device with the chip provided by the sixth aspect.

本開示の実施形態を実施すると、以下の有益な効果が得られる。 When the embodiments of the present disclosure are implemented, the following beneficial effects can be obtained.

わかるように、本開示の実施形態によれば、データと演算コマンドを受信した後、データは配信データとブロードキャストデータに分割され、配信データは基本データブロックに分割されてから、複数の基本ユニットに配信されて内積演算が行われる。このように、演算量が最も多い内積演算を複数の基本ユニットに分散して同時に行うので、演算時間が短縮され、消費電力が節約されるという利点がある。 As can be seen, according to the embodiments of the present disclosure, after receiving the data and the arithmetic command, the data is divided into distribution data and broadcast data, the distribution data is divided into basic data blocks, and then into a plurality of basic units. It is delivered and the internal product calculation is performed. As described above, since the inner product calculation having the largest amount of calculation is distributed to a plurality of basic units and performed at the same time, there is an advantage that the calculation time is shortened and the power consumption is saved.

本開示の実施形態における技術案をより明確に例示するために、実施形態の説明に使用される図面を以下に簡単に説明する。以下の説明における図面は、本開示のいくつかの実施形態であることは明らかであり、当業者にとっては、これらの図面に基づいて他の図面を創造的な仕事をすることなく得ることもできる。
図１ａは、本開示によって提供されるチップ装置の概略構造図である。図１ｂは、本開示によって提供される別のチップ装置の概略構造図である。図１ｃは、本開示によって提供されるチップ装置のデータ配信の概略図である。図１ｄは、チップ装置のデータ返しの概略図である。図２は、本開示の実施形態によって提供されるニューラルネットワークの計算方法の概略フローチャートである。図２ａは、本開示の実施形態によって提供されるマトリックスＡにマトリックスＢをかける概略図である。図３は、本開示の実施形態によって提供されるニューラルネットワークの計算方法の概略フローチャート図である。図３ａは、完全接続１の単一サンプルデータの概略図である。図３ｂは、完全接続２のマルチサンプルデータの概略図である。図３ｃは、畳み込み１のＭ個の畳み込みカーネルデータの概略図である。図３ｄは、畳み込み２入力データの概略図である。図３ｅは、入力データの３次元データブロックの演算ウィンドウの概略図である。図３ｆは、入力データの三次元データブロックの別の演算ウィンドウの概略図である。図３ｇは、入力データの三次元データブロックのさらに別の演算ウィンドウの概略図である。 In order to more clearly exemplify the technical proposals in the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is clear that the drawings in the following description are some embodiments of the present disclosure, and those skilled in the art can also obtain other drawings based on these drawings without any creative work. ..
FIG. 1a is a schematic structural diagram of the chip device provided by the present disclosure. FIG. 1b is a schematic structural diagram of another chip device provided by the present disclosure. FIG. 1c is a schematic diagram of data distribution of the chip apparatus provided by the present disclosure. FIG. 1d is a schematic diagram of data return of the chip device. FIG. 2 is a schematic flowchart of a neural network calculation method provided by the embodiments of the present disclosure. FIG. 2a is a schematic view of the matrix A provided by the embodiments of the present disclosure multiplied by the matrix B. FIG. 3 is a schematic flowchart of a neural network calculation method provided by the embodiments of the present disclosure. FIG. 3a is a schematic view of a single sample data of full connection 1. FIG. 3b is a schematic view of the multi-sample data of the complete connection 2. FIG. 3c is a schematic diagram of M convolution kernel data of convolution 1. FIG. 3d is a schematic view of the convolution 2 input data. FIG. 3e is a schematic view of the calculation window of the three-dimensional data block of the input data. FIG. 3f is a schematic view of another calculation window of the three-dimensional data block of the input data. FIG. 3g is a schematic view of yet another calculation window of the three-dimensional data block of the input data.

本開示の実施形態における技術案は、本開示の実施形態における添付の図面を参照して以下に明確かつ完全に説明されるが、記載された実施形態は本開示の実施形態の一部であり全ての実施形態ではない。本開示の実施形態に基づいて、当業者が創造的な仕事をすることなくて得られる他のすべての実施形態は、本開示の範囲に属する。 The technical proposal in the embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present disclosure, but the embodiments described are part of the embodiments of the present disclosure. Not all embodiments. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the scope of the present disclosure.

本開示の明細書、特許請求の範囲および図面に記載の「第１」、「第２」、「第３」、「第４」等の用語は、異なる対象を区別するためのものであり、特定の順序を説明するわけではない。さらに、用語「含む」、「備える」およびそれらの変形は、非排他的な包含をカバーすることを意図している。例えば、一連のステップまたはユニットを含むプロセス、方法、システム、製品、または装置は、列挙されたステップまたはユニットに限定されず、任意選択で、列挙されていないステップまたはユニットも含む。或いは、任意選択で、これらのプロセス、方法、製品または機器に固有の他のステップユニットも含む。 The terms "first", "second", "third", "fourth" and the like described in the specification, claims and drawings of the present disclosure are for distinguishing different objects. It does not explain a particular order. Moreover, the terms "include", "provide" and their variants are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units. Alternatively, optionally, other step units specific to these processes, methods, products or equipment are also included.

本明細書での「実施形態」は、実施形態に関連して説明された特定の特徴、構造、または特性が本開示の少なくとも１つの実施形態に含まれてもよいことを意味する。本明細書の様々な箇所でのこの単語の出現は、必ずしも同じ実施形態を指しているわけではなく、他の実施形態と相互に排他的で独立なまたは代替的な実施形態ではない。当業者は、本明細書に記載の実施形態を他の実施形態と組み合わせることができることを明示的または黙示的に理解するであろう。 "Embodiment" as used herein means that the particular features, structures, or properties described in connection with an embodiment may be included in at least one embodiment of the present disclosure. The appearance of this word in various parts of the specification does not necessarily refer to the same embodiment and is not mutually exclusive, independent or alternative to other embodiments. One of ordinary skill in the art will understand explicitly or implicitly that the embodiments described herein can be combined with other embodiments.

以下にＣＰＵを例として、ニューラルネットワークの演算方法を説明する。ニューラルネットワークではマトリックスとマトリックスとの乗算が広く用いられているが、ここではマトリックスＡとマトリックスＢの乗算を例としてＣＰＵにおける論理積演算を説明する。以下に示すように、マトリックスＡとマトリックスＢとの論理積の結果がＣとなり、つまりＣ＝Ａ＊Ｂであるとする。 The calculation method of the neural network will be described below using a CPU as an example. Multiplication of matrix and matrix is widely used in neural networks, but here, the logical product operation in the CPU will be described by taking the multiplication of matrix A and matrix B as an example. As shown below, it is assumed that the result of the logical product of the matrix A and the matrix B is C, that is, C = A * B.

したがって、ＣＰＵまたはＧＰＵでは、一行ずつ計算する必要があり、１行目の計算を終えてから２行目の計算を行い、そして３行目の計算を行い、すべての行の計算を完成させるまで。ニューラルネットワークでは、数千行のデータの場合もあるため、計算時間が非常に長い。計算中、ＣＰＵは長時間稼働状態にあり、エネルギー消費量も多くなる。 Therefore, in the CPU or GPU, it is necessary to calculate line by line, after the calculation of the first line is completed, the calculation of the second line is performed, and the calculation of the third line is performed until the calculation of all lines is completed. .. In a neural network, the calculation time is very long because the data may be thousands of rows. During the calculation, the CPU is in the operating state for a long time, and the energy consumption is also large.

図１ｂを参照する。図１ｂはチップ装置の概略構造図であり、図１ｂに示すように、チップ装置は、メインユニット回路、基本ユニット回路、および分岐ユニット回路を備える。メインユニット回路はレジスタおよび／またはオンチップキャッシュ回路を備える。メインユニットはさらにベクトル演算回路、ＡＬＵ（Arithmetic and Logic Unit，算術論理ユニット）回路、アキュムレータ回路、マトリックス転置回路、およびＤＭＡ（Direct Memory Access，直接メモリアクセス）回路、データ並べ替え回路などのうちの１つまたは任意の組み合わせを備えてもよい。各基礎ユニットはベースレジスタおよび／またはベースオンチップキャッシュ回路を備えてもよい。各基礎ユニットはさらに内積演算回路、ベクトル演算回路、アキュムレータ回路などのうちの１つまたは任意の組み合わせを備えてもよい。前記回路は全て集積回路としてもよい。分岐ユニットが存在する場合、メインユニットは分岐ユニットに接続され、その分岐ユニットは基本ユニットに接続される。基本ユニットはデータブロック間での内積演算を行い、メインユニットは外部データを送受信し、そして外部データを分岐ユニットに配信する。分岐ユニットはメインユニットまたは基本ユニットのデータを送受信するために用いられる。メインユニットについては、接続されるユニットの数が限られているので、より多い基本ユニットのアクセスを実現するためにメインユニットと基本ユニットとの間に分岐ユニットを追加する必要があることで、複雑なデータブロックの計算を実現する。よって、図１ｂに示す構造は、複雑なデータの計算に適している。 See FIG. 1b. FIG. 1b is a schematic structural diagram of a chip device, and as shown in FIG. 1b, the chip device includes a main unit circuit, a basic unit circuit, and a branch unit circuit. The main unit circuit includes a register and / or an on-chip cache circuit. The main unit is one of a vector arithmetic unit, an ALU (Arithmetic and Logic Unit) circuit, an accumulator circuit, a matrix transfer circuit, a DMA (Direct Memory Access) circuit, a data sorting circuit, and the like. It may be provided with one or any combination. Each foundation unit may include a base register and / or a base-on-chip cache circuit. Each foundation unit may further include one or any combination of an inner product arithmetic circuit, a vector arithmetic circuit, an accumulator circuit, and the like. All the circuits may be integrated circuits. If a branch unit exists, the main unit is connected to the branch unit and the branch unit is connected to the basic unit. The basic unit performs an internal product operation between data blocks, the main unit sends and receives external data, and the external data is distributed to the branch unit. The branch unit is used to send and receive data from the main unit or the basic unit. As for the main unit, since the number of connected units is limited, it is complicated because it is necessary to add a branch unit between the main unit and the basic unit in order to realize more access of the basic unit. Realize the calculation of various data blocks. Therefore, the structure shown in FIG. 1b is suitable for calculating complex data.

分岐ユニットと基礎ユニットとの接続構造は任意でよく、図１ｂのＨ型構造に限定されない。任意選択で、メインユニットから基礎ユニットへは、ブロードキャストや配信の構成であり、基礎ユニットからメインユニットへは、収集（gather）の構成である。ブロードキャスト、配信および収集の定義は以下のとおりである。 The connection structure between the branch unit and the foundation unit may be arbitrary and is not limited to the H-shaped structure shown in FIG. 1b. Arbitrarily, from the main unit to the basic unit is a broadcast or distribution configuration, and from the basic unit to the main unit is a gather configuration. The definitions of broadcast, delivery and collection are as follows:

メインユニットから基礎ユニットへのデータ伝送方法は以下を含んでもよい。 The data transmission method from the main unit to the basic unit may include the following.

メインユニットは複数の分岐ユニットにそれぞれ接続され、各分岐ユニットは複数の基礎ユニットにそれぞれ接続されている。 The main unit is connected to a plurality of branch units, and each branch unit is connected to a plurality of basic units.

メインユニットは１つの分岐ユニットに接続され、当該分岐ユニットは１つの分岐ユニットに接続され、以下同様であり、複数の分岐ユニットは直列に接続され、そして各分岐ユニットは複数の基礎ユニットに接続される。 The main unit is connected to one branch unit, the branch unit is connected to one branch unit, and so on, multiple branch units are connected in series, and each branch unit is connected to multiple foundation units. To.

メインユニットは複数の分岐ユニットにそれぞれ接続され、各分岐ユニットは複数の基礎ユニットと直列に接続される。 The main unit is connected to each of a plurality of branch units, and each branch unit is connected in series with a plurality of basic units.

メインユニットは１つの分岐ユニットに接続され、当該分岐ユニットは１つの分岐ユニットに接続され、以下同様であり、複数の分岐ユニットは直列に接続され、そして各分岐ユニットは複数の基礎ユニットと直列に接続される。 The main unit is connected to one branch unit, the branch unit is connected to one branch unit, and so on, multiple branch units are connected in series, and each branch unit is connected in series with multiple foundation units. Be connected.

データを配信するとき、メインユニットは一部または全部の基礎ユニットにデータを送信し、データを受信する各基礎ユニットによって受信されるデータは異なってもよい。 When delivering data, the main unit sends data to some or all of the underlying units, and the data received by each underlying unit that receives the data may be different.

データをブロードキャストするとき、メインユニットは一部または全部の基礎ユニットにデータを送信し、データを受信する各基礎ユニットは同じデータを受信する。 When broadcasting data, the main unit sends the data to some or all of the underlying units, and each underlying unit that receives the data receives the same data.

データが収集されるとき、一部または全部の基礎ユニットがデータをメインユニットに送信する。注意すべきことは、図１ａまたは図１ｂに示されるチップ装置は独立の物理的チップであってもよいが、もちろん実際の応用では、当該チップ装置は他のチップ（例えばＣＰＵ、ＧＰＵ）に集積されてもよい。本発明の実施形態は、上記のチップ装置の物理的表現を限定しない。 When data is collected, some or all of the underlying units send the data to the main unit. It should be noted that the chip device shown in FIGS. 1a or 1b may be an independent physical chip, but of course in practical applications the chip device is integrated into another chip (eg CPU, GPU). May be done. Embodiments of the present invention do not limit the physical representation of the chip device described above.

図１ｃを参照する。図１ｃは、チップ装置のデータ配信の概略図である。図１ｃの矢印で示すように、矢印はデータ配信方向であり。図１ｃに示すように、メインユニットが外部データを受信した後、外部データを分割してから、複数の分岐ユニットに配信し、分岐ユニットは分割データをメインユニットに送信する。 See FIG. 1c. FIG. 1c is a schematic diagram of data distribution of the chip device. As shown by the arrow in FIG. 1c, the arrow indicates the data distribution direction. As shown in FIG. 1c, after the main unit receives the external data, the external data is divided and then distributed to a plurality of branch units, and the branch unit transmits the divided data to the main unit.

図１ｄを参照する。図１ｄは、チップ装置のデータ返しの概略図である。図１ｄの矢印で示すように、矢印はデータ返し方向である。図１ｄに示すように、基本ユニットはデータ（例えば内積計算結果）を分岐ユニットに返して、分岐ユニットはメインユニットに返す。 See FIG. 1d. FIG. 1d is a schematic diagram of data return of the chip device. As shown by the arrow in FIG. 1d, the arrow is the data return direction. As shown in FIG. 1d, the basic unit returns data (for example, the result of inner product calculation) to the branch unit, and the branch unit returns it to the main unit.

図１ａを参照する。図１ａは別のチップ装置の概略構造図である。チップ装置はメインユニットと基本ユニットを備え、当該メインユニットは基本ユニットに接続されている。図１ａに示される構造は、基本ユニットがメインユニットに直接物理的に接続されているので、当該構造は接続する基本ユニットの数は制限され、簡単なデータ計算に適している。 See FIG. 1a. FIG. 1a is a schematic structural diagram of another chip device. The chip device includes a main unit and a basic unit, and the main unit is connected to the basic unit. In the structure shown in FIG. 1a, since the basic unit is directly physically connected to the main unit, the number of basic units to be connected is limited, and the structure is suitable for simple data calculation.

図２を参照する。図２は、上記のチップ装置を使用してニューラルネットワークの計算方法を提供する。当該方法は、図１ａまたは図１ｂに示すようなチップ装置を使用して実行される。当該方法は図２に示すように、以下のステップを含む。 See FIG. FIG. 2 provides a method of calculating a neural network using the above chip device. The method is performed using a chip device as shown in FIG. 1a or FIG. 1b. The method includes the following steps, as shown in FIG.

ステップＳ２０１で、チップ装置のメインユニットは、計算予定データブロックと演算コマンドとを取得する。 In step S201, the main unit of the chip device acquires the calculation schedule data block and the calculation command.

上記ステップＳ２０１における計算予定データブロックは、具体的に、マトリックス、ベクトル、３次元データ、４次元データ、多次元データ等であってもよく、本開示は前述のデータブロックの具体的な表現を、特に限定しない。演算コマンドは、具体的に、乗算コマンド、畳み込みコマンド、加算コマンド、減算コマンド、ＢＬＡＳ（英語：Basic Linear Algebra Subprograms、基本線形代数サブプログラム）関数または活性化関数等であってもよい。 The data block to be calculated in step S201 may be specifically a matrix, a vector, three-dimensional data, four-dimensional data, multidimensional data, or the like, and the present disclosure expresses a specific representation of the above-mentioned data block. Not particularly limited. The arithmetic command may be specifically a multiplication command, a convolution command, an addition command, a subtraction command, a BLAS (English: Basic Linear Algebra Subprograms) function, an activation function, or the like.

ステップＳ２０２で、メインユニットは、演算コマンドに応じて、計算予定データブロックを配信データブロックとブロードキャストデータブロックとに分割する。 In step S202, the main unit divides the calculation schedule data block into the distribution data block and the broadcast data block in response to the calculation command.

前述のステップＳ２０２の実施方法は、具体的には以下の通りである。 Specifically, the method of carrying out the above-mentioned step S202 is as follows.

当該演算コマンドが乗算コマンドであれば、乗数データブロックはブロードキャストデータブロックであり、被乗数データブロックは配信データブロックであると確定される。 If the arithmetic command is a multiplication command, it is determined that the multiplier data block is a broadcast data block and the multiplicand data block is a distribution data block.

ステップＳ２０３１で、メインユニットは、配信データブロックを分割処理して複数の基本データブロックを得て、当該複数の基本データブロックを複数の基本ユニットに配信する。 In step S2031, the main unit divides the distribution data block to obtain a plurality of basic data blocks, and distributes the plurality of basic data blocks to the plurality of basic units.

ステップＳ２０３２において、メインユニットは、ブロードキャストデータブロックを複数の基本ユニットにブロードキャストする。 In step S2032, the main unit broadcasts the broadcast data block to the plurality of basic units.

任意選択で、前述のステップＳ２０３１およびステップＳ２０３２は繰り返して実行されても良い。データ量が比較的多い場合、メインユニットは配信データブロックを分割して複数の基本データブロックを得て、各基本データブロックをｍ個の基本データサブブロックに分割し、ブロードキャストデータブロックもｍ個のブロードキャストデータサブブロックに分割する。メインユニットは毎回１つの基本データサブブロックを配信し、１つのブロードキャストデータサブブロックをブロードキャストする。当該基本データサブブロックおよびブロードキャストデータサブブロックは、並列ニューラルネットワーク計算の実行可能なデータブロックである。例えば、１０００＊１０００のマトリックスＡに１０００＊１０００のマトリックスＢをかけるのを例にとると、基本データブロックはマトリックスＡのｚ行目のデータであり、基本データサブブロックはマトリックスＡのｚ行目のデータの最初の２０列のデータであり、ブロードキャストデータサブブロックは、マトリックスＢのｚ行目の列の最初の２０行のデータであってもよい。 Arbitrarily, the above-mentioned steps S2031 and S2032 may be repeatedly executed. When the amount of data is relatively large, the main unit divides the distribution data block to obtain multiple basic data blocks, divides each basic data block into m basic data subblocks, and broadcast data blocks are also m. Divide into broadcast data subblocks. The main unit delivers one basic data subblock each time and broadcasts one broadcast data subblock. The basic data subblock and the broadcast data subblock are executable data blocks of parallel neural network calculation. For example, for example, when the matrix A of 1000 * 1000 is multiplied by the matrix B of 1000 * 1000, the basic data block is the data in the z-th row of the matrix A, and the basic data subblock is the z-th row of the matrix A. The data in the first 20 columns of the data of the above, and the broadcast data subblock may be the data in the first 20 rows of the z-th column of the matrix B.

上記ステップＳ２０３における基本データブロックは、具体的に内積演算が可能な最小データブロックであってもよい。マトリックス乗算を例にとると、基本データブロックはマトリックスの１行のデータであってもよい。例えば、畳み込み演算の場合、基本データブロックは畳み込みカーネルの重みであってもよい。 The basic data block in step S203 may be the smallest data block that can be specifically subjected to the inner product calculation. Taking matrix multiplication as an example, the basic data block may be one row of data in the matrix. For example, in the case of a convolution operation, the basic data block may be the weight of the convolution kernel.

前述のステップＳ２０３における配信の方法については、以下の実施形態の説明を参照してもよく、ここで贅言しない。ブロードキャストデータブロックをブロードキャストする方法については、以下の実施形態の説明を参照してもよく、ここで贅言しない。 Regarding the delivery method in step S203 described above, the following description of the embodiment may be referred to, and no verbosity is given here. Regarding the method of broadcasting the broadcast data block, the following description of the embodiment may be referred to, and no verbosity is given here.

ステップＳ２０４１で、チップ装置の基本ユニットは、基本データブロックとブロードキャストデータブロックに対して内積演算を行い、演算結果（中間結果かもしれない）を得る。 In step S2041, the basic unit of the chip device performs an inner product calculation on the basic data block and the broadcast data block, and obtains a calculation result (which may be an intermediate result).

ステップＳ２０４２で、演算結果が中間結果でなければ、演算結果をメインユニットに返す。 In step S2042, if the calculation result is not an intermediate result, the calculation result is returned to the main unit.

上記ステップＳ２０４における返し方式については、以下の実施形態の説明を参照してもよく、ここで贅言しない。 Regarding the return method in step S204, the following description of the embodiment may be referred to, and no verbosity is given here.

ステップＳ２０５で、メインユニットは、演算結果を処理して計算予定データブロックと演算コマンドとのコマンド結果とを得る。 In step S205, the main unit processes the calculation result and obtains the calculation schedule data block and the command result of the calculation command.

上記ステップＳ２０５の処理方式は、累算、配列等でもよく、本開示は上記処理の具体的な方式に限定されなく、その具体的な方式は例えば非線形変換等を含み、異なる演算コマンドに応じて構成されてもよい。 The processing method of step S205 may be accumulation, array, or the like, and the present disclosure is not limited to a specific method of the above processing, and the specific method includes, for example, a non-linear transformation, and responds to different arithmetic commands. It may be configured.

本開示によって提供される技術案において、演算を行うとき、メインユニットは計算予定データブロックおよび演算コマンドを含む外部データを受信し、計算予定データブロックおよび演算コマンドを取得し、演算コマンドに従って計算予定データブロックの配信データブロックおよびブロードキャストデータブロックを確定し、配信データブロックを複数の基本データブロックに分割し、ブロードキャストデータブロックを複数の基本ユニットにブロードキャストし、複数の基本データブロックを複数の基本ユニットに配信する。複数の基本ユニットはそれぞれ、基本データブロックとブロードキャストデータブロックとに対して内積演算を行い、演算結果を得る。複数の基本ユニットは、その演算結果をメインユニットに返し、メインユニットは、返された演算結果に従って演算コマンドのコマンド結果を得る。この技術案の技術的なポイントは以下にある。ニューラルネットワークに関して、大量の演算がデータブロックとデータブロックとの間の内積演算にあり、内積演算のオーバーヘッドが大きく、計算時間が長いので、本開示の実施形態は、当該演算コマンドと演算予定コマンドによって、まず計算予定データブロック内の配信データブロックとブロードキャストデータブロックを区別する。ブロードキャストデータブロックは、内積演算を行う際に使用しなければならないデータブロックであり、配信データブロックは、内積演算において分割可能なデータブロックである。マトリックス乗算を例にとると、例えば、計算予定データブロックはマトリックスＡとマトリックスＢであり、演算コマンドはマトリックス乗算コマンド（Ａ＊Ｂ）であり、マトリックス乗算において、被乗数マトリックスＡは複数の基本データブロックに分割可能であり、乗数マトリックスＢはブロードキャストであってもよいので、マトリックス乗算の規則に従って、マトリックスＡは分割可能なデータブロックであり、マトリックスＢはブロードキャストデータブロックであると確定される。マトリックス乗算の定義によれば、被乗数マトリックスＡの各行のデータがそれぞれ乗数マトリックスＢと内積演算を行う必要があるので、本出願の技術案ではマトリックスＡをＭ個の基本データブロックに分割する。そのＭ個の基本データブロックでは、各基本データブロックが、マトリックスＡの１行のデータであってもよい。そのため、マトリックス乗算では、時間のかかる演算時間を複数の基本ユニットによってそれぞれ行うため、内積演算では、複数の基本ユニットで結果を高速で並列に算出することができることで、計算時間を短縮される。より短い計算時間はまた、チップ装置の動作時間を短縮することができ、それによって消費電力を低減することができる。 In the technical proposal provided by the present disclosure, when performing an operation, the main unit receives external data including the calculation schedule data block and the calculation command, acquires the calculation schedule data block and the calculation command, and calculates the calculation schedule data according to the calculation command. Determine the distribution data block and broadcast data block of the block, divide the distribution data block into multiple basic data blocks, broadcast the broadcast data block to multiple basic units, and distribute the multiple basic data blocks to multiple basic units. To do. Each of the plurality of basic units performs an inner product calculation on the basic data block and the broadcast data block, and obtains the calculation result. The plurality of basic units return the operation result to the main unit, and the main unit obtains the command result of the operation command according to the returned operation result. The technical points of this technical proposal are as follows. With respect to the neural network, a large amount of operations are in the inner product operation between the data blocks, the overhead of the inner product operation is large, and the calculation time is long. Therefore, in the embodiment of the present disclosure, the operation command and the operation schedule command are used. First, distinguish between the distribution data block and the broadcast data block in the calculation schedule data block. The broadcast data block is a data block that must be used when performing the inner product operation, and the distribution data block is a data block that can be divided in the inner product operation. Taking matrix multiplication as an example, for example, the calculation schedule data blocks are matrix A and matrix B, the arithmetic command is a matrix multiplication command (A * B), and in matrix multiplication, the multiplicand matrix A is a plurality of basic data blocks. Since the multiplier matrix B may be broadcast, it is determined that matrix A is a divisible data block and matrix B is a broadcast data block according to the rules of matrix multiplication. According to the definition of matrix multiplication, the data in each row of the multiplicand matrix A needs to perform an inner product operation with the multiplier matrix B, respectively. Therefore, in the technical proposal of the present application, the matrix A is divided into M basic data blocks. In the M basic data blocks, each basic data block may be one row of data in the matrix A. Therefore, in matrix multiplication, the time-consuming calculation time is performed by each of the plurality of basic units, and in the inner product calculation, the results can be calculated in parallel at high speed by the plurality of basic units, so that the calculation time can be shortened. The shorter calculation time can also reduce the operating time of the chip device, thereby reducing power consumption.

本開示によって提供される技術案の効果が、実際的な例を通して以下に説明される。図２ａに示すように、マトリックスＡにベクトルＢをかける概略図である。図２ａに示すように、マトリックスＡはＭ行、Ｌ列で、ベクトルＢはＬ行である。演算器がマトリックスＡの１行とベクトルＢとの内積を算出するのに要する時間をt１とすると、ＣＰＵまたはＧＰＵで計算する場合、１行の計算を終えた後に次の行を計算する必要があり、ＧＰＵまたはＣＰＵで計算する方法では計算する時間Ｔ０＝ｍ＊ｔ１となる。本開示の実施形態によって提供される技術案によれば、Ｍ個の基本ユニットを有すると仮定すると、マトリックスＡはＭ個の基本データブロックに分割され、各基本データブロックはマトリックスＡの１行の行データであり、Ｍ個の基本ユニットが内積演算を同時に行う場合、計算時間はｔ１であり、本開示の実施形態によって提供される技術案を採用するのに必要な時間はＴ１＝ｔ１＋ｔ２＋ｔ３である。ここで、ｔ２はメインユニットがデータを分割する時間で、ｔ３は内積演算の演算結果を処理してコマンド結果を得るのに必要な時間である。データ分割および演算結果を処理する計算量が非常に小さいので、かかる時間が非常に短いので、Ｔ０＞＞Ｔ１である。よって、本開示の実施形態の技術案を採用するのは、計算時間を大幅に短縮することができる。同時に、演算予定データの電力消費について、Ｔ０＞＞Ｔ１により、本開示によって提供されるチップ装置は、その作業時間短い。チップ装置の作業時間が非常に短い場合、そのエネルギー消費量は長い作業時間よりもはるかに少ないと実験によって証明されたので、エネルギーを節約するという利点がある。 The effects of the proposed technology provided by the present disclosure are described below through practical examples. As shown in FIG. 2a, it is a schematic diagram which multiplies the matrix A by the vector B. As shown in FIG. 2a, the matrix A has M rows and L columns, and the vector B has L rows. Assuming that the time required for the calculator to calculate the inner product of one row of matrix A and vector B is t1, when calculating with the CPU or GPU, it is necessary to calculate the next row after completing the calculation of one row. Yes, in the method of calculating with GPU or CPU, the calculation time is T0 = m * t1. According to the proposed technology provided by the embodiments of the present disclosure, assuming that the matrix A has M basic units, the matrix A is divided into M basic data blocks, and each basic data block is one row of the matrix A. When it is row data and M basic units perform inner product operations at the same time, the calculation time is t1, and the time required to adopt the technical proposal provided by the embodiment of the present disclosure is T1 = t1 + t2 + t3. .. Here, t2 is the time required for the main unit to divide the data, and t3 is the time required to process the calculation result of the inner product operation and obtain the command result. Since the amount of calculation for processing the data division and the calculation result is very small, the time required is very short, so T0 >> T1. Therefore, adopting the technical proposal of the embodiment of the present disclosure can significantly reduce the calculation time. At the same time, with respect to the power consumption of the scheduled calculation data, the chip apparatus provided by the present disclosure has a short working time according to T0 >> T1. Experiments have shown that when the working time of a chip device is very short, its energy consumption is much less than the long working time, which has the advantage of saving energy.

上記ステップＳ２０３において、メインユニットは、ブロードキャストデータブロックを複数の基本ユニットにブロードキャストすることを実現する方法はたくさんある。具体的には以下の通りである。 In step S203, there are many ways in which the main unit can broadcast a broadcast data block to a plurality of basic units. Specifically, it is as follows.

方式Ａ：ブロードキャストデータブロックを複数の基本ユニットに一回でブロードキャストする。（ブロードキャストとは、「一対多数」のデータ送信、つまり、メインユニットが複数の（全部または一部の）基礎ユニットに同じデータブロックを同時に送信することを指す）。例えば、マトリックスＡ＊マトリックスＢにおいて、マトリックスＢはブロードキャストデータブロックであり、マトリックスＢを複数の基本ユニットに一回でブロードキャストする。別の例で、畳み込みでは、入力データはブロードキャストデータブロックであり、入力データブロックを一回で複数の基本ユニットにブロードキャストする。この方式の利点は、メインユニットと基本ユニットのデータ伝送量を節約できることである。つまり、すべてのブロードキャストデータを１回のブロードキャストで複数の基本ユニットに送信できる。 Method A: A broadcast data block is broadcast to a plurality of basic units at once. (Broadcast refers to "one-to-many" data transmission, that is, the main unit transmitting the same block of data to multiple (all or part) underlying units at the same time). For example, in Matrix A * Matrix B, Matrix B is a broadcast data block that broadcasts Matrix B to a plurality of basic units at once. In another example, in convolution, the input data is a broadcast data block, which broadcasts the input data block to multiple basic units at once. The advantage of this method is that the amount of data transmitted between the main unit and the basic unit can be saved. That is, all broadcast data can be transmitted to a plurality of basic units in one broadcast.

方法Ｂ：ブロードキャストデータブロックを複数の部分ブロードキャストデータブロックに分割し、複数の部分ブロードキャストデータブロックを複数の基本ユニットに例えば数回でブロードキャストする。例えば、マトリックスＢは複数の基本ユニットに数回でブロードキャストされる。具体的に、毎回、マトリックスＢのＮ列のデータをブロードキャストする。この方式の利点は、基本ユニットの構成を削減できることである。基本ユニットに配置されるレジスタの記憶容量が大きくない。データ量の大きいマトリックスＢである場合、一回でマトリックスＢを基本ユニットに配信すると、これらのデータを格納するために比較的大きなレジスタ容量が必要である。基本ユニット数が多いため、レジスタ容量を大きくすることが必然的にコストアップに大きな影響を与えるので、ここでブロードキャストデータブロックを数回でブロードキャストする方式を採用する。すなわち、基本ユニットは、毎回ブロードキャストされるブロードキャストデータブロックのデータの一部を格納するだけでよく、それによってコストが削減される。 Method B: The broadcast data block is divided into a plurality of partial broadcast data blocks, and the plurality of partial broadcast data blocks are broadcast to a plurality of basic units, for example, several times. For example, Matrix B is broadcast to a plurality of basic units several times. Specifically, the data in column N of matrix B is broadcast each time. The advantage of this method is that the configuration of the basic unit can be reduced. The storage capacity of the registers placed in the basic unit is not large. In the case of the matrix B having a large amount of data, if the matrix B is distributed to the basic unit at one time, a relatively large register capacity is required to store these data. Since the number of basic units is large, increasing the register capacity inevitably has a large effect on cost increase, so a method of broadcasting a broadcast data block several times is adopted here. That is, the base unit only needs to store a portion of the data in the broadcast data block that is broadcast each time, thereby reducing costs.

説明したいことは、前述のステップＳ２０３において複数の基本データブロックを複数の基本ユニットに配信するのに、上述の方式Ａまたは方式Ｂを採用してもよい。相違点は、その伝送方式がユニキャスト方式で、かつ伝送されるデータが基本データブロックであることだけである。 What I would like to explain is that the above-mentioned method A or method B may be adopted for delivering the plurality of basic data blocks to the plurality of basic units in the above-mentioned step S203. The only difference is that the transmission method is a unicast method and the data to be transmitted is a basic data block.

前述のステップＳ２０４の実施方法は、具体的には以下の通りである。 Specifically, the method of carrying out the above-mentioned step S204 is as follows.

方式Ａによってブロードキャストデータブロックをブロードキャストするおよび方式Ａによって基本データブロックを配信する場合（図３ａに示すように）、基本ユニットは基本データブロックおよびブロードキャストデータブロックに対して内積処理を行い、内積処理の結果を得る。すなわち一回で１行の内積演算を行い、その内積処理の結果（演算結果の一つ）をメインユニットに送信し、メインユニットがその内積処理の結果を累算する。もちろん実際の応用では、基本ユニットが内積処理の結果を累算した後、累算結果（演算結果のもう一つ）をメインユニットに送信する。上記方法によれば、メインユニットと基本ユニットとの間のデータ伝送量を低減し、計算速度を向上させることができる。 When broadcasting a broadcast data block by method A and delivering a basic data block by method A (as shown in FIG. 3a), the basic unit performs inner product processing on the basic data block and the broadcast data block, and performs inner product processing. Get results. That is, one line of inner product calculation is performed at a time, the result of the inner product processing (one of the calculation results) is transmitted to the main unit, and the main unit accumulates the result of the inner product processing. Of course, in an actual application, after the basic unit accumulates the result of the inner product processing, the accumulated result (another operation result) is transmitted to the main unit. According to the above method, the amount of data transmitted between the main unit and the basic unit can be reduced and the calculation speed can be improved.

方式Ｂのブロードキャストデータブロックが採用される場合、任意選択の技術案では、基本ユニットが部分ブロードキャストデータブロックを受信するたびに、基本ユニットは、基本データブロックと部分ブロードキャストデータブロックとの部分内積演算を行い、処理結果をメインユニットに送信し、メインユニットは処理結果を累算する。別の任意選択の案では、基本ユニットがｎ個の基本データブロックを受信する場合、ブロードキャストデータブロックを繰り返し使用し、当該ブロードキャストデータブロックとｎ個の基本データブロックの内積演算を行い、ｎ個の部分処理結果を得る。基本ユニットは、そのｎ個の処理結果をメインユニットに送信し、メインユニットは、ｎ個の処理結果をそれぞれ累算する。もちろん、上記の累算は基本ユニットでも行うことができる。 When the broadcast data block of method B is adopted, in the optional technical proposal, each time the basic unit receives the partial broadcast data block, the basic unit performs a partial internal product operation between the basic data block and the partial broadcast data block. The processing result is transmitted to the main unit, and the main unit accumulates the processing result. In another optional option, when the basic unit receives n basic data blocks, the broadcast data block is repeatedly used, and the inner product calculation of the broadcast data block and n basic data blocks is performed, and n basic data blocks are calculated. Obtain the partial processing result. The basic unit transmits the n processing results to the main unit, and the main unit accumulates the n processing results. Of course, the above accumulation can also be performed in the basic unit.

上記の場合、ブロードキャストデータブロックのデータ量は一般的に非常に大きく、かつ配信データブロックも大きい。チップ装置は、ハードウェアの構成に属するため、配置される基本ユニットの数は理論上で無限であるが、実際には、その数は限られており、一般的には数十個の基本ユニットである。技術の発展に伴い、例えば増えることで、その数は絶えず変化する可能性がある。しかし、ニューラルネットワークにおけるマトリックスとマトリックスの乗算では、マトリックスＡの行数が数千行になり、マトリックスＢの列数も数千列になるので、一回のブロードキャストデータでマトリックスＢを基本ユニットに送信するのは実現できない。そして、毎回マトリックスＢの一部のデータ、例えば最初の５列のデータをブロードキャストするという実現方法でもよい。マトリックスＡについても同様の方法を採用してもよい。基本ユニットについては、毎回部分内積計算を行えばよく、そして、部分内積計算の結果をレジスタに格納し、その行の全ての内積演算が実行された後、その行の全ての内積演算の結果を累算して一種類の演算結果を得て、その演算結果をメインユニットに送信する。この方式は、計算速度を向上させる利点を有する。 In the above case, the amount of data in the broadcast data block is generally very large, and the distribution data block is also large. Since chip devices belong to the hardware configuration, the number of basic units to be placed is theoretically infinite, but in reality, the number is limited, and generally dozens of basic units are used. Is. With the development of technology, the number can change constantly, for example by increasing. However, in the matrix-matrix multiplication in the neural network, the number of rows in the matrix A is several thousand, and the number of columns in the matrix B is also several thousand. Therefore, the matrix B is transmitted to the basic unit with one broadcast data. It is not feasible to do. Then, a realization method may be adopted in which a part of the data of the matrix B, for example, the data of the first five columns is broadcast each time. A similar method may be adopted for the matrix A. For the basic unit, the partial dot product calculation may be performed each time, and the result of the partial dot product calculation is stored in the register, and after all the dot product operations of that row are executed, the results of all the dot product operations of that row are calculated. Accumulate to obtain one type of calculation result, and send the calculation result to the main unit. This method has the advantage of improving the calculation speed.

図３を参照する。図３はニューラルネットワークの計算方法を提供する。この実施形態における計算は、マトリックスＡ＊マトリックスＢの計算方式によって説明されている。マトリックスＡ＊マトリックスＢは図３ａに示すマトリックスの概略図であってもよい。説明の便宜上、図３に示したニューラルネットワークの計算方法は図１ｂに示したチップ装置で行われる。図１ｂに示すように、チップ装置は１６個の基本ユニットを有する。ここでは説明と割り当ての便宜のために、Ｍの値は３２、Ｎの値は１５、Ｌの値は２０になるように設定してもよい。当然のことながら、計算装置は、任意の数の基本ユニットを有してもよいことを理解されたい。この方法は図３に示すように、以下のステップを含む。 See FIG. FIG. 3 provides a method for calculating a neural network. The calculation in this embodiment is described by the calculation method of Matrix A * Matrix B. Matrix A * Matrix B may be a schematic diagram of the matrix shown in FIG. 3a. For convenience of explanation, the calculation method of the neural network shown in FIG. 3 is performed by the chip apparatus shown in FIG. 1b. As shown in FIG. 1b, the chip device has 16 basic units. Here, for convenience of explanation and allocation, the value of M may be set to 32, the value of N may be set to 15, and the value of L may be set to 20. Of course, it should be understood that the arithmetic unit may have any number of basic units. This method includes the following steps, as shown in FIG.

ステップＳ３０１において、メインユニットは、マトリックスＡ、マトリックスＢ、および乗法演算コマンドＡ＊Ｂを受信する。 In step S301, the main unit receives the matrix A, the matrix B, and the multiplication operation command A * B.

ステップＳ３０２において、メインユニットは、乗法演算コマンドＡ＊Ｂに従って、マトリックスＢがブロードキャストデータブロックであり、マトリックスＡが配信データブロックであると確定し、マトリックスＡを３２個の基本データブロックに分割し、各基本データブロックがマトリックスＡの１行のデータである。 In step S302, the main unit determines that the matrix B is a broadcast data block and the matrix A is a distribution data block according to the multiplication operation command A * B, divides the matrix A into 32 basic data blocks, and divides the matrix A into 32 basic data blocks. Each basic data block is one row of data in Matrix A.

ステップＳ３０３において、メインユニットは、３２個の基本データブロックを均等に１６個の基本ユニットに割り当て、３２個の基本データブロックを均等に１６個の基本ユニットに割り当てることは、各基本ユニットが２個の基本データブロックを受信することである。この２つのデータブロックの割り当て方法は、任意の繰り返さない割り当て順序にしてもよい。 In step S303, the main unit equally allocates 32 basic data blocks to 16 basic units, and evenly allocates 32 basic data blocks to 16 basic units, that is, 2 for each basic unit. Is to receive the basic data block of. The allocation method of these two data blocks may be any non-repeating allocation order.

上記ステップＳ３０３の割り当て方法は、他のいくつかの割り当て方法を採用してもよい。例えば、各基礎ユニットにデータブロックの数を均等に割り当てられない場合、データベースは各基礎ユニットに不均等に割り当てられてもよい。そのなかのいくつかの均等に割り当てられないデータブロックを分割してから、均等に割り当てるなどの方法でもよい。本開示の実施形態は、上記の基本データブロックが複数の基本ユニットに割り当てられる方法を限定しない。 As the allocation method in step S303, some other allocation method may be adopted. For example, if the number of data blocks cannot be evenly distributed to each foundation unit, the database may be unevenly allocated to each foundation unit. A method such as dividing some non-equally allocated data blocks and then evenly allocating them may be used. The embodiments of the present disclosure do not limit how the basic data blocks described above are assigned to a plurality of basic units.

ステップＳ３０４において、メインユニットは、マトリックスＢの最初の数列（例えば最初の５列）の部分データを抽出し、マトリックスＢの最初の５列の部分データを１６個の基本ユニットにブロードキャストする。 In step S304, the main unit extracts the partial data of the first sequence of matrix B (for example, the first 5 columns) and broadcasts the partial data of the first 5 columns of matrix B to 16 basic units.

ステップＳ３０５において、１６個の基本ユニットは、最初の５列の部分データを繰り返し使用し２つの基本データブロックと内積演算および累積演算を行い、３２＊５個の前処理結果を得て、その３２＊５個の前処理結果をメインユニットに送信する。 In step S305, the 16 basic units repeatedly use the partial data of the first 5 columns to perform inner product calculation and cumulative calculation with the two basic data blocks, obtain 32 * 5 preprocessing results, and obtain 32 * 5 preprocessing results. * Send 5 pre-processing results to the main unit.

ステップＳ３０６において、メインユニットは、マトリックスＢの５列の部分データを抽出し、マトリックスＢの５列の部分データを１６個の基本ユニットにブロードキャストする。 In step S306, the main unit extracts the partial data of the five columns of the matrix B and broadcasts the partial data of the five columns of the matrix B to the 16 basic units.

ステップＳ３０７において、１６個の基本ユニットは、中央の５列の部分データを繰り返し使用して２個の基本データブロックと内積演算および累積演算を行い、３２＊５個の中処理結果を得て、その３２＊５個の中処理結果をメインユニットに送信する。 In step S307, the 16 basic units repeatedly use the partial data in the central 5 columns to perform the inner product calculation and the cumulative calculation with the 2 basic data blocks, and obtain 32 * 5 intermediate processing results. The 32 * 5 intermediate processing results are transmitted to the main unit.

ステップＳ３０８において、メインユニットは、マトリックスＢの最後の５列の部分データを抽出し、マトリックスＢの最後の５列の部分データを１６個の基本ユニットにブロードキャストする。 In step S308, the main unit extracts the partial data of the last 5 columns of the matrix B and broadcasts the partial data of the last 5 columns of the matrix B to the 16 basic units.

ステップＳ３０９において、１６個の基本ユニットは、最後の５列の部分データを繰り返し使用して２個の基本データブロックと内積演算および累積演算を行い、３２＊５個の後処理結果を得て、３２＊５個の後処理結果をメインユニットに送信する。 In step S309, the 16 basic units repeatedly use the last 5 columns of partial data to perform inner product operations and cumulative operations with the 2 basic data blocks to obtain 32 * 5 post-processing results. 32 * 5 post-processing results are transmitted to the main unit.

ステップＳ３１０、メインユニットは、３２＊５個の前処理結果、３２＊５個の中処理結果、および３２＊５個の後処理結果を前、中、後に準じて組み合わせて、３２＊１５のマトリックスＣを得る。このマトリックスＣは、マトリックスＡ＊マトリックスＢのコマンド結果である。 In step S310, the main unit combines 32 * 5 pretreatment results, 32 * 5 intermediate treatment results, and 32 * 5 posttreatment results according to the pre-, middle, and post-processes to form a 32 * 15 matrix. Get C. This matrix C is the command result of matrix A * matrix B.

図３に示す技術案は、マトリックスＡを３２個の基本データブロックに分割し、次にマトリックスＢをバッチでブロードキャストするので、基本ユニットはコマンド結果をバッチで得ることができる。内積が１６個の基本ユニットに分割されて計算されるので、計算
時間が大幅に短縮され、計算時間が短くエネルギー消費が少ないという利点がある。 In the proposed technique shown in FIG. 3, the matrix A is divided into 32 basic data blocks, and then the matrix B is broadcast in batch, so that the basic unit can obtain the command result in batch. Since the inner product is divided into 16 basic units for calculation, there are advantages that the calculation time is significantly shortened, the calculation time is short, and the energy consumption is low.

図１ａを参照する。図１ａは、本開示により提供されるチップ装置であり、チップ装置は、メインユニットと基本ユニットとを備え、メインユニットはハードウェアチップ装置であり、基本ユニットもハードウェアチップ装置である。 See FIG. 1a. FIG. 1a is a chip device provided by the present disclosure. The chip device includes a main unit and a basic unit, the main unit is a hardware chip device, and the basic unit is also a hardware chip device.

メインユニットは、ニューラルネットワーク演算における各連続的演算、並びに基本ユニットとのデータ伝送を行うように用いられる。 The main unit is used to perform each continuous operation in the neural network operation and data transmission with the basic unit.

基本ユニットは、メインユニットによって伝送されたデータに基づき、ニューラルネットワークにおける並列で加速される演算を行い、演算結果をメインユニットに送信するように用いられる。 The basic unit is used to perform parallel accelerated operations in a neural network based on the data transmitted by the main unit and transmit the operation results to the main unit.

上記の並列で加速される演算には、データブロックとデータブロックとの間の乗法演算、畳み込み演算などの大規模で並列化可能な演算が含まれるが、これらに限定されない。 The parallel-accelerated operations include, but are not limited to, large-scale parallelizable operations such as multiplication operations between data blocks and convolution operations.

上記の各連続的演算には、累積演算、マトリックス転置演算、データ配列演算などの連続的演算が含まれるが、これらに限定されない。 Each of the above continuous operations includes, but is not limited to, continuous operations such as cumulative operations, matrix transpose operations, and data array operations.

メインユニットと複数の基本ユニットについて、メインユニットは、計算予定データブロックおよび演算コマンドを取得し、演算コマンドに応じて、計算予定データブロックを配信データブロックとブロードキャストデータブロックとに分割する。配信データブロックを分割して複数の基本データブロックを得て、複数の基本データブロックを複数の基本ユニットに配信し、ブロードキャストデータブロックを複数の基本ユニットにブロードキャストする。基本ユニットは、基本データブロックとブロードキャストデータブロックとに対して内積演算を行い演算結果を得て、その演算結果をメインユニットに送信する。前記メインユニットは前記演算結果を処理し、計算予定データブロックと演算コマンドとのコマンド結果を得る。 For the main unit and a plurality of basic units, the main unit acquires a calculation schedule data block and a calculation command, and divides the calculation schedule data block into a distribution data block and a broadcast data block according to the calculation command. The distribution data block is divided to obtain a plurality of basic data blocks, the plurality of basic data blocks are distributed to a plurality of basic units, and the broadcast data block is broadcast to a plurality of basic units. The basic unit performs an inner product operation on the basic data block and the broadcast data block, obtains an operation result, and transmits the operation result to the main unit. The main unit processes the calculation result and obtains a command result of a calculation schedule data block and a calculation command.

任意選択で、チップ装置はさらに分岐ユニットを備え、当該分岐ユニットは、メインユニットと基本ユニットとの間に配置され、データを転送するように用いられる。 Optionally, the chip device further comprises a branching unit, which is located between the main unit and the basic unit and is used to transfer data.

任意選択で、メインユニットは具体的に、ブロードキャストデータブロックを複数の基本ユニットに一回でブロードキャストするように用いられる。 Optionally, the main unit is specifically used to broadcast a broadcast data block to multiple basic units at once.

任意選択で、基本ユニットは具体的に、基本データブロックおよびブロードキャストデータブロックに対して内積処理を行い内積処理の結果を得て、その内積処理の結果を累算して演算結果を得て、その演算結果をメインユニットに送信するように用いられる。 Arbitrarily, the basic unit specifically performs inner product processing on the basic data block and the broadcast data block to obtain the result of the inner product processing, and the result of the inner product processing is accumulated to obtain the calculation result. It is used to send the calculation result to the main unit.

任意選択で、メインユニットは、演算結果が内積処理の結果である場合、演算結果を累算して累算結果を得、累算結果を配列して計算予定データブロックと演算コマンドとのコマンド結果を得るように用いられる。 Arbitrarily, when the operation result is the result of the internal product processing, the main unit accumulates the operation results to obtain the accumulation result, arranges the accumulation results, and commands the calculation schedule data block and the operation command. Is used to obtain.

任意選択で、メインユニットは具体的に、ブロードキャストデータブロックを複数の部分ブロードキャストデータブロックに分割し、複数の部分ブロードキャストデータブロックを複数の基本ユニットに数回でブロードキャストするように用いられる。 Optionally, the main unit is specifically used to divide the broadcast data block into multiple partial broadcast data blocks and broadcast the multiple partial broadcast data blocks to multiple basic units several times.

任意選択で、基本ユニットは具体的に、部分ブロードキャストデータブロックおよび基本データブロックに対して内積処理を一回行い内積処理の結果を得て、内積処理の結果を累算して部分演算結果を得て、部分演算結果をメインユニットに送信するように用いられる。 Arbitrarily, the basic unit specifically performs the inner product processing once for the partial broadcast data block and the basic data block to obtain the result of the inner product processing, and accumulates the results of the inner product processing to obtain the partial calculation result. It is used to send the partial calculation result to the main unit.

任意選択で、基本ユニットは具体的に、部分ブロードキャストデータブロックをｎ回繰り返し使用して部分ブロードキャストデータブロックとｎ個の基本データブロックとの内積演算を行い、ｎ個の部分処理結果を得て、ｎ個の部分処理結果をそれぞれに累算した後、ｎ個の部分演算結果を得て、ｎ個の部分演算結果をメインユニットに送信するように用いられる。ここ、ｎは２以上の整数である。 Arbitrarily, the basic unit specifically uses the partial broadcast data block n times repeatedly to perform an inner product operation between the partial broadcast data block and n basic data blocks, and obtains n partial processing results. After accumulating n partial processing results, n partial calculation results are obtained and n partial calculation results are transmitted to the main unit. Here, n is an integer of 2 or more.

本開示の実施形態は、図１ａに示されるようなチップ装置の適用方法をさらに提供する。当該方法は具体的に、マトリックスとマトリックスの乗法演算、マトリックスとベクトルの乗法演算、畳み込み演算、または完全接続演算のうちの１つまたは任意の組み合わせを実行するように使用され得る。 The embodiments of the present disclosure further provide a method of applying a chip device as shown in FIG. 1a. The method can be specifically used to perform one or any combination of matrix-matrix multiplication operations, matrix-vector multiplication operations, convolution operations, or fully connected operations.

具体的には、メインユニットは、ｐｏｏｌｉｎｇ（プール化）演算、正則化（規格化）演算、例えばｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ（バッチ正規化）、ｌｒｎなどのニューラルネットワーク演算ステップを行ってもよい。 Specifically, the main unit may perform neural network calculation steps such as pooling operation, regularization (normalization) operation, for example, batch normalization (batch normalization), lrn, and the like.

本出願の実施形態はまた、図１ａまたは図１ｂに示すようなチップ装置を備えるチップを提供する。 An embodiment of the present application also provides a chip with a chip device as shown in FIG. 1a or FIG. 1b.

本願の実施形態は、上記のチップを備えるスマートデバイスをさらに提供し、当該チップには、図１ａまたは図１ｂに示すようなチップ装置が集積されている。スマートデバイスは、スマートフォン、タブレットコンピュータ、パーソナルデジタルアシスタント、スマートウォッチ、スマートカメラ、スマートＴＶ、スマート冷蔵庫などを含むが、これらに限定されない。上述のデバイスは例示の目的のためだけであり、本出願の実施形態は上記デバイスの特定の形に限定されない。 An embodiment of the present application further provides a smart device including the above-mentioned chip, in which a chip device as shown in FIG. 1a or FIG. 1b is integrated. Smart devices include, but are not limited to, smartphones, tablet computers, personal digital assistants, smart watches, smart cameras, smart TVs, smart refrigerators, and the like. The above-mentioned devices are for illustrative purposes only, and embodiments of the present application are not limited to the particular form of the above-mentioned devices.

上記のマトリックスかけるマトリックスの演算は、図３に示す実施形態の説明を参照してもよく、ここでは贅言しない。 The calculation of the matrix multiplied by the above may refer to the description of the embodiment shown in FIG. 3, and is not verbose here.

チップ装置を使用し完全接続演算を行う。 Perform a complete connection calculation using a chip device.

完全接続層の入力データが長さＬのベクトル（例えば図３ａに示す「完全接続１−単一サンプル」のベクトルＢ）であり（すなわち、ニューラルネットワークの入力が単一サンプルである）、完全接続層の出力が長さＭのベクトルであり、完全接続層の重みがＭ＊Ｌのマトリックス（例えば「図３ｂ完全接続１-単一サンプル」のマトリックスＡ）である場合、完全接続層の重みマトリックスをマトリックスＡ（すなわち、分割データブロック）とし、入力データをベクトルＢ（すなわち、ブロードキャストデータブロック）として、図２に示す方法１に従って演算を行う。具体的な演算方法は次のとおりであってもよい。 The input data of the fully connected layer is a vector of length L (eg, vector B of "fully connected 1-single sample" shown in FIG. 3a) (ie, the input of the neural network is a single sample) and is fully connected. If the output of the layer is a vector of length M and the weight of the fully connected layer is an M * L matrix (eg, matrix A of "FIG. 3b fully connected 1-single sample"), the weight matrix of the fully connected layer. Is a matrix A (that is, a divided data block), and the input data is a vector B (that is, a broadcast data block), and the calculation is performed according to the method 1 shown in FIG. The specific calculation method may be as follows.

完全接続層の入力データがマトリックスであり（すなわち、ニューラルネットワークの入力が複数のサンプルに対しバッチで演算を行う場合）（完全接続層の入力データがＮ個の入力サンプルを表し、各サンプルは長さＬのベクトルであり、入力データはＬ＊Ｎのマトリックスで表され、例えば「図３ｂ完全接続１―複数のサンプル」のマトリックスＢで表す）、各サンプルに対した完全接続層の出力は長さＭのベクトルである場合、完全接続層の出力データは「図３ａ完全接続―１複数のサンプル」の結果マトリックスのようなＭ＊Ｎのマトリックスであり、完全接続層の重みはＭ＊Ｌのマトリックスであり（例えば「図３ａ完全接続―１複数のサンプル」のマトリックスＡ）、完全接続層の重みマトリックスをマトリックスＡ（すなわち分割データブロック）とし、入力データマトリックスをマトリックスＢ（すなわちブロードキャストデータブロック）とし、または完全接続層の重みマトリックスをマトリックスＢ（ブロードキャストデータブロック）とし、入力ベクトルをマトリックスＡ（分割データブロック）とし、上述した図２に示す方法１に従って演算する。 The input data of the fully connected layer is a matrix (that is, when the input of the neural network performs batch operations on multiple samples) (the input data of the fully connected layer represents N input samples, and each sample is long. It is a vector of L, and the input data is represented by a matrix of L * N, for example, represented by matrix B of "Fig. 3b fully connected 1-multiple samples"), and the output of the fully connected layer for each sample is long. In the case of a vector of M, the output data of the fully connected layer is a matrix of M * N like the result matrix of "Fig. 3a fully connected-1 multiple samples", and the weight of the fully connected layer is M * L. It is a matrix (eg, Matrix A in "Fig. 3a Fully Connected-1 Multiple Samples"), where the weighted matrix of the fully connected layer is Matrix A (ie divided data block) and the input data matrix is Matrix B (ie broadcast data block) Or, the weight matrix of the fully connected layer is set to matrix B (broadcast data block), and the input vector is set to matrix A (divided data block), and the calculation is performed according to the method 1 shown in FIG.

チップ装置
チップ装置を使用して人工ニューラルネットワーク演算を行う場合、ニューラルネットワークにおける畳み込み層、プール化層、および正規化層（規格化層とも呼ばれ、例えばＢＮ（Batch normalization）またはＬＲＮ（Local Response Normalization））などの入力データは、「図３ｄ畳み込み２−入力データ」に示されるように（明確に表すため、各サンプルを表す３次元データブロックは、例としてＣ＝５、Ｈ＝１０、Ｗ＝１２を用いて説明されるが、実際の使用において、Ｎ、Ｃ、Ｈ、Ｗの大きさは図３ｄに示す値に限定されず）、図３ｄの各３次元データブロックは１つのサンプルがこの層の入力データに対応することを表す。各３次元データブロックの３つの次元はそれぞれＣ、Ｈ、Ｗで、そのような３次元データブロックはＮ個ある。 Chip device When performing artificial neural network operations using a chip device, the convolution layer, pooling layer, and normalization layer (also called standardization layer, for example, BN (Batch normalization) or LRN (Local Response Normalization)) in the neural network )) And other input data are shown in "Fig. 3d Convolution 2-Input Data" (for clarity, the 3D data blocks representing each sample are, for example, C = 5, H = 10, W = Although described with reference to 12, in actual use, the sizes of N, C, H, and W are not limited to the values shown in FIG. 3d), and each three-dimensional data block of FIG. 3d has one sample. Indicates that it corresponds to the input data of the layer. The three dimensions of each three-dimensional data block are C, H, and W, respectively, and there are N such three-dimensional data blocks.

上記のニューラルネットワーク層の計算を行う際、メインユニットは入力データを受信した後、各入力データのサンプルに対して、メインユニットのデータ並べ替え回路を用いて、入力データをある順序で並べる。当該順序は任意の順序であってもよい。 When performing the calculation of the neural network layer, after receiving the input data, the main unit arranges the input data in a certain order for each input data sample by using the data sorting circuit of the main unit. The order may be any order.

任意選択で、当該順序は、例えばＮＨＷＣおよびＮＷＨＣなどのように、上記の概略図に表されるＣ次元座標の変化が最も速い方式で入力データを並べる。中には、Ｃがデータブロックの最内層の次元で、Ｎがデータブロックの最外層の次元で、ＨとＷが中間層の次元である。こうする結果は、Ｃのデータが隣接しているので、演算の並列度を向上させることを容易にし、そして複数の特徴マップの並列演算を実行することをより容易にする。 Arbitrarily, the order arranges the input data in a manner such as NHWC and NWHC, in which the C-dimensional coordinates shown in the above schematic change are the fastest. Among them, C is the dimension of the innermost layer of the data block, N is the dimension of the outermost layer of the data block, and H and W are the dimensions of the intermediate layer. The result of this is that the data in C are adjacent, which makes it easier to improve the degree of parallelism of the operations and makes it easier to perform parallel operations on multiple feature maps.

以下は、Ｃ、Ｈ、およびＷが異なるニューラルネットワーク演算に対してどのように理解されるかを説明する。畳み込みとプール化の場合、ＨとＷは畳み込みとプール化の演算を行うときの関連する演算ウィンドウのスライドする次元である（演算ウィンドウがＷ次元においてスライドする例は例え図３eの「畳み込み３―スライドＡ」および図３ｆの「畳み込み３―スライドＢ」の２つの図に示され、演算ウィンドウがＨ次元においてスライドすることは例え図３ｇに示される）。演算ウィンドウのサイズとＭ個の畳み込みカーネルの１つの畳み込みカーネルのサイズとが一致する。図３ｃに示されるようなＭ個の畳み込みカーネルで、各畳み込みカーネルは５＊３＊３の三次元データブロックであり、その演算ウィンドウも５＊３＊３の三次元データブロックである。図３ｃに示されるＭ個の畳み込みカーネルのＫＨとＫＷは、ＫＨに対応する次元が入力データのＨ次元であり、ＫＷに対応する次元が入力データのＷ次元であることを示している。図３e、３ｆ、３ｇの灰色部分のキューブは毎回演算ウィンドウをスライドして計算を行うのに使用されるデータである。そのスライド方向は、Ｈをスライド方向として、その後Ｗをスライド方向としてもよく、またはＷをスライド方向とし、次いでＨをスライド方向としてもよい。具体的に、畳み込みの場合、各スライディングウィンドウでの演算は、図の灰色部分のキューブで表されるデータブロックと「図３ｃの畳み込み１―畳み込みカーネル」で表されるＭ個の畳み込みカーネルデータブロックとに対してそれぞれ行われる内積演算である。畳み込みでは、各スライディングウィンドウ位置について各畳み込みカーネルに対応する値を出力する。プール化の場合、各スライディングウィンドウでの演算は、図の灰色のキューブで表されるデータブロックがＨ次元およびＷ次元において（図の例では、灰色のデータブロックのうち、同一平面上の９つの数値のうち）最大値を選択するか、平均値を計算するなどの演算である。プール化では、スライディングウィンドウの位置ごとにＣ個の値が出力される。Ｃは、単一サンプルの３次元データブロックにおいて、ＨとＷ以外の別の次元である。Ｎは、合計Ｎ個のサンプルでこの層の演算を同時に実行することを表す。正則化アルゴリズムのＬＲＮでは、Ｃ次元の定義は以下の通りである。毎回の基本的ＬＲＮ演算では、Ｃ次元に沿って１つの連続データブロック（すなわち、Ｙ＊１＊１のデータブロック）選択される。ここで、Ｙ＊１＊１のデータブロックのＹがＣ次元における値であり、Ｙの値がＣ次元の最大値以下であり、１番目の１はＨ次元を表し、２番目の１はＷ次元を表す。残りの２つの次元はＨおよびＷ次元と定義し、すなわち各サンプルの３次元データブロックのそれぞれについて、ＬＲＮ正則化演算が実行されるたびに、同じＷ座標および同じＨ座標で異なるＣ座標における連続的一部のデータに対して演算を行う。正則化アルゴリズムＢＮの場合、Ｎ個のサンプルの３次元データブロック内の同じＣ次元の座標を持つすべての数値に対して平均値および分散（または標準偏差）を求める。 The following describes how C, H, and W are understood for different neural network operations. In the case of convolution and pooling, H and W are the sliding dimensions of the associated calculation window when performing the convolution and pooling operations (an example of the calculation window sliding in the W dimension is "Convolution 3-" in Fig. 3e. It is shown in two figures, "Slide A" and "Convolution 3-Slide B" in FIG. 3f, and it is shown, for example, in FIG. 3g that the calculation window slides in the H dimension). The size of the arithmetic window matches the size of one convolution kernel of M convolution kernels. There are M convolution kernels as shown in FIG. 3c, each convolution kernel is a 5 * 3 * 3 3D data block, and its calculation window is also a 5 * 3 * 3 3D data block. The KH and KW of the M convolution kernels shown in FIG. 3c indicate that the dimension corresponding to KH is the H dimension of the input data and the dimension corresponding to KW is the W dimension of the input data. The gray cubes in FIGS. 3e, 3f, and 3g are the data used to slide the calculation window each time to perform the calculation. The slide direction may be H as the slide direction and then W as the slide direction, or W may be the slide direction and then H may be the slide direction. Specifically, in the case of convolution, the operations in each sliding window are the data block represented by the cube in the gray part of the figure and the M convolution kernel data blocks represented by "Convolution 1-convolution kernel in Fig. 3c". It is an inner product operation performed on each of. In convolution, the value corresponding to each convolution kernel is output for each sliding window position. In the case of pooling, the calculation in each sliding window is that the data blocks represented by the gray cubes in the figure are in H and W dimensions (in the example of the figure, nine of the gray data blocks on the same plane). Operations such as selecting the maximum value (of the numerical values) or calculating the average value. In pooling, C values are output for each position of the sliding window. C is another dimension other than H and W in a single sample 3D data block. N represents that the operations of this layer are executed simultaneously with a total of N samples. In the LRN of the regularization algorithm, the definition of C dimension is as follows. In each basic LRN operation, one continuous data block (that is, a Y * 1 * 1 data block) is selected along the C dimension. Here, Y of the data block of Y * 1 * 1 is a value in the C dimension, the value of Y is equal to or less than the maximum value in the C dimension, the first 1 represents the H dimension, and the second 1 is W. Represents a dimension. The remaining two dimensions are defined as H and W dimensions, that is, for each of the 3D data blocks of each sample, each time an LRN regularization operation is performed, the same W coordinate and the same H coordinate but consecutive in different C coordinates. Performs operations on some target data. In the case of the regularization algorithm BN, the mean and variance (or standard deviation) are calculated for all the numerical values having the same C-dimensional coordinates in the 3D data block of N samples.

「図３ｃ〜図３ｇ」において、１つのキューブを使って１つの数値を表し、重みとも呼ばれる。概略図で使用される数字は、例に限定される。実際の場合、次元データは任意の数値であり得る（ある次元が１である場合も含み、その場合、４次元データブロックは自動的に３次元データブロックになる。例えば、同時に計算されるサンプル数が１の場合、入力データは３次元データブロックである。他の例で、畳み込みカーネルの数量が１の場合、畳み込みとデータは３次元データブロックとなる）。チップ装置を用いて入力データＢと畳み込みカーネルＡとの畳み込み演算を行う。 In "FIGS. 3c to 3g", one cube is used to represent one numerical value, which is also called a weight. The numbers used in the schematic are limited to examples. In the actual case, the dimensional data can be any numerical value (including the case where a certain dimension is 1, in which case the 4D data block automatically becomes a 3D data block. For example, the number of samples calculated at the same time. If is 1, the input data is a 3D data block. In another example, if the number of convolution kernels is 1, the convolution and data are 3D data blocks). A convolution operation is performed between the input data B and the convolution kernel A using a chip device.

畳み込み層の場合、その重み（すべての畳み込みカーネル）は「図３ｃ畳み込み１―畳み込みカーネル」に示すように、畳み込みカーネルの数はＭで、各畳み込みカーネルはＣ個のＫＨ行ＫＷ列のマトリックスから構成されるので、畳み込み層の重みは、４の次元がそれぞれＭ、Ｃ、ＫＨ、ＫＷである４次元データブロックとして表すことができる。「図３ｄ畳み込み２―入力データ」に示すように、畳み込み層の入力データは４次元データブロックで、Ｎ個の３次元データブロックからなり、各３次元データブロックは、Ｃ個のＨ行Ｗ列の特徴マトリックスから構成される（すなわち、４つの次元はそれぞれＮ、Ｃ、Ｈ、Ｗのデータブロックである）。メインユニットからＫ個の基礎ユニットのうちのある１個にＭ個の畳み込みカーネルうちのそれぞれの畳み込みカーネルの重みを配信し、基礎ユニットのオンチップキャッシュおよび／またはレジスタに格納する（この時点ではＭ個の畳み込みカーネルは配信データブロックであり、各畳み込みカーネルは基本データブロックであってもよいが、もちろん、実際の応用では、基本データブロックは、１つの畳み込みカーネルの１つの平面マトリックスのようなより小さな温度にも変更されてもよい）。具体的な配信方法は下記の通りであってもよい。畳み込みカーネルの数Ｍ≦Ｋの場合、１つの畳み込みカーネルの重みをＭ個の基礎ユニットにそれぞれ配信する。畳み込みカーネルの数Ｍ＞Ｋの場合、１つまたは複数の畳み込みカーネルの重みを各基礎ユニットにそれぞれ配信する（ｉ番目の基礎ユニットに配信される畳み込みカーネル重みセットはＡｉであり、合計Ｍｉ個の畳み込みカーネルを有する）。各基礎ユニットにおいて、例えばｉ番目の基本ユニットにおいて、受信された、メインユニットによって配信された畳み込みカーネル重みＡiをそのレジスタおよび／またはオンチップキャッシュに格納し、入力データの各部分（すなわち図３ｅ、図３ｆまたは３ｇに示されるようなスライディングウィンドウ）をブロードキャスト方式で各基礎ユニットに伝送し（上記のブロードキャストの方式は、前述の方式Ａまたは方式Ｂを採用してもよい）、ブロードキャストする場合、数回のブロードキャストの方式で全ての基本ユニットに演算ウィンドウの重みをブロードキャストすることができ、具体的に、毎回一部の演算ウィンドウの重みをブロードキャストしてもよく、例えば毎回１つの平面のマトリックスをブロードキャストするが、図３ｅを例にとると、毎回Ｃ平面のＫＨ＊ＫＷマトリックスをブロードキャストすることができ、もちろん実際の応用では、Ｃ平面のＫＨ＊ＨＷマトリックスの最初のｎ行または最初のｎ列のデータを一回でブロードキャストすることができ、本開示は、上記の部分データの伝送方法および部分データの配列方式を限定しない。；入力データの並べ方は、任意の次元順序の並べ方に変換し、次に入力データの各部分を順序に従い順次でメインユニットにブロードキャストする。任意選択で、前述の配信データである畳み込みカーネルの送信方式も、入力データの演算ウィンドウと類似的方法送信方式を採用してもよく、ここでは贅言しない。任意選択で、入力データの並べ方は、Ｃが最も内側のループに変換される。こうする結果は、Ｃのデータが隣接になるので、畳み込み演算の並列度を向上し、複数の特徴マップにおいて並列演算を実行することをより容易にする。任意選択で、入力データの並べ方を次元順序がＮＨＷＣまたはＮＷＨＣである並べ方に変換する。各基礎ユニット、例えばｉ番目の基礎ユニットが、重みＡｉにおける畳み込みカーネルと受信されたブロードキャストデータの対応部分（即ち、演算ウィンドウ）との内積を計算する。重みＡｉの対応部分のデータは、オンチップキャッシュから直接読み出し使用することができ、または繰り返し使用のために先ずレジスタに読み取ることができる。各基礎ユニットの内積演算の結果は累算され、そしてメインユニットに伝送し返される。基礎ユニットが毎回内積演算を実行して得た部分和をメインユニットに伝送し返されて累算する。毎回基礎ユニットが内積演算を行って得た部分和を、基礎ユニットのレジスタおよび／またはオンチップキャッシュに格納して、累算が終了後、メインユニットに返送される。各基本ユニットが内積演算を行って得た部分和を、場合によっては基礎ユニットのレジスタおよび／またはオンチップキャッシュへの部分および部分格納して累算し、場合によってはメインユニットに送信して累算し、累算が終了後、メインユニットに返送する。 In the case of a convolution layer, its weight (all convolution kernels) is as shown in "Fig. 3c Convolution 1-Convolution Kernel", the number of convolution kernels is M, and each convolution kernel is from a matrix of C KH rows and KW columns. Since it is configured, the weight of the convolution layer can be represented as a four-dimensional data block in which the four dimensions are M, C, KH, and KW, respectively. As shown in "Fig. 3d Convolution 2-Input Data", the input data of the convolution layer is a 4-dimensional data block consisting of N 3D data blocks, and each 3D data block has C H rows and W columns. Consists of a feature matrix of (ie, the four dimensions are N, C, H, W data blocks, respectively). The weights of each of the M convolution kernels are distributed from the main unit to one of the K base units and stored in the base unit's on-chip cache and / or register (at this point M). Each convolution kernel is a distribution data block, and each convolution kernel may be a base data block, but of course, in practical applications, the base data block is more like one plane matrix of one convolution kernel. It may be changed to a smaller temperature). The specific delivery method may be as follows. When the number of convolution kernels M ≤ K, the weight of one convolution kernel is distributed to each of the M basic units. When the number of convolution kernels M> K, the weights of one or more convolution kernels are distributed to each base unit (the convolution kernel weight set delivered to the i-th base unit is Ai, for a total of Mi). Has a convolution kernel). In each base unit, for example in the i-th base unit, the convolution kernel weight Ai received by the main unit is stored in its registers and / or on-chip cache, and each part of the input data (ie, FIG. 3e, (Sliding window as shown in FIG. 3f or 3g) is transmitted to each basic unit by a broadcast method (the above-mentioned broadcasting method may adopt the above-mentioned method A or method B), and the number is used for broadcasting. The weights of the arithmetic window can be broadcast to all the basic units by the method of broadcasting once, and specifically, the weights of some arithmetic windows may be broadcast each time, for example, a matrix of one plane is broadcast each time. However, taking FIG. 3e as an example, the KH * KW matrix in the C plane can be broadcast each time, and of course in practical applications, the first n rows or the first n columns of the KH * HW matrix in the C plane. The data can be broadcast at one time, and the present disclosure does not limit the method of transmitting the partial data and the method of arranging the partial data. The input data is arranged in an arbitrary dimensional order, and then each part of the input data is sequentially broadcast to the main unit in order. Arbitrarily, the transmission method of the convolution kernel, which is the distribution data described above, may also adopt a method transmission method similar to the input data calculation window, and is not exaggerated here. Arbitrarily, C is converted to the innermost loop in the arrangement of the input data. As a result of this, since the data of C becomes adjacent, the degree of parallelism of the convolution operation is improved, and it becomes easier to execute the parallel operation in a plurality of feature maps. Arbitrarily, the arrangement of the input data is converted to the arrangement in which the dimensional order is NHWC or NWHC. Each foundation unit, eg, the i-th foundation unit, calculates the inner product of the convolution kernel at weight Ai and the corresponding portion of the received broadcast data (ie, the arithmetic window). The data in the corresponding portion of the weight Ai can be read directly from the on-chip cache and used, or first read into a register for repeated use. The result of the inner product operation of each foundation unit is accumulated and transmitted back to the main unit. The partial sum obtained by executing the inner product operation by the basic unit is transmitted to the main unit and returned to the main unit for accumulation. The partial sum obtained by performing the inner product operation by the basic unit is stored in the register and / or on-chip cache of the basic unit each time, and is returned to the main unit after the accumulation is completed. The partial sum obtained by performing the inner product operation of each basic unit is accumulated by storing the partial sum in the register and / or on-chip cache of the basic unit in some cases, and in some cases, it is transmitted to the main unit and accumulated. Calculate and return to the main unit after the accumulation is completed.

チップ装置を用いたＢＬＡＳ（英語：Basic Linear Algebra Subprograms、基本線形代数サブプログラム）関数の実現方法
ＧＥＭＭ、ＧＥＭＭ計算は、ＢＬＡＳライブラリー内のマトリックス - マトリックス
乗算の演算を指す。この演算の通常の表現は次のとおりである。Ｃ＝ａｌｐｈａ＊ｏｐ（Ａ）＊ｏｐ（Ｂ）＋ｂｅｔａ＊Ｃ。ここでＡとＢは入力の２つのマトリックス、Ｃは出力マトリックス、ａｌｐｈａとｂｅｔａはスカラー、ｏｐはマトリックスＡまたはＢに対するある操作を表する。さらに、マトリックスＡおよびＢの幅と高さを説明するためのパラメータとして、いくつかの補助となる整数がある。 How to implement a BLAS (Basic Linear Algebra Subprograms) function using a chip device GEMM, GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library. The usual representation of this operation is: C = alpha * op (A) * op (B) + beta * C. Where A and B represent two inputs, C is an output matrix, alpha and beta are scalars, and op is an operation on matrix A or B. In addition, there are some auxiliary integers as parameters to explain the width and height of the matrices A and B.

当該装置を使用してＧＥＭＭ計算を実現するステップは次のとおりである。 The steps to realize the GEMM calculation using the device are as follows.

入力マトリックスＡおよびマトリックスＢにそれぞれのｏｐ操作を行う。ｏｐ操作は
、マトリックスの転置操作であってもよいが、もちろん、非線形関数演算、プール化などの他の操作であってもよい。マトリックスｏｐ操作は、メインユニットのベクトル演算機能を使用して実現される。あるマトリックスのｏｐが空であってもよい場合、メインユニットはそのマトリックスに対していかなる操作も行わない。 Each op operation is performed on the input matrix A and the matrix B. The op operation may be a matrix transposition operation, but of course, it may be another operation such as a nonlinear function operation or pooling. The matrix up operation is realized by using the vector calculation function of the main unit. If the op of a matrix may be empty, the main unit does not perform any operation on that matrix.

ｏｐ（Ａ）とｏｐ（Ｂ）との間のマトリックス乗算は、図２に示す方法によって行われる。 The matrix multiplication between op (A) and op (B) is performed by the method shown in FIG.

メインユニットのベクトル演算機能を用いて、ｏｐ（Ａ）＊ｏｐ（Ｂ）の結果の中の各値にａｌｐｈａをかける操作を行う。 Using the vector calculation function of the main unit, the operation of multiplying each value in the result of op (A) * op (B) by alpha is performed.

メインユニットのベクトル演算機能を用いて、マトリックスａｌｐｈａ＊ｏｐ（Ａ）＊ｏｐ（Ｂ）とｂｅｔａ＊Ｃとの対応する位置の加算ステップを実現する。 Using the vector calculation function of the main unit, the addition step of the corresponding positions of the matrix alpha * op (A) * op (B) and beta * C is realized.

ＧＥＭＶ
ＧＥＭＶ計算は、ＢＬＡＳライブラリーにおけるマトリックス−ベクター乗算の演算を指す。この演算の通常の表現は次のとおりである。Ｃ＝ａｌｐｈａ＊ｏｐ（Ａ）＊Ｂ＋ｂｅｔａ＊Ｃ。ここで、Ａは入力マトリックス、Ｂは入力ベクトル、Ｃは出力ベクトル、ａｌｐｈａとｂｅｔａはスカラー、ｏｐはマトリックスＡに対するある操作を表す。 GEMV
GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library. The usual representation of this operation is: C = alpha * op (A) * B + beta * C. Here, A is an input matrix, B is an input vector, C is an output vector, alpha and beta are scalars, and op is an operation on matrix A.

前記装置を使用してＧＥＭＶ計算を実現するステップは以下の通りである。 The steps to realize the GEMV calculation using the device are as follows.

入力マトリックスＡに対して対応するｏｐ操作を行う、チップ装置は、図２に示す方法を使用してマトリックスｏｐ（Ａ）とベクトルＢとの間のマトリックス−ベクトル乗算を完了させる。メインユニットのベクトル演算機能を用いて、ｏｐ（Ａ）＊Ｂの結果の各値にａｌｐｈａをかける操作を行う。メインユニットのベクトル演算機能を用いて、マトリックスａｌｐｈａ＊ｏｐ（Ａ）＊Ｂとｂｅｔａ＊Ｃとの間の対応する位置の加算ステップを実現する。 Performing the corresponding op operation on the input matrix A, the chip device completes the matrix-vector multiplication between the matrix op (A) and the vector B using the method shown in FIG. Using the vector calculation function of the main unit, the operation of multiplying each value of the result of op (A) * B by alpha is performed. The vector operation function of the main unit is used to realize the addition step of the corresponding positions between the matrices alpha * op (A) * B and beta * C.

チップ装置を用いた活性化関数の実現方法
活性化関数は、通常、１つのデータブロック（ベクトルまたは多次元マトリックスであってもよい）のデータそれぞれに対して非線形演算を実行することを指す。たとえば、活性化関数はｙ＝ｍａｘ（ｍ，ｘ）であってもよく、ここでｘは入力値、ｙは出力値、ｍは１つの定数である。活性化関数はｙ＝ｔａｎｈ（ｘ）であってもよく、ｘは入力値、ｙ＝ｓｉｇｍｏｉｄ（ｘ）である。活性化関数は区分線形関数であってもよい。活性化関数は１つのデータを入力し、１つのデータを出力する任意の関数であってもよい。 How to Realize an Activation Function Using a Chip Device An activation function usually refers to performing a non-linear operation on each piece of data in one data block (which may be a vector or a multidimensional matrix). For example, the activation function may be y = max (m, x), where x is an input value, y is an output value, and m is a constant. The activation function may be y = tanh (x), where x is the input value, y = sigmoid (x). The activation function may be a piecewise linear function. The activation function may be any function that inputs one data and outputs one data.

活性化関数を実現する場合、チップ装置は、メインユニットのベクトル計算機能を用いて、ベクトルを入力し、ベクトルの活性化ベクトルを算出する。メインユニットは、入力ベクトルの各値に活性化関数を適用し（活性化関数の入力も１つの数値で、出力も１つの数値である）、１つの数値出力から出力ベクトルへの対応する位置を算出する。 When realizing the activation function, the chip device inputs a vector and calculates the activation vector of the vector by using the vector calculation function of the main unit. The main unit applies an activation function to each value of the input vector (the input of the activation function is also one numerical value and the output is also one numerical value), and the corresponding position from one numerical output to the output vector is determined. calculate.

上記入力ベクトルのソースは、チップ装置の外部データ、およびチップ装置の分岐ユニットによって転送された基本ユニットの計算結果データを含むが、これらに限定されない。 The source of the input vector includes, but is not limited to, external data of the chip device and calculation result data of the basic unit transferred by the branch unit of the chip device.

上記計算結果データは、具体的に、マトリックスにベクトルをかける演算結果であってもよい。上記計算結果データは、さらに具体的に、マトリックスにマトリックスをかける演算結果であってもよい。上記入力データは、メインユニットがオフセットを施すことを実現した後の演算結果であってもよい。 Specifically, the calculation result data may be a calculation result of multiplying a matrix by a vector. More specifically, the calculation result data may be a calculation result obtained by multiplying a matrix by a matrix. The input data may be the calculation result after the main unit realizes that the offset is applied.

チップ装置を使用してオフセットを施す操作を実現
２つのベクトルまたは２つのマトリックスの加算機能を、メインユニットによって実現することができる。メインユニットによって、マトリックスの各行にベクトルを加える、または各列に加える機能を実現することができる。 Realizing the operation of applying offset using a chip device The addition function of two vectors or two matrices can be realized by the main unit. The main unit can provide the ability to add a vector to each row of the matrix or to each column.

任意選択で、上記マトリックスは、装置がマトリックスとマトリックスの乗法演算を実行する結果からのものでもよい。マトリックスは、装置がマトリックスにベクトルをかける演算を実行する結果からのものでもよい。マトリックスは、装置のメインユニットが外部から受信したデータでもよい。ベクトルは、装置のメインユニットが外部から受信したデータであってもよい。 Optionally, the matrix may be from the result of the device performing matrix-matrix multiplication operations. The matrix may be the result of the device performing an operation that multiplies the matrix by a vector. The matrix may be data received from the outside by the main unit of the device. The vector may be data received from the outside by the main unit of the device.

なお、上記の入力データおよび計算結果データは例に過ぎなく、実際の応用では他の種類やソースからのデータも可能であり、本開示の実施形態は上記のデータのソースや表現方法を限定するものではない。 The above input data and calculation result data are merely examples, and data from other types and sources are also possible in actual applications, and the embodiments of the present disclosure limit the sources and representation methods of the above data. It's not a thing.

前述の方法の実施形態では、簡潔さのために、それらはすべて一連の動作の組み合わせとして説明されているが、当業者は、本開示が説明された動作シーケンスによって限定されないことを理解されたい。なぜなら、特定のステップは、本開示に従って他のシーケンスでまたは同時に実行され得るからである。さらに、当業者はまた、本明細書に記載されている実施形態は任意選択可能な実施形態であり、関連する動作およびモジュールは必ずしも本開示によって必要とされないことを理解されたい。 In embodiments of the methods described above, for brevity, they are all described as a combination of actions, but those skilled in the art will appreciate that the present disclosure is not limited by the sequence of actions described. This is because certain steps can be performed in other sequences or at the same time in accordance with the present disclosure. Further, one of ordinary skill in the art will appreciate that the embodiments described herein are optional embodiments and that relevant actions and modules are not necessarily required by the present disclosure.

上記の実施形態では、様々な実施形態の説明がそれぞれの重点があり、ある実施形態で詳述されていない部分は、他の実施形態の関連する説明を参照してもよい。 In the above embodiments, the description of the various embodiments has its own emphasis, and the parts not detailed in one embodiment may refer to the relevant description of the other embodiments.

本明細書で提供されるいくつかの実施形態では、開示された装置は他の方法でも実施され得ることを理解されたい。例えば、上述した装置の実施形態は単なる例示であり、例えば、ユニットの分割は論理的な機能の分割のみであり、実際に実施する場合、他の分割方法、例えば複数のユニット又は構成要素を組み合わせてもよく、別のシステムに集積するか、または一部の特徴を無視したり行わなくてもよい。さらに、図示または説明した相互カップリングまたは直接カップリングまたは通信可能な接続は、何らかのインターフェース、デバイスまたはユニットを介した間接カップリングまたは通信可能な接続でもよく、電気的またはその他の方式でもよい。 It should be understood that in some embodiments provided herein, the disclosed devices may also be implemented in other ways. For example, the embodiment of the device described above is merely an example, for example, the division of units is only the division of logical functions, and when actually implemented, other division methods, for example, a combination of a plurality of units or components are combined. It may be integrated into another system, or some features may or may not be ignored. Further, the mutual coupling or direct coupling or communicable connection illustrated or described may be an indirect coupling or communicable connection via any interface, device or unit, and may be electrical or other.

また、本開示の各実施形態における各機能ユニットは、１つの処理ユニットに集積されていても、物理的に別々に存在していても、２つ以上の装置が集積されていてもよい。上記の集積されたユニット／モジュールはハードウェアの形で実施される。例えば、ハードウェアは、デジタル回路、アナログ回路などを含む回路としてもよい。ハードウェア構造の物理的実現は、物理装置を含むが、これらに限定されなく、物理装置はトランジスタ、メモリスタなどを含むがこれらに限定されない。計算装置内の計算モジュールは、ＣＰＵ、ＧＰＵ、ＦＰＧＡ、ＤＳＰ、ＡＳＩＣなどの任意の適切なハードウェアプロセッサとしてもよい。記憶装置は、ＲＲＡＭ（登録商標）、ＤＲＡＭ、ＳＲＡＭ、ＥＤＲＡＭ、ＨＢＭ、ＨＭＣなどの任意の適切な磁気記憶媒体または光磁気記憶媒体としてもよい。 Further, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, physically separated, or two or more devices may be integrated. The above integrated units / modules are implemented in the form of hardware. For example, the hardware may be a circuit including a digital circuit, an analog circuit, and the like. The physical realization of the hardware structure includes, but is not limited to, physical devices, and physical devices include, but are not limited to, transistors, memristors, and the like. The computing module in the computing device may be any suitable hardware processor such as CPU, GPU, FPGA, DSP, ASIC. The storage device may be any suitable magnetic storage medium or optical magnetic storage medium such as RRAM®, DRAM, SRAM, EDRAM, HBM, HMC.

説明されたユニットは、物理的に分離されていてもいなくてもよく、すなわち一箇所に配置されてもよく、または複数のネットワークユニットにわたって分散されてもよい。実施形態の解決策の目的を達成するために、実際の必要性に応じていくつかまたはすべてのユニットを選択してもよい。 The described units may or may not be physically separated, i.e. they may be located in one place or distributed across multiple network units. Some or all units may be selected depending on the actual need to achieve the objectives of the solution of the embodiment.

以上、本開示の実施形態について詳細に説明したが、具体的な例を用いて本開示の原理および実施形態を本明細書に記載したが、実施形態の上記の説明は、本開示の方法およびその核となる概念の理解をサポートすることのみを目的とする。同時に、当業者であれば、本開示の概念によって、実施形態および本出願の応用範囲において変更することがある。まとめて、上記の明細書の内容は本開示を限定するものとして解釈されるべきではない。 Although the embodiments of the present disclosure have been described in detail above, the principles and embodiments of the present disclosure have been described herein using specific examples. It is only intended to support the understanding of its core concepts. At the same time, one of ordinary skill in the art may make changes in the embodiments and the scope of application of the present application by the concept of the present disclosure. Taken together, the content of the above specification should not be construed as limiting this disclosure.

Claims

It is a convolution calculation method, and the method is used for a chip device.
The chip device receives input data and weights,
It said chip device includes the steps of the arrangement of the input data C are converted so that the innermost, obtain input data innermost is C,
The chip device includes a step of performing a convolution calculation on the input data whose innermost side is C and a weight, and obtaining the result of the convolution calculation.
Wherein the input data is four-dimensional data, the 4-dimensional N, H, W, is C, the dimension of the C is Ri dimensions der in the height direction of the four-dimensional data,
The chip device includes a master circuit and k slave circuits, and the chip device performs a convolution calculation on the input data and the weight whose innermost side is C, and the step of obtaining the result of the convolution calculation is Specifically,
The master circuit divides the weight into a plurality of basic data blocks, the master circuit distributes the plurality of basic data blocks to the k slave circuits, and the slave circuit stores the distributed basic data blocks. When,
The master circuit broadcasts each partial data of the input data to the k slave circuits, and the k slave circuits perform calculations on each partial data and the distributed basic data block, respectively, and obtain the calculation result. The step of obtaining and transmitting the calculation result to the master circuit,
The master circuit includes a step of obtaining the calculation result of the convolution operation according to all the operation results of the k slave circuits.
The k is the number of slave circuits, and the range of values of k is an integer of 2 or more.
A convolution calculation method characterized by that.

The method according to claim 1.
Converts arrangement of the input data so C is innermost, the step of obtaining input data innermost is C, specifically,
Converts the arrangement of the input data so C is innermost, comprises obtaining input data NHWC or input data NWHC,
A convolution calculation method characterized by that.

The method according to claim 1 or 2.
The weight is a four-dimensional data block, and the four dimensions of the weight are M, C, KH, and KW.
A convolution calculation method characterized by that.

The method according to claim 3.
Specifically, when the M-dimensional value of the weight is m, the weight includes m convolution kernels, and the master circuit divides the weight into a plurality of basic data blocks.
The master circuit includes a step of dividing m convolution kernels into m basic data blocks.
A convolution calculation method characterized by that.

The method according to claim 3.
Specifically, the master circuit divides the weight into a plurality of basic data blocks.
The master circuit comprises dividing m convolution kernels into m * c basic data blocks, where c is a value whose weight is in the dimension of C, and the range of values for c is an integer greater than or equal to 1. Is,
A convolution calculation method characterized by that.

The method according to claim 4.
Specifically, the master circuit divides the weight into a plurality of basic data blocks.
When m> k, the master circuit divides m convolution kernels into m basic data blocks and distributes one or more of m basic data blocks to k slave circuits.
When m ≦ k, the master circuit includes a step of dividing m convolution kernels into m basic data blocks and delivering one of the m basic data blocks to k slave circuits.
A convolution calculation method characterized by that.

The method according to claim 4 or 5.
The master circuit, broadcasting a respective partial data of the input data in the k-number of the slave circuit,
The master circuit cuts an input data block having the same size as the basic data block from the input data NHWC or input data NWHC as one partial data of each partial data, and inputs the input data block as a basic cut unit in step size. A step of cutting data to obtain each partial data and broadcasting one or more partial data of each partial data to k slave circuits is included.
A convolution calculation method characterized by that.

The method according to claim 7.
The k slave circuits perform calculations on each partial data and the distributed basic data block to obtain calculation results, and the master circuit follows all the calculation results of the k slave circuits. To obtain the calculation result of the convolution operation, specifically
Each of the k slave circuits performs multiplication integration on the element value of one partial data and the element value of the corresponding position of the distributed basic data block, obtains a plurality of multiplication results, and obtains the plurality of multiplication results. The multiplication result of is transmitted to the master circuit, the master circuit accumulates the multiple multiplication results transmitted by each slave circuit to obtain multiple convolution results, and the master circuit sorts the multiple convolution results and calculates the convolution operation. Includes steps to get results , or
Each of the k slave circuits performs a multiplication operation on the element value of one partial data and the element value of the corresponding position of the distributed basic data block, obtains a plurality of multiplication results, and obtains a plurality of multiplication results. Including the step of accumulating the multiplication results of to obtain the convolution result, sending the convolution result to the master circuit, and the master circuit sorting all the convolution results to obtain the calculation result of the convolution operation.
A convolution calculation method characterized by that.

The method according to any one of claims 1 to 6.
The chip device further comprises a branch circuit, the branch circuit connecting the master circuit to a plurality of slave circuits, and the method further comprises.
The branch circuit comprises transferring data between the master circuit and a plurality of slave circuits.
A convolution calculation method characterized by that.

The method according to any one of claims 1 to 6.
The master circuit includes one or any combination of a vector arithmetic circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, or a data sorting circuit.
A convolution calculation method characterized by that.

The method according to any one of claims 1 to 6.
The slave circuit includes one or any combination such as an inner product arithmetic circuit or an accumulator circuit.
A convolution calculation method characterized by that.

It ’s a chip device,
The chip device is used to receive input data and weights, the input data is four-dimensional data, the four dimensions are N, H, W, C , and the C dimension is the four dimensions. The height dimension of the data
The chip device is further used to convert the arrangement of input data so that C is on the innermost side to obtain input data on which C is on the innermost side.
It said chip unit perform convolutional arithmetic operations on the input data and the weighting innermost is C, which we used so as to obtain the calculation result of the convolution,
The chip device further includes a master circuit and k slave circuits.
The master circuit specifically divides the weight into a plurality of basic data blocks, distributes the plurality of basic data blocks to the k slave circuits, and broadcasts each partial data of the input data to the k slave circuits. Used as
The slave circuit stores a specifically distributed basic data block, performs an operation on each partial data and the distributed basic data block, obtains an operation result, and transmits the operation result to the master circuit. Used for
The master circuit is specifically used to obtain the calculation result of the convolution operation according to all the operation results of k slave circuits, where k is the number of slave circuits and the range of the value of k is 2 or more. Is an integer of
A chip device characterized by that.

The chip device according to claim 12.
It said chip device, specifically, to convert the arrangement of the input data so C is innermost, used to obtain the input data NHWC or input data NWHC,
A chip device characterized by that.

The chip device according to claim 13.
The weight is a four-dimensional data block, and the four dimensions of the weight are M, C, KH, and KW.
A chip device characterized by that.

The chip device according to claim 12.
If the value of the weight in the M dimension is m, then the weight comprises m convolution kernels, and the master circuit specifically divides the m convolution kernels into m basic data blocks. Used for
A chip device characterized by that.

The chip device according to claim 12.
Specifically, the master circuit is used to divide m convolution kernels into m * c basic data blocks, where c is a value in the dimension of C, and the range of values of c is 1 or more. Is an integer of
A chip device characterized by that.

The chip device according to claim 15.
When m> k, the master circuit specifically divides m convolution kernels into m basic data blocks and distributes one or more of m basic data blocks to k slave circuits.
When m ≦ k, the master circuit specifically divides m convolution kernels into m basic data blocks and distributes one of the m basic data blocks to k slave circuits.
A chip device characterized by that.

The chip device according to claim 15 or 16.
Specifically, the master circuit cuts an input data block having the same size as the basic data block from the input data NHWC or the input data NWHC as one partial data of each partial data, and uses the input data block as a basic cutting unit to perform steps. Cut the input data by size to obtain each partial data, and broadcast one or more partial data of each partial data to k slave circuits.
A chip device characterized by that.

The chip device according to claim 18.
Specifically, the slave circuit performs a multiplication operation on the element value of one partial data and the element value of the corresponding position of the distributed basic data block, obtains a plurality of multiplication results, and obtains the plurality of multiplication results. Used to send the multiplication result to the master circuit
Specifically, the master circuit accumulates the plurality of multiplication results transmitted by each slave circuit to obtain a plurality of convolution results, and the master circuit sorts the plurality of convolution results to obtain a calculation result of the convolution operation. Used for
A chip device characterized by that.

The chip device according to claim 18.
Specifically, the slave circuit performs a multiplication operation on the element value of one partial data and the element value of the corresponding position of the distributed basic data block, obtains a plurality of multiplication results, and obtains a plurality of multiplications. The results are accumulated to obtain the convolution result, the convolution result is sent to the master circuit, and the master circuit is used to sort all the convolution results and obtain the calculation result of the convolution operation.
A chip device characterized by that.

The chip device according to any one of claims 13 to 17.
The chip device further includes a branch circuit, which connects the master circuit to a plurality of slave circuits.
The branch circuit is used to transfer data between the master circuit and a plurality of slave circuits.
A chip device characterized by that.

The chip device according to any one of claims 13 to 17.
The master circuit includes one or any combination of a vector arithmetic circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, or a data sorting circuit.
A chip device characterized by that.

The chip device according to any one of claims 13 to 17.
The slave circuit includes one or any combination such as an inner product arithmetic circuit or an accumulator circuit.
A chip device characterized by that.