JP4947983B2

JP4947983B2 - Arithmetic processing system

Info

Publication number: JP4947983B2
Application number: JP2006023630A
Authority: JP
Inventors: 修野村; 隆森江; 圭祐是角
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2006-01-31
Filing date: 2006-01-31
Publication date: 2012-06-06
Anticipated expiration: 2026-01-31
Also published as: JP2007206887A

Description

本発明は、ＬＳＩチップ及び演算処理システムに関する。 The present invention relates to an LSI chip and an arithmetic processing system.

従来より、演算回路が実装された基板を複数接続することで、階層的な演算処理を実行する演算処理システムが知られている（特許文献１）。このような演算処理システムについて、図２３を参照して説明する。図２３は、演算回路が実装された複数の基板を接続してなる演算処理システムの構成例を模式的に示した図である。 Conventionally, there is known an arithmetic processing system that executes hierarchical arithmetic processing by connecting a plurality of substrates on which arithmetic circuits are mounted (Patent Document 1). Such an arithmetic processing system will be described with reference to FIG. FIG. 23 is a diagram schematically illustrating a configuration example of an arithmetic processing system formed by connecting a plurality of substrates on which arithmetic circuits are mounted.

図２３においては、演算回路に相当する複数個の学習機能付き神経細胞模倣素子２０が、階層型網状に接続して設けられている。さらに、これらの神経細胞模倣素子２０を制御する制御手段を備えた演算処理システムにおいて、神経細胞模倣素子２０は、層Ａ2、Ａ3毎に設けられた基板５５a、５５b上に分割して搭載されている。このような構成により、ニューロコンピュータシステムとしての演算処理システムが構築されている。 In FIG. 23, a plurality of neuron mimic elements 20 with a learning function corresponding to an arithmetic circuit are connected in a hierarchical network. Furthermore, in the arithmetic processing system provided with the control means which controls these neuron mimic elements 20, the neuron mimic element 20 is divided and mounted on the substrates 55a and 55b provided for each of the layers A2 and A3. Yes. With such a configuration, an arithmetic processing system as a neurocomputer system is constructed.

また、画像データ等の２次元的に分布するデータに対して、２次元的に分布する重みを有するカーネルとの畳込み演算を実行するＬＳＩ（large-scale integration）チップが知られている（非特許文献１）。このようなＬＳＩチップの処理フローについて、図２４を参照して説明する。図２４は、２次元データに対して畳み込み演算を実行するＬＳＩチップにおける、データの流れを模式的に示した図である。 There is also known an LSI (large-scale integration) chip that performs a convolution operation on a two-dimensionally distributed data such as image data with a kernel having a two-dimensionally distributed weight. Patent Document 1). The processing flow of such an LSI chip will be described with reference to FIG. FIG. 24 is a diagram schematically illustrating a data flow in an LSI chip that performs a convolution operation on two-dimensional data.

画像データとカーネルデータは、それぞれＳＲＡＭ（Image SRAMとKernel SRAM）に記憶される。画像データは、シフトレジスタ（ＳＲ）、デジタル/ＰＷＭ変換器（Ｄ／Ｐ）を経てＰＷＭ積和演算回路（ＰＷＭ−ＭＡＣ）に入力される。カーネルデータは、ＳＲ、Ｄ／Ｐ、ＰＷＭ/アナログ変換器（Ｐ／Ａ）を経てＰＷＭ−ＭＡＣに入力される。ただし、ＳＲはＳＲＡＭが出力するシリアルデータをパラレルデータに変換する。また、Ｄ／ＰとＰ／Ａはそれぞれ、デジタル信号をＰＷＭ信号、ＰＷＭ信号をアナログ電圧に変換する。ＰＷＭ−ＭＡＣは畳込み演算処理を並列に行なう。なお、ＰＷＭはpulse-width modulation（パルス幅変調）の略称である。 Image data and kernel data are stored in SRAM (Image SRAM and Kernel SRAM), respectively. The image data is input to the PWM product-sum operation circuit (PWM-MAC) through the shift register (SR) and the digital / PWM converter (D / P). Kernel data is input to PWM-MAC via SR, D / P, and PWM / analog converter (P / A). However, SR converts serial data output from the SRAM into parallel data. D / P and P / A convert the digital signal into a PWM signal and the PWM signal into an analog voltage, respectively. The PWM-MAC performs convolution operation processing in parallel. Note that PWM is an abbreviation for pulse-width modulation.

ＰＷＭ−ＭＡＣの出力は、ＰＷＭ/デジタル変換器（Ｐ／Ｄ）を経てデジタル累算器（ＡＣＣ）によって累算される。累算結果はマルチプレクサ（ＭＵＸ）を経てルックアップテーブル（ＬＵＴ）により非線形変換され、再びImage SRAMに記憶される。以上の処理を繰り返すことにより、画像データから特徴の検出を行う。
特開平５−２３３５８２号公報 K.Korekado et al., “An Image Filtering Processor for Face/Object Recognition Using Merged/Mixed Analog-Digital Architecture”, in 2005 Symposium on VLSI Circuits, Digest of Technical papers, pp. 220-223, Kyoto, Japan, June 2005. The output of the PWM-MAC is accumulated by a digital accumulator (ACC) via a PWM / digital converter (P / D). The accumulated result is nonlinearly converted by a look-up table (LUT) through a multiplexer (MUX) and stored again in the image SRAM. By repeating the above processing, features are detected from the image data.
JP-A-5-233582 K. Korekado et al., “An Image Filtering Processor for Face / Object Recognition Using Merged / Mixed Analog-Digital Architecture”, in 2005 Symposium on VLSI Circuits, Digest of Technical papers, pp. 220-223, Kyoto, Japan, June 2005.

しかしながら、特許文献１に開示された演算処理システムにおいては、異なる基板の神経細胞模倣素子間における全ての接続について配線を実装する必要があるため、接続する素子の数が増えた場合、その配線を全て実装することが困難になる。特に、画像データのように２次元的に分布するデータに対して２次元的な重み分布を有するカーネルとの畳み込み演算を実行する場合、多数の素子を複雑に接続する必要があるため、神経細胞模倣素子同士の全ての接続を実装することは困難である。 However, in the arithmetic processing system disclosed in Patent Document 1, since it is necessary to mount wiring for all connections between neuron mimic elements on different substrates, when the number of connected elements increases, the wiring is It becomes difficult to implement everything. In particular, when performing a convolution operation with a kernel having a two-dimensional weight distribution on two-dimensionally distributed data such as image data, it is necessary to connect many elements in a complicated manner. It is difficult to implement all connections between mimic elements.

また非特許文献１に開示されたＬＳＩチップにおいては、当該ＬＳＩチップに含まれるImage SRAMのメモリサイズを越えるサイズの２次元データに対して、演算処理を実行することが困難である。 In the LSI chip disclosed in Non-Patent Document 1, it is difficult to perform arithmetic processing on two-dimensional data having a size that exceeds the memory size of Image SRAM included in the LSI chip.

本発明は上記問題に鑑みなされたものであり、ＬＳＩチップに内蔵するメモリ回路ブロックのメモリサイズを超える２次元データに対する演算処理を効率的に実行可能な技術を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a technique capable of efficiently executing arithmetic processing on two-dimensional data exceeding the memory size of a memory circuit block built in an LSI chip.

また、本発明による演算処理システムは以下の構成を備える。即ち、
複数のＬＳＩチップを接続してなり、２次元の対象データと２次元のカーネルとの畳み込み演算を行う、演算処理システムであって、
前記ＬＳＩチップのそれぞれは、
前記対象データを保持する対象データメモリと、
演算において必要な対象データを保持するレジスタと、
前記カーネルを保持するカーネルメモリと、
前記レジスタに保持された対象データと、前記カーネルと、に基づいて畳み込み演算処理を行う演算回路と、
前記対象データメモリから前記レジスタへの出力線と接続し、前記対象データを隣接する前記ＬＳＩチップと入出力する入出力配線と、
前記入出力配線における隣接する前記ＬＳＩチップとの接続を切り替えるスイッチ回路と
を備え、
前記対象データは隣接する複数の前記ＬＳＩチップのそれぞれに備えられた前記対象データメモリに分散して保持され、
複数の前記ＬＳＩチップのそれぞれにおいて、当該ＬＳＩチップに備えられた前記対象データメモリから当該ＬＳＩチップに備えられた前記レジスタへ対象データを出力する際に、隣接する前記ＬＳＩチップの演算回路における演算において必要な対象データであって、当該隣接するＬＳＩチップに備えられた前記対象データメモリに存在しないデータは、該データを保持する前記対象データメモリと同一のＬＳＩチップに備えられた前記レジスタへ出力するとともに、前記スイッチ回路により前記入出力配線における接続を切り替えることにより、前記入出力配線を介して、隣接する前記ＬＳＩチップに備えられた前記レジスタへも同時に出力する。 The arithmetic processing system according to the present invention has the following configuration. That is,
An arithmetic processing system that connects a plurality of LSI chips and performs a convolution operation between two-dimensional target data and a two-dimensional kernel,
Each of the LSI chips is
A target data memory for holding the target data;
A register that holds the target data required for the operation;
Kernel memory holding the kernel;
An arithmetic circuit that performs convolution arithmetic processing based on the target data held in the register and the kernel;
An input / output wiring connected to an output line from the target data memory to the register, and for inputting / outputting the target data to and from the adjacent LSI chip;
A switch circuit for switching connection with the adjacent LSI chip in the input / output wiring,
The target data is distributed and held in the target data memory provided in each of the plurality of adjacent LSI chips,
In each of the plurality of LSI chips, when the target data is output from the target data memory provided in the LSI chip to the register provided in the LSI chip, in the calculation in the arithmetic circuit of the adjacent LSI chip Necessary target data that does not exist in the target data memory provided in the adjacent LSI chip is output to the register provided in the same LSI chip as the target data memory that holds the data. At the same time, by switching the connection in the input / output wiring by the switch circuit , the output is simultaneously output to the register provided in the adjacent LSI chip via the input / output wiring.

本発明によれば、ＬＳＩチップに内蔵するメモリ回路ブロックのメモリサイズを超える２次元データに対する演算処理を効率的に実行可能な技術を提供することができる。 According to the present invention, it is possible to provide a technique capable of efficiently executing arithmetic processing on two-dimensional data exceeding the memory size of a memory circuit block built in an LSI chip.

以下、添付図面を参照して本発明に係る実施の形態を詳細に説明する。ただし、この実施の形態に記載されている構成要素はあくまでも例示であり、本発明の範囲をそれらのみに限定する趣旨のものではない。 Embodiments according to the present invention will be described below in detail with reference to the accompanying drawings. However, the constituent elements described in this embodiment are merely examples, and are not intended to limit the scope of the present invention only to them.

＜＜第１実施形態＞＞
（演算処理システム）
図１は、本実施形態に係る演算処理システムを模式的に示した図である。図１に示すように演算処理システムは、４個のＬＳＩチップ１〜４が１列に配置されて構成される。なお、ＬＳＩチップの個数は、演算処理対象のデータサイズに応じて変更することが可能であり、本実施形態においては一例として４個のケースを示している。またそれぞれのＬＳＩチップは、隣接するＬＳＩチップとデータを入力及び出力する配線５によって接続されている。ここで、本実施形態においては、それぞれのチップ間で入力及び出力されるデータ幅を６ビットとしてるが、その他のデータ幅を有するものであっても良い。その場合は、データ幅に応じて配線の本数が変わる。なお図１では、６bitの配線及びＬＳＩチップのピンを１本の配線及びピンで略記している。また図１においては、前述した入力及び出力用の配線５以外の配線は、本実施形態の説明に必要でないため、記載していない。 << First Embodiment >>
(Calculation processing system)
FIG. 1 is a diagram schematically showing an arithmetic processing system according to the present embodiment. As shown in FIG. 1, the arithmetic processing system is configured by arranging four LSI chips 1 to 4 in one row. Note that the number of LSI chips can be changed according to the data size to be processed, and in this embodiment, four cases are shown as an example. Each LSI chip is connected to an adjacent LSI chip by wiring 5 for inputting and outputting data. Here, in this embodiment, the data width input and output between the chips is 6 bits, but may have other data widths. In that case, the number of wirings changes according to the data width. In FIG. 1, 6-bit wiring and LSI chip pins are abbreviated as one wiring and pins. In FIG. 1, wirings other than the input and output wirings 5 described above are not shown because they are not necessary for the description of this embodiment.

（演算処理例）
次に、前記の演算処理システムで実現する演算処理の一例について、図２を参照して説明する。図２は、階層型コンボリューショナル・ニューラルネットワークを用いて顔の位置検出を行う処理を模式的に示した図である。 (Example of calculation processing)
Next, an example of arithmetic processing realized by the arithmetic processing system will be described with reference to FIG. FIG. 2 is a diagram schematically showing processing for detecting the position of a face using a hierarchical convolutional neural network.

図２に示すように、本実施形態における演算処理は、初段層８に画像データ７を入力し、所定の２次元的に分布する重みを有するカーネル６との畳込み演算を階層的に繰り返し実行する。階層１〜６においては、初段層に対して実行した畳込み演算の演算結果群に対して、初段層と同様にカーネル６との畳込み演算を実行する。この場合、ある階層に含まれる一つの演算素子で実行される演算は、前段出力値をｏ、カーネルの重みをｗとした場合、以下の式（１）で表される。
ｕ＝Σｗ・ｏ・・・（１）
階層的に畳込み演算を行う際に、それぞれの階層で算出された演算結果は、次段の畳込み演算に対する入力値となる。また各層においては、畳込み演算結果として、複数の異なる演算結果群を算出する場合もある。この場合の複数の異なる演算結果は、図２に示すように、複数の異なる特徴の検出結果に相当する。これはカーネルの重み分布を変更することによって実現される。また図２に示すように、畳込み演算は前段層の複数の演算結果群を入力値とする場合もある。 As shown in FIG. 2, the arithmetic processing in this embodiment inputs image data 7 to the first-stage layer 8, and repeatedly executes a convolution operation with a kernel 6 having weights distributed two-dimensionally in a hierarchical manner. To do. In hierarchies 1 to 6, the convolution operation with the kernel 6 is executed on the operation result group of the convolution operation performed on the first layer similarly to the first layer. In this case, an operation executed by one arithmetic element included in a certain hierarchy is expressed by the following equation (1), where o is the output value of the previous stage and w is the weight of the kernel.
u = Σw · o (1)
When performing the convolution operation in a hierarchical manner, the operation result calculated in each layer becomes an input value for the convolution operation in the next stage. In each layer, a plurality of different calculation result groups may be calculated as convolution calculation results. A plurality of different calculation results in this case correspond to detection results of a plurality of different features, as shown in FIG. This is achieved by changing the kernel weight distribution. In addition, as shown in FIG. 2, the convolution calculation may use a plurality of calculation result groups in the previous layer as input values.

本実施形態における演算処理は、２次元的にデータが分布する自然画像から特徴を検出する手法として知られている、階層型コンボリューショナル・ニューラルネットワークを用いて顔の位置検出を行う場合を想定するが、これに限られない。即ち、２次元的に分布するデータと、所定の２次元的に分布する重みを有するカーネルとの畳込み演算を実行するものであれば、その他のアルゴリズムを実行するものであっても良い。 The calculation processing in the present embodiment assumes a case where face position detection is performed using a hierarchical convolutional neural network, which is known as a technique for detecting features from a natural image in which data is distributed two-dimensionally. However, it is not limited to this. In other words, any other algorithm may be executed as long as it performs a convolution operation between two-dimensionally distributed data and a kernel having a predetermined two-dimensionally distributed weight.

（ＬＳＩチップの回路構成）
続いて、前記の演算処理システムを構成するＬＳＩチップ１〜４の回路構成について、図３を参照して説明する。図３は、ＬＳＩチップの回路構成を模式的に示したブロック図である。図３に示すように、本実施形態におけるＬＳＩチップは以下の回路ブロックを有する。
・メモリ回路ブロック（ＭＥＭ）１０。
・カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１。
・レジスタ（ＲＥＧ）１２,１４。
・デジタル‐ＰＷＭ変換回路ブロック（Ｄ／Ｐ）１３,１５。
・ＰＷＭ‐アナログ変換回路（Ｐ／Ａ）１７。
・演算回路ブロック（ＰＷＭ−ＭＡＣ）１６。
・累算回路ブロック（ＡＣＣ）１９。
・ＰＷＭ‐デジタル変換回路ブロック（Ｐ／Ｄ）１８。
・マルチプレクサ（ＭＵＸ）２０。
・ルックアップテーブル（ＬＵＴ）２１。
なお、図３中では、配線は省略している。 (Circuit configuration of LSI chip)
Next, the circuit configuration of the LSI chips 1 to 4 constituting the arithmetic processing system will be described with reference to FIG. FIG. 3 is a block diagram schematically showing the circuit configuration of the LSI chip. As shown in FIG. 3, the LSI chip in this embodiment has the following circuit blocks.
A memory circuit block (MEM) 10;
Kernel data memory circuit block (KMEM) 11
Registers (REG) 12,14.
Digital-PWM conversion circuit blocks (D / P) 13, 15
PWM-analog conversion circuit (P / A) 17
Arithmetic circuit block (PWM-MAC) 16
An accumulation circuit block (ACC) 19;
PWM-digital conversion circuit block (P / D) 18
A multiplexer (MUX) 20.
A look-up table (LUT) 21;
In FIG. 3, wiring is omitted.

（処理の流れ）
続いて、前記の各回路ブロックが実行する処理の流れについて、図４を参照して説明する。図４は、回路ブロック間のデータの流れを模式的に示したブロック図である。 (Process flow)
Next, the flow of processing executed by each circuit block will be described with reference to FIG. FIG. 4 is a block diagram schematically showing the flow of data between circuit blocks.

画像データまたは前段層の演算結果は、メモリ回路ブロック（ＭＥＭ）１０に記憶される。また、カーネルの重みに関するデータは、カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１に記憶される。 The image data or the calculation result of the previous layer is stored in the memory circuit block (MEM) 10. Data relating to kernel weights is stored in a kernel data memory circuit block (KMEM) 11.

カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１に保持されたカーネルの重みに関するデータは、レジスタ（ＲＥＧ）１２に入力され、パラレルデータに変換される。続いて、５１個のデジタル‐ＰＷＭ変換回路ブロック（Ｄ／Ｐ）１３によってＰＷＭ信号に変換される。ＰＷＭ信号はＰＷＭ‐アナログ変換ブロック（Ｐ／Ａ）１７によってアナログ信号に変換され、さらに演算回路ブロック（ＰＷＭ−ＭＡＣ）１６に入力される。 Data relating to the kernel weight held in the kernel data memory circuit block (KMEM) 11 is input to the register (REG) 12 and converted into parallel data. Subsequently, the signal is converted into a PWM signal by 51 digital-PWM conversion circuit blocks (D / P) 13. The PWM signal is converted into an analog signal by a PWM-analog conversion block (P / A) 17 and further input to an arithmetic circuit block (PWM-MAC) 16.

また、メモリ回路ブロック（ＭＥＭ）１０に保持されている画像データまたは前段層による演算結果のデータは、レジスタ（ＲＥＧ）１４、デジタル‐ＰＷＭ変換回路ブロック（Ｄ／Ｐ）１５を経て、演算回路ブロック（ＰＷＭ−ＭＡＣ）１６に入力される。この時、メモリ回路ブロック（ＭＥＭ）１０から出力された画像データまたは前段層による演算結果のデータは、隣接するＬＳＩチップに対しても出力される。またレジスタ（ＲＥＧ）１４には、図４に示すように、隣接したＬＳＩチップから出力された画像データまたは前段層による演算結果のデータが入力される。なお、隣接するＬＳＩチップとのデータの入出力に係る処理、及びレジスタ（ＲＥＧ）１４へのデータの入力に係る処理に関しては、後で詳細な説明を行う。 Further, the image data held in the memory circuit block (MEM) 10 or the data of the calculation result by the previous layer passes through the register (REG) 14 and the digital-PWM conversion circuit block (D / P) 15, and the calculation circuit block (PWM-MAC) 16 is input. At this time, the image data output from the memory circuit block (MEM) 10 or the data of the calculation result by the previous layer is also output to the adjacent LSI chip. Further, as shown in FIG. 4, the register (REG) 14 receives image data output from an adjacent LSI chip or data of a calculation result by the previous layer. A process related to data input / output with an adjacent LSI chip and a process related to data input to the register (REG) 14 will be described in detail later.

なお、レジスタ（ＲＥＧ）１４は、メモリ回路ブロック（ＭＥＭ）１０が出力するシリアルデータをパラレルデータに変換する。デジタル‐ＰＷＭ変換回路ブロック（Ｄ／Ｐ）１３,１５とＰＷＭ‐アナログ変換回路ブロック（Ｐ／Ａ）１７は、それぞれデジタル信号をＰＷＭ信号に、ＰＷＭ信号をアナログ電圧に変換する。演算回路ブロック（ＰＷＭ−ＭＡＣ）１６は、入力されたカーネルの重みに関するデータと、画像データまたは前段層の演算結果に対して、畳込み演算処理を並列に実行する。 The register (REG) 14 converts serial data output from the memory circuit block (MEM) 10 into parallel data. The digital-PWM conversion circuit blocks (D / P) 13 and 15 and the PWM-analog conversion circuit block (P / A) 17 convert the digital signal into a PWM signal and the PWM signal into an analog voltage, respectively. The arithmetic circuit block (PWM-MAC) 16 executes a convolution operation process in parallel on the input data relating to the kernel weight and the image data or the calculation result of the previous layer.

演算回路ブロック１６の出力結果はＰＷＭ信号として出力され、ＰＷＭ‐デジタル変換回路ブロック（Ｐ／Ｄ）１８によってデジタル信号化された後、累算回路ブロック（ＡＣＣ）１９によって累算される。累算結果はマルチプレクサ（ＭＵＸ）２０を経てルックアップテーブル（ＬＵＴ）２１により非線形変換され、再びメモリ回路ブロック（ＭＥＭ）１０に保持される。 The output result of the arithmetic circuit block 16 is output as a PWM signal, converted into a digital signal by a PWM-digital conversion circuit block (P / D) 18 and then accumulated by an accumulation circuit block (ACC) 19. The accumulated result is nonlinearly converted by a look-up table (LUT) 21 through a multiplexer (MUX) 20 and held in the memory circuit block (MEM) 10 again.

以上の処理を特徴数回及び階層数回繰り返すことにより、階層型コンボリューショナル・ニューラルネットワークの演算処理が実現される。 By repeating the above processing several times for features and several times for layers, the arithmetic processing of the hierarchical convolutional neural network is realized.

（回路ブロック）
続いて、前述したＬＳＩチップを構成する回路ブロックの中で、演算回路ブロック（ＰＷＭ−ＭＡＣ）１６の詳細な回路構成について、図５を参照して説明を行う。図５は、演算回路ブロック１６の回路構成を示した図である。 (Circuit block)
Next, a detailed circuit configuration of the arithmetic circuit block (PWM-MAC) 16 among the circuit blocks constituting the LSI chip described above will be described with reference to FIG. FIG. 5 is a diagram showing a circuit configuration of the arithmetic circuit block 16.

本実施形態における演算回路ブロック１６中の演算回路２５（図中網掛部）は、演算回路１個につきそれぞれ５１個のスイッチド電流源（ＳＣＳ）２３と１個の積分容量（Ｃ）２４を有している（紙面の都合上、図５には５個のＳＣＳ２３のみ記載している）。演算回路ブロック１６は、８０個の演算回路２５から構成され、１つのメモリ回路ブロック（ＭＥＭ）１０の１列のメモリセルの個数と一致する（紙面の都合上、図５には３個の演算回路のみ記載している）。演算回路ブロック１６中の複数の演算回路２５は、図５に示すように、入力画像データ又は前段層の演算結果に相当するＰＷＭ信号ＰＩ_iとカーネルの重みデータに相当するアナログ電圧ＶＷ_jを共有する。ただし、ｉ＝１,・・・,１３０、ｊ＝１,・・・,５１である。また、ＰＷＭ信号ＰＩ_iはＤ／Ｐ１５を介してレジスタ（ＲＥＧ）１４から入力され、アナログ電圧ＶＷ_jはＤ／Ｐ１２、Ｐ／Ａ１７を介してレジスタ（ＲＥＧ）１２から入力される。 The arithmetic circuit 25 (shaded part in the figure) in the arithmetic circuit block 16 in this embodiment has 51 switched current sources (SCS) 23 and one integral capacitor (C) 24 for each arithmetic circuit. (For convenience of space, only five SCSs 23 are shown in FIG. 5). The arithmetic circuit block 16 is composed of 80 arithmetic circuits 25 and matches the number of memory cells in one column of one memory circuit block (MEM) 10 (three arithmetic operations are shown in FIG. Only the circuit is shown). As shown in FIG. 5, the plurality of arithmetic circuits 25 in the arithmetic circuit block 16 share the PWM signal PI _i corresponding to the input image data or the calculation result of the previous layer and the analog voltage VW _j corresponding to the kernel weight data. To do. However, i = 1,..., 130 and j = 1,. The PWM signal PI _i is input from the register (REG) 14 via D / P 15, and the analog voltage VW _j is input from the register (REG) 12 via D / P 12 and P / A 17.

なお、本実施形態では、１個のＬＳＩにおいて、サイズが５１×５１のカーネルを用いて畳み込み演算を行い、サイズが８０×８０の出力を得ることが可能な構成について例示的に説明する。従って、５１×５１のカーネルを用いた畳み込み演算により８０×８０の出力を得るために、１個のＬＳＩにおける２次元入力データのサイズは１３０×１３０となる（ただし、８０＋５１−１＝１３０）。また、メモリ回路ブロック（ＭＥＭ）１０は８０×８０のメモリセルから構成される。 In the present embodiment, a configuration in which a convolution operation is performed using a kernel having a size of 51 × 51 and an output having a size of 80 × 80 can be obtained in one LSI will be described as an example. Therefore, in order to obtain 80 × 80 output by a convolution operation using a 51 × 51 kernel, the size of two-dimensional input data in one LSI is 130 × 130 (where 80 + 51-1 = 130). The memory circuit block (MEM) 10 is composed of 80 × 80 memory cells.

この場合、演算回路２５は以下の手順で動作する。
（１）スイッチド電流源（ＳＣＳ）２３に入力ＰＷＭ信号ＰＩを入力する。
（２）ＰＷＭ信号を積分容量（Ｃ）２４の電荷に変換することで、アナログ電圧ＶＷ_jによる重み付け加算を行う。
（３）積分容量（Ｃ）２４の両端に掛かる電圧Ｖ_kを線形なランプ信号Ｖ_refと比較することでＰＷＭ信号ＰＯ_kに変換する。
即ち、以下の式が成り立つ。
ＰＯ₁＝ＰＩ₁・ＶＷ₁＋・・・＋ＰＩ₅₁・ＶＷ₅₁。
・・・・
ＰＯ₈₀＝ＰＩ₈₀・ＶＷ₁＋・・・＋ＰＩ₁₃₀・ＶＷ₅₁。 In this case, the arithmetic circuit 25 operates according to the following procedure.
(1) The input PWM signal PI is input to the switched current source (SCS) 23.
(2) to convert the PWM signal to the charge of the integrating capacitor (C) 24, performs weighting addition by the analog voltage VW _j.
(3) The voltage V _k applied to both ends of the integration capacitor (C) 24 is converted to the PWM signal PO _k by comparing it with the linear ramp signal V _ref .
That is, the following equation is established.
PO ₁ = PI ₁ · VW ₁ +... + PI ₅₁ · VW ₅₁ .
...
PO ₈₀ = PI ₈₀ · VW ₁ +... + PI ₁₃₀ · VW ₅₁ .

なお、積分容量（Ｃ）２４の電荷は、演算開始前にＲＳＴにＨｉｇｈ信号を入力することによりリセットされる。 The charge of the integration capacitor (C) 24 is reset by inputting a high signal to RST before the calculation is started.

次に、本実施形態におけるＬＳＩチップを構成するその他の回路ブロックについて説明する。 Next, other circuit blocks constituting the LSI chip in this embodiment will be described.

ＰＷＭ‐アナログ変換回路（Ｐ／Ａ）１７は、ＰＷＭ信号に変換されたカーネルの重みデータをさらにアナログ電圧ＶＷ_jに変換し、当該変換されたアナログ電圧ＶＷ_jを演算回路ブロック１６に対して出力する。なお、ＰＷＭ‐アナログ変換回路（Ｐ／Ａ）１７は、１個のスイッチド電流源と積分容量、及びソースフォロアバッファから構成され、スイッチド電流源はＰＷＭ信号を積分容量の電荷に変換し、ソースフォロアバッファは、積分容量の電圧を出力する。 PWM- analog conversion circuit (P / A) 17 is further converted to an analog voltage VW _j the weight data of the kernel that has been converted into PWM signals, outputs the converted analog voltage VW _j with respect to operation circuit block 16 To do. The PWM-analog conversion circuit (P / A) 17 is composed of one switched current source, an integration capacitor, and a source follower buffer, and the switched current source converts the PWM signal into the charge of the integration capacitor, The source follower buffer outputs the voltage of the integration capacitor.

また、メモリ回路ブロック（ＭＥＭ）１０、カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１は、本実施形態の場合ＳＲＡＭから構成される。ＲＥＧ１２，１４、Ｄ／Ｐ１３，１５、ＡＣＣ１９、Ｐ／Ｄ１８、ＭＵＸ２０、及び、ＬＵＴ２１に関しては、デジタル回路であり、前述した機能を有するものであればどのような回路構成であっても構わないため、詳細な説明を省略する。 In the present embodiment, the memory circuit block (MEM) 10 and the kernel data memory circuit block (KMEM) 11 are composed of SRAM. REG12, 14, D / P13, 15, ACC19, P / D18, MUX20, and LUT21 are digital circuits and may have any circuit configuration as long as they have the functions described above. Detailed description will be omitted.

（畳み込み演算）
演算回路ブロック１６において実行される畳み込み演算の実行フローについて、図６を参照して説明する。図６は、演算回路ブロック１６による畳込み演算の様子を模式的に示した図である。 (Convolution operation)
An execution flow of the convolution operation executed in the arithmetic circuit block 16 will be described with reference to FIG. FIG. 6 is a diagram schematically showing the state of the convolution operation by the arithmetic circuit block 16.

上記のように、本実施形態において１個のＬＳＩチップが演算対象とする最大画像サイズは、１３０×１３０画素である（ただし、後述するように、１３０×１３０画素の全てが当該ＬＳＩのメモリに保持されているわけではない）。また、カーネルの１辺の最大画素サイズは、１個の演算回路中のスイッチド電流源（ＳＣＳ）２３の数に等しく５１である。 As described above, the maximum image size to be calculated by one LSI chip in this embodiment is 130 × 130 pixels (however, as will be described later, all 130 × 130 pixels are stored in the memory of the LSI. Not retained). The maximum pixel size on one side of the kernel is 51, which is equal to the number of switched current sources (SCS) 23 in one arithmetic circuit.

このような状況において、画像データもしくは前段層の１つの特徴の演算結果の１行に属する１３０画素は、８０個の演算回路２５に同時に入力される。言い換えると、８０個の演算回路２５のそれぞれは、演算対象の２次元データの１行に属する１３０画素のうち演算に必要な５１個のデータを、Ｄ／Ｐ１５を介してＲＥＧ１４から取得する。そして、演算回路２５のそれぞれは、入力されたデータとカーネルの１行分の重みデータとの畳込み演算を並列に実行する。この演算（並列演算）をカーネル５１行分の計算のために５１回繰り返し、さらに前記演算をカーネルの正と負の値に分割して実行するために２回繰り返す。従って、メモリ回路ブロック１０の１行のメモリセルの個数８０個分の演算結果を確定するために、演算回路ブロック１６は、５１×２回の前記並列演算を実行する。さらに、８０行全ての演算結果を確定するために、前記５１×２回の演算を８０回実行する。 In such a situation, 130 pixels belonging to one row of the image data or the calculation result of one feature of the preceding layer are simultaneously input to 80 arithmetic circuits 25. In other words, each of the 80 arithmetic circuits 25 acquires 51 data necessary for calculation from the REG 14 through the D / P 15 among 130 pixels belonging to one row of the two-dimensional data to be calculated. Each of the arithmetic circuits 25 executes a convolution operation of the input data and the weight data for one row of the kernel in parallel. This operation (parallel operation) is repeated 51 times for the calculation of 51 rows of the kernel, and further, the operation is repeated twice to execute it divided into positive and negative values of the kernel. Therefore, in order to determine the calculation results for the number of memory cells in one row of the memory circuit block 10, the calculation circuit block 16 executes the parallel calculation 51 × 2 times. Further, the 51 × 2 calculations are executed 80 times in order to determine the calculation results for all 80 rows.

（並列演算）
上記のように、５１×５１のカーネルに基づいて畳み込み演算により８０×８０のデータを取得するためには、１３０×１３０の２次元データが入力される必要がある。しかし、メモリ回路ブロック（ＭＥＭ）１０のサイズは８０×８０である。このため、図６のように、演算回路２５に入力される演算対象の１３０画素のデータのうち、８０×８０の画素サイズを超える部分（図中斜線部）のデータは、演算を実行するＬＳＩチップ内には保持されていないことになる。そこで、本実施形態に係る構成においては、ＬＳＩチップを並列に接続し、隣接するＬＳＩチップから処理に必要な画素データを取得し、当該画素データを用いて２次元データの畳み込み演算を行う。以下、複数のＬＳＩチップによる並列演算について、上記のＬＳＩチップを４つ一列に接続して画像サイズが３２０×２４０画素に対して演算を行う場合を例示的に取り上げて説明する。 (Parallel operation)
As described above, in order to acquire 80 × 80 data by a convolution operation based on a 51 × 51 kernel, it is necessary to input 130 × 130 two-dimensional data. However, the size of the memory circuit block (MEM) 10 is 80 × 80. For this reason, as shown in FIG. 6, among the data of 130 pixels to be calculated that are input to the calculation circuit 25, the data of the portion exceeding the pixel size of 80 × 80 (shaded portion in the figure) is the LSI that executes the calculation. It is not held in the chip. Therefore, in the configuration according to the present embodiment, LSI chips are connected in parallel, pixel data necessary for processing is acquired from adjacent LSI chips, and convolution calculation of two-dimensional data is performed using the pixel data. In the following, parallel calculation using a plurality of LSI chips will be described by taking as an example the case where four LSI chips are connected in a row and calculation is performed on an image size of 320 × 240 pixels.

図７は、４つのＬＳＩチップ１〜４を用いて３２０×２４０の２次元データに対してカーネルとの畳み込み演算を行う様子を模式的に示した図である。図７において、７１１〜７１４は、一列に接続された４つのＬＳＩチップ１〜４による演算対象の領域をそれぞれ示している。ＬＳＩチップ１〜４は、それぞれの演算領域７１１〜７１４における対応する位置のデータについて、他のＬＳＩチップによる演算と同期して畳み込み演算を行う。 FIG. 7 is a diagram schematically showing how the convolution operation with the kernel is performed on 320 × 240 two-dimensional data using the four LSI chips 1 to 4. In FIG. 7, reference numerals 711 to 714 denote areas to be calculated by the four LSI chips 1 to 4 connected in a row. The LSI chips 1 to 4 perform convolution calculations on data at corresponding positions in the calculation areas 711 to 714 in synchronization with calculations by other LSI chips.

７０１〜７０４は、ＬＳＩチップ１〜４のカーネル、即ち、ＬＳＩチップ１〜４が演算を実行するために必要な２次元データの範囲をそれぞれ示している。図７に示すように、１つのメモリ回路ブロック１０のサイズ（８０×８０）よりも大きいサイズの画像を演算対象とする場合に、カーネルが、隣接するＬＳＩチップのメモリ回路ブロック１０に保持する画像データまたは演算結果の領域にはみ出してしまう。なお図７では、はみ出し部を斜線で表示している。 Reference numerals 701 to 704 denote kernels of the LSI chips 1 to 4, that is, ranges of two-dimensional data necessary for the LSI chips 1 to 4 to perform operations. As illustrated in FIG. 7, when an image having a size larger than the size (80 × 80) of one memory circuit block 10 is an operation target, an image held by the kernel in the memory circuit block 10 of an adjacent LSI chip It overflows into the data or calculation result area. In FIG. 7, the protruding portion is indicated by diagonal lines.

本実施形態に係る構成においては、はみ出し部に含まれる画像データを用いて演算を行うために、ＬＳＩチップ１〜４を隣接させて接続する。そしてさらに隣接するＬＳＩチップ１〜４間で、内蔵メモリに保持していない演算対象データ（カーネルがはみ出している部分のデータ）をお互いに入出力しあって補完するものである。 In the configuration according to the present embodiment, the LSI chips 1 to 4 are adjacently connected in order to perform an operation using the image data included in the protruding portion. Further, between the adjacent LSI chips 1 to 4, computation target data (data of a portion protruding from the kernel) that is not held in the built-in memory is mutually input and output and complemented.

次に、それぞれのＬＳＩチップ１〜４の演算処理において、カーネルが、隣接するＬＳＩチップ１〜４が保持するデータ領域にはみ出した場合の、隣接するチップ間でデータを入出力する様子について、図８を参照して説明する。図８は、各チップのメモリ回路ブロック１０に保持された画像データまたは前段層の演算結果から、演算対象となるデータを読み出し、隣接するチップのレジスタに入出力する際の入力行のメモリ読み出し順序、及びデータの流れを模式的に示した図である。なお図８には、各ＬＳＩチップ１〜４内のメモリ回路ブロック１０中の、演算対象としている８０×８０の領域のみが示されている。また、後段層における演算の算出位置がターゲット行として重ねて表示されている。 Next, in the arithmetic processing of the LSI chips 1 to 4, the manner in which data is input / output between adjacent chips when the kernel protrudes into the data area held by the adjacent LSI chips 1 to 4 is shown in FIG. Explanation will be made with reference to FIG. FIG. 8 shows the memory read order of input rows when data to be calculated is read out from the image data held in the memory circuit block 10 of each chip or the calculation result of the previous layer and input / output to / from the register of the adjacent chip. FIG. 4 is a diagram schematically showing the flow of data. FIG. 8 shows only the 80 × 80 area to be calculated in the memory circuit block 10 in each of the LSI chips 1 to 4. In addition, the calculation position of the calculation in the latter layer is displayed as a target row.

図８において、カーネルがはみ出した場合を演算するために、はみ出している部分の入力行のデータ（図中黒色部）を隣接するＬＳＩチップから取り込んでいることが分かる。例えば、ＬＳＩチップ１は、８０１の位置における演算を行うために、８１１のデータをＬＳＩチップ２から取り込んでいる。この時のメモリ読み出し順序、及びレジスタにデータを入力する際のデータの流れを以下で詳しく説明する。 In FIG. 8, it can be seen that the input line data (black portion in the figure) of the protruding portion is taken from the adjacent LSI chip in order to calculate the case where the kernel protrudes. For example, the LSI chip 1 fetches 811 data from the LSI chip 2 in order to perform the calculation at the position 801. The memory reading order at this time and the flow of data when data is input to the register will be described in detail below.

本実施形態においては、一つのメモリ回路ブロックの左側から一つずつデータを読み出していくため（図中の入力行）、まず図８の（ａ）に示すように、メモリ回路の左側に保持されているデータが、左側に隣接するＬＳＩチップに受け渡される。例えば、８１１のデータがＬＳＩチップ１に受け渡される。つまり（ａ）のデータのやり取りは、各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの右側に発生している場合に相当する。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。なお図８では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック１０を有するＬＳＩチップ自体のレジスタ１４にも入力されている。 In the present embodiment, since data is read one by one from the left side of one memory circuit block (input row in the figure), it is first held on the left side of the memory circuit as shown in FIG. Is transferred to the LSI chip adjacent to the left side. For example, 811 data is transferred to the LSI chip 1. In other words, the data exchange in (a) corresponds to the case where a kernel protrusion in each LSI chip occurs on the right side of the kernel. Here, the transferred data is used for the calculation of the target row in the subsequent layer. Although not shown in FIG. 8, the transferred data is also input to the register 14 of the LSI chip itself having the memory circuit block 10 holding the data.

その後、メモリ回路ブロック１０からのデータの読み出しが右側に移動していくと、カーネルのはみ出しは発生しなくなり、各ＬＳＩチップに内蔵されたメモリ回路ブロック１０からのデータのみがレジスタ１４に入力される。 Thereafter, when the reading of data from the memory circuit block 10 moves to the right side, the kernel does not protrude and only the data from the memory circuit block 10 built in each LSI chip is input to the register 14. .

そして、さらにメモリ回路からのデータの読み出しが右側に移動していくと、今度は各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの左側に発生するようになる。このため、図８（ｂ）に示すように、メモリ回路の右側に保持されているデータが、右側に隣接するＬＳＩチップに受け渡される。例えば、８１２のデータがＬＳＩチップ２に受け渡される。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。なお図８では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック１０を有するＬＳＩチップ自体のレジスタ１４にも入力されている。 Then, when the reading of data from the memory circuit further moves to the right side, this time, the protrusion of the kernel in each LSI chip occurs on the left side of the kernel. Therefore, as shown in FIG. 8B, the data held on the right side of the memory circuit is transferred to the LSI chip adjacent on the right side. For example, 812 data is transferred to the LSI chip 2. Here, the transferred data is used for the calculation of the target row in the subsequent layer. Although not shown in FIG. 8, the transferred data is also input to the register 14 of the LSI chip itself having the memory circuit block 10 holding the data.

上記のように、本実施形態に係る構成においては、４個のＬＳＩチップのそれぞれのレジスタ（ＲＥＧ）１４には、各ＬＳＩチップにおける演算に必要なデータ（１３０個）が正しく入力され、保持される。このため、本実施形態に係る構成においては、各ＬＳＩチップのメモリに格納可能なサイズよりも大きいサイズの２次元データについて演算を行うことができる。更に、各ＬＳＩチップ１〜４のメモリ読み出し位置は、４個のＬＳＩチップで全て共通なため、読み出し時の制御が容易である。また、各ＬＳＩチップ１〜４で演算対象とするカーネルの重みデータも各ＬＳＩチップ１〜４で共通であるため、制御が容易である。 As described above, in the configuration according to the present embodiment, the data (130) necessary for the calculation in each LSI chip is correctly input and held in the registers (REG) 14 of the four LSI chips. The For this reason, in the configuration according to the present embodiment, it is possible to perform computation on two-dimensional data having a size larger than the size that can be stored in the memory of each LSI chip. Furthermore, since the memory reading positions of the LSI chips 1 to 4 are all common to the four LSI chips, the control during reading is easy. Further, since the kernel weight data to be calculated by the LSI chips 1 to 4 is common to the LSI chips 1 to 4, the control is easy.

（データの入出力）
続いて、再度図４を参照して、隣接するＬＳＩチップとのデータの入出力方法について説明を行う。図４に示したように、隣接したＬＳＩチップとのデータの入出力は、左側のＬＳＩチップから右側のＬＳＩチップへデータが入力される場合と、右側のチップから左側のチップへデータが入力される場合の２種類のケースがある。そこで本実施形態においては、データの入出力の左右方向を切り替えるために、トライステートバッファ２６を使用している。二つのトライステートバッファ２６は、逆相の制御信号で制御することにより、データの出力方向を左側と右側で切り替えることが可能である。 (Data input / output)
Next, referring to FIG. 4 again, a method for inputting / outputting data with an adjacent LSI chip will be described. As shown in FIG. 4, data input / output between adjacent LSI chips is performed when data is input from the left LSI chip to the right LSI chip and when data is input from the right chip to the left chip. There are two types of cases. Therefore, in this embodiment, the tri-state buffer 26 is used to switch the left / right direction of data input / output. The two tri-state buffers 26 can switch the data output direction between the left side and the right side by controlling with the opposite phase control signals.

また、図９に示すブロック図のようにトライステートバッファ２７を用いることにより、ＬＳＩチップからデータを入・出力する配線を共用することが可能となる。図９は、隣接するＬＳＩチップからデータを入出力するための配線が共用された構成を例示的に示した図である。 Further, by using the tristate buffer 27 as shown in the block diagram of FIG. 9, it is possible to share wiring for inputting and outputting data from the LSI chip. FIG. 9 is a diagram exemplarily showing a configuration in which wirings for inputting and outputting data from adjacent LSI chips are shared.

この場合、左側のＬＳＩチップからの出力を取り込むときには、右側のＬＳＩチップにデータを出力するように配線を切り替え、右側のＬＳＩチップから出力を取り込むときには、左側のＬＳＩチップにデータを出力するように配線を切り替える。この手法を用いた場合、隣接するＬＳＩチップ間の配線は、図１０に示すように、６ビットの配線２８の１組のみとなり、配線数を削減することができる。 In this case, when capturing the output from the left LSI chip, the wiring is switched so that the data is output to the right LSI chip, and when capturing the output from the right LSI chip, the data is output to the left LSI chip. Switch the wiring. When this method is used, the wiring between adjacent LSI chips is only one set of 6-bit wirings 28 as shown in FIG. 10, and the number of wirings can be reduced.

なお、隣接したＬＳＩチップ間で以上説明したようにデータを入出力できるものであれば、配線構造及びバッファ回路等の構成はその他のものを使用しても構わない。また、入出力の配線は、チップ内部で共有または分配しても、チップ外部で共有または分配してもどちらでも構わない。図１０では、チップ内部で配線を共有した場合が示されている。 As long as data can be input / output between adjacent LSI chips as described above, other configurations such as a wiring structure and a buffer circuit may be used. Input / output wiring may be shared or distributed inside the chip, or shared or distributed outside the chip. FIG. 10 shows a case where wiring is shared inside the chip.

上記のように、本実施形態においては、演算対象である２次元の対象データと２次元のカーネルとの畳み込み演算を行うＬＳＩチップが開示されている。このＬＳＩチップは、対象データを保持するＭＥＭ１０（対象データメモリ）、カーネルを保持するＫＭＥＭ１１（カーネルメモリ）、ＰＷＭ−ＭＡＣ１６（演算回路）、対象データを外部のＬＳＩチップと入出力する入出力配線と、を備える。ただし、ＰＷＭ−ＭＡＣ１６は、対象データとカーネルとに基づいて畳み込み演算処理を行う。また、ＰＷＭ−ＭＡＣ１６は、当該演算回路における演算において必要な対象データであって、当該ＬＳＩチップに備えられたＭＥＭ１０に存在しないものは、入出力配線を介して、外部のＬＳＩチップに備えられたＭＥＭ１０から入力する。 As described above, the present embodiment discloses an LSI chip that performs a convolution operation between two-dimensional target data to be calculated and a two-dimensional kernel. This LSI chip includes a MEM 10 (target data memory) that holds target data, a KMEM 11 (kernel memory) that holds a kernel, a PWM-MAC 16 (arithmetic circuit), an input / output wiring that inputs and outputs the target data to and from an external LSI chip. . However, the PWM-MAC 16 performs a convolution operation process based on the target data and the kernel. The PWM-MAC 16 is the target data necessary for the calculation in the calculation circuit, and the data not existing in the MEM 10 provided in the LSI chip is provided in the external LSI chip via the input / output wiring. Input from MEM10.

このため、本実施形態に係る構成によれば、ＬＳＩチップに内蔵するメモリ回路ブロックのメモリサイズを超える２次元データに対する演算処理を効率的に実行することが可能である。大きなサイズの２次元データ（画等データなど）に対して演算処理を行う際に、配線数を過剰に増大させること無く演算処理を実行することが可能となる。また、複数の同一のＬＳＩチップを接続することで演算処理システムを構築することにより、１個のＬＳＩチップの面積を縮小することができ、製造時の歩留まりを高くすることが可能となる。また、接続するＬＳＩチップの個数を変えることで、ＬＳＩチップの回路構成を変更することなく、様々なサイズの演算対象データに対して畳み込み演算を実行することが可能となる。 For this reason, according to the configuration according to the present embodiment, it is possible to efficiently execute arithmetic processing on two-dimensional data exceeding the memory size of the memory circuit block built in the LSI chip. When performing arithmetic processing on large-size two-dimensional data (such as image data), it is possible to perform arithmetic processing without excessively increasing the number of wires. Further, by constructing an arithmetic processing system by connecting a plurality of identical LSI chips, the area of one LSI chip can be reduced, and the yield at the time of manufacturing can be increased. Further, by changing the number of LSI chips to be connected, it is possible to execute a convolution operation on operation target data of various sizes without changing the circuit configuration of the LSI chip.

また、演算回路における演算処理は並列に実行されるため、２次元データについて効率的に演算処理を実行することができる。また、本実施形態に係る構成においては、ＬＳＩチップが一列に配置されて構成されているため、小さな回路規模で様々なサイズの２次元データについて畳み込み演算を実行することができる。 In addition, since the arithmetic processing in the arithmetic circuit is executed in parallel, the arithmetic processing can be efficiently executed on the two-dimensional data. Further, in the configuration according to the present embodiment, LSI chips are arranged in a line, so that a convolution operation can be performed on two-dimensional data of various sizes with a small circuit scale.

なお、回路ブロックやその構成要素の説明において示した個数、画像サイズ等は本実施形態を説明するために例示したものであり、これに限られないことは明らかである。 It should be noted that the numbers, image sizes, and the like shown in the description of the circuit block and its constituent elements are examples for explaining the present embodiment, and are obviously not limited to this.

＜＜第２実施形態＞＞
第１実施形態に係る構成においては、演算回路ブロック１６がアナログ回路で構成されていた。本実施形態では、演算回路ブロックをデジタル回路で構成した場合について説明する。本実施形態に係る構成は、演算回路ブロックのデジタル回路化に伴う変更以外は、全て第１実施形態と同様である。このため、本実施形態においては、第１実施形態と異なる部分についてのみ説明を行い、それ以外は第１実施形態と同様であるため説明を省略する。 << Second Embodiment >>
In the configuration according to the first embodiment, the arithmetic circuit block 16 is configured by an analog circuit. In the present embodiment, a case where the arithmetic circuit block is configured by a digital circuit will be described. The configuration according to the present embodiment is the same as that of the first embodiment except for the change accompanying the digital circuit of the arithmetic circuit block. For this reason, in this embodiment, only a different part from 1st Embodiment is demonstrated, and since it is the same as that of 1st Embodiment other than that, description is abbreviate | omitted.

（ＬＳＩチップの回路構成）
図１１は、本実施形態における演算処理システムを構成するＬＳＩチップの回路構成を模式的に示したブロック図である。図１１に示すように、本実施形態におけるＬＳＩチップは以下の回路ブロックを有する。
・メモリ回路ブロック（ＭＥＭ）１０。
・カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１。
・レジスタ（ＲＥＧ）１２,１４。
・デジタル演算回路ブロック（Ｄ−ＭＡＣ）２９。
・累算回路ブロック（ＡＣＣ）１９。
・マルチプレクサ（ＭＵＸ）２０。
・ルックアップテーブル（ＬＵＴ）２１。
なお、図１１では、配線は省略している。 (Circuit configuration of LSI chip)
FIG. 11 is a block diagram schematically showing the circuit configuration of an LSI chip that constitutes the arithmetic processing system in the present embodiment. As shown in FIG. 11, the LSI chip in this embodiment has the following circuit blocks.
A memory circuit block (MEM) 10;
Kernel data memory circuit block (KMEM) 11
Registers (REG) 12,14.
A digital arithmetic circuit block (D-MAC) 29.
An accumulation circuit block (ACC) 19;
A multiplexer (MUX) 20.
A look-up table (LUT) 21;
In FIG. 11, wiring is omitted.

（処理の流れ）
続いて、前記の各回路ブロックが実行する処理の流れについて、図１２を参照して説明する。図１２は、回路ブロック間のデータの流れを模式的に示したブロック図である。 (Process flow)
Next, the flow of processing executed by each circuit block will be described with reference to FIG. FIG. 12 is a block diagram schematically showing the flow of data between circuit blocks.

カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１に保持されたカーネルの重みに関するデータは、レジスタ（ＲＥＧ）１２に入力されてパラレルデータに変換され、さらに演算回路ブロック（Ｄ−ＭＡＣ）２９に入力される。画像データまたは前段層の演算結果は、レジスタ（ＲＥＧ）１４を経て、演算回路ブロック（Ｄ−ＭＡＣ）２９に入力される。この時、メモリ回路ブロック（ＭＥＭ）１０から出力された画像データまたは前段層の演算結果のデータは、同時に隣接するＬＳＩチップに対しても出力される。またレジスタ（ＲＥＧ）１４には、図１２に示すように、隣接したＬＳＩチップから出力された画像データまたは前段層の演算結果のデータが入力される。ここでレジスタ（ＲＥＧ）１４は、メモリ回路ブロック（ＭＥＭ）１０が出力するシリアルデータをパラレルデータに変換する。 Data relating to the kernel weight held in the kernel data memory circuit block (KMEM) 11 is input to the register (REG) 12 to be converted into parallel data, and further input to the arithmetic circuit block (D-MAC) 29. The image data or the calculation result of the previous layer is input to the arithmetic circuit block (D-MAC) 29 through the register (REG) 14. At this time, the image data output from the memory circuit block (MEM) 10 or the data of the calculation result of the previous layer is also output to the adjacent LSI chip at the same time. Further, as shown in FIG. 12, the register (REG) 14 receives image data output from an adjacent LSI chip or data of an operation result of the previous layer. Here, the register (REG) 14 converts serial data output from the memory circuit block (MEM) 10 into parallel data.

演算回路ブロック（Ｄ−ＭＡＣ）２９はデジタル乗算回路から構成され、入力されたカーネルの重みに関するデータと、画像データまたは前段層の演算結果に対して、畳込み演算処理を並列に実行する。演算回路ブロック（Ｄ−ＭＡＣ）２９の出力結果は、累算回路ブロック（ＡＣＣ）１９によって累算される。累算結果はマルチプレクサ（ＭＵＸ）２０を経てルックアップテーブル（ＬＵＴ）２１により非線形変換され、再びメモリ回路ブロック（ＭＥＭ）１０に記憶される。 The arithmetic circuit block (D-MAC) 29 is composed of a digital multiplier circuit, and executes convolution arithmetic processing in parallel on the input data relating to the kernel weight and the image data or the calculation result of the previous layer. The output result of the arithmetic circuit block (D-MAC) 29 is accumulated by the accumulation circuit block (ACC) 19. The accumulated result is nonlinearly converted by a look-up table (LUT) 21 through a multiplexer (MUX) 20 and stored again in the memory circuit block (MEM) 10.

（演算回路ブロック）
続いて、前述したＬＳＩチップを構成する回路ブロックの中で、演算回路ブロック（Ｄ−ＭＡＣ）２９の詳細な回路構成について、図１３を参照して説明を行う。図１３は、演算回路ブロック２９の回路構成を示した図である。 (Arithmetic circuit block)
Next, a detailed circuit configuration of the arithmetic circuit block (D-MAC) 29 among the circuit blocks constituting the LSI chip described above will be described with reference to FIG. FIG. 13 is a diagram showing a circuit configuration of the arithmetic circuit block 29.

本実施形態における演算回路ブロック（Ｄ−ＭＡＣ）２９中の１個の演算回路３１は５１個のデジタル乗算回路（ＭＵＬ）３０を有しており、それぞれの乗算回路３０は、後段の累算回路ブロック（ＡＣＣ）１９に接続している。ただし、図５には、紙面の都合上、５個のＭＵＬ３０のみ記載している。演算回路ブロック（Ｄ−ＭＡＣ）２９は、８０個の演算回路３１から構成され、１つのメモリ回路ブロックの１列のメモリセルの個数と一致する（図５には、紙面の都合上、３個の演算回路のみ記載している）。演算回路ブロック２９中の複数の演算回路３１は、図１３に示すように、入力される２次元データに相当するデジタル信号ＤＩ_iとカーネルの重みデータに相当するデジタル電圧ＤＷ_jを共有する。ただし、ｉ＝１,・・・,１３０、ｊ＝１,・・・,５１である。また、入力デジタル信号ＤＩ_iはレジスタ（ＲＥＧ）１４から入力され、デジタル電圧ＤＷ_jはレジスタ（ＲＥＧ）１２から入力される。 One arithmetic circuit 31 in the arithmetic circuit block (D-MAC) 29 in the present embodiment has 51 digital multiplier circuits (MUL) 30, and each multiplier circuit 30 is an accumulator circuit in the subsequent stage. It is connected to a block (ACC) 19. However, only five MULs 30 are shown in FIG. 5 for the sake of space. The arithmetic circuit block (D-MAC) 29 is composed of 80 arithmetic circuits 31 and matches the number of memory cells in one column of one memory circuit block (in FIG. Only the arithmetic circuit is described). As shown in FIG. 13, the plurality of arithmetic circuits 31 in the arithmetic circuit block 29 share a digital signal DI _i corresponding to input two-dimensional data and a digital voltage DW _j corresponding to kernel weight data. However, i = 1,..., 130 and j = 1,. The input digital signal DI _i is input from the register (REG) 14, and the digital voltage DW _j is input from the register (REG) 12.

なお、本実施形態においても、１個のＬＳＩにおいて、サイズが５１×５１のカーネルを用いて畳み込み演算を行い、サイズが８０×８０の出力を得ることが可能な構成について例示的に説明する。このため、第１実施形態と同様に、５１×５１のカーネルを用いた畳み込み演算により８０×８０の出力を得るために、１個のＬＳＩにおける２次元入力データのサイズは１３０×１３０となる。また、メモリ回路ブロック（ＭＥＭ）１０は８０×８０のメモリセルから構成される。 In this embodiment as well, a configuration in which a convolution operation is performed using a kernel with a size of 51 × 51 in one LSI and an output with a size of 80 × 80 can be obtained will be exemplarily described. For this reason, as in the first embodiment, in order to obtain 80 × 80 output by a convolution operation using a 51 × 51 kernel, the size of the two-dimensional input data in one LSI is 130 × 130. The memory circuit block (MEM) 10 is composed of 80 × 80 memory cells.

この場合、演算回路２９及び後段に接続する累算回路１９は以下の手順で動作する。
（１）乗算回路（ＭＵＬ）３０にデジタル信号ＤＩ及びＤＷを入力する。
（２）乗算回路（ＭＵＬ）３０より、ＤＩとＤＷの乗算結果が出力され、累算回路ブロック（ＡＣＣ）１９に入力される。
（３）累算回路ブロック（ＡＣＣ）１９は、最大５１個の乗算回路（ＭＵＬ）３０からの入力を累算する。 In this case, the arithmetic circuit 29 and the accumulator circuit 19 connected to the subsequent stage operate in the following procedure.
(1) The digital signals DI and DW are input to the multiplication circuit (MUL) 30.
(2) The multiplication result of DI and DW is output from the multiplication circuit (MUL) 30 and input to the accumulation circuit block (ACC) 19.
(3) The accumulation circuit block (ACC) 19 accumulates inputs from a maximum of 51 multiplication circuits (MUL) 30.

以上のように、本実施形態における構成によれば、第１実施形態における演算回路が実行する演算をデジタル的に実行することができる。なお、本実施形態における演算処理システムは、前述した演算回路ブロックのデジタル回路化に伴う変更以外、全て第１実施形態に係る構成と同様である。従って、ＬＳＩチップを１列に接続して、演算対象となるデータを入出力する方法等に関しては、第１実施形態と同様であるため、説明を省略する。また、隣接したＬＳＩチップとのデータの入出力方法は、第１実施形態と同様に、入力配線と出力配線を共有したものであっても構わない。また、同様の機能を実現できるものであれば、その他の回路構成を用いても構わない。 As described above, according to the configuration of the present embodiment, the arithmetic operation performed by the arithmetic circuit in the first embodiment can be executed digitally. Note that the arithmetic processing system in the present embodiment is the same as the configuration according to the first embodiment except for the change accompanying the digital circuit of the arithmetic circuit block described above. Therefore, the method for connecting the LSI chips in one column and inputting / outputting data to be calculated is the same as in the first embodiment, and the description thereof is omitted. Further, the input / output method of data between adjacent LSI chips may share the input wiring and the output wiring, as in the first embodiment. Further, other circuit configurations may be used as long as the same function can be realized.

また上記の説明では、累算回路は５１個の演算回路の乗算結果（５１個）を並列に累算する例を示したがこれに限られない。例えば、図１４に示すように、５１個の乗算回路がＢＵＳにより累算回路ブロックと接続され、パイプライン処理により、５１個の乗算結果をシリアルに累算する構成であっても良い。また、それ以外の構成であっても、同様の演算を実現できるデジタル回路であれば、演算回路ブロックはどのような構成をとっても構わない。 In the above description, the accumulation circuit has shown an example of accumulating the multiplication results (51) of 51 arithmetic circuits in parallel, but is not limited thereto. For example, as shown in FIG. 14, 51 multiplication circuits may be connected to the accumulation circuit block by BUS, and 51 multiplication results may be serially accumulated by pipeline processing. Further, even if the configuration is other than that, the arithmetic circuit block may have any configuration as long as it is a digital circuit capable of realizing the same calculation.

＜＜第３実施形態＞＞
第１、第２実施形態に係る構成においては、ＬＳＩチップが一列に接続されていた。本実施形態においては、複数のＬＳＩチップを平面的に配置、接続することで、更に高速な演算処理を実行可能な構成について説明する。 << Third Embodiment >>
In the configuration according to the first and second embodiments, LSI chips are connected in a row. In the present embodiment, a description will be given of a configuration capable of executing higher-speed arithmetic processing by arranging and connecting a plurality of LSI chips in a plane.

（演算処理システム）
図１５は、本実施形態に係る演算処理システムを模式的に示した図である。図１５に示すように演算処理システムは、１２個のＬＳＩチップ０１〜１２が３行４列に配置されて構成される。なお、ＬＳＩチップの個数は、演算処理対象のデータサイズに応じて変更することが可能であり、本実施形態においては１２個のケースを示している。またそれぞれのＬＳＩチップは、隣接するＬＳＩチップとデータを入力及び出力する配線５によって接続されている。ここで、本実施形態においては、それぞれのチップ間で入力及び出力されるデータサイズを６ビットとしてるが、その他のデータサイズを有するものであっても良い。その場合は、データサイズに応じて配線の本数が変わる。また図１５においては、前述した入力及び出力用の配線以外の配線は、本実施形態の説明に必要でないため、記載していない。 (Calculation processing system)
FIG. 15 is a diagram schematically showing an arithmetic processing system according to the present embodiment. As shown in FIG. 15, the arithmetic processing system is configured by arranging 12 LSI chips 01 to 12 in 3 rows and 4 columns. Note that the number of LSI chips can be changed according to the data size to be processed, and in this embodiment, 12 cases are shown. Each LSI chip is connected to an adjacent LSI chip by wiring 5 for inputting and outputting data. Here, in this embodiment, the data size input and output between the chips is 6 bits, but may have other data sizes. In that case, the number of wires changes according to the data size. In FIG. 15, wirings other than the input and output wirings described above are not shown because they are not necessary for the description of this embodiment.

なお、本実施形態における演算処理システムで実現する演算処理は、第１実施形態と同様に、初段層に画像データを入力し、所定の２次元的に分布する重みを有するカーネルとの畳込み演算を階層的に繰り返し実行するものである。従って、演算処理の詳細は第１実施形態と同様であるため、説明を省略する。 Note that the arithmetic processing realized by the arithmetic processing system in this embodiment is similar to the first embodiment in that image data is input to the first layer and a convolution operation with a kernel having a predetermined two-dimensionally distributed weight. Is repeatedly executed in a hierarchical manner. Therefore, the details of the arithmetic processing are the same as those in the first embodiment, and thus description thereof is omitted.

（ＬＳＩチップの回路構成）
続いて図１６に、前記の演算処理システムを構成するＬＳＩチップ０１〜１２の回路構成について、図１６を参照して説明する。図１６は、ＬＳＩチップの回路構成を模式的に示したブロック図である。 (Circuit configuration of LSI chip)
Next, the circuit configuration of the LSI chips 01 to 12 constituting the arithmetic processing system will be described with reference to FIG. FIG. 16 is a block diagram schematically showing the circuit configuration of the LSI chip.

図１６に示すように、本実施形態におけるＬＳＩチップは以下の回路ブロックを有する。
・メモリ回路ブロック（ＭＥＭ）１０。
・カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１。
・レジスタ（ＲＥＧ）１２,１４。
・デジタル‐ＰＷＭ変換回路ブロック（Ｄ／Ｐ）１３,１５。
・ＰＷＭ‐アナログ変換回路（Ｐ／Ａ）１７。
・演算回路ブロック（ＰＷＭ−ＭＡＣ）１６。
・累算回路ブロック（ＡＣＣ）１９。
・ＰＷＭ‐デジタル変換回路ブロック（Ｐ／Ｄ）１８。
・マルチプレクサ（ＭＵＸ）２０。
・ルックアップテーブル（ＬＵＴ）２１。
・セレクタ（ＳＥＬ）３２。
なお、前記の各回路ブロックの構成、及び処理の流れは、後述するように、本実施形態ではセレクタ３２を介して隣接するチップとデータの入出力をすることを除いて第１実施形態と同様であるため、説明を省略する。また、ＬＳＩチップの演算回路ブロックによる畳込み演算の実行フローに関しても第１実施形態の場合と同様であるため、説明を省略する。 As shown in FIG. 16, the LSI chip in this embodiment has the following circuit blocks.
A memory circuit block (MEM) 10;
Kernel data memory circuit block (KMEM) 11
Registers (REG) 12,14.
Digital-PWM conversion circuit blocks (D / P) 13, 15
PWM-analog conversion circuit (P / A) 17
Arithmetic circuit block (PWM-MAC) 16
An accumulation circuit block (ACC) 19;
PWM-digital conversion circuit block (P / D) 18
A multiplexer (MUX) 20.
A look-up table (LUT) 21;
A selector (SEL) 32.
The configuration of each circuit block and the flow of processing are the same as those in the first embodiment except that data is input / output with an adjacent chip via the selector 32 in this embodiment, as will be described later. Therefore, the description is omitted. The execution flow of the convolution operation by the arithmetic circuit block of the LSI chip is also the same as that in the first embodiment, and thus the description thereof is omitted.

（並列演算）
図１７は、１２個のＬＳＩチップ０１〜１２を用いて３２０×２４０の２次元データに対してカーネルとの畳み込み演算を行う様子を模式的に示した図である。図１５、図１７に示すように本実施形態においては、第１実施形態でＬＳＩチップ４個を１列に並べたものをさらに３行配置することで、第１実施形態で逐次処理によって演算を実行した縦方向の演算ステップを１/３の演算ステップで実行することができる。なお、ＬＳＩチップ０１〜１２は、それぞれの演算領域における対応する位置のデータについて、他のＬＳＩチップによる演算と同期して畳み込み演算を行う。 (Parallel operation)
FIG. 17 is a diagram schematically showing a state in which a convolution operation with a kernel is performed on 320 × 240 two-dimensional data using twelve LSI chips 01 to 12. As shown in FIGS. 15 and 17, in this embodiment, four rows of LSI chips arranged in one column in the first embodiment are arranged in three rows, so that computation is performed by sequential processing in the first embodiment. The executed vertical calculation step can be executed by 1/3 of the calculation steps. Note that the LSI chips 01 to 12 perform a convolution operation on the data at the corresponding position in each operation area in synchronization with the operation by other LSI chips.

本実施形態においては、図１７に示すよう３２０×２４０画素の画像データに対して、３行４列に並べたＬＳＩチップを演算処理システムとして構成している。このため、カーネルが、４個のＬＳＩチップがメモリ回路に保持している画素のデータ領域に重なるケースが生じる。そこで、本実施形態に係る構成においては、隣接するＬＳＩチップ間で、内蔵メモリに保持していない演算対象データ（カーネルがはみ出している部分のデータ）をお互いに入出力しあって補完する。なお隣接するＬＳＩチップとは、本実施形態の場合、斜め方向に並ぶチップ対も含む。 In the present embodiment, as shown in FIG. 17, LSI chips arranged in 3 rows and 4 columns with respect to image data of 320 × 240 pixels are configured as an arithmetic processing system. For this reason, there occurs a case where the kernel overlaps the data area of the pixel held by the four LSI chips in the memory circuit. Therefore, in the configuration according to the present embodiment, operation target data (data of a portion protruding from the kernel) that is not held in the built-in memory is mutually input and output between adjacent LSI chips. In the present embodiment, the adjacent LSI chips include chip pairs arranged in an oblique direction.

以下、それぞれのＬＳＩチップの演算処理において、カーネルが、隣接するＬＳＩチップが保持するデータ領域にはみ出した場合の、隣接するチップ間でのデータの入出力の状況を詳しく説明する。本実施形態においては、カーネルが各チップ内のメモリ回路からはみ出すケースとして、以下の３通りがあり得る。
（１）左又は右にはみ出すケース
（２）左又は右、及び、下又は左下又は右下にはみ出すケース
（３）左又は右、及び、上又は左上又は右上にはみ出すケース
これらのうち、（１）のケースは、左右に隣接したＬＳＩチップ間でのみデータを入出力し合うものであり、従って、各ＬＳＩチップ間のデータの流れは第１実施形態と同様である。このため、説明を省略する。 Hereinafter, in the arithmetic processing of each LSI chip, the state of data input / output between adjacent chips when the kernel protrudes into the data area held by the adjacent LSI chip will be described in detail. In the present embodiment, there are the following three cases where the kernel protrudes from the memory circuit in each chip.
(1) Cases that protrude to the left or right (2) Cases that protrude to the left or right, and the lower or lower left or lower right (3) Cases that protrude to the left or right and upper, upper left, or upper right of these (1 In the case of ()), data is input / output only between the LSI chips adjacent to the left and right, and therefore the data flow between the LSI chips is the same as that of the first embodiment. Therefore, the description is omitted.

続いて（２）（３）のケースについて、図１８,１９を用いて説明を行う。なお、（２）（３）におけるカーネルが左・右にはみ出した場合の処理は、（１）のケースと同様に、左右に隣接したＬＳＩチップ間でのみデータを入出力し合うものであり、各ＬＳＩチップ間のデータの流れは第１実施形態と同様である。このため、本実施形態においては詳細な説明を省略する。 Subsequently, cases (2) and (3) will be described with reference to FIGS. The processing when the kernel in (2) and (3) protrudes to the left and right is to input and output data only between the LSI chips adjacent to the left and right, as in the case of (1). The data flow between the LSI chips is the same as in the first embodiment. For this reason, detailed description is abbreviate | omitted in this embodiment.

図１８,１９は、各チップのメモリ回路ブロックに保持した画像データまたは前段層の演算結果から、演算対象となるデータを読み出し、隣接するチップのレジスタに入出力する際の入力行のメモリ読み出し順序、及びデータの流れを示した図である。ただし、図１８は上記の（２）のケース、図１９は上記の（３）のケースに係る図である。なお、図１８,１９には、各ＬＳＩチップ内のメモリ回路ブロック中の、演算対象としている８０×８０の領域のみが示されている。また、後段層における演算の算出位置をターゲット行として重ねて表示されている。 18 and 19 show the memory read order of input rows when data to be calculated is read out from the image data held in the memory circuit block of each chip or the calculation result of the previous layer and input / output to / from the register of the adjacent chip. FIG. 5 is a diagram illustrating a flow of data. However, FIG. 18 is a diagram related to the case (2) above, and FIG. 19 is a diagram related to the case (3) above. FIGS. 18 and 19 show only the 80 × 80 area to be calculated in the memory circuit block in each LSI chip. In addition, the calculation position of the calculation in the subsequent layer is displayed as a target row.

（２）においては、一つのメモリブロックの左側から一つずつデータを読み出していく（図１８中の入力行）ため、まず図１８の（ａ）に示すように、メモリ回路の左上側に保持されているデータが、左上側及び上側に隣接するＬＳＩチップに受け渡される。例えば、１８０１の位置における演算を行うために、１８１１のデータがＬＳＩチップ０１に受け渡され、例えば、１８０２の位置における演算を行うために、１８１１のデータがＬＳＩチップ０２に受け渡されている。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。すなわち（ａ）のデータのやり取りは、各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの右下側及び下側に発生している場合に相当する。なお図１８では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック（ＭＥＭ）１０を有するＬＳＩチップ自体のレジスタ（ＲＥＧ）１４にも入力されている。 In (2), since data is read one by one from the left side of one memory block (input row in FIG. 18), first, as shown in FIG. 18 (a), the data is held at the upper left side of the memory circuit. The transferred data is transferred to the LSI chip adjacent on the upper left side and the upper side. For example, 1811 data is transferred to the LSI chip 01 to perform the calculation at the position 1801, and for example, 1811 data is transferred to the LSI chip 02 to perform the calculation at the position 1802. Here, the transferred data is used for the calculation of the target row in the subsequent layer. That is, the data exchange in (a) corresponds to the case where the protrusion of the kernel in each LSI chip occurs on the lower right side and the lower side of the kernel. Although not shown in FIG. 18, the transferred data is also input to the register (REG) 14 of the LSI chip itself having the memory circuit block (MEM) 10 holding this data. ing.

次に、図１８の（ｂ）に示すように、メモリ回路からのデータの読み出しが右側に移動していくと（図１８中の入力行）、メモリ回路の中央上側に保持されているデータが、上側に隣接するＬＳＩチップに受け渡される。例えば、１８０３の位置における演算を行うために、１８１２のデータがＬＳＩチップ０２に受け渡されている。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。すなわち（ｂ）のデータのやり取りは、各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの下側に発生している場合に相当する。なお図１８では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック（ＭＥＭ）１０を有するＬＳＩチップ自体のレジスタ（ＲＥＧ）１４にも入力されている。 Next, as shown in FIG. 18B, when data reading from the memory circuit moves to the right side (input row in FIG. 18), the data held at the upper center of the memory circuit is changed. , And passed to the LSI chip adjacent on the upper side. For example, 1812 data is transferred to the LSI chip 02 in order to perform calculation at the position 1803. Here, the transferred data is used for the calculation of the target row in the subsequent layer. In other words, the data exchange in (b) corresponds to the case where the protruding of the kernel in each LSI chip occurs below the kernel. Although not shown in FIG. 18, the transferred data is also input to the register (REG) 14 of the LSI chip itself having the memory circuit block (MEM) 10 holding this data. ing.

そして、さらにメモリ回路からのデータの読み出しが右側に移動していくと（図１８中の入力行）、今度は各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの左下側及び下側に発生する。このため、図１８（ｃ）に示すように、メモリ回路の右上側に保持されているデータが、右上側及び上側に隣接するＬＳＩチップに受け渡される。例えば、１８０４の位置における演算を行うために、１８１３のデータがＬＳＩチップ０１に受け渡され、例えば、１８０５の位置における演算を行うために、１８１３のデータがＬＳＩチップ０２に受け渡されている。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。なお図１８では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック（ＭＥＭ）１０を有するＬＳＩチップ自体のレジスタ（ＲＥＧ）１４にも入力されている。 Then, when the reading of data from the memory circuit further moves to the right (input line in FIG. 18), the protrusion of the kernel in each LSI chip occurs on the lower left side and the lower side of the kernel. For this reason, as shown in FIG. 18C, data held on the upper right side of the memory circuit is transferred to the LSI chips adjacent on the upper right side and the upper side. For example, 1813 data is transferred to the LSI chip 01 in order to perform calculation at the position 1804, and for example, 1813 data is transferred to the LSI chip 02 in order to perform calculation at the position 1805. Here, the transferred data is used for the calculation of the target row in the subsequent layer. Although not shown in FIG. 18, the transferred data is also input to the register (REG) 14 of the LSI chip itself having the memory circuit block (MEM) 10 holding this data. ing.

続いて（３）のケースについて、図１９を用いて説明を行う。（３）においては、一つのメモリブロックの左側から一つずつデータを読み出していくため（図１９中の入力行）、まず図１９の（ａ）に示すように、メモリ回路の左下側に保持されているデータが、左下側及び下側に隣接するＬＳＩチップに受け渡される。例えば、１９０１の位置における演算を行うために、１９１１のデータがＬＳＩチップ０５に受け渡され、例えば、１９０２の位置における演算を行うために、１９１１のデータがＬＳＩチップ０６に受け渡されている。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。すなわち（ａ）のデータのやり取りは、各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの右上側及び上側に発生している場合に相当する。なお図１９では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック（ＭＥＭ）１０を有するＬＳＩチップ自体のレジスタ（ＲＥＧ）１４にも入力されている。 Next, the case (3) will be described with reference to FIG. In (3), since data is read one by one from the left side of one memory block (input row in FIG. 19), first, as shown in (a) of FIG. 19, the data is held at the lower left side of the memory circuit. The transferred data is transferred to the LSI chip adjacent to the lower left side and the lower side. For example, 1911 data is transferred to the LSI chip 05 to perform the calculation at the position 1901, and for example, 1911 data is transferred to the LSI chip 06 to perform the calculation at the position 1902. Here, the transferred data is used for the calculation of the target row in the subsequent layer. That is, the data exchange in (a) corresponds to the case where the protruding of the kernel in each LSI chip occurs on the upper right side and the upper side of the kernel. Although omitted in FIG. 19, the transferred data is also input to the register (REG) 14 of the LSI chip itself having the memory circuit block (MEM) 10 holding the data. ing.

次に、図１９の（ｂ）に示すように、メモリ回路からのデータの読み出しが右側に移動していくと（図中の入力行）、メモリ回路の中央上側に保持されているデータが、下側に隣接するＬＳＩチップに受け渡される。例えば、１９０３の位置における演算を行うために、１９１２のデータがＬＳＩチップ０６に受け渡されている。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。すなわち（ｂ）のデータのやり取りは、各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの上側に発生している場合に相当する。なお図１９では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック（ＭＥＭ）１０を有するＬＳＩチップ自体のレジスタ（ＲＥＧ）１４にも入力されている。 Next, as shown in FIG. 19B, when data reading from the memory circuit moves to the right (input row in the figure), the data held at the upper center of the memory circuit is It is delivered to the LSI chip adjacent to the lower side. For example, 1912 data is transferred to the LSI chip 06 in order to perform calculation at the position 1903. Here, the transferred data is used for the calculation of the target row in the subsequent layer. In other words, the data exchange in (b) corresponds to the case where the protruding of the kernel in each LSI chip occurs on the upper side of the kernel. Although omitted in FIG. 19, the transferred data is also input to the register (REG) 14 of the LSI chip itself having the memory circuit block (MEM) 10 holding the data. ing.

そして、さらにメモリ回路からのデータの読み出しが右側に移動していくと（図中の入力行）、今度は各ＬＳＩチップにおけるカーネルのはみ出しが、カーネルの左上側及び上側に発生する。このため、図１９の（ｃ）に示すように、メモリ回路の右下側に保持されているデータが、右下側及び下側に隣接するＬＳＩチップに受け渡される。例えば、１９０４の位置における演算を行うために、１９１３のデータがＬＳＩチップ０５に受け渡され、例えば、１９０５の位置における演算を行うために、１９１３のデータがＬＳＩチップ０６に受け渡されている。ここで、受け渡されたデータは、後段層のターゲット行の演算に使用される。なお図１９では省略されているが、この際前記の受け渡されたデータは、このデータを保持しているメモリ回路ブロック（ＭＥＭ）１０を有するＬＳＩチップ自体のレジスタ（ＲＥＧ）１４にも入力されている。 Then, when the reading of data from the memory circuit further moves to the right (input line in the figure), this time, the protrusion of the kernel in each LSI chip occurs on the upper left side and the upper side of the kernel. For this reason, as shown in FIG. 19C, data held on the lower right side of the memory circuit is transferred to the LSI chips adjacent on the lower right side and the lower side. For example, 1913 data is transferred to the LSI chip 05 to perform calculation at the position of 1904, and for example, 1913 data is transferred to the LSI chip 06 to perform calculation at the position of 1905. Here, the transferred data is used for the calculation of the target row in the subsequent layer. Although omitted in FIG. 19, the transferred data is also input to the register (REG) 14 of the LSI chip itself having the memory circuit block (MEM) 10 holding the data. ing.

なお、以上の説明は、隣接するＬＳＩチップが８個存在するケースに関するものであり、隣接するＬＳＩチップが８個未満の場合は、前述したデータのやり取りは発生しない場合がある。ただしその場合は、単に入出力配線に隣接ＬＳＩチップが接続されないに過ぎず、特別の回路構成や制御を必要とするものでは無い。 The above description relates to the case where there are eight adjacent LSI chips. If there are less than eight adjacent LSI chips, the above-described data exchange may not occur. In this case, however, the adjacent LSI chip is simply not connected to the input / output wiring, and no special circuit configuration or control is required.

このようにして、１２個のＬＳＩチップのそれぞれのシフトレジスタには、演算対象となるデータ１３０個が正しく入力され、保持される。この場合、各ＬＳＩチップのメモリ読み出し位置は、１２個のＬＳＩチップで全て共通なため、読み出し時の制御が簡易になる。また、各ＬＳＩチップで演算対象とするカーネルの重みデータも共通であるため、制御が簡易になる。 In this way, 130 pieces of data to be calculated are correctly input and held in the shift registers of the 12 LSI chips. In this case, since the memory read position of each LSI chip is common to the 12 LSI chips, the control during reading is simplified. In addition, since the kernel weight data to be calculated in each LSI chip is common, the control is simplified.

続いて図２０に、隣接するＬＳＩチップとのデータの流れを加えたブロック図を示す。図２０に示したように、各ＬＳＩチップは、セレクタ３２を介して、最大８個の隣接するＬＳＩチップと接続されている。すなわち、隣接するチップ間で前述したようにデータを入出力する際に、セレクタを切り替えることで、適切な隣接チップへデータを出力し、かつ適切な隣接チップからの入力を受けるものである。なお、隣接したＬＳＩチップ間で以上説明したようにデータを入出力できるものであれば、配線構造及びセレクタ等の構成はその他のものを使用しても構わない。 Next, FIG. 20 shows a block diagram to which a data flow with an adjacent LSI chip is added. As shown in FIG. 20, each LSI chip is connected to a maximum of eight adjacent LSI chips via a selector 32. That is, when data is input / output between adjacent chips as described above, the selector is switched to output data to an appropriate adjacent chip and receive input from the appropriate adjacent chip. As long as data can be input / output between adjacent LSI chips as described above, other configurations such as a wiring structure and a selector may be used.

以上説明したように、本実施形態のようにＬＳＩチップを行列状に接続して並列動作させることにより、大きなサイズの画像をより高速に演算処理することが可能となる。また、接続するＬＳＩチップの個数を変えることで、ＬＳＩチップの回路構成を変更することなく、様々なサイズの演算対象データに対して畳み込み演算を実行することが可能となる。 As described above, by connecting LSI chips in a matrix and operating them in parallel as in the present embodiment, it is possible to process a large size image at higher speed. Further, by changing the number of LSI chips to be connected, it is possible to execute a convolution operation on operation target data of various sizes without changing the circuit configuration of the LSI chip.

＜＜第４実施形態＞＞
第３実施形態に係る構成においては、演算回路ブロック１６がアナログ回路で構成されていた。本実施形態では、演算回路ブロックをデジタル回路で構成した場合について説明する。本実施形態に係る構成は、演算回路ブロックのデジタル回路化に伴う変更以外は、全て第３実施形態と同様である。このため、本実施形態においては、第３実施形態と異なる部分についてのみ説明を行い、それ以外は第３実施形態と同様であるため説明を省略する。 << Fourth Embodiment >>
In the configuration according to the third embodiment, the arithmetic circuit block 16 is configured by an analog circuit. In the present embodiment, a case where the arithmetic circuit block is configured by a digital circuit will be described. The configuration according to the present embodiment is the same as that of the third embodiment except for the change accompanying the digital circuit of the arithmetic circuit block. For this reason, in this embodiment, only a different part from 3rd Embodiment is demonstrated, and since it is the same as that of 3rd Embodiment other than that, description is abbreviate | omitted.

（ＬＳＩチップの回路構成）
図２１は、本実施形態における演算処理システムを構成するＬＳＩチップの回路構成を模式的に示したブロック図である。図２１に示すように、本実施形態におけるＬＳＩチップは以下の回路ブロックを有する。
・メモリ回路ブロック（ＭＥＭ）１０。
・カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１。
・レジスタ（ＲＥＧ）１２,１４。
・デジタル演算回路ブロック（Ｄ−ＭＡＣ）２９。
・セレクタ（ＳＥＬ）３２。
・累算回路ブロック（ＡＣＣ）１９。
・マルチプレクサ（ＭＵＸ）２０。
・ルックアップテーブル（ＬＵＴ）２１。
なお、図２１では、配線は省略している。 (Circuit configuration of LSI chip)
FIG. 21 is a block diagram schematically showing a circuit configuration of an LSI chip constituting the arithmetic processing system in the present embodiment. As shown in FIG. 21, the LSI chip in this embodiment has the following circuit blocks.
A memory circuit block (MEM) 10;
Kernel data memory circuit block (KMEM) 11
Registers (REG) 12,14.
A digital arithmetic circuit block (D-MAC) 29.
A selector (SEL) 32.
An accumulation circuit block (ACC) 19;
A multiplexer (MUX) 20.
A look-up table (LUT) 21;
In FIG. 21, wiring is omitted.

（処理の流れ）
続いて、前記の各回路ブロックが実行する処理の流れについて、図２２を参照して説明する。図２２は、回路ブロック間のデータの流れを模式的に示したブロック図である。 (Process flow)
Next, the flow of processing executed by each circuit block will be described with reference to FIG. FIG. 22 is a block diagram schematically showing the flow of data between circuit blocks.

カーネルデータメモリ回路ブロック（ＫＭＥＭ）１１に保持されたカーネルの重みに関するデジタルデータは、レジスタ（ＲＥＧ）１２に入力されてパラレルデータに変換され、さらに演算回路ブロック（Ｄ−ＭＡＣ）２９に入力される。画像データまたは前段層の演算結果は、レジスタ（ＲＥＧ）１４を経て、演算回路ブロック（Ｄ−ＭＡＣ）２９に入力される。この時、メモリ回路ブロック（ＭＥＭ）１０から出力された画像データまたは前段層の演算結果のデータは、同時にセレクタ（ＳＥＬ）３２を介して、隣接する適切なＬＳＩチップに対して出力される。またレジスタ（ＲＥＧ）１４には、図２２に示すように、隣接したＬＳＩチップから出力された画像データまたは前段層の演算結果のデータが入力される。ここでレジスタ（ＲＥＧ）１４は、メモリ回路ブロック（ＭＥＭ）１０が出力するシリアルデータをパラレルデータに変換する。 Digital data relating to the kernel weight held in the kernel data memory circuit block (KMEM) 11 is input to the register (REG) 12 to be converted into parallel data, and further input to the arithmetic circuit block (D-MAC) 29. . The image data or the calculation result of the previous layer is input to the arithmetic circuit block (D-MAC) 29 through the register (REG) 14. At this time, the image data output from the memory circuit block (MEM) 10 or the data of the operation result of the previous layer is simultaneously output to the adjacent appropriate LSI chip via the selector (SEL) 32. Further, as shown in FIG. 22, the register (REG) 14 receives image data output from an adjacent LSI chip or data of the operation result of the previous layer. Here, the register (REG) 14 converts serial data output from the memory circuit block (MEM) 10 into parallel data.

演算回路ブロック（Ｄ−ＭＡＣ）２９はデジタル乗算回路（ＭＵＬ）３０から構成され、入力されたカーネルの重みに関するデータと、画像データまたは前段層の演算結果に対して、畳込み演算処理を並列に実行する。演算回路の出力結果は、累算回路ブロック（ＡＣＣ）１９によって累算される。累算結果はマルチプレクサ（ＭＵＸ）２０を経てルックアップテーブル（ＬＵＴ）２１により非線形変換され、再びメモリ回路ブロック（ＭＥＭ）１０に記憶される。 The arithmetic circuit block (D-MAC) 29 is composed of a digital multiplier circuit (MUL) 30, and performs convolution arithmetic processing in parallel on input kernel weight data and image data or previous layer arithmetic results. Execute. The output result of the arithmetic circuit is accumulated by an accumulation circuit block (ACC) 19. The accumulated result is nonlinearly converted by a look-up table (LUT) 21 through a multiplexer (MUX) 20 and stored again in the memory circuit block (MEM) 10.

以上のように、本実施形態における構成によれば、第３実施形態における演算回路が実行する演算をデジタル的に実行することができる。なお、前述したＬＳＩチップを構成する回路ブロックの中で、演算回路ブロック（Ｄ−ＭＡＣ）２９の詳細な回路構成は、第２実施形態で説明したものと同一であるため、詳しい説明を省略する。この時、本実施形態における演算処理システムは、前述した演算回路ブロックのデジタル回路化に伴う変更以外、全て第３実施形態と同様である。従って、ＬＳＩチップを行列状に接続して、演算対象となるデータを入出力する方法等に関しては、第３実施形態と同様として説明を省略する。 As described above, according to the configuration of the present embodiment, the operation executed by the arithmetic circuit in the third embodiment can be digitally executed. Note that, among the circuit blocks constituting the LSI chip described above, the detailed circuit configuration of the arithmetic circuit block (D-MAC) 29 is the same as that described in the second embodiment, and thus detailed description thereof is omitted. . At this time, the arithmetic processing system in the present embodiment is the same as that of the third embodiment except for the change accompanying the digital circuit of the arithmetic circuit block described above. Therefore, a method for connecting LSI chips in a matrix and inputting / outputting data to be calculated is the same as in the third embodiment, and a description thereof is omitted.

演算処理システムを模式的に示した図である。It is the figure which showed the arithmetic processing system typically. 階層型コンボリューショナル・ニューラルネットワークを用いて顔の位置検出を行う処理を模式的に示した図である。It is the figure which showed typically the process which performs the position detection of a face using a hierarchical convolutional neural network. ＬＳＩチップの回路構成を模式的に示したブロック図である。1 is a block diagram schematically showing a circuit configuration of an LSI chip. 回路ブロック間のデータの流れを模式的に示したブロック図である。It is the block diagram which showed typically the flow of the data between circuit blocks. 演算回路ブロックの回路構成を示した図である。It is the figure which showed the circuit structure of the arithmetic circuit block. 演算回路ブロックによる畳込み演算の様子を模式的に示した図である。It is the figure which showed typically the mode of the convolution calculation by an arithmetic circuit block. ４つのＬＳＩチップを用いて２次元データに対してカーネルとの畳み込み演算を行う様子を模式的に示した図である。It is the figure which showed typically a mode that the convolution calculation with a kernel was performed with respect to two-dimensional data using four LSI chips. 隣接するＬＳＩチップのメモリ回路ブロックから演算対象となるデータを読み出す際の、メモリ読み出し順序、データの流れを模式的に示した図である。It is the figure which showed typically the memory reading order and the flow of data at the time of reading the data used as a calculation target from the memory circuit block of an adjacent LSI chip. 隣接するＬＳＩチップとの入出力配線が共用された構成を例示的に示した図である。It is the figure which showed exemplarily the structure where the input / output wiring with an adjacent LSI chip was shared. チップ内部で配線が共有した構成を例示的に示した図である。It is the figure which showed illustratively the structure which wiring shared within the chip. ＬＳＩチップの回路構成を模式的に示したブロック図である。1 is a block diagram schematically showing a circuit configuration of an LSI chip. 回路ブロック間のデータの流れを模式的に示したブロック図である。It is the block diagram which showed typically the flow of the data between circuit blocks. 演算回路ブロックの回路構成を示した図である。It is the figure which showed the circuit structure of the arithmetic circuit block. 演算回路ブロックの回路構成を示した図である。It is the figure which showed the circuit structure of the arithmetic circuit block. 演算処理システムを模式的に示した図である。It is the figure which showed the arithmetic processing system typically. ＬＳＩチップの回路構成を模式的に示したブロック図である。1 is a block diagram schematically showing a circuit configuration of an LSI chip. １２個のＬＳＩチップを用いて２次元データに対してカーネルとの畳み込み演算を行う様子を模式的に示した図である。It is the figure which showed typically a mode that the convolution calculation with a kernel was performed with respect to two-dimensional data using 12 LSI chips. 隣接するＬＳＩチップのメモリ回路ブロックから演算対象となるデータを読み出す際の、メモリ読み出し順序、データの流れを模式的に示した図である。It is the figure which showed typically the memory reading order and the flow of data at the time of reading the data used as a calculation target from the memory circuit block of an adjacent LSI chip. 隣接するＬＳＩチップのメモリ回路ブロックから演算対象となるデータを読み出す際の、メモリ読み出し順序、データの流れを模式的に示した図である。It is the figure which showed typically the memory reading order and the flow of data at the time of reading the data used as a calculation target from the memory circuit block of an adjacent LSI chip. 回路ブロック間のデータの流れを模式的に示したブロック図である。It is the block diagram which showed typically the flow of the data between circuit blocks. 演算処理システムを構成するＬＳＩチップの回路構成を模式的に示したブロック図である。It is the block diagram which showed typically the circuit structure of the LSI chip which comprises an arithmetic processing system. 回路ブロック間のデータの流れを模式的に示したブロック図である。It is the block diagram which showed typically the flow of the data between circuit blocks. 演算回路が実装された複数の基板を接続してなる演算処理システムの構成例を模式的に示した図である。It is the figure which showed typically the example of a structure of the arithmetic processing system formed by connecting the some board | substrate with which the arithmetic circuit was mounted. ２次元データに対して畳み込み演算を実行するＬＳＩチップにおける、データの流れを模式的に示した図である。It is the figure which showed typically the data flow in the LSI chip which performs a convolution operation with respect to two-dimensional data.

Claims

An arithmetic processing system that connects a plurality of LSI chips and performs a convolution operation between two-dimensional target data and a two-dimensional kernel,
Each of the LSI chips is
A target data memory for holding the target data;
A register that holds the target data required for the operation;
Kernel memory holding the kernel;
An arithmetic circuit that performs convolution arithmetic processing based on the target data held in the register and the kernel;
An input / output wiring connected to an output line from the target data memory to the register, and for inputting / outputting the target data to and from the adjacent LSI chip;
A switch circuit for switching connection with the adjacent LSI chip in the input / output wiring,
The target data is distributed and held in the target data memory provided in each of the plurality of adjacent LSI chips,
In each of the plurality of LSI chips, when the target data is output from the target data memory provided in the LSI chip to the register provided in the LSI chip, in the calculation in the arithmetic circuit of the adjacent LSI chip Necessary target data that does not exist in the target data memory provided in the adjacent LSI chip is output to the register provided in the same LSI chip as the target data memory that holds the data. In addition, an operation processing system characterized in that, by switching the connection in the input / output wiring by the switch circuit , the output is simultaneously output to the register provided in the adjacent LSI chip via the input / output wiring.

The arithmetic processing system according to claim 1, wherein the LSI chips are arranged in one row.

The arithmetic processing system according to claim 1, wherein the LSI chips are arranged in a matrix format.