JP2022170512A

JP2022170512A - Neural network generation device, neural network operation device, edge device, method for controlling neural network, and software generation program

Info

Publication number: JP2022170512A
Application number: JP2021076688A
Authority: JP
Inventors: 拓之徳永; Hiroyuki Tokunaga
Original assignee: Leap Mind Inc
Current assignee: Leap Mind Inc
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2022-11-10
Also published as: WO2022230906A1

Abstract

To generate and control a neural network that can be built in such a building-in apparatus as an IoT apparatus and can be operated with high performance.SOLUTION: A neural network generation device is for generating a neural network execution model for operating a neural network. The neural network execution model converts input data to first quantification data quantified by first quantification means and second quantification data quantified by second quantification means different from the first quantification means.SELECTED DRAWING: Figure 14

Description

本発明は、ニューラルネットワーク生成装置、ニューラルネットワーク演算装置、エッジデバイス、ニューラルネットワーク制御方法およびソフトウェア生成プログラムに関する。 The present invention relates to a neural network generation device, a neural network arithmetic device, an edge device, a neural network control method, and a software generation program.

近年、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：ＣＮＮ）が画像認識等のモデルとして用いられている。畳み込みニューラルネットワークは、畳み込み層やプーリング層を有する多層構造であり、畳み込み演算等の多数の演算を必要とする。畳み込みニューラルネットワークによる演算を高速化する演算手法が様々考案されている（特許文献１など）。 In recent years, a convolutional neural network (CNN) has been used as a model for image recognition and the like. A convolutional neural network has a multilayer structure having convolution layers and pooling layers, and requires a large number of operations such as convolution operations. Various calculation methods have been devised for speeding up calculation by a convolutional neural network (Patent Document 1, etc.).

特開２０１８－０７７８２９号公報JP 2018-077829 A

一方で、ＩｏＴ機器などの組み込み機器においても畳み込みニューラルネットワークを利用した画像認識等が使用されている。組み込み機器において畳み込みニューラルネットワークを効率的に動作させるため、組み込み機器のハードウェア構成に合わせたニューラルネットワークに係る演算を行う回路やモデルを生成することが望まれている。また、これらの回路やモデルを高効率かつ高速に動作させる制御方法が望まれている。また、これらの回路やモデルを高効率かつ高速に動作させるソフトウェアを生成するソフトウェア生成プログラムが望まれている。 On the other hand, image recognition and the like using convolutional neural networks are also used in built-in devices such as IoT devices. In order to efficiently operate a convolutional neural network in an embedded device, it is desired to generate a circuit or a model for performing computations related to the neural network that matches the hardware configuration of the embedded device. Also, a control method for operating these circuits and models with high efficiency and high speed is desired. There is also a demand for a software generation program that generates software that allows these circuits and models to operate efficiently and at high speed.

上記事情を踏まえ、本発明は、ＩｏＴ機器などの組み込み機器に組み込み可能であり、高効率かつ高速に動作させることができるニューラルネットワークに係る演算を行う回路やモデルを生成するニューラルネットワーク生成装置、高効率かつ高速に動作させることができるニューラルネットワークに係る演算を行うニューラルネットワーク演算装置、ニューラルネットワーク演算装置を含むエッジデバイス、ニューラルネットワークに係る演算を行う回路やモデルを高効率かつ高速に動作させるニューラルネットワーク制御方法およびニューラルネットワークに係る演算を行う回路やモデルを高効率かつ高速に動作させるソフトウェアを生成するソフトウェア生成プログラムを提供することを目的とする。 Based on the above circumstances, the present invention provides a neural network generator that generates circuits and models that perform computations related to neural networks that can be embedded in embedded devices such as IoT devices and that can be operated efficiently and at high speed. Neural network computing equipment that performs computations related to neural networks that can operate efficiently and at high speed, edge devices that include neural network computing devices, and neural networks that operate circuits and models that perform computations related to neural networks efficiently and at high speed. It is an object of the present invention to provide a software generating program for generating software for operating circuits and models for performing calculations related to control methods and neural networks with high efficiency and high speed.

上記課題を解決するために、この発明は以下の手段を提案している。
本発明の第一の態様に係るニューラルネットワーク生成装置は、ニューラルネットワークを演算するニューラルネットワーク実行モデルを生成するニューラルネットワーク生成装置であって、前記ニューラルネットワーク実行モデルは、入力データを第一量子化手段により量子化した第一量子化データと、前記入力データを前記第一量子化手段と異なる第二量子化手段により量子化した第二量子化データに変換する。 In order to solve the above problems, the present invention proposes the following means.
A neural network generation device according to a first aspect of the present invention is a neural network generation device for generating a neural network execution model for computing a neural network, wherein the neural network execution model converts input data into first quantization means and the input data are converted into second quantized data quantized by a second quantization means different from the first quantization means.

本発明のニューラルネットワーク生成装置、ニューラルネットワーク演算装置、エッジデバイス、ニューラルネットワーク制御方法およびソフトウェア生成プログラムは、ＩｏＴ機器などの組み込み機器に組み込み可能であり、高性能に動作させることができるニューラルネットワークを生成して制御できる。 The neural network generation device, neural network arithmetic device, edge device, neural network control method, and software generation program of the present invention can be embedded in embedded devices such as IoT devices, and generate a neural network that can operate at high performance. can be controlled by

第一実施形態に係るニューラルネットワーク生成装置を示す図である。1 is a diagram showing a neural network generation device according to a first embodiment; FIG. 同ニューラルネットワーク生成装置の演算部の入出力を示す図である。It is a figure which shows the input-output of the calculating part of the same neural network generation apparatus. 畳み込みニューラルネットワークの一例を示す図である。1 is a diagram showing an example of a convolutional neural network; FIG. 同畳み込みニューラルネットワークの畳み込み層が行う畳み込み演算を説明する図である。FIG. 4 is a diagram for explaining convolution operations performed by convolution layers of the same convolutional neural network; ニューラルネットワーク実行モデルの一例を示す図である。It is a figure which shows an example of a neural network execution model. 同ニューラルネットワーク実行モデルの動作例を示すタイミングチャートである。4 is a timing chart showing an operation example of the same neural network execution model; 同ニューラルネットワーク生成装置の制御フローチャートである。It is a control flowchart of the same neural network generation device. 生成される畳み込み演算回路の内部ブロック図である。FIG. 4 is an internal block diagram of a generated convolution operation circuit; 同畳み込み演算回路の乗算器の内部ブロック図である。FIG. 4 is an internal block diagram of a multiplier of the convolution arithmetic circuit; 同乗算器の積和演算ユニットの内部ブロック図である。3 is an internal block diagram of a sum-of-products operation unit of the same multiplier; FIG. 同畳み込み演算回路のアキュムレータ回路の内部ブロック図である。FIG. 4 is an internal block diagram of an accumulator circuit of the same convolution arithmetic circuit; 同アキュムレータ回路のアキュムレータユニットの内部ブロック図である。It is an internal block diagram of the accumulator unit of the same accumulator circuit. 同畳み込み演算回路の制御回路のステート遷移図である。FIG. 4 is a state transition diagram of a control circuit of the same convolution arithmetic circuit; 同畳み込み演算回路の入力変換部のブロック図である。It is a block diagram of the input conversion part of the same convolution arithmetic circuit. 第一閾値群および第二閾値群の例を示す図である。It is a figure which shows the example of a 1st threshold value group and a 2nd threshold value group. 同入力変換部から出力されるデータを示す図である。It is a figure which shows the data output from the same input conversion part. 同第一閾値群および同第二閾値群の他の例を示す図である。It is a figure which shows the other example of the same 1st threshold value group and the said 2nd threshold value group. 同第一閾値群および同第二閾値群の他の例を示す図である。It is a figure which shows the other example of the same 1st threshold value group and the said 2nd threshold value group. 同第一閾値群および同第二閾値群の他の例を示す図である。It is a figure which shows the other example of the same 1st threshold value group and the said 2nd threshold value group. 同入力変換部の変形例のブロック図である。It is a block diagram of the modification of the same input conversion part. 同変形例における第一閾値群、第二閾値群および第三閾値群の例を示す図である。It is a figure which shows the example of the 1st threshold value group, the 2nd threshold value group, and the 3rd threshold value group in the same modification. 同畳み込み演算のデータ分割やデータ展開を説明する図である。It is a figure explaining the data division and data expansion|deployment of the same convolution operation.

（第一実施形態）
本発明の第一実施形態について、図１から図２２を参照して説明する。
図１は、本実施形態に係るニューラルネットワーク生成装置３００を示す図である。 (First embodiment)
A first embodiment of the present invention will be described with reference to FIGS. 1 to 22. FIG.
FIG. 1 is a diagram showing a neural network generation device 300 according to this embodiment.

［ニューラルネットワーク生成装置３００］
ニューラルネットワーク生成装置３００は、ＩｏＴ機器などの組み込み機器に組み込み可能な学習済みのニューラルネットワーク実行モデル１００を生成する装置である。ニューラルネットワーク実行モデル１００は、畳み込みニューラルネットワーク２００（以下、「ＣＮＮ２００」という）を組み込み機器において演算させるために生成されたソフトウェアやハードウェアモデルである。 [Neural network generation device 300]
The neural network generation device 300 is a device that generates a trained neural network execution model 100 that can be embedded in an embedded device such as an IoT device. The neural network execution model 100 is a software or hardware model generated for operating a convolutional neural network 200 (hereinafter referred to as "CNN 200") in an embedded device.

ニューラルネットワーク生成装置３００は、ＣＰＵ（Central Processing Unit）等のプロセッサとメモリ等のハードウェアを備えたプログラム実行可能な装置（コンピュータ）である。ニューラルネットワーク生成装置３００の機能は、ニューラルネットワーク生成装置３００においてニューラルネットワーク生成プログラムおよびソフトウェア生成プログラムを実行することにより実現される。ニューラルネットワーク生成装置３００は、記憶部３１０と、演算部３２０と、データ入力部３３０と、データ出力部３４０と、表示部３５０と、操作入力部３６０と、を備える。 The neural network generation device 300 is a program-executable device (computer) having a processor such as a CPU (Central Processing Unit) and hardware such as a memory. The functions of neural network generation device 300 are realized by executing a neural network generation program and a software generation program in neural network generation device 300 . The neural network generation device 300 includes a storage unit 310 , a calculation unit 320 , a data input unit 330 , a data output unit 340 , a display unit 350 and an operation input unit 360 .

記憶部３１０は、ハードウェア情報ＨＷと、ネットワーク情報ＮＷと、学習データセットＤＳと、ニューラルネットワーク実行モデル１００（以下、「ＮＮ実行モデル１００」という）と、学習済みパラメータＰＭと、を記憶する。ハードウェア情報ＨＷ、学習データセットＤＳおよびネットワーク情報ＮＷは、ニューラルネットワーク生成装置３００に入力される入力データである。ＮＮ実行モデル１００および学習済みパラメータＰＭは、ニューラルネットワーク生成装置３００が出力する出力データである。なお、「学習済みのＮＮ実行モデル１００」は、ＮＮ実行モデル１００および学習済みパラメータＰＭを含む。 Storage unit 310 stores hardware information HW, network information NW, learning data set DS, neural network execution model 100 (hereinafter referred to as “NN execution model 100”), and learned parameters PM. Hardware information HW, learning data set DS, and network information NW are input data that are input to neural network generation device 300 . The NN execution model 100 and the learned parameters PM are output data output by the neural network generation device 300 . The "trained NN execution model 100" includes the NN execution model 100 and the learned parameters PM.

ハードウェア情報ＨＷは、ＮＮ実行モデル１００を動作させる組み込み機器（以降、「動作対象ハードウェア」という）の情報である。ハードウェア情報ＨＷは、例えば、動作対象ハードウェアのデバイス種別、デバイス制約、メモリ構成、バス構成、動作周波数、消費電力、製造プロセス種別などである。デバイス種別は、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）などの種別である。デバイス制約は、動作対象デバイスに含まれる演算器数の上限や回路規模の上限などである。メモリ構成は、メモリ種別やメモリ個数やメモリ容量や入出力データ幅である。バス構成は、バス種類、バス幅、バス通信規格、同一バス上の接続デバイスなどである。また、ＮＮ実行モデル１００に複数のバリエーションが存在する場合、ハードウェア情報ＨＷには使用するＮＮ実行モデル１００のバリエーションに関する情報が含まれる。 The hardware information HW is information about an embedded device that operates the NN execution model 100 (hereinafter referred to as "operation target hardware"). The hardware information HW includes, for example, the device type, device restrictions, memory configuration, bus configuration, operating frequency, power consumption, and manufacturing process type of hardware to be operated. The device type is, for example, a type such as ASIC (Application Specific Integrated Circuit) or FPGA (Field-Programmable Gate Array). The device constraint is the upper limit of the number of arithmetic units included in the device to be operated, the upper limit of the circuit scale, and the like. The memory configuration includes memory type, number of memories, memory capacity, and input/output data width. The bus configuration includes the bus type, bus width, bus communication standard, connected devices on the same bus, and the like. Also, when there are multiple variations of the NN execution model 100, the hardware information HW includes information on the variation of the NN execution model 100 to be used.

ネットワーク情報ＮＷは、ＣＮＮ２００の基本情報である。ネットワーク情報ＮＷは、例えば、ＣＮＮ２００のネットワーク構成、入力データ情報、出力データ情報、量子化情報などである。入力データ情報は、画像や音声などの入力データ種別と、入力データサイズなどである。 Network information NW is basic information of CNN 200 . The network information NW is, for example, the network configuration of the CNN 200, input data information, output data information, quantization information, and the like. The input data information includes the type of input data such as image and sound, and the size of the input data.

学習データセットＤＳは、学習に用いる学習データＤ１と、推論テストに用いるテストデータＤ２と、を有する。 The learning data set DS has learning data D1 used for learning and test data D2 used for inference testing.

図２は、演算部３２０の入出力を示す図である。
演算部３２０は、実行モデル生成部３２１と、学習部３２２と、推論部３２３と、ハードウェア生成部３２４と、ソフトウェア生成部３２５と、を有する。演算部３２０に入力されるＮＮ実行モデル１００は、ニューラルネットワーク生成装置３００以外の装置で生成されたものであってもよい。 FIG. 2 is a diagram showing inputs and outputs of the calculation unit 320. As shown in FIG.
The calculation unit 320 has an execution model generation unit 321 , a learning unit 322 , an inference unit 323 , a hardware generation unit 324 and a software generation unit 325 . The NN execution model 100 input to the calculation unit 320 may be generated by a device other than the neural network generation device 300 .

実行モデル生成部３２１は、ハードウェア情報ＨＷおよびネットワーク情報ＮＷに基づいてＮＮ実行モデル１００を生成する。ＮＮ実行モデル１００は、ＣＮＮ２００を動作対象ハードウェアにおいて演算させるために生成されたソフトウェアやハードウェアモデルである。ソフトウェアは、ハードウェアモデルを制御するソフトウェアを含む。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。 The execution model generator 321 generates the NN execution model 100 based on the hardware information HW and network information NW. The NN execution model 100 is a software or hardware model generated to cause the CNN 200 to operate on hardware to be operated. Software includes software that controls the hardware model. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. .

学習部３２２は、ＮＮ実行モデル１００および学習データＤ１を用いて、学習済みパラメータＰＭを生成する。推論部３２３は、ＮＮ実行モデル１００およびテストデータＤ２を用いて推論テストを実施する。 Learning unit 322 generates learned parameters PM using NN execution model 100 and learning data D1. The inference unit 323 performs an inference test using the NN execution model 100 and the test data D2.

ハードウェア生成部３２４は、ハードウェア情報ＨＷおよびＮＮ実行モデル１００に基づいてニューラルネットワークハードウェアモデル４００を生成する。ニューラルネットワークハードウェアモデル４００は、動作対象ハードウェアに実装可能なハードウェアモデルである。ニューラルネットワークハードウェアモデル４００は、ハードウェア情報ＨＷに基づいて、動作対象ハードウェアに最適化されている。ニューラルネットワークハードウェアモデル４００は、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。ニューラルネットワークハードウェアモデル４００は、ＮＮ実行モデル１００をハードウェアに実装するために必要なパラメータリストやコンフィグレーションファイルであってもよい。パラメータリストやコンフィグレーションファイルは別途生成されたＮＮ実行モデル１００と組み合わせて使用される。 Hardware generator 324 generates neural network hardware model 400 based on hardware information HW and NN execution model 100 . The neural network hardware model 400 is a hardware model that can be implemented in hardware to operate. The neural network hardware model 400 is optimized for operation target hardware based on the hardware information HW. The neural network hardware model 400 may be an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. The neural network hardware model 400 may be a parameter list and configuration files necessary for implementing the NN execution model 100 on hardware. The parameter list and configuration file are used in combination with the NN execution model 100 generated separately.

以降の説明において、ニューラルネットワークハードウェアモデル４００を動作対象ハードウェアに実装したものを、「ニューラルネットワークハードウェア６００」という。 In the following description, hardware implemented with the neural network hardware model 400 is referred to as "neural network hardware 600".

ソフトウェア生成部３２５は、ネットワーク情報ＮＷおよびＮＮ実行モデル１００に基づいて、ニューラルネットワークハードウェア６００を動作させるソフトウェア５００を生成する。ソフトウェア５００は、学習済みパラメータＰＭを必要に応じてニューラルネットワークハードウェア６００へ転送するソフトウェアを含む。 Software generator 325 generates software 500 for operating neural network hardware 600 based on network information NW and NN execution model 100 . Software 500 includes software that transfers learned parameters PM to neural network hardware 600 as needed.

データ入力部３３０には、学習済みのＮＮ実行モデル１００を生成するために必要なハードウェア情報ＨＷやネットワーク情報ＮＷ等が入力される。ハードウェア情報ＨＷやネットワーク情報ＮＷ等は、例えば所定のデータフォーマットで記載されたデータとして入力される。入力されたハードウェア情報ＨＷやネットワーク情報ＮＷ等は、記憶部３１０に記憶される。ハードウェア情報ＨＷやネットワーク情報ＮＷ等は、操作入力部３６０から使用者により入力または変更されてもよい。 The data input unit 330 receives hardware information HW, network information NW, and the like necessary for generating the trained NN execution model 100 . The hardware information HW, network information NW, etc. are input as data described in a predetermined data format, for example. The input hardware information HW, network information NW, etc. are stored in the storage unit 310 . The hardware information HW, network information NW, etc. may be input or changed by the user through the operation input unit 360 .

データ出力部３４０には、生成された学習済みのＮＮ実行モデル１００が出力される。例えば、生成されたＮＮ実行モデル１００と、学習済みパラメータＰＭとがデータ出力部３４０に出力される。 The generated trained NN execution model 100 is output to the data output unit 340 . For example, the generated NN execution model 100 and learned parameters PM are output to the data output unit 340 .

表示部３５０は、ＬＣＤディスプレイ等の公知のモニタを有する。表示部３５０は、演算部３２０が生成したＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）画像やコマンド等を受け付けるためのコンソール画面などを表示できる。また、演算部３２０が使用者からの情報入力を必要とする場合、表示部３５０は操作入力部３６０から情報を入力することを使用者に促すメッセージや情報入力に必要なＧＵＩ画像を表示できる。 The display unit 350 has a known monitor such as an LCD display. The display unit 350 can display a GUI (Graphical User Interface) image generated by the calculation unit 320, a console screen for receiving commands, and the like. Further, when the calculation unit 320 requires information input from the user, the display unit 350 can display a message prompting the user to input information from the operation input unit 360 or a GUI image required for information input.

操作入力部３６０は、使用者が演算部３２０等に対しての指示を入力する装置である。操作入力部３６０は、タッチパネル、キーボード、マウス等の公知の入力デバイスである。操作入力部３６０の入力は、演算部３２０に送信される。 The operation input unit 360 is a device through which the user inputs instructions to the calculation unit 320 and the like. The operation input unit 360 is a known input device such as a touch panel, keyboard, and mouse. An input of the operation input section 360 is transmitted to the calculation section 320 .

演算部３２０の機能の全部または一部は、例えばＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）のような１つ以上のプロセッサがプログラムメモリに記憶されたプログラムを実行することにより実現される。ただし、演算部３２０の機能の全部または一部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、ＰＬＤ（Programmable Logic Device）等のハードウェア（例えば回路部；circuity）により実現されてもよい。また、演算部３２０の機能の全部または一部は、ソフトウェアとハードウェアとの組み合わせにより実現されてもよい。 All or part of the functions of the arithmetic unit 320 are implemented by one or more processors such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) executing a program stored in a program memory. . However, all or part of the functions of the arithmetic unit 320 are implemented by hardware such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), PLD (Programmable Logic Device), etc. circuitry). Moreover, all or part of the functions of the calculation unit 320 may be realized by a combination of software and hardware.

演算部３２０の機能の全部または一部は、クラウドサーバ等の外部機器に設けられたＣＰＵやＧＰＵやハードウェア等の外部アクセラレータを用いて実現されてもよい。演算部３２０は、例えばクラウドサーバ上の演算性能が高いＧＰＵや専用ハードウェアを併用することで、演算部３２０の演算速度を向上させることができる。 All or part of the functions of the computing unit 320 may be implemented using an external accelerator such as a CPU or GPU or hardware provided in an external device such as a cloud server. The calculation unit 320 can improve the calculation speed of the calculation unit 320 by using, for example, a GPU with high calculation performance on a cloud server or dedicated hardware.

記憶部３１０は、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read-Only Memory）、ＲＯＭ（Read-Only Memory）、またＲＡＭ（Random Access Memory）等により実現される。記憶部３１０の全部または一部はクラウドサーバ等の外部機器に設けられ、通信回線により演算部３２０等と接続させてもよい。 The storage unit 310 is realized by flash memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), ROM (Read-Only Memory), RAM (Random Access Memory), or the like. All or part of the storage unit 310 may be provided in an external device such as a cloud server and connected to the calculation unit 320 or the like via a communication line.

［畳み込みニューラルネットワーク（ＣＮＮ）２００］
次に、ＣＮＮ２００について説明する。図３は、ＣＮＮ２００の一例を示す図である。ＣＮＮ２００のネットワーク情報ＮＷは、以下で説明するＣＮＮ２００の構成に関する情報である。ＣＮＮ２００は、低ビットの重みｗや量子化された入力データａを用いており、組み込み機器に組み込みやすい。 [Convolutional Neural Network (CNN) 200]
Next, CNN200 is demonstrated. FIG. 3 is a diagram showing an example of the CNN 200. As shown in FIG. The network information NW of the CNN 200 is information regarding the configuration of the CNN 200 described below. The CNN 200 uses low-bit weight w and quantized input data a, and is easy to incorporate into embedded equipment.

ＣＮＮ２００は、入力層２０５と、畳み込み演算を行う畳み込み層２１０と、量子化演算を行う量子化演算層２２０と、出力層２３０と、を含む多層構造のネットワークである。ＣＮＮ２００の少なくとも一部において、畳み込み層２１０と量子化演算層２２０とが交互に連結されている。ＣＮＮ２００は、画像認識や動画認識に広く使われるモデルである。ＣＮＮ２００は、全結合層などの他の機能を有する層（レイヤ）をさらに有してもよい。 The CNN 200 is a multilayer network including an input layer 205 , a convolution layer 210 that performs convolution operations, a quantization operation layer 220 that performs quantization operations, and an output layer 230 . In at least part of CNN 200, convolutional layers 210 and quantization operation layers 220 are interleaved. CNN200 is a model widely used for image recognition and moving image recognition. The CNN 200 may further have layers with other functions, such as fully connected layers.

入力層２０５は、ＣＮＮ２００に入力される入力データが、例えば３２ビットの浮動小数点型など、畳み込み層２１０への入力データａと形式が異なる場合、型変換や量子化やデータ分割やデータ整形等のデータ変換を行う。以降の説明において、ＣＮＮ２００に入力される入力データであって、畳み込み層２１０への入力データａと形式が異なる入力データを「入力データｂ」という。入力層２０５は、入力データｂを入力データａに変換する。 When the input data input to the CNN 200 has a different format from the input data a to the convolution layer 210, such as a 32-bit floating point type, the input layer 205 performs type conversion, quantization, data division, data shaping, and the like. Perform data conversion. In the following description, input data that is input to the CNN 200 and has a different format from the input data a to the convolutional layer 210 is referred to as "input data b". The input layer 205 transforms input data b into input data a.

図４は、畳み込み層２１０が行う畳み込み演算を説明する図である。
畳み込み層２１０は、入力データａに対して重みｗを用いた畳み込み演算を行う。畳み込み層２１０は、入力データａと重みｗとを入力とする積和演算を行う。 FIG. 4 is a diagram for explaining the convolution operation performed by the convolution layer 210. As shown in FIG.
The convolution layer 210 performs a convolution operation on input data a using weight w. The convolution layer 210 performs a sum-of-products operation with input data a and weight w as inputs.

畳み込み層２１０への入力データａ（アクティベーションデータ、特徴マップともいう）は、画像データ等の多次元データである。本実施形態において、入力データａは、要素（ｘ，ｙ，ｃ）からなる３次元テンソルである。ＣＮＮ２００の畳み込み層２１０は、低ビットの入力データａに対して畳み込み演算を行う。本実施形態において、入力データａの要素は、２ビットの符号なし整数（０，１，２，３）である。入力データａの要素は、例えば、４ビットや８ビット符号なし整数でもよい。 Input data a (also called activation data or feature map) to the convolutional layer 210 is multidimensional data such as image data. In this embodiment, the input data a is a three-dimensional tensor consisting of elements (x, y, c). The convolution layer 210 of the CNN 200 performs a convolution operation on low-bit input data a. In this embodiment, the elements of the input data a are 2-bit unsigned integers (0, 1, 2, 3). Elements of input data a may be, for example, 4-bit or 8-bit unsigned integers.

畳み込み層２１０の重みｗ（フィルタ、カーネルともいう）は、学習可能なパラメータである要素を有する多次元データである。本実施形態において、重みｗは、要素（ｉ，ｊ，ｃ，ｄ）からなる４次元テンソルである。重みｗは、要素（ｉ，ｊ，ｃ）からなる３次元テンソル（以降、「重みｗｏ」という）をｄ個有している。学習済みのＣＮＮ２００における重みｗは、学習済みのデータである。ＣＮＮ２００の畳み込み層２１０は、低ビットの重みｗを用いて畳み込み演算を行う。本実施形態において、重みｗの要素は、１ビットの符号付整数（０，１）であり、値「０」は＋１を表し、値「１」は－１を表す。 The weights w (also called filters, kernels) of the convolutional layer 210 are multidimensional data whose elements are learnable parameters. In this embodiment, the weight w is a 4-dimensional tensor consisting of elements (i,j,c,d). The weight w has d three-dimensional tensors (hereinafter referred to as “weight wo”) each having elements (i, j, c). The weight w in the learned CNN 200 is learned data. Convolutional layer 210 of CNN 200 performs a convolution operation using low-bit weights w. In this embodiment, the elements of the weight w are 1-bit signed integers (0,1), where the value '0' represents +1 and the value '1' represents -1.

畳み込み層２１０は、式１に示す畳み込み演算を行い、出力データｆを出力する。式１において、ｓはストライドを示す。図４において点線で示された領域は、入力データａに対して重みｗｏが適用される領域ａｏ（以降、「適用領域ａｏ」という）の一つを示している。適用領域ａｏの要素は、（ｘ＋ｉ，ｙ＋ｊ，ｃ）で表される。 The convolution layer 210 performs the convolution operation shown in Equation 1 and outputs output data f. In Equation 1, s indicates stride. A region indicated by a dotted line in FIG. 4 indicates one of the regions ao (hereinafter referred to as “applied region ao”) to which the weight wo is applied to the input data a. Elements of the application area ao are represented by (x+i, y+j, c).

量子化演算層２２０は、畳み込み層２１０が出力する畳み込み演算の出力に対して量子化などを実施する。量子化演算層２２０は、プーリング層２２１と、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２と、活性化関数層２２３と、量子化層２２４と、を有する。 The quantization operation layer 220 performs quantization and the like on the convolution operation output from the convolution layer 210 . The quantization operation layer 220 has a pooling layer 221 , a batch normalization layer 222 , an activation function layer 223 and a quantization layer 224 .

プーリング層２２１は、畳み込み層２１０が出力する畳み込み演算の出力データｆに対して平均プーリング（式２）やＭＡＸプーリング（式３）などの演算を実施して、畳み込み層２１０の出力データｆを圧縮する。式２および式３において、ｕは入力テンソルを示し、ｖは出力テンソルを示し、Ｔはプーリング領域の大きさを示す。式３において、ｍａｘはＴに含まれるｉとｊの組み合わせに対するｕの最大値を出力する関数である。 The pooling layer 221 performs operations such as average pooling (equation 2) and MAX pooling (equation 3) on the output data f of the convolutional operation output by the convolutional layer 210 to compress the output data f of the convolutional layer 210. do. In Equations 2 and 3, u indicates the input tensor, v indicates the output tensor, and T indicates the size of the pooling region. In Equation 3, max is a function that outputs the maximum value of u for combinations of i and j contained in T.

ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２は、量子化演算層２２０やプーリング層２２１の出力データに対して、例えば式４に示すような演算によりデータ分布の正規化を行う。式４において、ｕは入力テンソルを示し、ｖは出力テンソルを示し、αはスケールを示し、βはバイアスを示す。学習済みのＣＮＮ２００において、αおよびβは学習済みの定数ベクトルである。 The Batch Normalization layer 222 normalizes the data distribution of the output data of the quantization operation layer 220 and the pooling layer 221 by, for example, the operation shown in Equation 4. In Equation 4, u denotes the input tensor, v the output tensor, α the scale, and β the bias. In the trained CNN 200, α and β are trained constant vectors.

活性化関数層２２３は、量子化演算層２２０やプーリング層２２１やＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２の出力に対してＲｅＬＵ（式５）などの活性化関数の演算を行う。式５において、ｕは入力テンソルであり、ｖは出力テンソルである。式５において、ｍａｘは引数のうち最も大きい数値を出力する関数である。 The activation function layer 223 computes an activation function such as ReLU (Formula 5) on the outputs of the quantization computation layer 220 , the pooling layer 221 and the batch normalization layer 222 . In Equation 5, u is the input tensor and v is the output tensor. In Expression 5, max is a function that outputs the largest numerical value among the arguments.

量子化層２２４は、量子化パラメータに基づいて、プーリング層２２１や活性化関数層２２３の出力に対して例えば式６に示すような量子化を行う。式６に示す量子化は、入力テンソルｕを２ビットにビット削減している。式６において、ｑ(ｃ)は量子化パラメータのベクトルである。学習済みのＣＮＮ２００において、ｑ(ｃ)は学習済みの定数ベクトルである。式６における不等号「≦」は「＜」であってもよい。 The quantization layer 224 quantizes the outputs of the pooling layer 221 and the activation function layer 223 based on the quantization parameter, as shown in Equation 6, for example. The quantization shown in Equation 6 reduces the input tensor u to 2 bits. In Equation 6, q(c) is the vector of quantization parameters. In the trained CNN 200, q(c) is a trained constant vector. The inequality sign “≦” in Equation 6 may be “<”.

出力層２３０は、恒等関数やソフトマックス関数等によりＣＮＮ２００の結果を出力する層である。出力層２３０の前段のレイヤは、畳み込み層２１０であってもよいし、量子化演算層２２０であってもよい。 The output layer 230 is a layer that outputs the results of the CNN 200 using an identity function, a softmax function, or the like. A layer preceding the output layer 230 may be the convolution layer 210 or the quantization operation layer 220 .

ＣＮＮ２００は、量子化された量子化層２２４の出力データが、畳み込み層２１０に入力されるため、量子化を行わない他の畳み込みニューラルネットワークと比較して、畳み込み層２１０の畳み込み演算の負荷が小さい。 In the CNN 200, the quantized output data of the quantization layer 224 is input to the convolution layer 210, so the convolution operation load of the convolution layer 210 is small compared to other convolutional neural networks that do not perform quantization. .

［ニューラルネットワーク実行モデル１００（ＮＮ実行モデル）１００］
次に、ＮＮ実行モデル１００について説明する。図５は、ＮＮ実行モデル１００の一例を示す図である。ＮＮ実行モデル１００は、ＣＮＮ２００を動作対象ハードウェアにおいて演算させるために生成されたソフトウェアやハードウェアモデルである。ソフトウェアは、ハードウェアモデルを制御するソフトウェアを含む。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。 [Neural network execution model 100 (NN execution model) 100]
Next, the NN execution model 100 will be explained. FIG. 5 is a diagram showing an example of the NN execution model 100. As shown in FIG. The NN execution model 100 is a software or hardware model generated to cause the CNN 200 to operate on hardware to be operated. Software includes software that controls the hardware model. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. .

ＮＮ実行モデル１００は、第一メモリ１と、第二メモリ２と、ＤＭＡコントローラ３（以下、「ＤＭＡＣ３」ともいう）と、畳み込み演算回路４と、量子化演算回路５と、コントローラ６と、を備える。ＮＮ実行モデル１００は、第一メモリ１および第二メモリ２を介して、畳み込み演算回路４と量子化演算回路５とがループ状に形成されていることを特徴とする。 The NN execution model 100 includes a first memory 1, a second memory 2, a DMA controller 3 (hereinafter also referred to as "DMAC 3"), a convolution operation circuit 4, a quantization operation circuit 5, and a controller 6. Prepare. The NN execution model 100 is characterized in that a convolution operation circuit 4 and a quantization operation circuit 5 are formed in a loop via a first memory 1 and a second memory 2 .

第一メモリ１は、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。第一メモリ１には、ＤＭＡＣ３やコントローラ６を介してデータの書き込みおよび読み出しが行われる。第一メモリ１は、畳み込み演算回路４の入力ポートと接続されており、畳み込み演算回路４は第一メモリ１からデータを読み出すことができる。また、第一メモリ１は、量子化演算回路５の出力ポートと接続されており、量子化演算回路５は第一メモリ１にデータを書き込むことができる。外部プロセッサＥＰは、第一メモリ１に対するデータの書き込みや読み出しにより、ＮＮ実行モデル１００に対するデータの入出力を行うことができる。 The first memory 1 is a rewritable memory such as a volatile memory such as an SRAM (Static RAM). Data is written to and read from the first memory 1 via the DMAC 3 and the controller 6 . The first memory 1 is connected to the input port of the convolution operation circuit 4 , and the convolution operation circuit 4 can read data from the first memory 1 . The first memory 1 is also connected to the output port of the quantization arithmetic circuit 5 , and the quantization arithmetic circuit 5 can write data to the first memory 1 . The external processor EP can input/output data to/from the NN execution model 100 by writing data to and reading data from the first memory 1 .

第二メモリ２は、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。第二メモリ２には、ＤＭＡＣ３やコントローラ６を介してデータの書き込みおよび読み出しが行われる。第二メモリ２は、量子化演算回路５の入力ポートと接続されており、量子化演算回路５は第二メモリ２からデータを読み出すことができる。また、第二メモリ２は、畳み込み演算回路４の出力ポートと接続されており、畳み込み演算回路４は第二メモリ２にデータを書き込むことができる。外部プロセッサＥＰは、第二メモリ２に対するデータの書き込みや読み出しにより、ＮＮ実行モデル１００に対するデータの入出力を行うことができる。 The second memory 2 is, for example, a rewritable memory such as a volatile memory such as an SRAM (Static RAM). Data is written to and read from the second memory 2 via the DMAC 3 and the controller 6 . The second memory 2 is connected to the input port of the quantization arithmetic circuit 5 , and the quantization arithmetic circuit 5 can read data from the second memory 2 . The second memory 2 is also connected to the output port of the convolution circuit 4 , and the convolution circuit 4 can write data to the second memory 2 . The external processor EP can input/output data to/from the NN execution model 100 by writing data to and reading data from the second memory 2 .

ＤＭＡＣ３は、外部バスＥＢに接続されており、ＤＲＡＭなどの外部メモリＥＭと第一メモリ１との間のデータ転送を行う。また、ＤＭＡＣ３は、ＤＲＡＭなどの外部メモリＥＭと第二メモリ２との間のデータ転送を行う。また、ＤＭＡＣ３は、ＤＲＡＭなどの外部メモリＥＭと畳み込み演算回路４との間のデータ転送を行う。また、ＤＭＡＣ３は、ＤＲＡＭなどの外部メモリＥＭと量子化演算回路５との間のデータ転送を行う。 The DMAC 3 is connected to the external bus EB and performs data transfer between the external memory EM such as DRAM and the first memory 1 . The DMAC 3 also transfers data between the external memory EM such as a DRAM and the second memory 2 . The DMAC 3 also transfers data between the external memory EM such as DRAM and the convolution circuit 4 . The DMAC 3 also transfers data between the external memory EM such as a DRAM and the quantization arithmetic circuit 5 .

畳み込み演算回路４は、学習済みのＣＮＮ２００の畳み込み層２１０における畳み込み演算を行う回路である。畳み込み演算回路４は、第一メモリ１に格納された入力データａを読み出し、入力データａに対して畳み込み演算を実施する。畳み込み演算回路４は、畳み込み演算の出力データｆ（以降、「畳み込み演算出力データ」ともいう）を第二メモリ２に書き込む。 The convolution operation circuit 4 is a circuit that performs convolution operation in the convolution layer 210 of the trained CNN 200 . The convolution operation circuit 4 reads the input data a stored in the first memory 1 and performs a convolution operation on the input data a. The convolution operation circuit 4 writes output data f of the convolution operation (hereinafter also referred to as “convolution operation output data”) to the second memory 2 .

量子化演算回路５は、学習済みのＣＮＮ２００の量子化演算層２２０における量子化演算の少なくとも一部を行う回路である。量子化演算回路５は、第二メモリ２に格納された畳み込み演算の出力データｆを読み出し、畳み込み演算の出力データｆに対して量子化演算（プーリング、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ、活性化関数、および量子化のうち少なくとも量子化を含む演算）を行う。量子化演算回路５は、量子化演算の出力データ（以降、「量子化演算出力データ」ともいう）оｕｔを第一メモリ１に書き込む。 The quantization operation circuit 5 is a circuit that performs at least part of the quantization operation in the quantization operation layer 220 of the trained CNN 200 . The quantization operation circuit 5 reads the output data f of the convolution operation stored in the second memory 2, and performs quantization operations (pooling, batch normalization, activation function, and quantization) on the output data f of the convolution operation. calculation including at least quantization). The quantization operation circuit 5 writes the output data of the quantization operation (hereinafter also referred to as “quantization operation output data”) out to the first memory 1 .

コントローラ６は、外部バスＥＢに接続されており、外部プロセッサＥＰのスレーブとして動作する。コントローラ６は、パラメータレジスタや状態レジスタを含むレジスタ６１を有している。パラメータレジスタは、ＮＮ実行モデル１００の動作を制御するレジスタである。状態レジスタはセマフォＳを含むＮＮ実行モデル１００の状態を示すレジスタである。外部プロセッサＥＰは、コントローラ６を経由して、レジスタ６１にアクセスできる。 The controller 6 is connected to the external bus EB and operates as a slave to the external processor EP. The controller 6 has registers 61 including parameter registers and status registers. A parameter register is a register that controls the operation of the NN execution model 100 . The state register is a register that indicates the state of the NN execution model 100 including the semaphore S. An external processor EP can access the registers 61 via the controller 6 .

コントローラ６は、内部バスＩＢを介して、第一メモリ１と、第二メモリ２と、ＤＭＡＣ３と、畳み込み演算回路４と、量子化演算回路５と、接続されている。外部プロセッサＥＰは、コントローラ６を経由して、各ブロックに対してアクセスできる。例えば、外部プロセッサＥＰは、コントローラ６を経由して、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５に対する命令を指示することができる。また、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５は、内部バスＩＢを介して、コントローラ６が有する状態レジスタ（セマフォＳを含む）を更新できる。状態レジスタ（セマフォＳを含む）は、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５と接続された専用配線を介して更新されるように構成されていてもよい。 Controller 6 is connected to first memory 1, second memory 2, DMAC 3, convolution circuit 4, and quantization circuit 5 via internal bus IB. An external processor EP can access each block via the controller 6 . For example, the external processor EP can issue commands to the DMAC 3, the convolution circuit 4, and the quantization circuit 5 via the controller 6. FIG. Also, the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5 can update the status register (including the semaphore S) of the controller 6 via the internal bus IB. The status register (including the semaphore S) may be configured to be updated via dedicated wiring connected to the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5. FIG.

ＮＮ実行モデル１００は、第一メモリ１や第二メモリ２等を有するため、ＤＲＡＭなどの外部メモリＥＭからのＤＭＡＣ３によるデータ転送において、重複するデータのデータ転送の回数を低減できる。これにより、メモリアクセスにより発生する消費電力を大幅に低減することができる。 Since the NN execution model 100 has the first memory 1, the second memory 2, etc., it is possible to reduce the number of data transfers of overlapping data in the data transfer by the DMAC 3 from the external memory EM such as DRAM. As a result, power consumption caused by memory access can be greatly reduced.

図６は、ＮＮ実行モデル１００の動作例を示すタイミングチャートである。ＮＮ実行モデル１００は、複数のレイヤの多層構造であるＣＮＮ２００の演算を、ループ状に形成された回路により演算する。ＮＮ実行モデル１００は、ループ状の回路構成により、ハードウェア資源を効率的に利用できる。以下、図６に示すニューラルネットワークハードウェア６００の動作例を説明する。 FIG. 6 is a timing chart showing an operation example of the NN execution model 100. As shown in FIG. The NN execution model 100 performs computations of the CNN 200, which has a multi-layered structure, by circuits formed in loops. The NN execution model 100 can efficiently use hardware resources due to its looped circuit configuration. An operation example of the neural network hardware 600 shown in FIG. 6 will be described below.

ＤＭＡＣ３は、レイヤ１（図３参照）の入力データａを第一メモリ１に格納する。ＤＭＡＣ３は、畳み込み演算回路４が行う畳み込み演算の順序にあわせて、レイヤ１の入力データａを分割して第一メモリ１に転送してもよい。 The DMAC 3 stores the input data a of layer 1 (see FIG. 3) in the first memory 1 . The DMAC 3 may divide the input data a of the layer 1 according to the order of the convolution operation performed by the convolution operation circuit 4 and transfer the divided data to the first memory 1 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ１（図３参照）の入力データａを読み出す。畳み込み演算回路４は、レイヤ１の入力データａに対してレイヤ１の畳み込み演算を行う。レイヤ１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the input data a of layer 1 (see FIG. 3) stored in the first memory 1 . The convolution operation circuit 4 performs a layer 1 convolution operation on layer 1 input data a. The output data f of the layer 1 convolution operation is stored in the second memory 2 .

量子化演算回路５は、第二メモリ２に格納されたレイヤ１の出力データｆを読み出す。量子化演算回路５は、レイヤ１の出力データｆに対してレイヤ２の量子化演算を行う。レイヤ２の量子化演算の出力データоｕｔは、第一メモリ１に格納される。 The quantization arithmetic circuit 5 reads the layer 1 output data f stored in the second memory 2 . A quantization operation circuit 5 performs a layer 2 quantization operation on layer 1 output data f. The output data out of the layer 2 quantization operation are stored in the first memory 1 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２の量子化演算の出力データを読み出す。畳み込み演算回路４は、レイヤ２の量子化演算の出力データоｕｔを入力データａとしてレイヤ３の畳み込み演算を行う。レイヤ３の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data of the layer 2 quantization operation stored in the first memory 1 . The convolution operation circuit 4 performs a layer 3 convolution operation using the output data out of the layer 2 quantization operation as input data a. The output data f of the layer 3 convolution operation is stored in the second memory 2 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２Ｍ－２（Ｍは自然数）の量子化演算の出力データоｕｔを読み出す。畳み込み演算回路４は、レイヤ２Ｍ－２の量子化演算の出力データоｕｔを入力データａとしてレイヤ２Ｍ－１の畳み込み演算を行う。レイヤ２Ｍ－１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data out of the quantization operation of the layer 2M-2 (M is a natural number) stored in the first memory 1. FIG. The convolution operation circuit 4 performs the convolution operation of the layer 2M-1 using the output data out of the quantization operation of the layer 2M-2 as the input data a. The output data f of the layer 2M-1 convolution operation is stored in the second memory 2. FIG.

量子化演算回路５は、第二メモリ２に格納されたレイヤ２Ｍ－１の出力データｆを読み出す。量子化演算回路５は、２Ｍ－１レイヤの出力データｆに対してレイヤ２Ｍの量子化演算を行う。レイヤ２Ｍの量子化演算の出力データоｕｔは、第一メモリ１に格納される。 The quantization arithmetic circuit 5 reads the layer 2M-1 output data f stored in the second memory 2 . The quantization operation circuit 5 performs a layer 2M quantization operation on the output data f of the 2M-1 layer. The output data out of the layer 2M quantization operation are stored in the first memory 1 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２Ｍの量子化演算の出力データоｕｔを読み出す。畳み込み演算回路４は、レイヤ２Ｍの量子化演算の出力データоｕｔを入力データａとしてレイヤ２Ｍ＋１の畳み込み演算を行う。レイヤ２Ｍ＋１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data out of the layer 2M quantization operation stored in the first memory 1 . The convolution operation circuit 4 performs a layer 2M+1 convolution operation using the output data out of the layer 2M quantization operation as input data a. The output data f of the layer 2M+1 convolution operation are stored in the second memory 2 .

畳み込み演算回路４と量子化演算回路５とが交互に演算を行い、図３に示すＣＮＮ２００の演算を進めていく。ＮＮ実行モデル１００は、畳み込み演算回路４が時分割によりレイヤ２Ｍ－１とレイヤ２Ｍ＋１の畳み込み演算を実施する。また、ＮＮ実行モデル１００は、量子化演算回路５が時分割によりレイヤ２Ｍ－２とレイヤ２Ｍの量子化演算を実施する。そのため、ＮＮ実行モデル１００は、レイヤごとに別々の畳み込み演算回路４と量子化演算回路５を実装する場合と比較して、回路規模が著しく小さい。 The convolution calculation circuit 4 and the quantization calculation circuit 5 alternately perform calculations to advance the calculation of the CNN 200 shown in FIG. In the NN execution model 100, the convolution circuit 4 performs the convolution calculations of layer 2M-1 and layer 2M+1 by time division. In addition, in the NN execution model 100, the quantization operation circuit 5 performs quantization operations for layer 2M-2 and layer 2M by time division. Therefore, the NN execution model 100 has a significantly smaller circuit scale than a case where separate convolution operation circuits 4 and quantization operation circuits 5 are implemented for each layer.

［ニューラルネットワーク生成装置３００の動作］
次に、ニューラルネットワーク生成装置３００の動作（ニューラルネットワーク制御方法）を、図７に示すニューラルネットワーク生成装置３００の制御フローチャートに沿って説明する。ニューラルネットワーク生成装置３００は初期化処理（ステップＳ１０）を実施した後、ステップＳ１１を実行する。 [Operation of Neural Network Generation Device 300]
Next, the operation of the neural network generation device 300 (neural network control method) will be described with reference to the control flowchart of the neural network generation device 300 shown in FIG. After executing the initialization process (step S10), the neural network generation device 300 executes step S11.

＜ハードウェア情報取得工程（Ｓ１１）＞
ステップＳ１１において、ニューラルネットワーク生成装置３００は、動作対象ハードウェアのハードウェア情報ＨＷを取得する（ハードウェア情報取得工程）。ニューラルネットワーク生成装置３００は、例えば、データ入力部３３０に入力されたハードウェア情報ＨＷを取得する。ニューラルネットワーク生成装置３００は、表示部３５０にハードウェア情報ＨＷの入力に必要なＧＵＩ画像を表示させ、使用者にハードウェア情報ＨＷを操作入力部３６０から入力させることでハードウェア情報ＨＷを取得してもよい。 <Hardware Information Acquisition Step (S11)>
In step S11, the neural network generation device 300 acquires the hardware information HW of the hardware to be operated (hardware information acquisition step). The neural network generation device 300 acquires hardware information HW input to the data input unit 330, for example. The neural network generation device 300 displays a GUI image necessary for inputting the hardware information HW on the display unit 350, and causes the user to input the hardware information HW from the operation input unit 360, thereby acquiring the hardware information HW. may

ハードウェア情報ＨＷは、具体的には、第一メモリ１および第二メモリ２として割り当てるメモリのメモリ種別やメモリ容量や入出力データ幅を有する。 The hardware information HW specifically has memory types, memory capacities, and input/output data widths of the memories to be allocated as the first memory 1 and the second memory 2 .

取得されたハードウェア情報ＨＷは、記憶部３１０に記憶される。次に、ニューラルネットワーク生成装置３００は、ステップＳ１２を実行する。 The acquired hardware information HW is stored in the storage unit 310 . Next, the neural network generation device 300 executes step S12.

＜ネットワーク情報取得工程（Ｓ１２）＞
ステップＳ１２において、ニューラルネットワーク生成装置３００は、ＣＮＮ２００のネットワーク情報ＮＷを取得する（ネットワーク情報取得工程）。ニューラルネットワーク生成装置３００は、例えば、データ入力部３３０に入力されたネットワーク情報ＮＷを取得する。ニューラルネットワーク生成装置３００は、表示部３５０にネットワーク情報ＮＷの入力に必要なＧＵＩ画像を表示させ、使用者にネットワーク情報ＮＷを操作入力部３６０から入力させることでネットワーク情報ＮＷを取得してもよい。 <Network information acquisition step (S12)>
In step S12, the neural network generation device 300 acquires the network information NW of the CNN 200 (network information acquisition step). The neural network generation device 300 acquires network information NW input to the data input unit 330, for example. The neural network generation device 300 may acquire the network information NW by displaying a GUI image necessary for inputting the network information NW on the display unit 350 and having the user input the network information NW from the operation input unit 360. .

ネットワーク情報ＮＷは、具体的には、入力層２０５や出力層２３０を含むネットワーク構成と、重みｗや入力データａのビット幅を含む畳み込み層２１０の構成と、量子化情報を含む量子化演算層２２０の構成と、を有する。 Specifically, the network information NW includes the network configuration including the input layer 205 and the output layer 230, the configuration of the convolution layer 210 including the weight w and the bit width of the input data a, and the quantization operation layer including quantization information. 220 configurations.

取得されたネットワーク情報ＮＷは、記憶部３１０に記憶される。次に、ニューラルネットワーク生成装置３００は、ステップＳ１３を実行する。 The acquired network information NW is stored in storage unit 310 . Next, the neural network generation device 300 executes step S13.

＜ニューラルネットワーク実行モデル生成工程（Ｓ１３）＞
ステップＳ１３において、ニューラルネットワーク生成装置３００の実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００を生成する（ニューラルネットワーク実行モデル生成工程）。 <Neural Network Execution Model Generation Step (S13)>
In step S13, the execution model generation unit 321 of the neural network generation device 300 generates the NN execution model 100 based on the hardware information HW and the network information NW (neural network execution model generation step).

ニューラルネットワーク実行モデル生成工程（ＮＮ実行モデル生成工程）は、例えば、畳み込み演算回路生成工程（Ｓ１３－１）と、量子化演算回路生成工程（Ｓ１３－２）と、ＤＭＡＣ生成工程（Ｓ１３－３）と、を有する。 The neural network execution model generation step (NN execution model generation step) includes, for example, a convolution operation circuit generation step (S13-1), a quantization operation circuit generation step (S13-2), and a DMAC generation step (S13-3). and have

＜畳み込み演算回路生成工程（Ｓ１３－１）＞
実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００の畳み込み演算回路４を生成する（畳み込み演算回路生成工程）。実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された重みｗや入力データａのビット幅などの情報から、畳み込み演算回路４のハードウェアモデルを生成する。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。以下、生成される畳み込み演算回路４のハードウェアモデルの一例を説明する。 <Convolution Operation Circuit Generation Step (S13-1)>
The execution model generation unit 321 generates the convolutional operation circuit 4 of the NN execution model 100 based on the hardware information HW and the network information NW (convolutional operation circuit generation step). The execution model generation unit 321 generates a hardware model of the convolution operation circuit 4 from information such as the weight w input as the network information NW and the bit width of the input data a. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. . An example of the generated hardware model of the convolution operation circuit 4 will be described below.

図８は、生成される畳み込み演算回路４の内部ブロック図である。
畳み込み演算回路４は、重みメモリ４１と、乗算器４２と、アキュムレータ回路４３と、ステートコントローラ４４と、入力変換部４９と、を有する。畳み込み演算回路４は、乗算器４２およびアキュムレータ回路４３に対する専用のステートコントローラ４４を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに畳み込み演算を実施できる。 FIG. 8 is an internal block diagram of the generated convolution operation circuit 4. As shown in FIG.
The convolution arithmetic circuit 4 has a weight memory 41 , a multiplier 42 , an accumulator circuit 43 , a state controller 44 and an input converter 49 . The convolution operation circuit 4 has a dedicated state controller 44 for the multiplier 42 and the accumulator circuit 43, and when an instruction command is input, the convolution operation can be performed without the need for an external controller.

重みメモリ４１は、畳み込み演算に用いる重みｗが格納されるメモリであり、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。ＤＭＡＣ３は、ＤＭＡ転送により、畳み込み演算に必要な重みｗを重みメモリ４１に書き込む。 The weight memory 41 is a memory that stores the weight w used in the convolution operation, and is a rewritable memory such as a volatile memory such as an SRAM (Static RAM). The DMAC 3 writes the weight w required for the convolution operation into the weight memory 41 by DMA transfer.

図９は、乗算器４２の内部ブロック図である。
乗算器４２は、入力データａの各要素と重みｗの各要素とを乗算する。入力データａの各要素は、入力データａが分割されたデータであり、Ｂｃ個の要素を持つベクトルデータである（例えば、後述する「入力ベクトルＡ」）。また、重みｗの各要素は、重みｗが分割されたデータであり、Ｂｃ×Ｂｄ個の要素を持つマトリクスデータである（例えば、後述する「重みマトリクスＷ」）。乗算器４２は、Ｂｃ×Ｂｄ個の積和演算ユニット４７を有し、入力ベクトルＡと重みマトリクスＷとの乗算を並列して実施できる。 FIG. 9 is an internal block diagram of the multiplier 42. As shown in FIG.
The multiplier 42 multiplies each element of the input data a by each element of the weight w. Each element of the input data a is data obtained by dividing the input data a, and is vector data having Bc elements (for example, "input vector A" described later). Each element of the weight w is data obtained by dividing the weight w, and is matrix data having Bc×Bd elements (for example, a “weight matrix W” described later). The multiplier 42 has Bc×Bd product-sum operation units 47, and can perform multiplication of the input vector A and the weight matrix W in parallel.

乗算器４２は、乗算に必要な入力ベクトルＡと重みマトリクスＷを、第一メモリ１および重みメモリ４１から読み出して乗算を実施する。乗算器４２は、Ｂｄ個の積和演算結果Ｏ（ｄｉ）を出力する。 The multiplier 42 reads out the input vector A and the weight matrix W required for multiplication from the first memory 1 and the weight memory 41 to carry out the multiplication. The multiplier 42 outputs Bd sum-of-products operation results O(di).

図１０は、積和演算ユニット４７の内部ブロック図である。
積和演算ユニット４７は、入力ベクトルＡの要素Ａ（ｃｉ）と、重みマトリクスＷの要素Ｗ（ｃｉ，ｄｉ）との乗算を実施する。また、積和演算ユニット４７は、乗算結果と他の積和演算ユニット４７の乗算結果Ｓ（ｃｉ，ｄｉ）と加算する。積和演算ユニット４７は、加算結果Ｓ（ｃｉ＋１，ｄｉ）を出力する。ｃｉは０から(Ｂｃ－１)までのインデックスである。ｄｉは０から(Ｂｄ－１)までのインデックスである。要素Ａ（ｃｉ）は、２ビットの符号なし整数（０，１，２，３）である。要素Ｗ（ｃｉ，ｄｉ）は、１ビットの符号付整数（０，１）であり、値「０」は＋１を表し、値「１」は－１を表す。 FIG. 10 is an internal block diagram of the sum-of-products operation unit 47. As shown in FIG.
Sum-of-products unit 47 performs multiplication of input vector A element A(ci) with weight matrix W element W(ci, di). Further, the product-sum operation unit 47 adds the multiplication result and the multiplication result S(ci, di) of another product-sum operation unit 47 . The sum-of-products operation unit 47 outputs the addition result S(ci+1, di). ci is an index from 0 to (Bc-1). di is an index from 0 to (Bd-1). Element A(ci) is a 2-bit unsigned integer (0, 1, 2, 3). The element W(ci,di) is a 1-bit signed integer (0,1), where the value "0" represents +1 and the value "1" represents -1.

積和演算ユニット４７は、反転器（インバータ）４７ａと、セレクタ４７ｂと、加算器４７ｃと、を有する。積和演算ユニット４７は、乗算器を用いず、反転器４７ａおよびセレクタ４７ｂのみを用いて乗算を行う。セレクタ４７ｂは、要素Ｗ（ｃｉ，ｄｉ）が「０」の場合、要素Ａ（ｃｉ）の入力を選択する。セレクタ４７ｂは、要素Ｗ（ｃｉ，ｄｉ）が「１」の場合、要素Ａ（ｃｉ）を反転器により反転させた補数を選択する。要素Ｗ（ｃｉ，ｄｉ）は、加算器４７ｃのＣａｒｒｙ－ｉｎにも入力される。加算器４７ｃは、要素Ｗ（ｃｉ，ｄｉ）が「０」のとき、Ｓ（ｃｉ，ｄｉ）に要素Ａ（ｃｉ）を加算した値を出力する。加算器４７ｃは、要素Ｗ（ｃｉ，ｄｉ）が「１」のとき、Ｓ（ｃｉ，ｄｉ）から要素Ａ（ｃｉ）を減算した値を出力する。 The sum-of-products operation unit 47 has an inverter (inverter) 47a, a selector 47b, and an adder 47c. The sum-of-products operation unit 47 performs multiplication using only the inverter 47a and the selector 47b without using a multiplier. The selector 47b selects the input of the element A(ci) when the element W(ci, di) is "0". If the element W(ci, di) is "1", the selector 47b selects the complement of the element A(ci) inverted by an inverter. Element W(ci, di) is also input to Carry-in of adder 47c. The adder 47c outputs a value obtained by adding the element A(ci) to S(ci, di) when the element W(ci, di) is "0". The adder 47c outputs a value obtained by subtracting the element A(ci) from S(ci, di) when the element W(ci, di) is "1".

図１１は、アキュムレータ回路４３の内部ブロック図である。
アキュムレータ回路４３は、乗算器４２の積和演算結果Ｏ（ｄｉ）を第二メモリ２にアキュムレートする。アキュムレータ回路４３は、Ｂｄ個のアキュムレータユニット４８を有し、Ｂｄ個の積和演算結果Ｏ（ｄｉ）を並列して第二メモリ２にアキュムレートできる。 FIG. 11 is an internal block diagram of the accumulator circuit 43. As shown in FIG.
The accumulator circuit 43 accumulates the sum-of-products operation result O(di) of the multiplier 42 in the second memory 2 . The accumulator circuit 43 has Bd accumulator units 48 and can accumulate Bd product-sum operation results O(di) in parallel in the second memory 2 .

図１２は、アキュムレータユニット４８の内部ブロック図である。
アキュムレータユニット４８は、加算器４８ａと、マスク部４８ｂとを有している。加算器４８ａは、積和演算結果Ｏの要素Ｏ（ｄｉ）と、第二メモリ２に格納された式１に示す畳み込み演算の途中経過である部分和と、を加算する。加算結果は、要素あたり１６ビットである。加算結果は、要素あたり１６ビットに限定されず、例えば要素あたり１５ビットや１７ビットであってもよい。 FIG. 12 is an internal block diagram of the accumulator unit 48. As shown in FIG.
The accumulator unit 48 has an adder 48a and a mask portion 48b. The adder 48 a adds the element O(di) of the sum-of-products operation result O and the partial sum, which is the intermediate progress of the convolution operation shown in Equation 1, stored in the second memory 2 . The addition result is 16 bits per element. The addition result is not limited to 16 bits per element, and may be, for example, 15 bits or 17 bits per element.

加算器４８ａは、加算結果を第二メモリ２の同一アドレスに書き込む。マスク部４８ｂは、初期化信号ｃｌｅａｒがアサートされた場合に、第二メモリ２からの出力をマスクし、要素Ｏ（ｄｉ）に対する加算対象をゼロにする。初期化信号ｃｌｅａｒは、第二メモリ２に途中経過の部分和が格納されていない場合にアサートされる。 The adder 48a writes the addition result to the same address in the second memory 2. FIG. The mask unit 48b masks the output from the second memory 2 and zeros the addition target for the element O(di) when the initialization signal clear is asserted. The initialization signal clear is asserted when the intermediate partial sum is not stored in the second memory 2 .

乗算器４２およびアキュムレータ回路４３による畳み込み演算が完了すると、第二メモリに、Ｂｄ個の要素を持つ出力データｆ（ｘ，ｙ，ｄｏ）が格納される。 When the convolution operation by the multiplier 42 and the accumulator circuit 43 is completed, the output data f(x, y, do) having Bd elements is stored in the second memory.

ステートコントローラ４４は、乗算器４２およびアキュムレータ回路４３のステートを制御する。また、ステートコントローラ４４は、内部バスＩＢを介してコントローラ６と接続されている。ステートコントローラ４４は、命令キュー４５と制御回路４６とを有する。 State controller 44 controls the states of multiplier 42 and accumulator circuit 43 . Also, the state controller 44 is connected to the controller 6 via an internal bus IB. The state controller 44 has an instruction queue 45 and a control circuit 46 .

命令キュー４５は、畳み込み演算回路４用の命令コマンドＣ４が格納されるキューであり、例えばＦＩＦＯメモリで構成される。命令キュー４５には、内部バスＩＢ経由で命令コマンドＣ４が書き込まれる。 The instruction queue 45 is a queue in which the instruction command C4 for the convolution operation circuit 4 is stored, and is composed of a FIFO memory, for example. An instruction command C4 is written to the instruction queue 45 via the internal bus IB.

制御回路４６は、命令コマンドＣ４をデコードし、命令コマンドＣ４に基づいて乗算器４２およびアキュムレータ回路４３を制御するステートマシンである。制御回路４６は、論理回路により実装されていてもよいし、ソフトウェアによって制御されるＣＰＵによって実装されていてもよい。 The control circuit 46 is a state machine that decodes the instruction command C4 and controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4. The control circuit 46 may be implemented by a logic circuit or by a CPU controlled by software.

図１３は、制御回路４６のステート遷移図である。
制御回路４６は、命令キュー４５に命令コマンドＣ４が入力されると（Ｎｏｔｅｍｐｔｙ）、アイドルステートＳ１からデコードステートＳ２に遷移する。 FIG. 13 is a state transition diagram of the control circuit 46. As shown in FIG.
When the instruction command C4 is input to the instruction queue 45 (Not empty), the control circuit 46 transitions from the idle state S1 to the decode state S2.

制御回路４６は、デコードステートＳ２において、命令キュー４５から出力される命令コマンドＣ３をデコードする。また、制御回路４６は、コントローラ６のレジスタ６１に格納されたセマフォＳを読み出し、命令コマンドＣ４において指示された乗算器４２やアキュムレータ回路４３の動作を実行可能であるかを判定する。実行不能である場合（Ｎｏｔｒｅａｄｙ）、制御回路４６は実行可能となるまで待つ（Ｗａｉｔ）。実行可能である場合（ｒｅａｄｙ）、制御回路４６はデコードステートＳ２から実行ステートＳ３に遷移する。 The control circuit 46 decodes the instruction command C3 output from the instruction queue 45 in the decode state S2. Also, the control circuit 46 reads the semaphore S stored in the register 61 of the controller 6 and determines whether the operations of the multiplier 42 and the accumulator circuit 43 instructed by the instruction command C4 can be executed. If it is not executable (Not ready), the control circuit 46 waits until it becomes executable (Wait). If it is ready (ready), the control circuit 46 transitions from the decode state S2 to the run state S3.

制御回路４６は、実行ステートＳ３において、乗算器４２やアキュムレータ回路４３を制御して、乗算器４２やアキュムレータ回路４３に命令コマンドＣ４において指示された動作を実施させる。制御回路３４は、乗算器４２やアキュムレータ回路４３の動作が終わると、命令キュー４５から実行を終えた命令コマンドＣ４を取り除くとともに、コントローラ６のレジスタ６１に格納されたセマフォＳを更新する。制御回路４６は、命令キュー４５に命令がある場合（Ｎｏｔｅｍｐｔｙ）、実行ステートＳ３からデコードステートＳ２に遷移する。制御回路４６は、命令キュー４５に命令がない場合（ｅｍｐｔｙ）、実行ステートＳ３からアイドルステートＳ１に遷移する。 In the execution state S3, the control circuit 46 controls the multiplier 42 and the accumulator circuit 43 to perform the operation indicated by the instruction command C4. After the operation of the multiplier 42 and the accumulator circuit 43 is finished, the control circuit 34 removes the executed instruction command C4 from the instruction queue 45 and updates the semaphore S stored in the register 61 of the controller 6 . When there is an instruction in the instruction queue 45 (Not empty), the control circuit 46 transitions from the execution state S3 to the decode state S2. When the instruction queue 45 has no instruction (empty), the control circuit 46 transitions from the execution state S3 to the idle state S1.

実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された重みｗや入力データａのビット幅などの情報から、畳み込み演算回路４における演算器の仕様やサイズ（ＢｃやＢｄ）を決定する。ハードウェア情報ＨＷとして生成するＮＮ実行モデル１００（ニューラルネットワークハードウェアモデル４００、ニューラルネットワークハードウェア６００）のハードウェア規模が含まれる場合、実行モデル生成部３２１は、指定された規模にあわせて畳み込み演算回路４における演算器の仕様やサイズ（ＢｃやＢｄ）を調整する。 The execution model generator 321 determines the specifications and sizes (Bc and Bd) of the arithmetic units in the convolution arithmetic circuit 4 from information such as the weight w input as the network information NW and the bit width of the input data a. When the hardware scale of the NN execution model 100 (neural network hardware model 400, neural network hardware 600) to be generated as the hardware information HW is included, the execution model generating unit 321 performs the convolution operation according to the specified scale. The specifications and sizes (Bc and Bd) of the calculator in the circuit 4 are adjusted.

［入力変換部４９］
図１４は、入力変換部４９のブロック図である。
入力変換部４９は、入力データａの要素より多ビット（例えば８ビット以上）の要素を含む入力データｂを入力データａに変換する。入力変換部４９は、ＣＮＮ２００の畳み込み層２１０の前に連結された入力層２０５の少なくとも一部に相当する。入力変換部４９は、第一変換部４９１と、第二変換部４９２と、閾値メモリ４９５と、を有する。 [Input converter 49]
FIG. 14 is a block diagram of the input conversion section 49. As shown in FIG.
The input conversion unit 49 converts the input data b including elements of more bits (for example, 8 bits or more) than the elements of the input data a into the input data a. Input transformer 49 corresponds to at least part of input layer 205 concatenated before convolutional layer 210 of CNN 200 . The input converter 49 has a first converter 491 , a second converter 492 and a threshold memory 495 .

なお、入力変換部４９は、必ずしもハードウェアとして実装されるものでなくてもよい。後述するソフトウェア生成工程（Ｓ１７）において事前処理として入力データｂの変換処理を行ってもよい。 Note that the input conversion unit 49 does not necessarily have to be implemented as hardware. Conversion processing of the input data b may be performed as a pre-processing in the software generation step (S17) described later.

ここで、入力変換部４９の説明においては、説明を簡略化するために入力データｂがｃ軸方向の要素数が１である画像データ（すなわちｘｙ平面における２次元画像）であるとする。また、画像データは、ｘ軸方向およびｙ軸方向の各要素が８ビット（０－２５５）である行列データ構造を備えるとする。入力データｂは、入力変換部４９により、畳み込み演算回路に入力可能な入力データａに変換される。 Here, in the description of the input conversion unit 49, for the sake of simplicity, it is assumed that the input data b is image data having one element in the c-axis direction (that is, a two-dimensional image on the xy plane). It is also assumed that the image data comprises a matrix data structure with 8 bits (0-255) for each element in the x and y directions. The input data b is converted by the input conversion unit 49 into input data a that can be input to the convolution circuit.

第一変換部４９１は、第一閾値群（第一閾値グループ）ＴＧ１を用いて入力データｂを第一量子化データａ１に変換する。第一閾値群ＴＧ１は１個以上の閾値であり、閾値は入力データｂの取りうる範囲（０－２５５）における所定の値である。第一変換部４９１は、式６と同様の方法で入力データｂと第一閾値群ＴＧ１とを比較し、比較結果をエンコードすることにより、入力データｂを第一量子化データａ１に変換する（第一量子化手段）。 The first conversion unit 491 converts the input data b into first quantized data a1 using a first threshold value group (first threshold value group) TG1. The first threshold group TG1 is one or more thresholds, and the thresholds are predetermined values within the range (0-255) that the input data b can take. The first conversion unit 491 compares the input data b with the first threshold value group TG1 in the same manner as in Equation 6, and encodes the comparison result to convert the input data b into the first quantized data a1 ( first quantization means).

本実施形態において、第一閾値群ＴＧ１は３個の閾値である。第一変換部４９１は、要素が８ビットである入力データｂを、要素が２ビットである第一量子化データａ１に量子化する。 In this embodiment, the first threshold group TG1 is three thresholds. The first conversion unit 491 quantizes the input data b having 8-bit elements into first quantized data a1 having 2-bit elements.

第二変換部４９２は、第二閾値群（第二閾値グループ）ＴＧ２を用いて入力データｂを第二量子化データａ２に変換する。第二閾値群ＴＧ２は１個以上の閾値であり、閾値は入力データｂの取りうる範囲（０－２５５）における所定の値である。第二変換部４９２は、式６と同様の方法で入力データｂと第二閾値群ＴＧ２とを比較し、比較結果をエンコードすることにより、入力データｂを第二量子化データａ２に変換する（第二量子化手段）。 The second conversion unit 492 converts the input data b into second quantized data a2 using a second threshold value group (second threshold value group) TG2. The second threshold group TG2 is one or more thresholds, and the thresholds are predetermined values within the range (0-255) that the input data b can take. The second conversion unit 492 compares the input data b with the second threshold value group TG2 in the same manner as in Equation 6, and encodes the comparison result to convert the input data b into the second quantized data a2 ( second quantization means).

本実施形態において、第二閾値群ＴＧ２は３個の閾値である。第二変換部４９２は、要素が８ビットである入力データｂを、要素が２ビットである第二量子化データａ２に量子化する。 In this embodiment, the second threshold group TG2 is three thresholds. The second conversion unit 492 quantizes the input data b having 8-bit elements into second quantized data a2 having 2-bit elements.

すなわち、入力変換部４９は、異なる二種類の閾値群（第一閾値群ＴＧ１と第二閾値群ＴＧ２）に基づいて、入力データｂを二種類の量子化データ（第一量子化データａ１と第二量子化データａ２）に変換する。 That is, the input conversion unit 49 converts the input data b into two types of quantized data (first quantized data a1 and second Convert to binary quantized data a2).

閾値メモリ４９５は、第一変換部４９１での演算に用いる第一閾値群ＴＧ１と、第二変換部４９２での演算に用いる第二閾値群ＴＧ２と、を記憶するメモリであり、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。例えば、ＤＭＡＣ３は、ＤＭＡ転送により、第一閾値群ＴＧ１と第二閾値群ＴＧ２を閾値メモリ４９５に書き込む。 The threshold memory 495 is a memory that stores a first threshold value group TG1 used for calculation in the first conversion unit 491 and a second threshold value group TG2 used for calculation in the second conversion unit 492. For example, SRAM (Static A rewritable memory such as a volatile memory such as a RAM). For example, the DMAC3 writes the first threshold group TG1 and the second threshold group TG2 to the threshold memory 495 by DMA transfer.

第二閾値群ＴＧ２は、第一閾値群ＴＧ１と異なる閾値群である。第二閾値群ＴＧ２は、第一閾値群ＴＧ１と比較して、少なくとも一部の閾値が異なっている。本実施形態においては、第二閾値群ＴＧ２は、第一閾値群ＴＧ１と比較して、全ての閾値が異なっている。 The second threshold group TG2 is a threshold group different from the first threshold group TG1. The second threshold group TG2 differs from the first threshold group TG1 in at least some thresholds. In the present embodiment, the second threshold group TG2 differs from the first threshold group TG1 in all thresholds.

図１５は、第一閾値群ＴＧ１および第二閾値群ＴＧ２の例を示す図である。
第一閾値群ＴＧ１は、閾値３６．４、閾値１０９．３、および閾値１８２．１である。第二閾値群ＴＧ２は、閾値７２．８、閾値１４５．７、および閾値２１８．６である。第二閾値群ＴＧ２の閾値の平均値は、第一閾値群ＴＧ１の閾値の平均値より大きい。なお、これらの閾値は、四捨五入等で丸められた整数であってもよい。 FIG. 15 is a diagram showing examples of the first threshold value group TG1 and the second threshold value group TG2.
The first threshold group TG1 is threshold 36.4, threshold 109.3 and threshold 182.1. The second threshold group TG2 is threshold 72.8, threshold 145.7, and threshold 218.6. The average value of the thresholds in the second threshold group TG2 is greater than the average value of the thresholds in the first threshold group TG1. Note that these threshold values may be rounded integers.

第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）を、７個の領域に略等分に分割する。第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、同一のステップ幅（約３６．４（≒２５５／７））で配列している。第一閾値群ＴＧ１に含まれる閾値と、第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）において、交互に配列している。なお、第一閾値群ＴＧ１に含まれる閾値と第二閾値群ＴＧ２に含まれる閾値は、少なくとも一部の閾値のみが交互に配列されていてもよい。 The thresholds included in the first threshold value group TG1 and the second threshold value group TG2 divide the possible range (0-255) of the input data b into seven areas. The thresholds included in the first threshold group TG1 and the second threshold group TG2 are arranged with the same step width (approximately 36.4 (≈255/7)). The thresholds included in the first threshold value group TG1 and the threshold values included in the second threshold value group TG2 are alternately arranged within the possible range (0-255) of the input data b. At least some of the thresholds included in the first threshold group TG1 and the thresholds included in the second threshold group TG2 may be alternately arranged.

図１６は、入力変換部４９から出力されるデータ（入力データａ）を示す図である。
入力変換部４９は、第一量子化データａ１と第二量子化データａ２とをｃ軸方向に並べて連結して、入力データａとして出力する。乗算器４２は、入力変換部４９から出力される入力データａに対して畳み込み演算を実施する。 FIG. 16 is a diagram showing data (input data a) output from the input conversion unit 49. As shown in FIG.
The input conversion unit 49 arranges and connects the first quantized data a1 and the second quantized data a2 in the c-axis direction, and outputs the result as input data a. The multiplier 42 performs a convolution operation on the input data a output from the input conversion section 49 .

入力データｂのｃ軸方向の要素数が２以上である場合、入力変換部４９は、例えばｃ軸方向の要素ごとに上記と同様の変換を行う。この場合、入力データａのｃ軸方向の要素数は、入力データｂのｃ軸方向の要素数の２倍になる。一例として、入力データｂのc軸方向の要素として色成分であるＲＧＢの３要素を含む場合、入力データａのｃ軸方向の要素数は２倍の６となる。なお、入力変換部４９は、例えばＲＧＢの３要素のうちの１要素（例えばＧ）に対応する入力データａのｃ軸方向の要素数のみを選択的に増やすように変換してもよい。 When the number of elements in the c-axis direction of the input data b is two or more, the input conversion unit 49 performs the same conversion as described above for each element in the c-axis direction, for example. In this case, the number of elements in the c-axis direction of the input data a is twice the number of elements in the c-axis direction of the input data b. As an example, when the input data b includes three elements of RGB, which are color components, as elements in the c-axis direction, the number of elements in the c-axis direction of the input data a is doubled to six. Note that the input conversion unit 49 may convert, for example, selectively increasing only the number of elements in the c-axis direction of the input data a corresponding to one element (for example, G) among the three elements of RGB.

畳み込み演算回路４は、量子化された入力データａを畳み込み演算の入力とするため、乗算器４２等の構成を小規模化できる。一方、入力データｂが入力データａに量子化されることにより、入力データａの精度（画像データの場合、階調数）が低下する。しかしながら、畳み込み演算回路４は、異なる二種類の閾値群（第一閾値群ＴＧ１と第二閾値群ＴＧ２）に基づいて、入力データｂを二種類の量子化データ（第一量子化データａ１と第二量子化データａ２）に変換して、二種類の量子化データをｃ軸方向に連結する。そのため、畳み込み演算回路４は、小規模化された乗算器４２等の構成を維持しつつ、量子化に伴う入力データａの精度低下の影響を低減できる。 Since the convolution operation circuit 4 uses the quantized input data a as an input for the convolution operation, the size of the configuration of the multiplier 42 and the like can be reduced. On the other hand, since the input data b is quantized into the input data a, the precision of the input data a (the number of gradations in the case of image data) is lowered. However, the convolution operation circuit 4 converts the input data b into two types of quantized data (first quantized data a1 and second Converting to two quantized data a2), the two types of quantized data are connected in the c-axis direction. Therefore, the convolution operation circuit 4 can reduce the influence of the accuracy deterioration of the input data a due to quantization while maintaining the configuration of the multiplier 42 and the like which are reduced in size.

畳み込み演算回路４に入力される入力データａにおけるｃ軸方向の要素数が小さい場合、畳み込み演算回路４において並列化（マルチチャンネル化）されたハードウェアリソース（例えば畳み込み演算回路４の乗算器４２）を有効に活用できない場合がある。畳み込み演算回路４は、上記の方法により二種類の量子化データをｃ軸方向に並べて連結することにより、ｃ軸方向の要素数を増やし、並列化（マルチチャンネル化）されたハードウェアリソースを有効に活用して高速化を図れる。 When the number of elements in the c-axis direction in the input data a to be input to the convolution operation circuit 4 is small, hardware resources parallelized (multichannelized) in the convolution operation circuit 4 (for example, the multiplier 42 of the convolution operation circuit 4) are used. may not be used effectively. The convolution operation circuit 4 increases the number of elements in the c-axis direction by arranging and concatenating two types of quantized data in the c-axis direction by the above-described method, and effectively utilizes parallelized (multichannel) hardware resources. can be used to improve speed.

図１５に示す第一閾値群ＴＧ１に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）において略同一のステップ幅で配列している。図１５に示す第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）において略同一のステップ幅で配列している。その結果、入力データｂの特徴に関わらず、上述した入力データａの精度低下の影響が低減される。また、量子化に伴う入力データａの精度の低下の影響は、第一閾値群ＴＧ１に含まれる閾値と第二閾値群ＴＧ２に含まれる閾値とを互い違いに補間的に設定することより低減される。 The thresholds included in the first threshold value group TG1 shown in FIG. 15 are arranged with substantially the same step width in the possible range (0-255) of the input data b. The thresholds included in the second threshold value group TG2 shown in FIG. 15 are arranged with substantially the same step width in the possible range (0-255) of the input data b. As a result, regardless of the characteristics of the input data b, the influence of the aforementioned decrease in accuracy of the input data a is reduced. Further, the influence of the decrease in accuracy of the input data a due to quantization is reduced by alternately setting the thresholds included in the first threshold value group TG1 and the threshold values included in the second threshold value group TG2 in an interpolative manner. .

図１５に示す第一閾値群ＴＧ１に含まれる閾値と、第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）において、交互に配列している。そのため、第一閾値群ＴＧ１と第二閾値群ＴＧ２はいずれも、入力データｂの取りうる範囲（０－２５５）を比較的均等に分割する。その結果、二種類の量子化データ（第一量子化データａ１と第二量子化データａ２）は、入力データｂ本来の特徴を継承しやすく、畳み込み演算において特徴量を抽出しやすい。 The thresholds included in the first threshold value group TG1 and the threshold values included in the second threshold value group TG2 shown in FIG. 15 are arranged alternately within the possible range (0-255) of the input data b. Therefore, both the first threshold value group TG1 and the second threshold value group TG2 divide the possible range (0-255) of the input data b relatively evenly. As a result, the two types of quantized data (the first quantized data a1 and the second quantized data a2) tend to inherit the original features of the input data b, making it easier to extract features in the convolution operation.

なお、本実施形態は、入力変換部４９における量子化手段として式６で示すような閾値を用いる手法を例示したが、量子化手段はこれに限定されない。量子化手段は、例えば複数のルックアップテーブルを用いて行ってもよい。 Although the present embodiment exemplified the method of using the threshold value shown in Equation 6 as the quantization means in the input conversion section 49, the quantization means is not limited to this. The quantization means may for example be performed using a plurality of lookup tables.

図１７は、第一閾値群ＴＧ１および第二閾値群ＴＧ２の他の例を示す図である。
図１７に示す第一閾値群ＴＧ１は、閾値３６．４、閾値７２．８、および閾値１０９．３である。図１７に示す第二閾値群ＴＧ２は、閾値１４５．７、閾値１８２．１、および閾値２１８．６である。なお、これらの閾値は、四捨五入等で丸められた整数であってもよい。 FIG. 17 is a diagram showing another example of the first threshold value group TG1 and the second threshold value group TG2.
The first threshold group TG1 shown in FIG. 17 is a threshold of 36.4, a threshold of 72.8, and a threshold of 109.3. The second threshold value group TG2 shown in FIG. 17 is a threshold value of 145.7, a threshold value of 182.1, and a threshold value of 218.6. Note that these threshold values may be rounded integers.

図１７に示す例においても、第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）を、７個の領域に略等分に分割する。そのため、図１５に示す閾値群による量子化と同様に、入力データａの精度低下の影響が低減される。 In the example shown in FIG. 17 as well, the thresholds included in the first threshold value group TG1 and the second threshold value group TG2 divide the possible range (0-255) of the input data b into seven regions. . Therefore, the influence of the accuracy deterioration of the input data a is reduced in the same manner as the quantization by the threshold group shown in FIG. 15 .

図１７に示す第一閾値群ＴＧ１の３個の閾値は、最大値２５５よりも最小値０に近い。そのため、第一閾値群ＴＧ１に基づいて変換された第一量子化データａ１は、入力データｂの取りうる範囲（０－２５５）において最小値０に近い領域のデータの特徴を継承しやすい。 The three threshold values of the first threshold value group TG1 shown in FIG. 17 are closer to the minimum value 0 than the maximum value 255. Therefore, the first quantized data a1 transformed based on the first threshold value group TG1 tends to inherit the characteristics of the data in the area close to the minimum value 0 in the possible range (0-255) of the input data b.

図１７に示す第二閾値群ＴＧ２の３個の閾値は、最小値０よりも最大値２５５に近い。そのため、第二閾値群ＴＧ２に基づいて変換された第二量子化データａ２は、入力データｂの取りうる範囲（０－２５５）において最大値２５５に近い領域のデータの特徴を継承しやすい。 The three threshold values of the second threshold value group TG2 shown in FIG. 17 are closer to the maximum value of 255 than the minimum value of 0. Therefore, the second quantized data a2 transformed based on the second threshold value group TG2 tends to inherit the characteristics of the data in the area close to the maximum value 255 in the possible range (0-255) of the input data b.

図１８は、第一閾値群ＴＧ１および第二閾値群ＴＧ２の他の例を示す図である。
図１８に示す第一閾値群ＴＧ１は、閾値３６．４、閾値７２．８、および閾値１４５．７である。図１８に示す第二閾値群ＴＧ２は閾値１０９．３、閾値１８２．１、および閾値２１８．６である。なお、これらの閾値は、四捨五入等で丸められた整数であってもよい。 FIG. 18 is a diagram showing another example of the first threshold value group TG1 and the second threshold value group TG2.
The first threshold value group TG1 shown in FIG. 18 is a threshold value of 36.4, a threshold value of 72.8, and a threshold value of 145.7. The second threshold group TG2 shown in FIG. 18 is threshold 109.3, threshold 182.1, and threshold 218.6. Note that these threshold values may be rounded integers.

図１８に示す例においても、第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）を、７個の領域に略等分に分割する。そのため、図１５に示す閾値群による量子化と同様に、入力データａの精度低下の影響が低減される。また、量子化に伴う入力データａの精度の低下の影響は、第一閾値群ＴＧ１に含まれる閾値と第二閾値群ＴＧ２に含まれる閾値とを互い違いに補間的に設定することより低減されている。言い換えれば、第一閾値群ＴＧ１に含まれる各閾値の平均値と第二閾値群ＴＧ２に含まれる各閾値の平均値とが異なるように設定することで、それぞれの閾値を互い違いに補間的に設定することを可能としている。 In the example shown in FIG. 18 as well, the thresholds included in the first threshold group TG1 and the second threshold group TG2 divide the possible range (0-255) of the input data b into seven areas. . Therefore, the influence of the accuracy deterioration of the input data a is reduced in the same manner as the quantization by the threshold group shown in FIG. 15 . Further, the influence of the decrease in accuracy of the input data a due to quantization is reduced by alternately setting the thresholds included in the first threshold value group TG1 and the threshold values included in the second threshold value group TG2 in an interpolative manner. there is In other words, by setting the average value of the threshold values included in the first threshold value group TG1 and the average value of the threshold values included in the second threshold value group TG2 to be different, the respective threshold values are alternately set in an interpolative manner. making it possible to

なお、第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）を略等分に分割しなくてもよい。入力変換部４９は、例えば、入力データｂのヒストグラム等から得たデータの分布情報に基づいて、第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値を設定してもよい。例えば、データ変換後の量子化データ（第一量子化データａ１と第二量子化データａ２）に偏りがなくなる様に閾値群を設定すれば、入力データａの精度低下の影響が低減されやすい。 The thresholds included in the first threshold value group TG1 and the second threshold value group TG2 do not have to divide the possible range (0-255) of the input data b into substantially equal parts. The input conversion unit 49 may set threshold values included in the first threshold value group TG1 and the second threshold value group TG2, for example, based on data distribution information obtained from a histogram of the input data b. For example, if the threshold group is set so that the quantized data after data conversion (the first quantized data a1 and the second quantized data a2) are not biased, the influence of the accuracy deterioration of the input data a can be reduced.

図１９は、第一閾値群ＴＧ１および第二閾値群ＴＧ２の他の例を示す図である。
第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）の全域に分布していなくてもよい。入力変換部４９は、図１９に示すように、入力データｂの取りうる範囲（０－２５５）において最小値および最大値を設定して、閾値を分布させる有効範囲ＶＲを設定してもよい。入力変換部４９は、例えば、入力データｂのヒストグラム等から得たデータの分布情報に基づいて、データが多く分布する領域を、閾値を分布させる有効範囲ＶＲとして設定してもよい。データ変化（画像データの場合、階調変化）が多い範囲を有効範囲ＶＲと設定することで、データ変換後の量子化データ（第一量子化データａ１と第二量子化データａ２）が入力データｂの特徴を継承しやすい。 FIG. 19 is a diagram showing another example of the first threshold value group TG1 and the second threshold value group TG2.
The thresholds included in the first threshold value group TG1 and the second threshold value group TG2 may not be distributed over the entire range (0-255) that the input data b can take. As shown in FIG. 19, the input conversion unit 49 may set the minimum and maximum values in the possible range (0-255) of the input data b to set the effective range VR in which the thresholds are distributed. The input conversion unit 49 may set, for example, a region in which a large amount of data is distributed as the effective range VR for distributing the threshold values, based on data distribution information obtained from a histogram of the input data b. By setting a range with many data changes (gradation changes in the case of image data) as the effective range VR, the quantized data after data conversion (the first quantized data a1 and the second quantized data a2) becomes the input data. It is easy to inherit the characteristics of b.

入力変換部４９は、入力データｂがカメラから取得した画像データである場合、パターンノイズなどのノイズが含まれる領域を有効範囲ＶＲから除いてもよい。入力データｂに含まれるノイズの影響が排除され、かつ、入力データａの精度低下の影響が低減される。また、画像データにおける黒レベルを０よりも大きい値に設定する場合、入力変換部４９は黒レベル以下の領域を有効範囲ＶＲから除いてもよい。さらに、画像データに対してデジタルゲインを乗算する演算などの追加演算を行う場合、入力変換部４９は追加演算により階調が失われる領域を有効範囲ＶＲから予め除いてもよい。一例として、追加演算で２倍のデジタルゲインを乗算する場合、入力変換部４９は入力データｂの有効な範囲を前半領域（０－１２７）として後半領域（１２８－２５５）を有効範囲ＶＲから除いてもよい。なお、有効範囲ＶＲは、分割された領域であってもよい。 When the input data b is image data obtained from a camera, the input conversion unit 49 may exclude a region containing noise such as pattern noise from the effective range VR. The influence of noise contained in the input data b is eliminated, and the influence of deterioration in accuracy of the input data a is reduced. Also, when the black level in the image data is set to a value greater than 0, the input conversion section 49 may exclude the area below the black level from the valid range VR. Furthermore, when performing an additional operation such as an operation of multiplying the image data by a digital gain, the input conversion unit 49 may remove in advance from the effective range VR a region where the gradation is lost due to the additional operation. As an example, when multiplying by a double digital gain in the additional operation, the input conversion unit 49 sets the effective range of the input data b to the first half region (0-127) and excludes the latter half region (128-255) from the effective range VR. may Note that the effective range VR may be a divided area.

なお、有効範囲ＶＲは、入力データｂの取りうる範囲（０－２５５）と同じであってもよい。 Note that the valid range VR may be the same as the possible range (0-255) of the input data b.

なお、第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、２のべき乗であってもよい。入力変換部４９において、閾値と入力データｂとの比較やデータ変換に必要な回路の規模が低減される。 Note that the thresholds included in the first threshold value group TG1 and the second threshold value group TG2 may be powers of two. In the input conversion unit 49, the scale of the circuit required for comparison between the threshold value and the input data b and data conversion is reduced.

［入力変換部４９の変形例］
図２０は、入力変換部４９の変形例である入力変換部４９Ｂのブロック図である。
入力変換部４９Ｂは、入力データａの要素より多ビット（例えば８ビット以上）の要素を含む入力データｂを入力データａに変換する。入力変換部４９Ｂは、ＣＮＮ２００の畳み込み層２１０の前に連結された入力層２０５の少なくとも一部に相当する。入力変換部４９Ｂは、第一変換部４９１と、第二変換部４９２と、第三変換部４９３と、閾値メモリ４９５と、を有する。 [Modified example of input converter 49]
FIG. 20 is a block diagram of an input conversion section 49B that is a modification of the input conversion section 49. As shown in FIG.
The input conversion unit 49B converts the input data b including elements of more bits (for example, 8 bits or more) than the elements of the input data a into the input data a. Input transformer 49B corresponds to at least part of input layer 205 connected before convolutional layer 210 of CNN 200 . The input converter 49B has a first converter 491 , a second converter 492 , a third converter 493 and a threshold memory 495 .

第三変換部４９３は、第三閾値群（第三閾値グループ）ＴＧ３を用いて入力データｂを第三量子化データａ３に変換する。第三閾値群ＴＧ３は１個以上の閾値であり、閾値は入力データｂの取りうる範囲（０－２５５）における所定の値である。第三変換部４９３は、式６と同様の方法で入力データｂと第三閾値群ＴＧ３とを比較し、比較結果をエンコードすることにより、入力データｂを第三量子化データａ３に変換する（第三量子化手段）。 The third transform unit 493 transforms the input data b into third quantized data a3 using a third threshold value group (third threshold value group) TG3. The third threshold group TG3 is one or more thresholds, and the thresholds are predetermined values within the range (0-255) that the input data b can take. The third conversion unit 493 compares the input data b with the third threshold value group TG3 in the same manner as in Equation 6, and encodes the comparison result to convert the input data b into the third quantized data a3 ( third quantization means).

本実施形態において、第三閾値群ＴＧ３は３個の閾値である。第三変換部４９３は、要素が８ビットである入力データｂを、要素が２ビットである第三量子化データａ３に量子化する。 In this embodiment, the third threshold group TG3 is three thresholds. The third conversion unit 493 quantizes the input data b having 8-bit elements into third quantized data a3 having 2-bit elements.

すなわち、入力変換部４Ｂは、異なる三種類の閾値群（第一閾値群ＴＧ１と第二閾値群ＴＧ２と第三閾値群ＴＧ３）に基づいて、入力データｂを三種類の量子化データ（第一量子化データａ１と第二量子化データａ２と第三量子化データａ３）に変換する。 That is, the input conversion unit 4B transforms the input data b into three types of quantized data (first Quantized data a1, second quantized data a2 and third quantized data a3).

閾値メモリ４９５は、第一変換部４９１での演算に用いる第一閾値群ＴＧ１と、第二変換部４９２での演算に用いる第二閾値群ＴＧ２と、に加えて第三閾値群ＴＧ３を記憶する。 The threshold memory 495 stores a third threshold group TG3 in addition to a first threshold group TG1 used for calculation in the first conversion unit 491 and a second threshold group TG2 used for calculation in the second conversion unit 492. .

第三閾値群ＴＧ３は、第一閾値群ＴＧ１と異なる閾値群であり、第二閾値群ＴＧ２と異なる閾値群である。本実施形態においては、第三閾値群ＴＧ３は、第一閾値群ＴＧ１および第二閾値群ＴＧ２と比較して、全ての閾値が異なっている。 The third threshold group TG3 is a threshold group different from the first threshold group TG1 and a threshold group different from the second threshold group TG2. In this embodiment, the third threshold group TG3 differs in all thresholds from the first threshold group TG1 and the second threshold group TG2.

図２１は、第一閾値群ＴＧ１、第二閾値群ＴＧ２および第三閾値群ＴＧ３の例を示す図である。第一閾値群ＴＧ１は、閾値２５．５、閾値１０２．０、および閾値１７８．５である。第二閾値群ＴＧ２は、閾値５１．０、閾値１２７．５、および閾値２０４．５である。第三閾値群ＴＧ３は、閾値７６．５、閾値１５３．０、および閾値２２９．５である。なお、これらの閾値は、四捨五入等で丸められた整数であってもよい。 FIG. 21 is a diagram showing examples of the first threshold group TG1, the second threshold group TG2, and the third threshold group TG3. The first threshold group TG1 is a threshold of 25.5, a threshold of 102.0 and a threshold of 178.5. The second threshold group TG2 is a threshold of 51.0, a threshold of 127.5, and a threshold of 204.5. The third threshold group TG3 is a threshold of 76.5, a threshold of 153.0, and a threshold of 229.5. Note that these threshold values may be rounded integers.

第一閾値群ＴＧ１、第二閾値群ＴＧ２および第三閾値群ＴＧ３に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）を、１０個の領域に略等分に分割する。第一閾値群ＴＧ１および第二閾値群ＴＧ２に含まれる閾値は、同一のステップ幅（２５．５．（＝２５５／１０））で配列している。 The thresholds included in the first threshold group TG1, the second threshold group TG2, and the third threshold group TG3 substantially equally divide the range (0-255) of the input data b into 10 areas. The threshold values included in the first threshold value group TG1 and the second threshold value group TG2 are arranged with the same step width (25.5.(=255/10)).

入力変換部４９Ｂは、第一量子化データａ１と第二量子化データａ２と第三量子化データａ３とをｃ軸方向に並べて連結して、入力データａとして出力する。乗算器４２は、入力変換部４９から出力される入力データａに対して畳み込み演算を実施する。 The input conversion unit 49B arranges and connects the first quantized data a1, the second quantized data a2, and the third quantized data a3 in the c-axis direction, and outputs them as the input data a. The multiplier 42 performs a convolution operation on the input data a output from the input conversion section 49 .

図２１に示す第一閾値群ＴＧ１に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）において略同一のステップ幅で配列している。図２１に示す第二閾値群ＴＧ２に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）において略同一のステップ幅で配列している。図２１に示す第三閾値群ＴＧ３に含まれる閾値は、入力データｂの取りうる範囲（０－２５５）において略同一のステップ幅で配列している。その結果、入力データｂの特徴に関わらず、上述した入力データａの精度低下の影響が低減される。 The thresholds included in the first threshold value group TG1 shown in FIG. 21 are arranged with substantially the same step width in the possible range (0-255) of the input data b. The thresholds included in the second threshold value group TG2 shown in FIG. 21 are arranged with substantially the same step width in the possible range (0-255) of the input data b. The thresholds included in the third threshold value group TG3 shown in FIG. 21 are arranged with substantially the same step width in the possible range (0-255) of the input data b. As a result, regardless of the characteristics of the input data b, the influence of the aforementioned decrease in accuracy of the input data a is reduced.

入力変換部４９Ｂを含む畳み込み演算回路４は、小規模化された乗算器４２等の構成を維持しつつ、量子化に伴う入力データａの精度低下の影響を低減できる。 The convolution operation circuit 4 including the input conversion unit 49B can reduce the influence of the accuracy deterioration of the input data a due to quantization while maintaining the configuration of the multiplier 42 and the like which are reduced in size.

なお、入力変換部４９Ｂは、異なる四種類以上の閾値群に基づいて、入力データｂを四種類以上の量子化データに変換してもよい。異なるＮ種類の閾値群に基づいて量子化データに変換する場合、Ｎ種類の閾値群に含まれる閾値のステップ幅ｓｔｅｐは、例えば、式７により算出できる。ここで、min_valは入力データｂの取りうる範囲の下限で、max_valは入力データｂの取りうる範囲の上限であり、ｋは、入力データａのビット数である。なお、min_valおよびmax_valは、有効範囲ＶＲの下限および上限と合わせてもよい。 Note that the input conversion unit 49B may convert the input data b into four or more types of quantized data based on four or more different threshold groups. When converting into quantized data based on N different threshold groups, the step width step of the thresholds included in the N threshold groups can be calculated by Equation 7, for example. Here, min_val is the lower limit of the range that the input data b can take, max_val is the upper limit of the range that the input data b can take, and k is the number of bits of the input data a. Note that min_val and max_val may be matched with the lower limit and upper limit of the effective range VR.

＜量子化演算回路生成工程（Ｓ１３－２）＞
実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００の量子化演算回路５を生成する（量子化演算回路生成工程）。実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された量子化情報から、量子化演算回路５のハードウェアモデルを生成する。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。 <Quantization Operation Circuit Generation Step (S13-2)>
The execution model generation unit 321 generates the quantization arithmetic circuit 5 of the NN execution model 100 based on the hardware information HW and the network information NW (quantization arithmetic circuit generation step). The execution model generation unit 321 generates a hardware model of the quantization arithmetic circuit 5 from the quantization information input as the network information NW. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. .

＜ＤＭＡＣ生成工程（Ｓ１３－３）＞
実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００のＤＭＡＣ３を生成する（ＤＭＡＣ生成工程）。実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された情報から、ＤＭＡＣ３のハードウェアモデルを生成する。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。 <DMAC generation step (S13-3)>
The execution model generation unit 321 generates the DMAC3 of the NN execution model 100 based on the hardware information HW and the network information NW (DMAC generation step). The execution model generation unit 321 generates a hardware model of the DMAC 3 from information input as the network information NW. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. .

＜学習工程（Ｓ１４）＞
ステップＳ１４において、ニューラルネットワーク生成装置３００の学習部３２２および推論部３２３は、学習データセットＤＳを用いて、生成されたＮＮ実行モデル１００の学習パラメータを学習する（学習工程）。学習工程（Ｓ１４）は、例えば、学習済みパラメータ生成工程（Ｓ１４－１）と、推論テスト工程（Ｓ１４－２）と、を有する。 <Learning step (S14)>
In step S14, the learning unit 322 and the inference unit 323 of the neural network generating device 300 learn learning parameters of the generated NN execution model 100 using the learning data set DS (learning step). The learning step (S14) includes, for example, a learned parameter generation step (S14-1) and an inference test step (S14-2).

＜学習工程：学習済みパラメータ生成工程（Ｓ１４－１）＞
学習部３２２は、ＮＮ実行モデル１００および学習データＤ１を用いて、学習済みパラメータＰＭを生成する。学習済みパラメータＰＭは、学習済みの重みｗ、量子化パラメータｑおよび入力変換部４９の閾値群（第一閾値群ＴＧ１および第二閾値群ＴＧ２）である。 <Learning Step: Learned Parameter Generation Step (S14-1)>
Learning unit 322 generates learned parameters PM using NN execution model 100 and learning data D1. The learned parameters PM are the learned weight w, the quantization parameter q, and the threshold group (first threshold group TG1 and second threshold group TG2) of the input conversion unit 49 .

例えば、ＮＮ実行モデル１００が画像認識を実施するＣＮＮ２００の実行モデルである場合、学習データＤ１は入力画像と教師データＴとの組み合わせである。入力画像は、ＣＮＮ２００に入力される入力データａである。教師データＴは、画像に撮像された被写体の種類や、画像における検出対象物の有無や、画像における検出対象物の座標値などである。 For example, when the NN execution model 100 is the execution model of the CNN 200 that performs image recognition, the learning data D1 is a combination of the input image and the teacher data T. An input image is input data a input to the CNN 200 . The teacher data T includes the type of subject captured in the image, the presence or absence of the detection target in the image, the coordinate values of the detection target in the image, and the like.

学習部３２２は、公知の技術である誤差逆伝播法などによる教師あり学習によって、学習済みパラメータＰＭを生成する。学習部３２２は、入力画像に対するＮＮ実行モデル１００の出力と、入力画像に対応する教師データＴと、の差分Ｅを損失関数（誤差関数）により求め、差分Ｅが小さくなるように学習対象である重みｗおよび量子化パラメータｑを更新する。 The learning unit 322 generates a learned parameter PM by supervised learning using a well-known technique such as the error backpropagation method. The learning unit 322 obtains the difference E between the output of the NN execution model 100 for the input image and the teacher data T corresponding to the input image using a loss function (error function). Update the weights w and the quantization parameter q.

例えば重みｗを更新する場合、重みｗに関する損失関数の勾配が用いられる。勾配は、例えば損失関数を微分することにより算出される。誤差逆伝播法を用いる場合、勾配は逆伝番（ｂａｃｋｗａｒｄ）により算出される。 For example, when updating weight w, the gradient of the loss function with respect to weight w is used. The slope is calculated, for example, by differentiating the loss function. When using backpropagation, the gradient is calculated by backward propagation.

学習部３２２は、勾配を算出して重みｗを更新する際において、畳み込み演算に関連する演算を高精度化する。具体的には、ＮＮ実行モデル１００が使用する低ビットの重みｗ（例えば１ビット）より高精度な３２ビットの浮動小数点型の重みｗが学習に使用される。また、ＮＮ実行モデル１００の畳み込み演算回路４において実施する畳み込み演算が高精度化される。 The learning unit 322 increases the precision of operations related to the convolution operation when calculating the gradient and updating the weight w. Specifically, a 32-bit floating-point weight w that is more accurate than the low-bit weight w (for example, 1 bit) used by the NN execution model 100 is used for learning. Further, the precision of the convolution operation performed in the convolution operation circuit 4 of the NN execution model 100 is improved.

学習部３２２は、勾配を算出して重みｗを更新する際において、活性化関数に関連する演算を高精度化する。具体的には、ＮＮ実行モデル１００の量子化演算回路５において実施するＲｅＬＵ関数などの活性化関数より高精度なシグモンド関数が学習に使用される。 The learning unit 322 increases the precision of calculations related to the activation function when calculating the gradient and updating the weight w. Specifically, a Sigmond function, which is more accurate than an activation function such as a ReLU function implemented in the quantization arithmetic circuit 5 of the NN execution model 100, is used for learning.

一方、学習部３２２は、順伝搬（ｆоｒｗａｒｄ）により入力画像に対する出力データを算出する際においては、畳み込み演算および活性化関数に関連する演算を高精度化せず、ＮＮ実行モデル１００に基づいた演算を実施する。重みｗを更新する際に用いられた高精度な重みｗは、ルックアップテーブル等により低ビット化される。 On the other hand, when calculating the output data for the input image by forward propagation, the learning unit 322 does not increase the precision of the convolution operation and the operation related to the activation function, and performs the operation based on the NN execution model 100. to implement. The high-precision weight w used when updating the weight w is reduced in bits by a lookup table or the like.

学習部３２２は、勾配を算出して重みｗを更新する際において、畳み込み演算および活性化関数に関連する演算を高精度化することにより、演算における中間データの精度低下を防止して、高い推論精度を実現できる学習済みパラメータＰＭを生成できる。 When the gradient is calculated and the weight w is updated, the learning unit 322 increases the accuracy of the convolution operation and the operation related to the activation function, thereby preventing the accuracy of the intermediate data in the operation from decreasing and enabling high inference. A learned parameter PM that can achieve accuracy can be generated.

一方、学習部３２２は、入力画像に対する出力データを算出する際において、順伝搬（ｆоｒｗａｒｄ）の演算を高精度化せず、ＮＮ実行モデル１００に基づいた演算を実施する。そのため、学習部３２２が算出した出力データと、生成された学習済みパラメータＰＭを用いたＮＮ実行モデル１００の出力データと、が一致する。 On the other hand, when calculating the output data for the input image, the learning unit 322 performs calculation based on the NN execution model 100 without increasing the accuracy of the forward propagation calculation. Therefore, the output data calculated by the learning unit 322 and the output data of the NN execution model 100 using the generated learned parameter PM match.

学習部３２２は、重みｗおよび量子化パラメータｑに加えて、入力変換部４９の閾値群の学習を行う。学習部３２２は、誤差逆伝播法などによる教師あり学習によって、入力データに対するＮＮ実行モデル１００の出力と、入力データに対応する教師データＴと、の差分Ｅを損失関数（誤差関数）により求め、差分Ｅが小さくなるように閾値群に含まれる少なくとも一つ以上の閾値を更新する。学習前の入力変換部４９の閾値群の初期値は、図１５などで示したように、第一閾値群ＴＧ１に含まれる閾値と第二閾値群ＴＧ２に含まれる閾値とが互い違いに補間的に設定されている。閾値群の各閾値は学習を繰り返すことによって更新される。なお、閾値群の各閾値は、学習の際には小数点を含む形式とし、学習終了時に四捨五入等で丸められた整数としてもよい。 The learning unit 322 learns the threshold value group of the input conversion unit 49 in addition to the weight w and the quantization parameter q. The learning unit 322 obtains the difference E between the output of the NN execution model 100 for the input data and the teacher data T corresponding to the input data by supervised learning such as the error backpropagation method, using a loss function (error function), At least one or more thresholds included in the threshold group are updated so that the difference E becomes smaller. As shown in FIG. 15 and the like, the initial values of the threshold value group of the input conversion unit 49 before learning are alternately interpolated between the threshold values included in the first threshold value group TG1 and the threshold values included in the second threshold value group TG2. is set. Each threshold in the threshold group is updated by repeating learning. It should be noted that each threshold value in the threshold value group may have a format including a decimal point at the time of learning, and may be a rounded integer at the end of learning.

なお、損失関数の勾配は出力層から入力層に向けて徐々に消失していく。本実施形態において、入力変換部４９は最も入力層に近い。よって、入力変換部４９の閾値群における学習時の勾配の変化量は、他の学習対象のパラメータと比較して小さい。そのため、入力変換部４９の閾値群の初期値を適切に設定している場合、学習部３２２は、入力変換部４９の閾値群を学習対象としなくてもよい。 Note that the gradient of the loss function gradually disappears from the output layer toward the input layer. In this embodiment, the input converter 49 is closest to the input layer. Therefore, the amount of change in the gradient during learning in the threshold value group of the input conversion unit 49 is small compared to other parameters to be learned. Therefore, when the initial values of the threshold value group of the input conversion unit 49 are appropriately set, the learning unit 322 does not need to set the threshold value group of the input conversion unit 49 as learning targets.

なお、本実施形態において、学習部３２２が入力変換部４９の閾値群の学習を行う例を示したが、学習部３２２は入力変換部４９の閾値群に加えて、入力データｂの有効範囲ＶＲを学習の対象としてもよい。 In the present embodiment, an example in which the learning unit 322 learns the threshold value group of the input conversion unit 49 is shown. may be used as learning targets.

なお、学習部３２２は、学習時において使用された入力データｂから、推論時における入力データｂの取りうる範囲を推測して、有効範囲ＶＲを学習してもよい。学習時において使用された入力データｂは、推論時における入力データｂと同条件であることが好ましい。そのため、学習部３２２は、学習時において使用された入力データｂから、推論時における入力データｂの取りうる範囲をある程度予測できる。 Note that the learning unit 322 may learn the effective range VR by estimating the possible range of the input data b at the time of inference from the input data b used at the time of learning. The input data b used during learning preferably has the same conditions as the input data b used during inference. Therefore, the learning unit 322 can predict to some extent the possible range of the input data b at the time of inference from the input data b used at the time of learning.

＜学習工程：推論テスト工程（Ｓ１４－２）＞
推論部３２３は、学習部３２２が生成した学習済みパラメータＰＭ、ＮＮ実行モデル１００およびテストデータＤ２を用いて推論テストを実施する。例えば、ＮＮ実行モデル１００が画像認識を実施するＣＮＮ２００の実行モデルである場合、テストデータＤ２は、学習データＤ１同様に入力画像と教師データＴとの組み合わせである。 <Learning Step: Inference Test Step (S14-2)>
The inference unit 323 performs an inference test using the learned parameters PM generated by the learning unit 322, the NN execution model 100, and the test data D2. For example, if the NN execution model 100 is the execution model of the CNN 200 that performs image recognition, the test data D2 is a combination of the input image and the teacher data T, similar to the learning data D1.

推論部３２３は、推論テストの進捗および結果を表示部３５０に表示する。推論テストの結果は、例えばテストデータＤ２に対する正解率である。 The inference unit 323 displays the progress and results of the inference test on the display unit 350 . The result of the reasoning test is, for example, the accuracy rate for the test data D2.

＜確認工程（Ｓ１５）＞
ステップＳ１５において、ニューラルネットワーク生成装置３００の推論部３２３は、操作入力部３６０から結果に関する確認を入力することを使用者に促すメッセージや情報入力に必要なＧＵＩ画像を表示部３５０に表示させる。使用者は、推論テストの結果を許容するかを、操作入力部３６０から入力する。使用者が推論テストの結果を許容することを示す入力が操作入力部３６０から入力された場合、ニューラルネットワーク生成装置３００は、次にステップＳ１６を実施する。使用者が推論テストの結果を許容しないことを示す入力が操作入力部３６０から入力された場合、ニューラルネットワーク生成装置３００は、再度ステップＳ１２を実施する。なお、ニューラルネットワーク生成装置３００はステップＳ１１まで戻って、ハードウェア情報ＨＷを使用者に再入力させてもよい。 <Confirmation step (S15)>
In step S<b>15 , the inference unit 323 of the neural network generation device 300 causes the display unit 350 to display a message prompting the user to input confirmation regarding the result from the operation input unit 360 and a GUI image necessary for inputting information. The user inputs from the operation input unit 360 whether to accept the result of the inference test. When an input indicating that the user accepts the result of the inference test is input from the operation input unit 360, the neural network generation device 300 next performs step S16. When the user inputs an input from the operation input unit 360 indicating that the result of the inference test is not acceptable, the neural network generator 300 performs step S12 again. Incidentally, the neural network generation device 300 may return to step S11 and allow the user to re-input the hardware information HW.

＜出力工程（Ｓ１６）＞
ステップＳ１６において、ニューラルネットワーク生成装置３００のハードウェア生成部３２４は、ハードウェア情報ＨＷおよびＮＮ実行モデル１００に基づいてニューラルネットワークハードウェアモデル４００を生成する。 <Output step (S16)>
In step S<b>16 , hardware generation unit 324 of neural network generation device 300 generates neural network hardware model 400 based on hardware information HW and NN execution model 100 .

＜ソフトウェア生成工程（Ｓ１７）＞
ステップＳ１７において、ニューラルネットワーク生成装置３００のソフトウェア生成部３２５は、ネットワーク情報ＮＷおよびＮＮ実行モデル１００などに基づいて、ニューラルネットワークハードウェア６００（ニューラルネットワークハードウェアモデル４００を動作対象ハードウェアに実装したもの）を動作させるソフトウェア５００を生成する。ソフトウェア５００は、学習済みパラメータＰＭを必要に応じてニューラルネットワークハードウェア６００へ転送するソフトウェアを含む。 <Software generation step (S17)>
In step S17, the software generation unit 325 of the neural network generation device 300 generates the neural network hardware 600 (the neural network hardware model 400 implemented in the operation target hardware) based on the network information NW and the NN execution model 100. ) is generated. Software 500 includes software that transfers learned parameters PM to neural network hardware 600 as needed.

ソフトウェア生成工程（Ｓ１７）は、例えば、入力データ変換工程（Ｓ１７－１）と、入力データ分割工程（Ｓ１７－２）と、ネットワーク分割工程（Ｓ１７－３）と、アロケーション工程（Ｓ１７－４）と、を有する。 The software generation step (S17) includes, for example, an input data conversion step (S17-1), an input data division step (S17-2), a network division step (S17-3), and an allocation step (S17-4). , have

＜入力データ変換工程（Ｓ１７－１）＞
畳み込み演算回路４において入力変換部４９がハードウェアとして実装されない場合、ソフトウェア生成部３２５は、事前処理として、事前に入力データｂを変換して変換済みの入力データａを生成する。入力データ変換工程における入力データａの変換方法は、入力変換部４９での変換方法と同じである。 <Input data conversion step (S17-1)>
If the input conversion unit 49 is not implemented as hardware in the convolution arithmetic circuit 4, the software generation unit 325 converts the input data b in advance to generate converted input data a as preprocessing. The conversion method of the input data a in the input data conversion step is the same as the conversion method in the input conversion section 49 .

＜入力データ分割工程（Ｓ１７－２）：データ分割＞
ソフトウェア生成部３２５は、第一メモリ１および第二メモリ２として割り当てるメモリのメモリ容量や演算器の仕様やサイズ（ＢｃやＢｄ）などに基づいて、畳み込み層２１０の畳み込み演算の入力データａを部分テンソルに分割する。部分テンソルへの分割方法や分割数は特に限定されない。部分テンソルは、例えば、入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃ）をａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ）に分割することにより形成される。 <Input Data Division Step (S17-2): Data Division>
The software generation unit 325 partially converts the input data a for the convolution operation of the convolution layer 210 based on the memory capacity of the memory allocated as the first memory 1 and the second memory 2, the specifications and sizes (Bc and Bd) of the calculator, and the like. Split into tensors. The method of division into partial tensors and the number of divisions are not particularly limited. A partial tensor is formed, for example, by splitting the input data a(x+i, y+j, c) into a(x+i, y+j, co).

図２２は、畳み込み演算のデータ分割やデータ展開を説明する図である。
畳み込み演算のデータ分割において、式１における変数ｃは、式８に示すように、サイズＢｃのブロックで分割される。また、式１における変数ｄは、式９に示すように、サイズＢｄのブロックで分割される。式８において、ｃｏはオフセットであり、ｃｉは０から(Ｂｃ－１)までのインデックスである。式９において、ｄｏはオフセットであり、ｄｉは０から(Ｂｄ－１)までのインデックスである。なお、サイズＢｃとサイズＢｄは同じであってもよい。 FIG. 22 is a diagram for explaining data division and data development in a convolution operation.
In the data division of the convolution operation, the variable c in Equation 1 is divided into blocks of size Bc as shown in Equation 8. Also, the variable d in Equation 1 is divided into blocks of size Bd, as shown in Equation 9. In Equation 8, co is the offset and ci is the index from 0 to (Bc-1). In Equation 9, do is the offset and di is the index from 0 to (Bd-1). Note that the size Bc and the size Bd may be the same.

式１における入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃ）は、ｃ軸方向においてサイズＢｃにより分割され、分割された入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ）で表される。以降の説明において、分割された入力データａを「分割入力データａ」ともいう。 Input data a(x+i, y+j, c) in Equation 1 is divided by size Bc in the c-axis direction and represented by divided input data a(x+i, y+j, co). In the following description, the divided input data a is also referred to as "divided input data a".

式１における重みｗ（ｉ，ｊ，ｃ，ｄ）は、ｃ軸方向においてサイズＢｃおよびｄ軸方向においてサイズＢｄにより分割され、分割された重みｗ（ｉ，ｊ，ｃｏ，ｄｏ）で表される。以降の説明において、分割された重みｗを「分割重みｗ」ともいう。 The weight w (i, j, c, d) in Equation 1 is divided by the size Bc in the c-axis direction and the size Bd in the d-axis direction, and is represented by the divided weight w (i, j, co, do) be. In the following description, the divided weight w is also referred to as "divided weight w".

サイズＢｄにより分割された出力データｆ（ｘ，ｙ，ｄｏ）は、式１０により求まる。分割された出力データｆ（ｘ，ｙ，ｄｏ）を組み合わせることで、最終的な出力データｆ（ｘ，ｙ，ｄ）を算出できる。 The output data f(x, y, do) divided by the size Bd is obtained by Equation (10). By combining the divided output data f(x, y, do), the final output data f(x, y, d) can be calculated.

＜入力データ分割工程（Ｓ１７－３）：データ展開＞
ソフトウェア生成部３２５は、ＮＮ実行モデル１００の畳み込み演算回路４に、分割された入力データａおよび重みｗを展開する。 <Input data division step (S17-3): data expansion>
The software generator 325 develops the divided input data a and weight w in the convolution operation circuit 4 of the NN execution model 100 .

分割入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ）は、Ｂｃ個の要素を持つベクトルデータに展開される。分割入力データａの要素は、ｃｉでインデックスされる（０≦ｃｉ＜Ｂｃ）。以降の説明において、ｉ，ｊごとにベクトルデータに展開された分割入力データａを「入力ベクトルＡ」ともいう。入力ベクトルＡは、分割入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ×Ｂｃ）から分割入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ×Ｂｃ＋（Ｂｃ－１））までを要素とする。 Divided input data a(x+i, y+j, co) is developed into vector data having Bc elements. Elements of the divided input data a are indexed by ci (0≤ci<Bc). In the following description, divided input data a developed into vector data for each i and j is also referred to as "input vector A". Input vector A has elements from divided input data a(x+i, y+j, co×Bc) to divided input data a(x+i, y+j, co×Bc+(Bc−1)).

分割重みｗ（ｉ，ｊ，ｃｏ，ｄｏ）は、Ｂｃ×Ｂｄ個の要素を持つマトリクスデータに展開される。マトリクスデータに展開された分割重みｗの要素は、ｃｉとｄｉでインデックスされる（０≦ｄｉ＜Ｂｄ）。以降の説明において、ｉ，ｊごとにマトリクスデータに展開された分割重みｗを「重みマトリクスＷ」ともいう。重みマトリクスＷは、分割重みｗ（ｉ，ｊ，ｃｏ×Ｂｃ，ｄｏ×Ｂｄ）から分割重みｗ（ｉ，ｊ，ｃｏ×Ｂｃ＋（Ｂｃ－１），ｄｏ×Ｂｄ＋（Ｂｄ－１））までを要素とする。 The division weight w(i, j, co, do) is developed into matrix data having Bc×Bd elements. The elements of the division weight w developed into matrix data are indexed by ci and di (0≦di<Bd). In the following description, the divided weight w developed into matrix data for each i and j is also referred to as "weight matrix W". The weight matrix W includes division weights w(i, j, co×Bc, do×Bd) to division weights w(i, j, co×Bc+(Bc−1), do×Bd+(Bd−1)). element.

入力ベクトルＡと重みマトリクスＷとを乗算することで、ベクトルデータが算出される。ｉ，ｊ，ｃｏごとに算出されたベクトルデータを３次元テンソルに整形することで、出力データｆ（ｘ，ｙ，ｄｏ）を得ることができる。このようなデータの展開を行うことで、畳み込み層２１０の畳み込み演算を、ベクトルデータとマトリクスデータとの乗算により実施できる。 By multiplying the input vector A and the weight matrix W, vector data is calculated. Output data f(x, y, do) can be obtained by shaping the vector data calculated for each of i, j, and co into a three-dimensional tensor. By developing such data, the convolution operation of the convolution layer 210 can be performed by multiplying the vector data and the matrix data.

＜アロケーション工程（Ｓ１７－４）＞
ソフトウェア生成部３２５は、分割された演算をニューラルネットワークハードウェア６００に割り当てて実施させるソフトウェア５００を生成する（アロケーション工程）。生成されるソフトウェア５００は、命令コマンドＣ４を含む。入力データ変換工程（Ｓ１７－１）において入力データｂの変換が行われた場合、ソフトウェア５００は、変換済みの入力データａを含む。 <Allocation step (S17-4)>
The software generation unit 325 generates the software 500 that allocates the divided operations to the neural network hardware 600 for execution (allocation step). The generated software 500 includes an instruction command C4. If the input data b has been converted in the input data conversion step (S17-1), the software 500 includes the converted input data a.

以上説明したように、本実施形態に係るニューラルネットワーク生成装置３００、ニューラルネットワーク制御方法およびソフトウェア生成プログラムによれば、ＩｏＴ機器などの組み込み機器に組み込み可能であり、高性能に動作させることができるニューラルネットワークを生成および制御できる。 As described above, according to the neural network generation device 300, the neural network control method, and the software generation program according to the present embodiment, the neural network can be embedded in an embedded device such as an IoT device and can be operated with high performance. Can generate and control networks.

以上、本発明の第一実施形態について図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。また、上述の実施形態および変形例において示した構成要素は適宜に組み合わせて構成することが可能である。 As described above, the first embodiment of the present invention has been described in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and design changes and the like are included within the scope of the present invention. . Also, the constituent elements shown in the above-described embodiment and modifications can be combined as appropriate.

（変形例１－１）
上記実施形態において、第一メモリ１と第二メモリ２は別のメモリであったが、第一メモリ１と第二メモリ２の態様はこれに限定されない。第一メモリ１と第二メモリ２は、例えば、同一メモリにおける第一メモリ領域と第二メモリ領域であってもよい。 (Modification 1-1)
In the above embodiment, the first memory 1 and the second memory 2 are different memories, but the aspect of the first memory 1 and the second memory 2 is not limited to this. The first memory 1 and the second memory 2 may be, for example, a first memory area and a second memory area in the same memory.

（変形例１－２）
例えば、上記実施形態に記載のＮＮ実行モデル１００やニューラルネットワークハードウェア６００に入力されるデータは単一の形式に限定されず、静止画像、動画像、音声、文字、数値およびこれらの組み合わせで構成することが可能である。なお、ＮＮ実行モデル１００やニューラルネットワークハードウェア６００に入力されるデータは、ニューラルネットワークハードウェア６００が設けられるエッジデバイスに搭載され得る、光センサ、温度計、Global Positioning System（GPS）計測器、角速度計測器、風速計などの物理量測定器における測定結果に限られない。周辺機器から有線または無線通信経由で受信する基地局情報、車両・船舶等の情報、天候情報、混雑状況に関する情報などの周辺情報や金融情報や個人情報等の異なる情報を組み合わせてもよい。 (Modification 1-2)
For example, the data input to the NN execution model 100 and the neural network hardware 600 described in the above embodiments are not limited to a single format, and can be composed of still images, moving images, voices, characters, numerical values, and combinations thereof. It is possible to The data input to the NN execution model 100 and the neural network hardware 600 can be mounted on the edge device where the neural network hardware 600 is provided, such as an optical sensor, a thermometer, a Global Positioning System (GPS) measuring instrument, an angular velocity It is not limited to the measurement result of a physical quantity measuring instrument such as a measuring instrument or an anemometer. Peripheral information such as base station information, vehicle/vessel information, weather information, and congestion information received from peripheral devices via wired or wireless communication, and different information such as financial information and personal information may be combined.

（変形例１－３）
ニューラルネットワークハードウェア６００が設けられるエッジデバイスは、バッテリ等で駆動する携帯電話などの通信機器、パーソナルコンピュータなどのスマートデバイス、デジタルカメラ、ゲーム機器、ロボット製品などのモバイル機器を想定するが、これに限られるものではない。Power on Ethernet（PoE）などでの供給可能なピーク電力制限、製品発熱の低減または長時間駆動の要請が高い製品に利用することでも他の先行例にない効果を得ることができる。例えば、車両や船舶などに搭載される車載カメラや、公共施設や路上などに設けられる監視カメラ等に適用することで長時間の撮影を実現できるだけでなく、軽量化や高耐久化にも寄与する。また、テレビやディスプレイ等の表示デバイス、医療カメラや手術ロボット等の医療機器、製造現場や建築現場で使用される作業ロボットなどにも適用することで同様の効果を奏することができる。 (Modification 1-3)
Edge devices provided with the neural network hardware 600 are assumed to be communication devices such as mobile phones driven by batteries, smart devices such as personal computers, mobile devices such as digital cameras, game devices, and robot products. It is not limited. Unprecedented effects can also be obtained by using power on Ethernet (PoE), etc., to limit the peak power that can be supplied, reduce product heat generation, or use it for products that require long-time operation. For example, by applying it to in-vehicle cameras installed in vehicles and ships, surveillance cameras installed in public facilities and roads, etc., it is possible not only to realize long-time shooting, but also to contribute to weight reduction and durability. . Similar effects can be obtained by applying the present invention to display devices such as televisions and displays, medical equipment such as medical cameras and surgical robots, and work robots used at manufacturing sites and construction sites.

上述した実施形態におけるプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 It may be realized by recording the program in the above-described embodiment in a computer-readable recording medium and causing a computer system to read and execute the program recorded in the recording medium. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices. The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. Furthermore, "computer-readable recording medium" refers to a program that dynamically retains programs for a short period of time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include something that holds the program for a certain period of time, such as a volatile memory inside a computer system that serves as a server or client in that case. Further, the program may be for realizing part of the functions described above, or may be capable of realizing the functions described above in combination with a program already recorded in the computer system.

また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Also, the effects described herein are merely illustrative or exemplary, and are not limiting. In other words, the technology according to the present disclosure can produce other effects that are obvious to those skilled in the art from the description of this specification, in addition to or instead of the above effects.

本発明は、ニューラルネットワークの生成に適用することができる。 The present invention can be applied to the generation of neural networks.

３００ニューラルネットワーク生成装置
２００畳み込みニューラルネットワーク（ＣＮＮ）
１００ニューラルネットワーク実行モデル（ＮＮ実行モデル）
４００ニューラルネットワークハードウェアモデル
５００ソフトウェア
６００ニューラルネットワークハードウェア（ニューラルネットワーク演算装置）
１第一メモリ
２第二メモリ
３ＤＭＡコントローラ（ＤＭＡＣ）
４畳み込み演算回路
４２乗算器
４３アキュムレータ回路
４９入力変換部
４９１第一変換部
４９２第二変換部
４９３第三変換部
４９５閾値メモリ
５量子化演算回路
６コントローラ
ＰＭ学習済みパラメータ
ＤＳ学習データセット
ＨＷハードウェア情報
ＮＷネットワーク情報
ＴＧ１第一閾値群
ＴＧ２第二閾値群
ＴＧ３第三閾値群 300 Neural network generator 200 Convolutional neural network (CNN)
100 neural network execution model (NN execution model)
400 neural network hardware model 500 software 600 neural network hardware (neural network arithmetic unit)
1 first memory 2 second memory 3 DMA controller (DMAC)
4 convolution operation circuit 42 multiplier 43 accumulator circuit 49 input conversion unit 491 first conversion unit 492 second conversion unit 493 third conversion unit 495 threshold memory 5 quantization operation circuit 6 controller PM learned parameter DS learning data set HW Hardware Information NW Network information TG1 First threshold group TG2 Second threshold group TG3 Third threshold group

Claims

A neural network generation device for generating a neural network execution model for computing a neural network,
The neural network execution model includes first quantized data obtained by quantizing input data by a first quantizing means, and second quantized data obtained by quantizing the input data by a second quantizing means different from the first quantizing means. converted to readable data,
Neural network generator.

the number of bits of the first quantized data and the number of bits of the second quantized data are equal,
The neural network execution model performs a convolution operation on data in which the first quantized data and the second quantized data are concatenated.
The neural network generator according to claim 1.

The first quantization means converts the input data into the first quantized data according to a first threshold group,
The second quantization means converts the input data into the second quantized data using a second threshold group at least partially different from the first threshold group.
3. The neural network generator according to claim 1 or 2.

The thresholds of the first threshold group and the second threshold group are values included in the valid range set to the possible range of the input data.
4. The neural network generator according to claim 3.

The thresholds of the first threshold group and the second threshold group are values that substantially evenly divide the effective range.
5. The neural network generation device according to claim 4.

The thresholds of the first threshold group and the thresholds of the second threshold group are values arranged alternately in the effective range,
6. The neural network generator according to claim 4 or 5.

The thresholds of the first threshold group are values arranged with substantially the same step width,
The neural network generator according to any one of claims 3 to 6.

The thresholds of the first threshold group and the second threshold group are values set based on the distribution information of the input data.
5. The neural network generator according to claim 3 or 4.

Input conversion for converting input data into first quantized data obtained by quantizing input data by a first quantizing means and second quantized data obtained by quantizing the input data by a second quantizing means different from the first quantizing means. Department and
a convolution operation circuit that receives as input data in which the first quantized data and the second quantized data are concatenated;
comprising
Neural network arithmetic unit.

The first quantization means converts the input data into the first quantized data according to a first threshold group,
10. The neural network operation device according to claim 9, wherein said second quantization means converts said input data into said second quantized data using a second threshold value group at least partially different from said first threshold value group.

The thresholds of the first threshold group and the second threshold group are values included in the valid range set to the possible range of the input data,
The thresholds of the first threshold group and the second threshold group are values that substantially evenly divide the effective range.
The neural network operation device according to claim 10.

a neural network operation device according to any one of claims 9 to 11;
a power supply for operating the neural network arithmetic device;
edge device.

A method of controlling neural network hardware that operates a neural network, comprising:
A conversion step of converting input data into first quantized data obtained by quantizing input data by a first quantizing means and second quantized data obtained by quantizing said input data by a second quantizing means different from said first quantizing means. When,
an operation step of performing a convolution operation on the first quantized data and the second quantized data;
comprising
Neural network control method.

the transforming step is preprocessed by a device other than the neural network hardware;
The neural network control method according to claim 13.

A program for generating software for controlling neural network hardware that operates a neural network,
A conversion step of converting input data into first quantized data obtained by quantizing input data by a first quantizing means and second quantized data obtained by quantizing said input data by a second quantizing means different from said first quantizing means. When,
generating the software comprising an operation step of performing a convolution operation on the first quantized data and the second quantized data;
Software generation program.

A program for generating software for controlling neural network hardware that operates a neural network,
Convolution using first quantized data obtained by quantizing input data by a first quantizing means and second quantized data obtained by quantizing the input data by a second quantizing means different from the first quantizing means generating said software comprising computation steps to perform computation;
Software generation program.