JP2023006509A

JP2023006509A - Software generation device and software generation method

Info

Publication number: JP2023006509A
Application number: JP2021109142A
Authority: JP
Inventors: 南樹前田; Namiki Maeda
Original assignee: Leap Mind Inc
Current assignee: Leap Mind Inc
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-01-18

Abstract

To provide a software generation device and a software generation method capable of generating and controlling a neural network which can be incorporated into an embedded device such as an IoT device and can be operated at high performance.SOLUTION: Disclosed is a neural network generation device for generating the software for controlling a neural network execution model (NN execution model) and including a software generation part which includes the steps of: analyzing information on a model including a plurality of layers operating in an NN execution model; determining a lifetime corresponding to a plurality of layers included in the model based on an analysis result of the analysis means; and generating the software based on the lifetime.SELECTED DRAWING: Figure 2

Description

本発明は、ソフトウェア生成装置およびソフトウェア生成方法に関する。 The present invention relates to a software generation device and software generation method.

近年、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：ＣＮＮ）が画像認識等のモデルとして用いられている。畳み込みニューラルネットワークは、畳み込み層やプーリング層を有する多層構造であり、畳み込み演算等の多数の演算を必要とする。畳み込みニューラルネットワークによる演算を高速化する演算手法が様々考案されている（特許文献１など）。 In recent years, a convolutional neural network (CNN) has been used as a model for image recognition and the like. A convolutional neural network has a multilayer structure having convolution layers and pooling layers, and requires a large number of operations such as convolution operations. Various calculation methods have been devised for speeding up calculation by a convolutional neural network (Patent Document 1, etc.).

特開２０１８－０７７８２９号公報JP 2018-077829 A

一方で、ＩｏＴ機器などの組み込み機器においても畳み込みニューラルネットワークを利用した画像認識等が使用されている。組み込み機器において畳み込みニューラルネットワークを効率的に動作させるため、組み込み機器のハードウェア構成に合わせたニューラルネットワークに係る演算を行う回路やモデルを生成することが望まれている。また、これらの回路やモデルを高効率かつ高速に動作させるソフトウェア生成方法が望まれている。 On the other hand, image recognition and the like using convolutional neural networks are also used in built-in devices such as IoT devices. In order to efficiently operate a convolutional neural network in an embedded device, it is desired to generate a circuit or a model for performing computations related to the neural network that matches the hardware configuration of the embedded device. There is also a demand for a software generation method that allows these circuits and models to operate efficiently and at high speed.

上記事情を踏まえ、本発明は、ＩｏＴ機器などの組み込み機器に組み込み可能であり、高効率かつ高速に動作させることができるニューラルネットワークに係る演算を行う回路やモデルを生成するソフトウェア生成装置およびニューラルネットワークに係る演算を行う回路やモデルを高効率かつ高速に動作させるためのソフトウェア生成方法を提供することを目的とする。 In view of the above circumstances, the present invention provides a software generation apparatus and a neural network for generating circuits and models for performing calculations related to neural networks that can be embedded in embedded devices such as IoT devices and can be operated at high efficiency and high speed. It is an object of the present invention to provide a software generation method for operating a circuit or model that performs calculations related to high efficiency and high speed.

上記課題を解決するために、この発明は以下の手段を提案している。
本発明の第一の態様に係るニューラルネットワーク生成装置は、ニューラルネットワーク回路を制御するためのソフトウェアを生成するソフトウェア生成装置であって、前記ニューラルネットワーク回路において動作する複数のレイヤを含むモデルに関する情報を解析する解析手段と、前記解析手段の解析結果に基づいて前記モデルに含まれる前記複数のレイヤに対応するライフタイムを決定する決定手段と、前記ライフタイムに基づいて前記ソフトウェアを生成する生成手段とを備える。 In order to solve the above problems, the present invention proposes the following means.
A neural network generation device according to a first aspect of the present invention is a software generation device for generating software for controlling a neural network circuit, wherein information on a model including a plurality of layers operating in the neural network circuit is generated by: analysis means for analyzing; determination means for determining lifetimes corresponding to the plurality of layers included in the model based on the analysis results of the analysis means; and generation means for generating the software based on the lifetimes Prepare.

本発明のソフトウェア生成装置およびソフトウェア生成方法は、ＩｏＴ機器などの組み込み機器に組み込み可能であり、高性能に動作させることができるニューラルネットワークを生成して制御できる。 INDUSTRIAL APPLICABILITY The software generation device and software generation method of the present invention can be embedded in embedded equipment such as IoT equipment, and can generate and control a neural network that can operate with high performance.

実施形態に係るニューラルネットワーク生成装置を示す図である。1 is a diagram showing a neural network generation device according to an embodiment; FIG. 同ニューラルネットワーク生成装置の演算部の入出力を示す図である。It is a figure which shows the input-output of the calculating part of the same neural network generation apparatus. 畳み込みニューラルネットワークの一例を示す図である。1 is a diagram showing an example of a convolutional neural network; FIG. 同畳み込みニューラルネットワークの畳み込み層が行う畳み込み演算を説明する図である。FIG. 4 is a diagram for explaining convolution operations performed by convolution layers of the same convolutional neural network; ニューラルネットワーク実行モデルの一例を示す図である。It is a figure which shows an example of a neural network execution model. 同ニューラルネットワーク実行モデルの動作例を示すタイミングチャートである。4 is a timing chart showing an operation example of the same neural network execution model; 同ニューラルネットワーク生成装置の制御フローチャートである。It is a control flowchart of the same neural network generation device. 生成される畳み込み演算回路の内部ブロック図である。FIG. 4 is an internal block diagram of a generated convolution operation circuit; 同畳み込み演算回路の乗算器の内部ブロック図である。FIG. 4 is an internal block diagram of a multiplier of the convolution arithmetic circuit; 同乗算器の積和演算ユニットの内部ブロック図である。3 is an internal block diagram of a sum-of-products operation unit of the same multiplier; FIG. 同畳み込み演算回路のアキュムレータ回路の内部ブロック図である。FIG. 4 is an internal block diagram of an accumulator circuit of the same convolution arithmetic circuit; 同アキュムレータ回路のアキュムレータユニットの内部ブロック図である。It is an internal block diagram of the accumulator unit of the same accumulator circuit. 同畳み込み演算回路の制御回路のステート遷移図である。FIG. 4 is a state transition diagram of a control circuit of the same convolution arithmetic circuit; 生成される量子化演算回路の内部ブロック図である。3 is an internal block diagram of a generated quantization arithmetic circuit; FIG. 同量子化演算回路のベクトル演算回路と量子化回路の内部ブロック図である。3 is an internal block diagram of a vector operation circuit and a quantization circuit of the same quantization operation circuit; FIG. 同ベクトル演算回路の演算ユニットのブロック図である。4 is a block diagram of an arithmetic unit of the same vector arithmetic circuit; FIG. 同量子化回路の量子化ユニットの内部ブロック図である。4 is an internal block diagram of a quantization unit of the same quantization circuit; FIG. 生成されるＤＭＡＣの内部ブロック図である。4 is an internal block diagram of a generated DMAC; FIG. エッジデバイスの内部ブロック図である。3 is an internal block diagram of an edge device; FIG. ニューラルネットワークの一例と従来のアロケーションの一例を示す図である。It is a figure which shows an example of a neural network and an example of the conventional allocation. アロケーション工程を説明する図である。It is a figure explaining an allocation process. ニューラルネットワークの一例とアロケーションの一例を示す図である。It is a figure which shows an example of a neural network, and an example of allocation.

（実施形態）
本発明の実施形態について、図１から図２２を参照して説明する。
図１は、本実施形態に係るニューラルネットワーク生成装置３００を示す図である。 (embodiment)
An embodiment of the present invention will be described with reference to FIGS. 1 to 22. FIG.
FIG. 1 is a diagram showing a neural network generation device 300 according to this embodiment.

［ニューラルネットワーク生成装置３００］
ニューラルネットワーク生成装置３００は、ＩｏＴ機器などの組み込み機器に組み込み可能な学習済みのニューラルネットワーク実行モデル１００を生成する装置である。ニューラルネットワーク実行モデル１００は、畳み込みニューラルネットワーク２００（以下、「ＣＮＮ２００」という）を組み込み機器において演算させるために生成されたソフトウェアやハードウェアモデルである。 [Neural network generation device 300]
The neural network generation device 300 is a device that generates a trained neural network execution model 100 that can be embedded in an embedded device such as an IoT device. The neural network execution model 100 is a software or hardware model generated for operating a convolutional neural network 200 (hereinafter referred to as "CNN 200") in an embedded device.

ニューラルネットワーク生成装置３００は、ＣＰＵ（Central Processing Unit）等のプロセッサとメモリ等のハードウェアを備えたプログラム実行可能な装置（コンピュータ）である。ニューラルネットワーク生成装置３００の機能は、ニューラルネットワーク生成装置３００においてニューラルネットワーク生成プログラムを実行することにより実現される。ニューラルネットワーク生成装置３００は、記憶部３１０と、演算部３２０と、データ入力部３３０と、データ出力部３４０と、表示部３５０と、操作入力部３６０と、を備える。 The neural network generation device 300 is a program-executable device (computer) having a processor such as a CPU (Central Processing Unit) and hardware such as a memory. The functions of neural network generation device 300 are realized by executing a neural network generation program in neural network generation device 300 . The neural network generation device 300 includes a storage unit 310 , a calculation unit 320 , a data input unit 330 , a data output unit 340 , a display unit 350 and an operation input unit 360 .

記憶部３１０は、ハードウェア情報ＨＷと、ネットワーク情報ＮＷと、学習データセットＤＳと、ニューラルネットワーク実行モデル１００（以下、「ＮＮ実行モデル１００」という）と、学習済みパラメータＰＭと、を記憶する。ハードウェア情報ＨＷ、学習データセットＤＳおよびネットワーク情報ＮＷは、ニューラルネットワーク生成装置３００に入力される入力データである。ＮＮ実行モデル１００および学習済みパラメータＰＭは、ニューラルネットワーク生成装置３００が出力する出力データである。なお、「学習済みのＮＮ実行モデル１００」は、ＮＮ実行モデル１００および学習済みパラメータＰＭを含む。 Storage unit 310 stores hardware information HW, network information NW, learning data set DS, neural network execution model 100 (hereinafter referred to as “NN execution model 100”), and learned parameters PM. Hardware information HW, learning data set DS, and network information NW are input data that are input to neural network generation device 300 . The NN execution model 100 and the learned parameters PM are output data output by the neural network generation device 300 . The "trained NN execution model 100" includes the NN execution model 100 and the learned parameters PM.

ハードウェア情報ＨＷは、ＮＮ実行モデル１００を動作させる組み込み機器（以降、「動作対象ハードウェア」という）の情報である。ハードウェア情報ＨＷは、例えば、動作対象ハードウェアのデバイス種別、デバイス制約、メモリ構成、バス構成、動作周波数、消費電力、製造プロセス種別などである。デバイス種別は、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）などの種別である。デバイス制約は、動作対象デバイスに含まれる演算器数の上限や回路規模の上限などである。メモリ構成は、メモリ種別やメモリ個数やメモリ容量や入出力データ幅である。バス構成は、バス種類、バス幅、バス通信規格、同一バス上の接続デバイスなどである。また、ＮＮ実行モデル１００に複数のバリエーションが存在する場合、ハードウェア情報ＨＷには使用するＮＮ実行モデル１００のバリエーションに関する情報が含まれる。 The hardware information HW is information about an embedded device that operates the NN execution model 100 (hereinafter referred to as "operation target hardware"). The hardware information HW includes, for example, the device type, device restrictions, memory configuration, bus configuration, operating frequency, power consumption, and manufacturing process type of hardware to be operated. The device type is, for example, a type such as ASIC (Application Specific Integrated Circuit) or FPGA (Field-Programmable Gate Array). The device constraint is the upper limit of the number of arithmetic units included in the device to be operated, the upper limit of the circuit scale, and the like. The memory configuration includes memory type, number of memories, memory capacity, and input/output data width. The bus configuration includes the bus type, bus width, bus communication standard, connected devices on the same bus, and the like. Also, when there are multiple variations of the NN execution model 100, the hardware information HW includes information on the variation of the NN execution model 100 to be used.

ネットワーク情報ＮＷは、ＣＮＮ２００の基本情報である。ネットワーク情報ＮＷは、例えば、ＣＮＮ２００のネットワーク構成、入力データ情報、出力データ情報、量子化情報などである。入力データ情報は、画像や音声などの入力データ種別と、入力データサイズなどである。 Network information NW is basic information of CNN 200 . The network information NW is, for example, the network configuration of the CNN 200, input data information, output data information, quantization information, and the like. The input data information includes the type of input data such as image and sound, and the size of the input data.

学習データセットＤＳは、学習に用いる学習データＤ１と、推論テストに用いるテストデータＤ２と、を有する。 The learning data set DS has learning data D1 used for learning and test data D2 used for inference testing.

図２は、演算部３２０の入出力を示す図である。
演算部３２０は、実行モデル生成部３２１と、学習部３２２と、推論部３２３と、ハードウェア生成部３２４と、ソフトウェア生成部３２５と、を有する。演算部３２０に入力されるＮＮ実行モデル１００は、ニューラルネットワーク生成装置３００以外の装置で生成されたものであってもよい。 FIG. 2 is a diagram showing inputs and outputs of the calculation unit 320. As shown in FIG.
The calculation unit 320 has an execution model generation unit 321 , a learning unit 322 , an inference unit 323 , a hardware generation unit 324 and a software generation unit 325 . The NN execution model 100 input to the calculation unit 320 may be generated by a device other than the neural network generation device 300 .

実行モデル生成部３２１は、ハードウェア情報ＨＷおよびネットワーク情報ＮＷに基づいてＮＮ実行モデル１００を生成する。ＮＮ実行モデル１００は、ＣＮＮ２００を動作対象ハードウェアにおいて演算させるために生成されたソフトウェアやハードウェアモデルである。ソフトウェアは、ハードウェアモデルを制御するソフトウェアを含む。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。 The execution model generator 321 generates the NN execution model 100 based on the hardware information HW and network information NW. The NN execution model 100 is a software or hardware model generated to cause the CNN 200 to operate on hardware to be operated. Software includes software that controls the hardware model. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. .

学習部３２２は、ＮＮ実行モデル１００および学習データＤ１を用いて、学習済みパラメータＰＭを生成する。推論部３２３は、ＮＮ実行モデル１００およびテストデータＤ２を用いて推論テストを実施する。 Learning unit 322 generates learned parameters PM using NN execution model 100 and learning data D1. The inference unit 323 performs an inference test using the NN execution model 100 and the test data D2.

ハードウェア生成部３２４は、ハードウェア情報ＨＷおよびＮＮ実行モデル１００に基づいてニューラルネットワークハードウェアモデル４００を生成する。ニューラルネットワークハードウェアモデル４００は、動作対象ハードウェアに実装可能なハードウェアモデルである。ニューラルネットワークハードウェアモデル４００は、ハードウェア情報ＨＷに基づいて、動作対象ハードウェアに最適化されている。ニューラルネットワークハードウェアモデル４００は、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。ニューラルネットワークハードウェアモデル４００は、ＮＮ実行モデル１００をハードウェアに実装するために必要なパラメータリストやコンフィグレーションファイルであってもよい。パラメータリストやコンフィグレーションファイルは別途生成されたＮＮ実行モデル１００と組み合わせて使用される。 Hardware generator 324 generates neural network hardware model 400 based on hardware information HW and NN execution model 100 . The neural network hardware model 400 is a hardware model that can be implemented in hardware to operate. The neural network hardware model 400 is optimized for operation target hardware based on the hardware information HW. The neural network hardware model 400 may be an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. The neural network hardware model 400 may be a parameter list and configuration files necessary for implementing the NN execution model 100 on hardware. The parameter list and configuration file are used in combination with the NN execution model 100 generated separately.

以降の説明において、ニューラルネットワークハードウェアモデル４００を動作対象ハードウェアに実装したものを、「ニューラルネットワークハードウェア６００」という。 In the following description, hardware implemented with the neural network hardware model 400 is referred to as "neural network hardware 600".

ソフトウェア生成部３２５は、ネットワーク情報ＮＷまたはＮＮ実行モデル１００に基づいて、ニューラルネットワークハードウェア６００を動作させるソフトウェア５００を生成する。ソフトウェア５００は、学習済みパラメータＰＭを必要に応じてニューラルネットワークハードウェア６００へ転送するソフトウェアを含む。なお、ソフトウェア５００の形式としては、ソースコードの形式だけでなく、バイナリ形式であってもよい。 Software generator 325 generates software 500 for operating neural network hardware 600 based on network information NW or NN execution model 100 . Software 500 includes software that transfers learned parameters PM to neural network hardware 600 as needed. The format of the software 500 may be not only the source code format but also the binary format.

データ入力部３３０には、学習済みのＮＮ実行モデル１００を生成するために必要なハードウェア情報ＨＷやネットワーク情報ＮＷ等が入力される。ハードウェア情報ＨＷやネットワーク情報ＮＷ等は、例えば所定のデータフォーマットで記載されたデータとして入力される。入力されたハードウェア情報ＨＷやネットワーク情報ＮＷ等は、記憶部３１０に記憶される。ハードウェア情報ＨＷやネットワーク情報ＮＷ等は、操作入力部３６０から使用者により入力または変更されてもよい。 The data input unit 330 receives hardware information HW, network information NW, and the like necessary for generating the trained NN execution model 100 . The hardware information HW, network information NW, etc. are input as data described in a predetermined data format, for example. The input hardware information HW, network information NW, etc. are stored in the storage unit 310 . The hardware information HW, network information NW, etc. may be input or changed by the user through the operation input unit 360 .

データ出力部３４０には、生成された学習済みのＮＮ実行モデル１００が出力される。例えば、生成されたＮＮ実行モデル１００と、学習済みパラメータＰＭとがデータ出力部３４０に出力される。 The generated trained NN execution model 100 is output to the data output unit 340 . For example, the generated NN execution model 100 and learned parameters PM are output to the data output unit 340 .

表示部３５０は、ＬＣＤディスプレイ等の公知のモニタを有する。表示部３５０は、演算部３２０が生成したＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）画像やコマンド等を受け付けるためのコンソール画面などを表示できる。また、演算部３２０が使用者からの情報入力を必要とする場合、表示部３５０は操作入力部３６０から情報を入力することを使用者に促すメッセージや情報入力に必要なＧＵＩ画像を表示できる。 The display unit 350 has a known monitor such as an LCD display. The display unit 350 can display a GUI (Graphical User Interface) image generated by the calculation unit 320, a console screen for receiving commands, and the like. Further, when the calculation unit 320 requires information input from the user, the display unit 350 can display a message prompting the user to input information from the operation input unit 360 or a GUI image required for information input.

操作入力部３６０は、使用者が演算部３２０等に対しての指示を入力する装置である。操作入力部３６０は、タッチパネル、キーボード、マウス等の公知の入力デバイスである。操作入力部３６０の入力は、演算部３２０に送信される。 The operation input unit 360 is a device through which the user inputs instructions to the calculation unit 320 and the like. The operation input unit 360 is a known input device such as a touch panel, keyboard, and mouse. An input of the operation input section 360 is transmitted to the calculation section 320 .

演算部３２０の機能の全部または一部は、例えばＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）のような１つ以上のプロセッサがプログラムメモリに記憶されたプログラムを実行することにより実現される。ただし、演算部３２０の機能の全部または一部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、ＰＬＤ（Programmable Logic Device）等のハードウェア（例えば回路部；circuity）により実現されてもよい。また、演算部３２０の機能の全部または一部は、ソフトウェアとハードウェアとの組み合わせにより実現されてもよい。 All or part of the functions of the arithmetic unit 320 are implemented by one or more processors such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) executing a program stored in a program memory. . However, all or part of the functions of the arithmetic unit 320 are implemented by hardware such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), PLD (Programmable Logic Device), etc. circuitry). Moreover, all or part of the functions of the calculation unit 320 may be realized by a combination of software and hardware.

演算部３２０の機能の全部または一部は、クラウドサーバ等の外部機器に設けられたＣＰＵやＧＰＵやハードウェア等の外部アクセラレータを用いて実現されてもよい。演算部３２０は、例えばクラウドサーバ上の演算性能が高いＧＰＵや専用ハードウェアを併用することで、演算部３２０の演算速度を向上させることができる。 All or part of the functions of the computing unit 320 may be implemented using an external accelerator such as a CPU or GPU or hardware provided in an external device such as a cloud server. The calculation unit 320 can improve the calculation speed of the calculation unit 320 by using, for example, a GPU with high calculation performance on a cloud server or dedicated hardware.

記憶部３１０は、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read-Only Memory）、ＲＯＭ（Read-Only Memory）、またＲＡＭ（Random Access Memory）等により実現される。記憶部３１０の全部または一部はクラウドサーバ等の外部機器に設けられ、通信回線により演算部３２０等と接続させてもよい。 The storage unit 310 is realized by flash memory, EEPROM (Electrically Erasable Programmable Read-Only Memory), ROM (Read-Only Memory), RAM (Random Access Memory), or the like. All or part of the storage unit 310 may be provided in an external device such as a cloud server and connected to the calculation unit 320 or the like via a communication line.

［畳み込みニューラルネットワーク（ＣＮＮ）２００］
次に、ＣＮＮ２００について説明する。図３は、ＣＮＮ２００の一例を示す図である。ＣＮＮ２００のネットワーク情報ＮＷは、以下で説明するＣＮＮ２００の構成に関する情報である。ＣＮＮ２００は、低ビットの重みｗや量子化された入力データａを用いており、組み込み機器に組み込みやすい。 [Convolutional Neural Network (CNN) 200]
Next, CNN200 is demonstrated. FIG. 3 is a diagram showing an example of the CNN 200. As shown in FIG. The network information NW of the CNN 200 is information regarding the configuration of the CNN 200 described below. The CNN 200 uses low-bit weight w and quantized input data a, and is easy to incorporate into embedded equipment.

ＣＮＮ２００は、畳み込み演算を行う畳み込み層２１０と、量子化演算を行う量子化演算層２２０と、出力層２３０と、を含む多層構造のネットワークである。ＣＮＮ２００の少なくとも一部において、畳み込み層２１０と量子化演算層２２０とが交互に連結されている。ＣＮＮ２００は、画像認識や動画認識に広く使われるモデルである。ＣＮＮ２００は、全結合層などの他の機能を有する層（レイヤ）をさらに有してもよい。 The CNN 200 is a multi-layered network including a convolution layer 210 that performs convolution operations, a quantization operation layer 220 that performs quantization operations, and an output layer 230 . In at least part of CNN 200, convolutional layers 210 and quantization operation layers 220 are interleaved. CNN200 is a model widely used for image recognition and moving image recognition. The CNN 200 may further have layers with other functions, such as fully connected layers.

図４は、畳み込み層２１０が行う畳み込み演算を説明する図である。
畳み込み層２１０は、入力データａに対して重みｗを用いた畳み込み演算を行う。畳み込み層２１０は、入力データａと重みｗとを入力とする積和演算を行う。 FIG. 4 is a diagram for explaining the convolution operation performed by the convolution layer 210. As shown in FIG.
The convolution layer 210 performs a convolution operation on input data a using weight w. The convolution layer 210 performs a sum-of-products operation with input data a and weight w as inputs.

畳み込み層２１０への入力データａ（アクティベーションデータ、特徴マップともいう）は、画像データ等の多次元データである。本実施形態において、入力データａは、要素（ｘ，ｙ，ｃ）からなる３次元テンソルである。ＣＮＮ２００の畳み込み層２１０は、低ビットの入力データａに対して畳み込み演算を行う。本実施形態において、入力データａの要素は、２ビットの符号なし整数（０，１，２，３）である。入力データａの要素は、例えば、４ビットや８ビット符号なし整数でもよい。 Input data a (also called activation data or feature map) to the convolutional layer 210 is multidimensional data such as image data. In this embodiment, the input data a is a three-dimensional tensor consisting of elements (x, y, c). The convolution layer 210 of the CNN 200 performs a convolution operation on low-bit input data a. In this embodiment, the elements of the input data a are 2-bit unsigned integers (0, 1, 2, 3). Elements of input data a may be, for example, 4-bit or 8-bit unsigned integers.

ＣＮＮ２００に入力される入力データが、例えば３２ビットの浮動小数点型など、畳み込み層２１０への入力データａと形式が異なる場合、ＣＮＮ２００は畳み込み層２１０の前に型変換や量子化を行う入力層をさらに有してもよい。 If the input data input to the CNN 200 has a different format from the input data a to the convolutional layer 210, such as a 32-bit floating point type, the CNN 200 has an input layer that performs type conversion and quantization before the convolutional layer 210. You may have more.

畳み込み層２１０の重みｗ（フィルタ、カーネルともいう）は、学習可能なパラメータである要素を有する多次元データである。本実施形態において、重みｗは、要素（ｉ，ｊ，ｃ，ｄ）からなる４次元テンソルである。重みｗは、要素（ｉ，ｊ，ｃ）からなる３次元テンソル（以降、「重みｗｏ」という）をｄ個有している。学習済みのＣＮＮ２００における重みｗは、学習済みのデータである。ＣＮＮ２００の畳み込み層２１０は、低ビットの重みｗを用いて畳み込み演算を行う。本実施形態において、重みｗの要素は、１ビットの符号付整数（０，１）であり、値「０」は＋１を表し、値「１」は－１を表す。 The weights w (also called filters, kernels) of the convolutional layer 210 are multidimensional data whose elements are learnable parameters. In this embodiment, the weight w is a 4-dimensional tensor consisting of elements (i,j,c,d). The weight w has d three-dimensional tensors (hereinafter referred to as “weight wo”) each having elements (i, j, c). The weight w in the learned CNN 200 is learned data. Convolutional layer 210 of CNN 200 performs a convolution operation using low-bit weights w. In this embodiment, the elements of the weight w are 1-bit signed integers (0,1), where the value '0' represents +1 and the value '1' represents -1.

畳み込み層２１０は、式１に示す畳み込み演算を行い、出力データｆを出力する。式１において、ｓはストライドを示す。図４において点線で示された領域は、入力データａに対して重みｗｏが適用される領域ａｏ（以降、「適用領域ａｏ」という）の一つを示している。適用領域ａｏの要素は、（ｘ＋ｉ，ｙ＋ｊ，ｃ）で表される。 The convolution layer 210 performs the convolution operation shown in Equation 1 and outputs output data f. In Equation 1, s indicates stride. A region indicated by a dotted line in FIG. 4 indicates one of the regions ao (hereinafter referred to as “applied region ao”) to which the weight wo is applied to the input data a. Elements of the application area ao are represented by (x+i, y+j, c).

量子化演算層２２０は、畳み込み層２１０が出力する畳み込み演算の出力に対して量子化などを実施する。量子化演算層２２０は、プーリング層２２１と、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２と、活性化関数層２２３と、量子化層２２４と、を有する。 The quantization operation layer 220 performs quantization and the like on the convolution operation output from the convolution layer 210 . The quantization operation layer 220 has a pooling layer 221 , a batch normalization layer 222 , an activation function layer 223 and a quantization layer 224 .

プーリング層２２１は、畳み込み層２１０が出力する畳み込み演算の出力データｆに対して平均プーリング（式２）やＭＡＸプーリング（式３）などの演算を実施して、畳み込み層２１０の出力データｆを圧縮する。式２および式３において、ｕは入力テンソルを示し、ｖは出力テンソルを示し、Ｔはプーリング領域の大きさを示す。式３において、ｍａｘはＴに含まれるｉとｊの組み合わせに対するｕの最大値を出力する関数である。 The pooling layer 221 performs operations such as average pooling (equation 2) and MAX pooling (equation 3) on the output data f of the convolutional operation output by the convolutional layer 210 to compress the output data f of the convolutional layer 210. do. In Equations 2 and 3, u indicates the input tensor, v indicates the output tensor, and T indicates the size of the pooling region. In Equation 3, max is a function that outputs the maximum value of u for combinations of i and j contained in T.

ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２は、量子化演算層２２０やプーリング層２２１の出力データに対して、例えば式４に示すような演算によりデータ分布の正規化を行う。式４において、ｕは入力テンソルを示し、ｖは出力テンソルを示し、αはスケールを示し、βはバイアスを示す。学習済みのＣＮＮ２００において、αおよびβは学習済みの定数ベクトルである。 The Batch Normalization layer 222 normalizes the data distribution of the output data of the quantization operation layer 220 and the pooling layer 221 by, for example, the operation shown in Equation 4. In Equation 4, u denotes the input tensor, v the output tensor, α the scale, and β the bias. In the trained CNN 200, α and β are trained constant vectors.

活性化関数層２２３は、量子化演算層２２０やプーリング層２２１やＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２の出力に対してＲｅＬＵ（式５）などの活性化関数の演算を行う。式５において、ｕは入力テンソルであり、ｖは出力テンソルである。式５において、ｍａｘは引数のうち最も大きい数値を出力する関数である。 The activation function layer 223 computes an activation function such as ReLU (Formula 5) on the outputs of the quantization computation layer 220 , the pooling layer 221 and the batch normalization layer 222 . In Equation 5, u is the input tensor and v is the output tensor. In Expression 5, max is a function that outputs the largest numerical value among the arguments.

量子化層２２４は、量子化パラメータに基づいて、プーリング層２２１や活性化関数層２２３の出力に対して例えば式６に示すような量子化を行う。式６に示す量子化は、入力テンソルｕを２ビットにビット削減している。式６において、ｑ(ｃ)は量子化パラメータのベクトルである。学習済みのＣＮＮ２００において、ｑ(ｃ)は学習済みの定数ベクトルである。式６における不等式「≦」は「＜」であってもよい。 The quantization layer 224 quantizes the outputs of the pooling layer 221 and the activation function layer 223 based on the quantization parameter, as shown in Equation 6, for example. The quantization shown in Equation 6 reduces the input tensor u to 2 bits. In Equation 6, q(c) is the vector of quantization parameters. In the trained CNN 200, q(c) is a trained constant vector. The inequality “≦” in Equation 6 may be “<”.

出力層２３０は、恒等関数やソフトマックス関数等によりＣＮＮ２００の結果を出力する層である。出力層２３０の前段のレイヤは、畳み込み層２１０であってもよいし、量子化演算層２２０であってもよい。 The output layer 230 is a layer that outputs the results of the CNN 200 using an identity function, a softmax function, or the like. A layer preceding the output layer 230 may be the convolution layer 210 or the quantization operation layer 220 .

ＣＮＮ２００は、量子化された量子化層２２４の出力データが、畳み込み層２１０に入力されるため、量子化を行わない他の畳み込みニューラルネットワークと比較して、畳み込み層２１０の畳み込み演算の負荷が小さい。 In the CNN 200, the quantized output data of the quantization layer 224 is input to the convolution layer 210, so the convolution operation load of the convolution layer 210 is small compared to other convolutional neural networks that do not perform quantization. .

［ニューラルネットワーク実行モデル１００（ＮＮ実行モデル）１００］
次に、ＮＮ実行モデル１００について説明する。図５は、ＮＮ実行モデル１００の一例を示す図である。ＮＮ実行モデル１００は、ＣＮＮ２００を動作対象ハードウェアにおいて演算させるために生成されたソフトウェアやハードウェアモデルである。ソフトウェアは、ハードウェアモデルを制御するソフトウェアを含む。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。 [Neural network execution model 100 (NN execution model) 100]
Next, the NN execution model 100 will be explained. FIG. 5 is a diagram showing an example of the NN execution model 100. As shown in FIG. The NN execution model 100 is a software or hardware model generated to cause the CNN 200 to operate on hardware to be operated. Software includes software that controls the hardware model. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. .

ＮＮ実行モデル１００は、第一メモリ１と、第二メモリ２と、ＤＭＡコントローラ３（以下、「ＤＭＡＣ３」ともいう）と、畳み込み演算回路４と、量子化演算回路５と、コントローラ６と、を備える。ＮＮ実行モデル１００は、第一メモリ１および第二メモリ２を介して、畳み込み演算回路４と量子化演算回路５とがループ状に形成されていることを特徴とする。 The NN execution model 100 includes a first memory 1, a second memory 2, a DMA controller 3 (hereinafter also referred to as "DMAC 3"), a convolution operation circuit 4, a quantization operation circuit 5, and a controller 6. Prepare. The NN execution model 100 is characterized in that a convolution operation circuit 4 and a quantization operation circuit 5 are formed in a loop via a first memory 1 and a second memory 2 .

第一メモリ１は、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。第一メモリ１には、ＤＭＡＣ３やコントローラ６を介してデータの書き込みおよび読み出しが行われる。第一メモリ１は、畳み込み演算回路４の入力ポートと接続されており、畳み込み演算回路４は第一メモリ１からデータを読み出すことができる。また、第一メモリ１は、量子化演算回路５の出力ポートと接続されており、量子化演算回路５は第一メモリ１にデータを書き込むことができる。外部ホストＣＰＵは、第一メモリ１に対するデータの書き込みや読み出しにより、ＮＮ実行モデル１００に対するデータの入出力を行うことができる。 The first memory 1 is a rewritable memory such as a volatile memory such as an SRAM (Static RAM). Data is written to and read from the first memory 1 via the DMAC 3 and the controller 6 . The first memory 1 is connected to the input port of the convolution operation circuit 4 , and the convolution operation circuit 4 can read data from the first memory 1 . The first memory 1 is also connected to the output port of the quantization arithmetic circuit 5 , and the quantization arithmetic circuit 5 can write data to the first memory 1 . The external host CPU can input/output data to/from the NN execution model 100 by writing/reading data to/from the first memory 1 .

第二メモリ２は、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。第二メモリ２には、ＤＭＡＣ３やコントローラ６を介してデータの書き込みおよび読み出しが行われる。第二メモリ２は、量子化演算回路５の入力ポートと接続されており、量子化演算回路５は第二メモリ２からデータを読み出すことができる。また、第二メモリ２は、畳み込み演算回路４の出力ポートと接続されており、畳み込み演算回路４は第二メモリ２にデータを書き込むことができる。外部ホストＣＰＵは、第二メモリ２に対するデータの書き込みや読み出しにより、ＮＮ実行モデル１００に対するデータの入出力を行うことができる。 The second memory 2 is, for example, a rewritable memory such as a volatile memory such as an SRAM (Static RAM). Data is written to and read from the second memory 2 via the DMAC 3 and the controller 6 . The second memory 2 is connected to the input port of the quantization arithmetic circuit 5 , and the quantization arithmetic circuit 5 can read data from the second memory 2 . The second memory 2 is also connected to the output port of the convolution circuit 4 , and the convolution circuit 4 can write data to the second memory 2 . The external host CPU can input/output data to/from the NN execution model 100 by writing/reading data to/from the second memory 2 .

ＤＭＡＣ３は、外部バスＥＢに接続されており、ＤＲＡＭなどの外部メモリと第一メモリ１との間のデータ転送を行う。また、ＤＭＡＣ３は、ＤＲＡＭなどの外部メモリと第二メモリ２との間のデータ転送を行う。また、ＤＭＡＣ３は、ＤＲＡＭなどの外部メモリと畳み込み演算回路４との間のデータ転送を行う。また、ＤＭＡＣ３は、ＤＲＡＭなどの外部メモリと量子化演算回路５との間のデータ転送を行う。 The DMAC 3 is connected to the external bus EB and performs data transfer between an external memory such as a DRAM and the first memory 1 . The DMAC 3 also transfers data between an external memory such as a DRAM and the second memory 2 . The DMAC 3 also transfers data between an external memory such as a DRAM and the convolution circuit 4 . The DMAC 3 also transfers data between an external memory such as a DRAM and the quantization arithmetic circuit 5 .

畳み込み演算回路４は、学習済みのＣＮＮ２００の畳み込み層２１０における畳み込み演算を行う回路である。畳み込み演算回路４は、第一メモリ１に格納された入力データａを読み出し、入力データａに対して畳み込み演算を実施する。畳み込み演算回路４は、畳み込み演算の出力データｆ（以降、「畳み込み演算出力データ」ともいう）を第二メモリ２に書き込む。 The convolution operation circuit 4 is a circuit that performs convolution operation in the convolution layer 210 of the trained CNN 200 . The convolution operation circuit 4 reads the input data a stored in the first memory 1 and performs a convolution operation on the input data a. The convolution operation circuit 4 writes output data f of the convolution operation (hereinafter also referred to as “convolution operation output data”) to the second memory 2 .

量子化演算回路５は、学習済みのＣＮＮ２００の量子化演算層２２０における量子化演算の少なくとも一部を行う回路である。量子化演算回路５は、第二メモリ２に格納された畳み込み演算の出力データｆを読み出し、畳み込み演算の出力データｆに対して量子化演算（プーリング、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ、活性化関数、および量子化のうち少なくとも量子化を含む演算）を行う。量子化演算回路５は、量子化演算の出力データ（以降、「量子化演算出力データ」ともいう）оｕｔを第一メモリ１に書き込む。 The quantization operation circuit 5 is a circuit that performs at least part of the quantization operation in the quantization operation layer 220 of the trained CNN 200 . The quantization operation circuit 5 reads the output data f of the convolution operation stored in the second memory 2, and performs quantization operations (pooling, batch normalization, activation function, and quantization) on the output data f of the convolution operation. calculation including at least quantization). The quantization operation circuit 5 writes the output data of the quantization operation (hereinafter also referred to as “quantization operation output data”) out to the first memory 1 .

コントローラ６は、外部バスＥＢに接続されており、外部のホストＣＰＵのスレーブとして動作する。コントローラ６は、パラメータレジスタや状態レジスタを含むレジスタ６１を有している。パラメータレジスタは、ＮＮ実行モデル１００の動作を制御するレジスタである。状態レジスタはセマフォＳを含むＮＮ実行モデル１００の状態を示すレジスタである。外部ホストＣＰＵは、コントローラ６を経由して、レジスタ６１にアクセスできる。 The controller 6 is connected to the external bus EB and operates as a slave of an external host CPU. The controller 6 has registers 61 including parameter registers and status registers. A parameter register is a register that controls the operation of the NN execution model 100 . The state register is a register that indicates the state of the NN execution model 100 including the semaphore S. An external host CPU can access the register 61 via the controller 6 .

コントローラ６は、内部バスＩＢを介して、第一メモリ１と、第二メモリ２と、ＤＭＡＣ３と、畳み込み演算回路４と、量子化演算回路５と、接続されている。外部ホストＣＰＵは、コントローラ６を経由して、各ブロックに対してアクセスできる。例えば、外部ホストＣＰＵは、コントローラ６を経由して、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５に対する命令を指示することができる。また、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５は、内部バスＩＢを介して、コントローラ６が有する状態レジスタ（セマフォＳを含む）を更新できる。状態レジスタ（セマフォＳを含む）は、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５と接続された専用配線を介して更新されるように構成されていてもよい。 Controller 6 is connected to first memory 1, second memory 2, DMAC 3, convolution circuit 4, and quantization circuit 5 via internal bus IB. An external host CPU can access each block via the controller 6 . For example, the external host CPU can issue commands to the DMAC 3, the convolution circuit 4, and the quantization circuit 5 via the controller 6. FIG. Also, the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5 can update the status register (including the semaphore S) of the controller 6 via the internal bus IB. The status register (including the semaphore S) may be configured to be updated via dedicated wiring connected to the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5. FIG.

ＮＮ実行モデル１００は、第一メモリ１や第二メモリ２等を有するため、ＤＲＡＭなどの外部メモリからのＤＭＡＣ３によるデータ転送において、重複するデータのデータ転送の回数を低減できる。これにより、メモリアクセスにより発生する消費電力を大幅に低減することができる。 Since the NN execution model 100 has the first memory 1, the second memory 2, etc., it is possible to reduce the number of data transfers of overlapping data in the data transfer by the DMAC 3 from the external memory such as DRAM. As a result, power consumption caused by memory access can be greatly reduced.

図６は、ＮＮ実行モデル１００の動作例を示すタイミングチャートである。ＮＮ実行モデル１００は、複数のレイヤの多層構造であるＣＮＮ２００の演算を、ループ状に形成された回路により演算する。ＮＮ実行モデル１００は、ループ状の回路構成により、ハードウェア資源を効率的に利用できる。以下、図６に示すニューラルネットワークハードウェア６００の動作例を説明する。 FIG. 6 is a timing chart showing an operation example of the NN execution model 100. As shown in FIG. The NN execution model 100 performs computations of the CNN 200, which has a multi-layered structure, by circuits formed in loops. The NN execution model 100 can efficiently use hardware resources due to its looped circuit configuration. An operation example of the neural network hardware 600 shown in FIG. 6 will be described below.

ＤＭＡＣ３は、レイヤ１（図３参照）の入力データａを第一メモリ１に格納する。ＤＭＡＣ３は、畳み込み演算回路４が行う畳み込み演算の順序にあわせて、レイヤ１の入力データａを分割して第一メモリ１に転送してもよい。 The DMAC 3 stores the input data a of layer 1 (see FIG. 3) in the first memory 1 . The DMAC 3 may divide the input data a of the layer 1 according to the order of the convolution operation performed by the convolution operation circuit 4 and transfer the divided data to the first memory 1 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ１（図３参照）の入力データａを読み出す。畳み込み演算回路４は、レイヤ１の入力データａに対してレイヤ１の畳み込み演算を行う。レイヤ１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the input data a of layer 1 (see FIG. 3) stored in the first memory 1 . The convolution operation circuit 4 performs a layer 1 convolution operation on layer 1 input data a. The output data f of the layer 1 convolution operation is stored in the second memory 2 .

量子化演算回路５は、第二メモリ２に格納されたレイヤ１の出力データｆを読み出す。量子化演算回路５は、レイヤ１の出力データｆに対してレイヤ２の量子化演算を行う。レイヤ２の量子化演算の出力データоｕｔは、第一メモリ１に格納される。 The quantization arithmetic circuit 5 reads the layer 1 output data f stored in the second memory 2 . A quantization operation circuit 5 performs a layer 2 quantization operation on layer 1 output data f. The output data out of the layer 2 quantization operation are stored in the first memory 1 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２の量子化演算の出力データを読み出す。畳み込み演算回路４は、レイヤ２の量子化演算の出力データоｕｔを入力データａとしてレイヤ３の畳み込み演算を行う。レイヤ３の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data of the layer 2 quantization operation stored in the first memory 1 . The convolution operation circuit 4 performs a layer 3 convolution operation using the output data out of the layer 2 quantization operation as input data a. The output data f of the layer 3 convolution operation is stored in the second memory 2 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２Ｍ－２（Ｍは自然数）の量子化演算の出力データоｕｔを読み出す。畳み込み演算回路４は、レイヤ２Ｍ－２の量子化演算の出力データоｕｔを入力データａとしてレイヤ２Ｍ－１の畳み込み演算を行う。レイヤ２Ｍ－１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data out of the quantization operation of the layer 2M-2 (M is a natural number) stored in the first memory 1. FIG. The convolution operation circuit 4 performs the convolution operation of the layer 2M-1 using the output data out of the quantization operation of the layer 2M-2 as the input data a. The output data f of the layer 2M-1 convolution operation is stored in the second memory 2. FIG.

量子化演算回路５は、第二メモリ２に格納されたレイヤ２Ｍ－１の出力データｆを読み出す。量子化演算回路５は、２Ｍ－１レイヤの出力データｆに対してレイヤ２Ｍの量子化演算を行う。レイヤ２Ｍの量子化演算の出力データоｕｔは、第一メモリ１に格納される。 The quantization arithmetic circuit 5 reads the layer 2M-1 output data f stored in the second memory 2 . The quantization operation circuit 5 performs a layer 2M quantization operation on the output data f of the 2M-1 layer. The output data out of the layer 2M quantization operation are stored in the first memory 1 .

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２Ｍの量子化演算の出力データоｕｔを読み出す。畳み込み演算回路４は、レイヤ２Ｍの量子化演算の出力データоｕｔを入力データａとしてレイヤ２Ｍ＋１の畳み込み演算を行う。レイヤ２Ｍ＋１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data out of the layer 2M quantization operation stored in the first memory 1 . The convolution operation circuit 4 performs a layer 2M+1 convolution operation using the output data out of the layer 2M quantization operation as input data a. The output data f of the layer 2M+1 convolution operation are stored in the second memory 2 .

畳み込み演算回路４と量子化演算回路５とが交互に演算を行い、図３に示すＣＮＮ２００の演算を進めていく。ＮＮ実行モデル１００は、畳み込み演算回路４が時分割によりレイヤ２Ｍ－１の畳み込み演算とレイヤ２Ｍ＋１を実施する。また、ＮＮ実行モデル１００は、量子化演算回路５が時分割によりレイヤ２Ｍ－２の畳み込み演算とレイヤ２Ｍを実施する。そのため、ＮＮ実行モデル１００は、レイヤごとに別々の畳み込み演算回路４と量子化演算回路５を実装する場合と比較して、回路規模が著しく小さい。 The convolution calculation circuit 4 and the quantization calculation circuit 5 alternately perform calculations to advance the calculation of the CNN 200 shown in FIG. In the NN execution model 100, the convolution circuit 4 performs the layer 2M-1 convolution and the layer 2M+1 by time division. In addition, in the NN execution model 100, the quantization operation circuit 5 performs layer 2M-2 convolution operation and layer 2M by time division. Therefore, the NN execution model 100 has a significantly smaller circuit scale than a case where separate convolution operation circuits 4 and quantization operation circuits 5 are implemented for each layer.

なお、図６においては、畳み込み演算回路４と量子化演算回路５とが交互に演算を行い、図３に示すＣＮＮ２００の演算を進めていく例を示したが、これに限られるものではない。例えば、量子化演算回路５が時分割によりレイヤ２Ｍの量子化演算を行うのと並行して、畳み込み演算回路４がレイヤ２Ｍ＋１を実施するよう制御してもよい。この動作により、より演算効率を高めることが可能となる。また、別の例として、連続しない二つのレイヤに対して並列した演算を行ってもよい。 Although FIG. 6 shows an example in which the convolution operation circuit 4 and the quantization operation circuit 5 alternately perform operations to advance the operation of the CNN 200 shown in FIG. 3, the present invention is not limited to this. For example, the convolution operation circuit 4 may be controlled to perform the layer 2M+1 in parallel with the quantization operation circuit 5 performing the layer 2M quantization operation by time division. This operation makes it possible to further improve the computational efficiency. As another example, parallel operations may be performed on two discontinuous layers.

［ニューラルネットワーク生成装置３００の動作］
次に、ニューラルネットワーク生成装置３００の動作（ニューラルネットワーク制御方法）を、図７に示すニューラルネットワーク生成装置３００の制御フローチャートに沿って説明する。ニューラルネットワーク生成装置３００は初期化処理（ステップＳ１０）を実施した後、ステップＳ１１を実行する。 [Operation of Neural Network Generation Device 300]
Next, the operation of the neural network generation device 300 (neural network control method) will be described with reference to the control flowchart of the neural network generation device 300 shown in FIG. After executing the initialization process (step S10), the neural network generation device 300 executes step S11.

＜ハードウェア情報取得工程（Ｓ１１）＞
ステップＳ１１において、ニューラルネットワーク生成装置３００は、動作対象ハードウェアのハードウェア情報ＨＷを取得する（ハードウェア情報取得工程）。ニューラルネットワーク生成装置３００は、例えば、データ入力部３３０に入力されたハードウェア情報ＨＷを取得する。ニューラルネットワーク生成装置３００は、表示部３５０にハードウェア情報ＨＷの入力に必要なＧＵＩ画像を表示させ、使用者にハードウェア情報ＨＷを操作入力部３６０から入力させることでハードウェア情報ＨＷを取得してもよい。 <Hardware Information Acquisition Step (S11)>
In step S11, the neural network generation device 300 acquires the hardware information HW of the hardware to be operated (hardware information acquisition step). The neural network generation device 300 acquires hardware information HW input to the data input unit 330, for example. The neural network generation device 300 displays a GUI image necessary for inputting the hardware information HW on the display unit 350, and causes the user to input the hardware information HW from the operation input unit 360, thereby acquiring the hardware information HW. may

ハードウェア情報ＨＷは、具体的には、第一メモリ１および第二メモリ２として割り当てるメモリのメモリ種別やメモリ容量や入出力データ幅を有する。 The hardware information HW specifically has memory types, memory capacities, and input/output data widths of the memories to be allocated as the first memory 1 and the second memory 2 .

取得されたハードウェア情報ＨＷは、記憶部３１０に記憶される。次に、ニューラルネットワーク生成装置３００は、ステップＳ１２を実行する。 The acquired hardware information HW is stored in the storage unit 310 . Next, the neural network generation device 300 executes step S12.

＜ネットワーク情報取得工程（Ｓ１２）＞
ステップＳ１２において、ニューラルネットワーク生成装置３００は、ＣＮＮ２００のネットワーク情報ＮＷを取得する（ネットワーク情報取得工程）。ニューラルネットワーク生成装置３００は、例えば、データ入力部３３０に入力されたネットワーク情報ＮＷを取得する。ニューラルネットワーク生成装置３００は、表示部３５０にネットワーク情報ＮＷの入力に必要なＧＵＩ画像を表示させ、使用者にネットワーク情報ＮＷを操作入力部３６０から入力させることでネットワーク情報ＮＷを取得してもよい。 <Network information acquisition step (S12)>
In step S12, the neural network generation device 300 acquires the network information NW of the CNN 200 (network information acquisition step). The neural network generation device 300 acquires network information NW input to the data input unit 330, for example. The neural network generation device 300 may acquire the network information NW by displaying a GUI image necessary for inputting the network information NW on the display unit 350 and having the user input the network information NW from the operation input unit 360. .

ネットワーク情報ＮＷは、具体的には、入力層や出力層２３０を含むネットワーク構成と、重みｗや入力データａのビット幅を含む畳み込み層２１０の構成と、量子化情報を含む量子化演算層２２０の構成と、を有する。 Specifically, the network information NW includes the network configuration including the input layer and the output layer 230, the configuration of the convolution layer 210 including the weight w and the bit width of the input data a, and the quantization operation layer 220 including quantization information. and a configuration of

取得されたネットワーク情報ＮＷは、記憶部３１０に記憶される。次に、ニューラルネットワーク生成装置３００は、ステップＳ１３を実行する。 The acquired network information NW is stored in storage unit 310 . Next, the neural network generation device 300 executes step S13.

＜ニューラルネットワーク実行モデル生成工程（Ｓ１３）＞
ステップＳ１３において、ニューラルネットワーク生成装置３００の実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００を生成する（ニューラルネットワーク実行モデル生成工程）。 <Neural Network Execution Model Generation Step (S13)>
In step S13, the execution model generation unit 321 of the neural network generation device 300 generates the NN execution model 100 based on the hardware information HW and the network information NW (neural network execution model generation step).

ニューラルネットワーク実行モデル生成工程（ＮＮ実行モデル生成工程）は、例えば、畳み込み回路生成工程（Ｓ１３－１）と、量子化回路生成工程（Ｓ１３－２）と、ＤＭＡＣ生成工程（Ｓ１３－３）と、を有する。なお、ＮＮ実行モデル生成工程として、一部または全部の回路を事前に生成し記憶部３１０などに記憶されたものを用いることで、当該工程の一部または全部を省略してもよい。また、ハードウェア情報ＨＷまたはネットワーク情報ＮＷなどに基づいて、事前に作成しておいた回路から選択により、ＮＮ実行モデル生成工程を実現してもよい。 The neural network execution model generation step (NN execution model generation step) includes, for example, a convolution circuit generation step (S13-1), a quantization circuit generation step (S13-2), a DMAC generation step (S13-3), have As the NN execution model generation process, a part or all of the process may be omitted by using a part or all of the circuit generated in advance and stored in the storage unit 310 or the like. Alternatively, the NN execution model generation process may be realized by selecting from circuits prepared in advance based on the hardware information HW or the network information NW.

＜畳み込み回路生成工程（Ｓ１３－１）＞
実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００の畳み込み回路４を生成する（畳み込み回路生成工程）。実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された重みｗや入力データａのビット幅などの情報から、畳み込み演算回路４のハードウェアモデルを生成する。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。以下、生成される畳み込み演算回路４のハードウェアモデルの一例を説明する。 <Convolution Circuit Generation Step (S13-1)>
The execution model generator 321 generates the convolution circuit 4 of the NN execution model 100 based on the hardware information HW and the network information NW (convolution circuit generation step). The execution model generation unit 321 generates a hardware model of the convolution operation circuit 4 from information such as the weight w input as the network information NW and the bit width of the input data a. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. . An example of the generated hardware model of the convolution operation circuit 4 will be described below.

図８は、生成される畳み込み演算回路４の内部ブロック図である。
畳み込み演算回路４は、重みメモリ４１と、乗算器４２と、アキュムレータ回路４３と、ステートコントローラ４４と、を有する。畳み込み演算回路４は、乗算器４２およびアキュムレータ回路４３に対する専用のステートコントローラ４４を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに畳み込み演算を実施できる。 FIG. 8 is an internal block diagram of the generated convolution operation circuit 4. As shown in FIG.
The convolution circuit 4 has a weight memory 41 , a multiplier 42 , an accumulator circuit 43 and a state controller 44 . The convolution operation circuit 4 has a dedicated state controller 44 for the multiplier 42 and the accumulator circuit 43, and when an instruction command is input, the convolution operation can be performed without the need for an external controller.

重みメモリ４１は、畳み込み演算に用いる重みｗが格納されるメモリであり、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。ＤＭＡＣ３は、ＤＭＡ転送により、畳み込み演算に必要な重みｗを重みメモリ４１に書き込む。 The weight memory 41 is a memory that stores the weight w used in the convolution operation, and is a rewritable memory such as a volatile memory such as an SRAM (Static RAM). The DMAC 3 writes the weight w required for the convolution operation into the weight memory 41 by DMA transfer.

図９は、乗算器４２の内部ブロック図である。
乗算器４２は、入力ベクトルＡと重みマトリクスＷとを乗算する。入力ベクトルＡは、入力データａが分割されたデータであり、Ｂｃ個の要素を持つベクトルデータである。また、重みマトリクスＷは、重みｗが分割されたデータであり、Ｂｃ×Ｂｄ個の要素を持つマトリクスデータである。乗算器４２は、Ｂｃ×Ｂｄ個の積和演算ユニット４７を有し、入力ベクトルＡと重みマトリクスＷとを乗算を並列して実施できる。 FIG. 9 is an internal block diagram of the multiplier 42. As shown in FIG.
Multiplier 42 multiplies input vector A and weight matrix W. FIG. The input vector A is data obtained by dividing the input data a, and is vector data having Bc elements. The weight matrix W is data obtained by dividing the weight w, and is matrix data having Bc×Bd elements. The multiplier 42 has Bc×Bd product-sum operation units 47, and can perform multiplication of the input vector A and the weight matrix W in parallel.

乗算器４２は、乗算に必要な入力ベクトルＡと重みマトリクスＷを、第一メモリ１および重みメモリ４１から読み出して乗算を実施する。乗算器４２は、Ｂｄ個の積和演算結果Ｏ（ｄｉ）を出力する。 The multiplier 42 reads out the input vector A and the weight matrix W required for multiplication from the first memory 1 and the weight memory 41 to carry out the multiplication. The multiplier 42 outputs Bd sum-of-products operation results O(di).

図１０は、積和演算ユニット４７の内部ブロック図である。
積和演算ユニット４７は、入力ベクトルＡの要素Ａ（ｃｉ）と、重みマトリクスＷの要素Ｗ（ｃｉ，ｄｉ）との乗算を実施する。また、積和演算ユニット４７は、乗算結果と他の積和演算ユニット４７の乗算結果Ｓ（ｃｉ，ｄｉ）と加算する。積和演算ユニット４７は、加算結果Ｓ（ｃｉ＋１，ｄｉ）を出力する。ｃｉは０から(Ｂｃ－１)までのインデックスである。ｄｉは０から(Ｂｄ－１)までのインデックスである。要素Ａ（ｃｉ）は、２ビットの符号なし整数（０，１，２，３）である。要素Ｗ（ｃｉ，ｄｉ）は、１ビットの符号付整数（０，１）であり、値「０」は＋１を表し、値「１」は－１を表す。 FIG. 10 is an internal block diagram of the sum-of-products operation unit 47. As shown in FIG.
Sum-of-products unit 47 performs multiplication of input vector A element A(ci) with weight matrix W element W(ci, di). Further, the product-sum operation unit 47 adds the multiplication result and the multiplication result S(ci, di) of another product-sum operation unit 47 . The sum-of-products operation unit 47 outputs the addition result S(ci+1, di). ci is an index from 0 to (Bc-1). di is an index from 0 to (Bd-1). Element A(ci) is a 2-bit unsigned integer (0, 1, 2, 3). The element W(ci,di) is a 1-bit signed integer (0,1), where the value "0" represents +1 and the value "1" represents -1.

積和演算ユニット４７は、反転器（インバータ）４７ａと、セレクタ４７ｂと、加算器４７ｃと、を有する。積和演算ユニット４７は、乗算器を用いず、反転器４７ａおよびセレクタ４７ｂのみを用いて乗算を行う。セレクタ４７ｂは、要素Ｗ（ｃｉ，ｄｉ）が「０」の場合、要素Ａ（ｃｉ）の入力を選択する。セレクタ４７ｂは、要素Ｗ（ｃｉ，ｄｉ）が「１」の場合、要素Ａ（ｃｉ）を反転器により反転させた補数を選択する。要素Ｗ（ｃｉ，ｄｉ）は、加算器４７ｃのＣａｒｒｙ－ｉｎにも入力される。加算器４７ｃは、要素Ｗ（ｃｉ，ｄｉ）が「０」のとき、Ｓ（ｃｉ，ｄｉ）に要素Ａ（ｃｉ）を加算した値を出力する。加算器４７ｃは、Ｗ（ｃｉ，ｄｉ）が「１」のとき、Ｓ（ｃｉ，ｄｉ）から要素Ａ（ｃｉ）を減算した値を出力する。 The sum-of-products operation unit 47 has an inverter (inverter) 47a, a selector 47b, and an adder 47c. The sum-of-products operation unit 47 performs multiplication using only the inverter 47a and the selector 47b without using a multiplier. The selector 47b selects the input of the element A(ci) when the element W(ci, di) is "0". If the element W(ci, di) is "1", the selector 47b selects the complement of the element A(ci) inverted by an inverter. Element W(ci, di) is also input to Carry-in of adder 47c. The adder 47c outputs a value obtained by adding the element A(ci) to S(ci, di) when the element W(ci, di) is "0". The adder 47c outputs a value obtained by subtracting the element A(ci) from S(ci, di) when W(ci, di) is "1".

図１１は、アキュムレータ回路４３の内部ブロック図である。
アキュムレータ回路４３は、乗算器４２の積和演算結果Ｏ（ｄｉ）を第二メモリ２にアキュムレートする。アキュムレータ回路４３は、Ｂｄ個のアキュムレータユニット４８を有し、Ｂｄ個の積和演算結果Ｏ（ｄｉ）を並列して第二メモリ２にアキュムレートできる。 FIG. 11 is an internal block diagram of the accumulator circuit 43. As shown in FIG.
The accumulator circuit 43 accumulates the sum-of-products operation result O(di) of the multiplier 42 in the second memory 2 . The accumulator circuit 43 has Bd accumulator units 48 and can accumulate Bd product-sum operation results O(di) in parallel in the second memory 2 .

図１２は、アキュムレータユニット４８の内部ブロック図である。
アキュムレータユニット４８は、加算器４８ａと、マスク部４８ｂとを有している。加算器４８ａは、積和演算結果Ｏの要素Ｏ（ｄｉ）と、第二メモリ２に格納された式１に示す畳み込み演算の途中経過である部分和と、を加算する。加算結果は、要素あたり１６ビットである。加算結果は、要素あたり１６ビットに限定されず、例えば要素あたり１５ビットや１７ビットであってもよい。 FIG. 12 is an internal block diagram of the accumulator unit 48. As shown in FIG.
The accumulator unit 48 has an adder 48a and a mask portion 48b. The adder 48 a adds the element O(di) of the sum-of-products operation result O and the partial sum, which is the intermediate progress of the convolution operation shown in Equation 1, stored in the second memory 2 . The addition result is 16 bits per element. The addition result is not limited to 16 bits per element, and may be, for example, 15 bits or 17 bits per element.

加算器４８ａは、加算結果を第二メモリ２の同一アドレスに書き込む。マスク部４８ｂは、初期化信号ｃｌｅａｒがアサートされた場合に、第二メモリ２からの出力をマスクし、要素Ｏ（ｄｉ）に対する加算対象をゼロにする。初期化信号ｃｌｅａｒは、第二メモリ２に途中経過の部分和が格納されていない場合にアサートされる。 The adder 48a writes the addition result to the same address in the second memory 2. FIG. The mask unit 48b masks the output from the second memory 2 and zeros the addition target for the element O(di) when the initialization signal clear is asserted. The initialization signal clear is asserted when the intermediate partial sum is not stored in the second memory 2 .

乗算器４２およびアキュムレータ回路４３による畳み込み演算が完了すると、第二メモリに、Ｂｄ個の要素を持つ出力データｆ（ｘ，ｙ，ｄｏ）が格納される。 When the convolution operation by the multiplier 42 and the accumulator circuit 43 is completed, the output data f(x, y, do) having Bd elements is stored in the second memory.

ステートコントローラ４４は、乗算器４２およびアキュムレータ回路４３のステートを制御する。また、ステートコントローラ４４は、内部バスＩＢを介してコントローラ６と接続されている。ステートコントローラ４４は、命令キュー４５と制御回路４６とを有する。 State controller 44 controls the states of multiplier 42 and accumulator circuit 43 . Also, the state controller 44 is connected to the controller 6 via an internal bus IB. The state controller 44 has an instruction queue 45 and a control circuit 46 .

命令キュー４５は、畳み込み演算回路４用の命令コマンドＣ４が格納されるキューであり、例えばＦＩＦＯメモリで構成される。命令キュー４５には、内部バスＩＢ経由で命令コマンドＣ４が書き込まれる。 The instruction queue 45 is a queue in which the instruction command C4 for the convolution operation circuit 4 is stored, and is composed of a FIFO memory, for example. An instruction command C4 is written to the instruction queue 45 via the internal bus IB.

制御回路４６は、命令コマンドＣ４をデコードし、命令コマンドＣ４に基づいて乗算器４２およびアキュムレータ回路４３を制御するステートマシンである。制御回路４６は、論理回路により実装されていてもよいし、ソフトウェアによって制御されるＣＰＵによって実装されていてもよい。 The control circuit 46 is a state machine that decodes the instruction command C4 and controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4. The control circuit 46 may be implemented by a logic circuit or by a CPU controlled by software.

図１３は、制御回路４６のステート遷移図である。
制御回路４６は、命令キュー４５に命令コマンドＣ４が入力されると（Ｎｏｔｅｍｐｔｙ）、アイドルステートＳ１からデコードステートＳ２に遷移する。 FIG. 13 is a state transition diagram of the control circuit 46. As shown in FIG.
When the instruction command C4 is input to the instruction queue 45 (Not empty), the control circuit 46 transitions from the idle state S1 to the decode state S2.

制御回路４６は、デコードステートＳ２において、命令キュー４５から出力される命令コマンドＣ３をデコードする。また、制御回路４６は、コントローラ６のレジスタ６１に格納されたセマフォＳを読み出し、命令コマンドＣ４において指示された乗算器４２やアキュムレータ回路４３の動作を実行可能であるかを判定する。実行不能である場合（Ｎｏｔｒｅａｄｙ）、制御回路４６は実行可能となるまで待つ（Ｗａｉｔ）。実行可能である場合（ｒｅａｄｙ）、制御回路４６はデコードステートＳ２から実行ステートＳ３に遷移する。 The control circuit 46 decodes the instruction command C3 output from the instruction queue 45 in the decode state S2. Also, the control circuit 46 reads the semaphore S stored in the register 61 of the controller 6 and determines whether the operations of the multiplier 42 and the accumulator circuit 43 instructed by the instruction command C4 can be executed. If it is not executable (Not ready), the control circuit 46 waits until it becomes executable (Wait). If it is ready (ready), the control circuit 46 transitions from the decode state S2 to the run state S3.

制御回路４６は、実行ステートＳ３において、乗算器４２やアキュムレータ回路４３を制御して、乗算器４２やアキュムレータ回路４３に命令コマンドＣ４において指示された動作を実施させる。制御回路３４は、乗算器４２やアキュムレータ回路４３の動作が終わると、命令キュー４５から実行を終えた命令コマンドＣ４を取り除くとともに、コントローラ６のレジスタ６１に格納されたセマフォＳを更新する。制御回路４６は、命令キュー４５に命令がある場合（Ｎｏｔｅｍｐｔｙ）、実行ステートＳ３からデコードステートＳ２に遷移する。制御回路４６は、命令キュー４５に命令がない場合（ｅｍｐｔｙ）、実行ステートＳ３からアイドルステートＳ１に遷移する。 In the execution state S3, the control circuit 46 controls the multiplier 42 and the accumulator circuit 43 to perform the operation indicated by the instruction command C4. After the operation of the multiplier 42 and the accumulator circuit 43 is finished, the control circuit 34 removes the executed instruction command C4 from the instruction queue 45 and updates the semaphore S stored in the register 61 of the controller 6 . When there is an instruction in the instruction queue 45 (Not empty), the control circuit 46 transitions from the execution state S3 to the decode state S2. When the instruction queue 45 has no instruction (empty), the control circuit 46 transitions from the execution state S3 to the idle state S1.

実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された重みｗや入力データａのビット幅などの情報から、畳み込み演算回路４における演算器の仕様やサイズ（ＢｃやＢｄ）を決定する。ハードウェア情報ＨＷとして生成するＮＮ実行モデル１００（ニューラルネットワークハードウェアモデル４００、ニューラルネットワークハードウェア６００）のハードウェア規模が含まれる場合、実行モデル生成部３２１は、指定された規模にあわせて畳み込み演算回路４における演算器の仕様やサイズ（ＢｃやＢｄ）を調整する。 The execution model generator 321 determines the specifications and sizes (Bc and Bd) of the arithmetic units in the convolution arithmetic circuit 4 from information such as the weight w input as the network information NW and the bit width of the input data a. When the hardware scale of the NN execution model 100 (neural network hardware model 400, neural network hardware 600) to be generated as the hardware information HW is included, the execution model generating unit 321 performs the convolution operation according to the specified scale. The specifications and sizes (Bc and Bd) of the calculator in the circuit 4 are adjusted.

＜量子化回路生成工程（Ｓ１３－２）＞
実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００の量子化回路５を生成する（量子化回路生成工程）。実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された量子化情報から、量子化回路５のハードウェアモデルを生成する。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。以下、生成される量子化回路５のハードウェアモデルの一例を説明する。 <Quantization Circuit Generation Step (S13-2)>
The execution model generation unit 321 generates the quantization circuit 5 of the NN execution model 100 based on the hardware information HW and the network information NW (quantization circuit generation step). The execution model generation unit 321 generates a hardware model of the quantization circuit 5 from the quantization information input as the network information NW. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. . An example of the generated hardware model of the quantization circuit 5 will be described below.

図１４は、生成される量子化演算回路５の内部ブロック図である。
量子化演算回路５は、量子化パラメータメモリ５１と、ベクトル演算回路５２と、量子化回路５３と、ステートコントローラ５４と、を有する量子化演算回路５は、ベクトル演算回路５２および量子化回路５３に対する専用のステートコントローラ５４を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに量子化演算を実施できる。 FIG. 14 is an internal block diagram of the generated quantization arithmetic circuit 5. As shown in FIG.
Quantization operation circuit 5 has quantization parameter memory 51 , vector operation circuit 52 , quantization circuit 53 , and state controller 54 . It has a dedicated state controller 54, and when an instruction command is input, it can perform quantization operations without the need for an external controller.

量子化パラメータメモリ５１は、量子化演算に用いる量子化パラメータｑが格納されるメモリであり、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。ＤＭＡＣ３は、ＤＭＡ転送により、量子化演算に必要な量子化パラメータｑを量子化パラメータメモリ５１に書き込む。 The quantization parameter memory 51 is a memory that stores the quantization parameter q used in the quantization calculation, and is a rewritable memory such as a volatile memory such as an SRAM (Static RAM). The DMAC 3 writes the quantization parameter q required for the quantization calculation into the quantization parameter memory 51 by DMA transfer.

図１５は、ベクトル演算回路５２と量子化回路５３の内部ブロック図である。
ベクトル演算回路５２は、第二メモリ２に格納された出力データｆ（ｘ，ｙ，ｄｏ）に対して演算を行う。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７を有し、出力データｆ（ｘ，ｙ，ｄｏ）に対して並列にＳＩＭＤ演算を行う。 FIG. 15 is an internal block diagram of the vector operation circuit 52 and the quantization circuit 53. As shown in FIG.
The vector computation circuit 52 computes the output data f(x, y, do) stored in the second memory 2 . The vector operation circuit 52 has Bd number of operation units 57 and performs SIMD operations on output data f(x, y, do) in parallel.

図１６は、演算ユニット５７のブロック図である。
演算ユニット５７は、例えば、ＡＬＵ５７ａと、第一セレクタ５７ｂと、第二セレクタ５７ｃと、レジスタ５７ｄと、シフタ５７ｅと、を有する。演算ユニット５７は、公知の汎用ＳＩＭＤ演算回路が有する他の演算器等をさらに有してもよい。 FIG. 16 is a block diagram of the arithmetic unit 57. As shown in FIG.
The arithmetic unit 57 has, for example, an ALU 57a, a first selector 57b, a second selector 57c, a register 57d, and a shifter 57e. The arithmetic unit 57 may further include other calculators and the like that a known general-purpose SIMD arithmetic circuit has.

ベクトル演算回路５２は、演算ユニット５７が有する演算器等を組み合わせることで、出力データｆ（ｘ，ｙ，ｄｏ）に対して、量子化演算層２２０におけるプーリング層２２１や、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２や、活性化関数層２２３の演算のうち少なくとも一つの演算を行う。 The vector operation circuit 52 performs a pooling layer 221 in the quantization operation layer 220, a batch normalization layer 222, and a At least one operation among the operations of the activation function layer 223 is performed.

演算ユニット５７は、レジスタ５７ｄに格納されたデータと第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）とをＡＬＵ５７ａにより加算できる。演算ユニット５７は、ＡＬＵ５７ａによる加算結果をレジスタ５７ｄに格納できる。演算ユニット５７は、第一セレクタ５７ｂの選択によりレジスタ５７ｄに格納されたデータに代えて「０」をＡＬＵ５７ａに入力することで加算結果を初期化できる。例えばプーリング領域が２×２である場合、シフタ５７ｅはＡＬＵ５７ａの出力を２ｂｉｔ右シフトすることで加算結果の平均値を出力できる。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式２に示す平均プーリングの演算を実施できる。 The arithmetic unit 57 can add the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a. The arithmetic unit 57 can store the addition result by the ALU 57a in the register 57d. The arithmetic unit 57 can initialize the addition result by inputting "0" to the ALU 57a instead of the data stored in the register 57d by selecting the first selector 57b. For example, when the pooling area is 2×2, the shifter 57e can output the average value of the addition result by shifting the output of the ALU 57a to the right by 2 bits. The vector operation circuit 52 can perform the average pooling operation shown in Equation 2 by repeating the above operations and the like by the Bd number of operation units 57 .

演算ユニット５７は、レジスタ５７ｄに格納されたデータと第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）とをＡＬＵ５７ａにより比較できる。
演算ユニット５７は、ＡＬＵ５７ａによる比較結果に応じて第二セレクタ５７ｃを制御して、レジスタ５７ｄに格納されたデータと要素ｆ（ｄｉ）の大きい方を選択できる。演算ユニット５７は、第一セレクタ５７ｂの選択により要素ｆ（ｄｉ）の取りうる値の最小値をＡＬＵ５７ａに入力することで比較対象を最小値に初期化できる。本実施形態において要素ｆ（ｄｉ）は１６ｂｉｔ符号付き整数であるので、要素ｆ（ｄｉ）の取りうる値の最小値は「０ｘ８０００」である。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式３のＭＡＸプーリングの演算を実施できる。なお、ＭＡＸプーリングの演算ではシフタ５７ｅは第二セレクタ５７ｃの出力をシフトしない。 The arithmetic unit 57 can compare the data stored in the register 57d with the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a.
The arithmetic unit 57 can control the second selector 57c according to the result of comparison by the ALU 57a to select the larger one of the data stored in the register 57d and the element f(di). The arithmetic unit 57 can initialize the comparison target to the minimum value by inputting the minimum value of the possible values of the element f(di) to the ALU 57a by selecting the first selector 57b. Since the element f(di) is a 16-bit signed integer in this embodiment, the minimum possible value of the element f(di) is "0x8000". The vector operation circuit 52 can implement the MAX pooling operation of Equation 3 by repeating the above operations and the like by the Bd number of operation units 57 . Note that the shifter 57e does not shift the output of the second selector 57c in the MAX pooling calculation.

演算ユニット５７は、レジスタ５７ｄに格納されたデータと第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）とをＡＬＵ５７ａにより減算できる。シフタ５７ｅはＡＬＵ５７ａの出力を左シフト（すなわち乗算）もしくは右シフト（すなわち除算）できる。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式４のＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎの演算を実施できる。 The arithmetic unit 57 can subtract the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a. Shifter 57e can left shift (ie, multiply) or right shift (ie, divide) the output of ALU 57a. The vector operation circuit 52 can perform the operation of Batch Normalization of Equation 4 by repeating the above operation and the like by the Bd number of operation units 57 .

演算ユニット５７は、第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）と第一セレクタ５７ｂにより選択された「０」とをＡＬＵ５７ａにより比較できる。演算ユニット５７は、ＡＬＵ５７ａによる比較結果に応じて要素ｆ（ｄｉ）と予めレジスタ５７ｄに格納された定数値「０」のいずれかを選択して出力できる。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式５のＲｅＬＵ演算を実施できる。 The arithmetic unit 57 can compare the element f(di) of the output data f(x, y, do) read from the second memory 2 with "0" selected by the first selector 57b by the ALU 57a. The arithmetic unit 57 can select and output either the element f(di) or the constant value "0" previously stored in the register 57d according to the comparison result by the ALU 57a. The vector operation circuit 52 can perform the ReLU operation of Equation 5 by repeating the above operations and the like by the Bd number of operation units 57 .

ベクトル演算回路５２は、平均プーリング、ＭＡＸプーリング、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ、活性化関数の演算およびこれらの演算の組み合わせを実施できる。ベクトル演算回路５２は、汎用ＳＩＭＤ演算を実施できるため、量子化演算層２２０における演算に必要な他の演算を実施してもよい。また、ベクトル演算回路５２は、量子化演算層２２０における演算以外の演算を実施してもよい。 The vector operation circuit 52 can perform average pooling, MAX pooling, batch normalization, activation function operations, and combinations of these operations. Since vector arithmetic circuit 52 is capable of performing general-purpose SIMD operations, it may also perform other operations required for operations in quantization operations layer 220 . Also, the vector operation circuit 52 may perform operations other than the operations in the quantization operation layer 220 .

なお、量子化演算回路５は、ベクトル演算回路５２を有してなくてもよい。量子化演算回路５がベクトル演算回路５２を有していない場合、出力データｆ（ｘ，ｙ，ｄｏ）は量子化回路５３に入力される。 Note that the quantization arithmetic circuit 5 may not have the vector arithmetic circuit 52 . If the quantization operation circuit 5 does not have the vector operation circuit 52 , the output data f(x, y, do) are input to the quantization circuit 53 .

量子化回路５３は、ベクトル演算回路５２の出力データに対して、量子化を行う。量子化回路５３は、図１５に示すように、Ｂｄ個の量子化ユニット５８を有し、ベクトル演算回路５２の出力データに対して並列に演算を行う。 A quantization circuit 53 quantizes the output data of the vector operation circuit 52 . As shown in FIG. 15, the quantization circuit 53 has Bd quantization units 58 and performs operations on the output data of the vector operation circuit 52 in parallel.

図１７は、量子化ユニット５８の内部ブロック図である。
量子化ユニット５８は、ベクトル演算回路５２の出力データの要素ｉｎ（ｄｉ）に対して量子化を行う。量子化ユニット５８は、比較器５８ａと、エンコーダ５８ｂと、を有する。量子化ユニット５８はベクトル演算回路５２の出力データ（１６ビット／要素）に対して、量子化演算層２２０における量子化層２２４の演算（式６）を行う。量子化ユニット５８は、量子化パラメータメモリ５１から必要な量子化パラメータｑ（ｔｈ０，ｔｈ１，ｔｈ２）を読み出し、比較器５８ａにより入力ｉｎ（ｄｉ）と量子化パラメータｑとの比較を行う。量子化ユニット５８は、比較器５８ａによる比較結果をエンコーダ５８ｂにより２ビット／要素に量子化する。式４におけるα(c)とβ(c)は、変数ｃごとに異なるパラメータであるため、α(c)とβ(c)を反映する量子化パラメータｑ（ｔｈ０，ｔｈ１，ｔｈ２）はｉｎ（ｄｉ）ごとに異なるパラメータである。 FIG. 17 is an internal block diagram of quantization unit 58. As shown in FIG.
A quantization unit 58 quantizes the element in(di) of the output data of the vector operation circuit 52 . Quantization unit 58 comprises a comparator 58a and an encoder 58b. The quantization unit 58 performs the operation (formula 6) of the quantization layer 224 in the quantization operation layer 220 on the output data (16 bits/element) of the vector operation circuit 52 . The quantization unit 58 reads the necessary quantization parameters q (th0, th1, th2) from the quantization parameter memory 51, and the comparator 58a compares the input in(di) with the quantization parameter q. Quantization unit 58 quantizes the result of comparison by comparator 58a to 2 bits/element by encoder 58b. Since α(c) and β(c) in Equation 4 are different parameters for each variable c, the quantization parameter q(th0, th1, th2) reflecting α(c) and β(c) is in( d) different parameters for each;

量子化ユニット５８は、入力ｉｎ（ｄｉ）を３つの閾値ｔｈ０，ｔｈ１，ｔｈ２と比較することにより、入力ｉｎ（ｄｉ）を４領域（例えば、ｉｎ≦ｔｈ０，ｔｈ０＜ｉｎ≦ｔｈ１，ｔｈ１＜ｉｎ≦ｔｈ２，ｔｈ２＜ｉｎ）に分類し、分類結果を２ビットにエンコードして出力する。量子化ユニット５８は、量子化パラメータｑ（ｔｈ０，ｔｈ１，ｔｈ２）の設定により、量子化と併せてＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎや活性化関数の演算を行うこともできる。 Quantization unit 58 divides input in(di) into four regions (eg, in≦th0, th0<in≦th1, th1<in ≤th2, th2<in), and the classification result is encoded into 2 bits and output. The quantization unit 58 can also perform batch normalization and calculation of an activation function together with quantization by setting quantization parameters q (th0, th1, th2).

量子化ユニット５８は、閾値ｔｈ０を式４のβ(ｃ)、閾値の差（ｔｈ１―ｔｈ０）および（ｔｈ２―ｔｈ１）を式４のα(ｃ)として設定して量子化を行うことで、式４に示すＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎの演算を量子化と併せて実施できる。（ｔｈ１―ｔｈ０）および（ｔｈ２―ｔｈ１）を大きくすることでα(ｃ)を小さくできる。（ｔｈ１―ｔｈ０）および（ｔｈ２―ｔｈ１）を小さくすることで、α(c)を大きくできる。 The quantization unit 58 performs quantization by setting the threshold th0 as β(c) in Equation 4 and threshold differences (th1−th0) and (th2−th1) as α(c) in Equation 4, The Batch Normalization operation shown in Equation 4 can be performed together with quantization. α(c) can be reduced by increasing (th1-th0) and (th2-th1). α(c) can be increased by decreasing (th1-th0) and (th2-th1).

量子化ユニット５８は、入力ｉｎ（ｄｉ）の量子化と併せて活性化関数のＲｅＬＵ演算を実施できる。例えば、量子化ユニット５８は、ｉｎ（ｄｉ）≦ｔｈ０およびｔｈ２＜ｉｎ（ｄｉ）となる領域では出力値を飽和させる。量子化ユニット５８は、出力が非線形とするように量子化パラメータｑを設定することで活性化関数の演算を量子化と併せて実施できる。 Quantization unit 58 may perform a ReLU operation of the activation function in conjunction with quantization of the input in(di). For example, quantization unit 58 saturates the output values in regions where in(di)≤th0 and th2<in(di). Quantization unit 58 may perform activation function computation in conjunction with quantization by setting the quantization parameter q such that the output is non-linear.

ステートコントローラ５４は、ベクトル演算回路５２および量子化回路５３のステートを制御する。また、ステートコントローラ５４は、内部バスＩＢを介してコントローラ６と接続されている。ステートコントローラ５４は、命令キュー５５と制御回路５６とを有する。 State controller 54 controls the states of vector operation circuit 52 and quantization circuit 53 . Also, the state controller 54 is connected to the controller 6 via an internal bus IB. The state controller 54 has an instruction queue 55 and a control circuit 56 .

命令キュー５５は、量子化演算回路５用の命令コマンドＣ５が格納されるキューであり、例えばＦＩＦＯメモリで構成される。命令キュー５５には、内部バスＩＢ経由で命令コマンドＣ５が書き込まれる。 The instruction queue 55 is a queue in which the instruction command C5 for the quantization arithmetic circuit 5 is stored, and is composed of a FIFO memory, for example. The instruction command C5 is written to the instruction queue 55 via the internal bus IB.

制御回路５６は、命令コマンドＣ５をデコードし、命令コマンドＣ５に基づいてベクトル演算回路５２および量子化回路５３を制御するステートマシンである。制御回路５６は、畳み込み演算回路４のステートコントローラ４４の制御回路４６と同様の構成である。 The control circuit 56 is a state machine that decodes the instruction command C5 and controls the vector operation circuit 52 and the quantization circuit 53 based on the instruction command C5. The control circuit 56 has the same configuration as the control circuit 46 of the state controller 44 of the convolution circuit 4 .

量子化演算回路５は、Ｂｄ個の要素を持つ量子化演算出力データを第一メモリ１に書き込む。なお、ＢｄとＢｃの好適な関係を式７に示す。式７においてｎは整数である。 The quantization operation circuit 5 writes quantization operation output data having Bd elements into the first memory 1 . Formula 7 shows a suitable relationship between Bd and Bc. In Equation 7, n is an integer.

実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された量子化情報から、量子化演算回路５における、プーリングの演算の有無および種類（平均プーリング、ＭＡＸプーリングなど）、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎの演算の有無および方式、活性化関数の演算の有無および方式（ＲｅＬＵ演算など）、量子化の方式（ビット数など）、およびその他の演算の有無を決定する。ハードウェア情報ＨＷとして生成するＮＮ実行モデル１００（ニューラルネットワークハードウェアモデル４００、ニューラルネットワークハードウェア６００）のハードウェア規模が含まれる場合、実行モデル生成部３２１は、指定された規模にあわせて量子化演算回路５における演算器の構成を調整する。 Based on the quantization information input as the network information NW, the execution model generation unit 321 determines the presence or absence and type of pooling operation (average pooling, MAX pooling, etc.) and the presence and type of Batch Normalization operation in the quantization operation circuit 5. , the presence or absence and method of operation of the activation function (ReLU operation, etc.), the method of quantization (number of bits, etc.), and the presence or absence of other operations. When the hardware scale of the NN execution model 100 (neural network hardware model 400, neural network hardware 600) to be generated as the hardware information HW is included, the execution model generator 321 quantizes according to the designated scale. The configuration of the computing units in the computing circuit 5 is adjusted.

＜ＤＭＡＣ生成工程（Ｓ１３－３）＞
実行モデル生成部３２１は、ハードウェア情報ＨＷとネットワーク情報ＮＷとに基づいてＮＮ実行モデル１００のＤＭＡＣ３を生成する（ＤＭＡＣ生成工程）。実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された情報から、ＤＭＡＣ３のハードウェアモデルを生成する。ハードウェアモデルは、ビヘイビアレベルであってもよく、ＲＴＬ（Register Transfer Level）であってもよく、ゲートや回路モジュール間の接続を表すネットリストであってもよく、それらの組み合わせであってもよい。以下、生成されるＤＭＡＣ３のハードウェアモデルの一例を説明する。 <DMAC generation step (S13-3)>
The execution model generation unit 321 generates the DMAC3 of the NN execution model 100 based on the hardware information HW and the network information NW (DMAC generation step). The execution model generation unit 321 generates a hardware model of the DMAC 3 from information input as the network information NW. The hardware model may be a behavioral level, an RTL (Register Transfer Level), a netlist representing connections between gates and circuit modules, or a combination thereof. . An example of the generated DMAC3 hardware model will be described below.

図１８は、生成されるＤＭＡＣ３の内部ブロック図である。
ＤＭＡＣ３は、データ転送回路３１と、ステートコントローラ３２と、を有する。ＤＭＡＣ３は、データ転送回路３１に対する専用のステートコントローラ３２を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずにＤＭＡデータ転送を実施できる。 FIG. 18 is an internal block diagram of the generated DMAC3.
The DMAC 3 has a data transfer circuit 31 and a state controller 32 . The DMAC 3 has a dedicated state controller 32 for the data transfer circuit 31, and when an instruction command is input, DMA data transfer can be performed without the need for an external controller.

データ転送回路３１は、外部バスＥＢに接続されており、ＤＲＡＭなどの外部メモリと第一メモリ１との間のＤＭＡデータ転送を行う。また、データ転送回路３１は、ＤＲＡＭなどの外部メモリと第二メモリ２との間のＤＭＡデータ転送を行う。また、データ転送回路３１は、ＤＲＡＭなどの外部メモリと畳み込み演算回路４との間のデータ転送を行う。また、データ転送回路３１は、ＤＲＡＭなどの外部メモリと量子化演算回路５との間のデータ転送を行う。データ転送回路３１のＤＭＡチャンネル数は限定されない。例えば、第一メモリ１と第二メモリ２のそれぞれに専用のＤＭＡチャンネルを有していてもよい。 The data transfer circuit 31 is connected to the external bus EB and performs DMA data transfer between an external memory such as a DRAM and the first memory 1 . The data transfer circuit 31 also performs DMA data transfer between an external memory such as a DRAM and the second memory 2 . A data transfer circuit 31 transfers data between an external memory such as a DRAM and the convolution circuit 4 . A data transfer circuit 31 transfers data between an external memory such as a DRAM and the quantization arithmetic circuit 5 . The number of DMA channels of the data transfer circuit 31 is not limited. For example, each of the first memory 1 and the second memory 2 may have a dedicated DMA channel.

ステートコントローラ３２は、データ転送回路３１のステートを制御する。また、ステートコントローラ３２は、内部バスＩＢを介してコントローラ６と接続されている。ステートコントローラ３２は、命令キュー３３と制御回路３４とを有する。 State controller 32 controls the state of data transfer circuit 31 . Also, the state controller 32 is connected to the controller 6 via an internal bus IB. The state controller 32 has an instruction queue 33 and a control circuit 34 .

命令キュー３３は、ＤＭＡＣ３用の命令コマンドＣ３が格納されるキューであり、例えばＦＩＦＯメモリで構成される。命令キュー３３には、内部バスＩＢ経由で１つ以上の命令コマンドＣ３が書き込まれる。 The instruction queue 33 is a queue in which the instruction command C3 for the DMAC 3 is stored, and is composed of, for example, a FIFO memory. One or more instruction commands C3 are written to the instruction queue 33 via the internal bus IB.

制御回路３４は、命令コマンドＣ３をデコードし、命令コマンドＣ３に基づいて順次データ転送回路３１を制御するステートマシンである。制御回路３４は、畳み込み演算回路４のステートコントローラ４４の制御回路４６と同様の構成である。 The control circuit 34 is a state machine that decodes the instruction command C3 and sequentially controls the data transfer circuit 31 based on the instruction command C3. The control circuit 34 has the same configuration as the control circuit 46 of the state controller 44 of the convolution circuit 4 .

実行モデル生成部３２１は、ネットワーク情報ＮＷとして入力された情報から、ＤＭＡＣ３における、ＤＭＡチャンネル数やデータバス幅などを決定する。 The execution model generation unit 321 determines the number of DMA channels, data bus width, etc. in the DMAC 3 from the information input as the network information NW.

例えば、実行モデル生成部３２１は、ホスト側の外部バスＥＢの仕様にあわせた仕様（データバス幅等）のＤＭＡＣ３を生成する。データバス幅やＤＭＡチャンネル数を増やすことで外部メモリと第一メモリ１や第二メモリ２とのデータ伝送速度を向上させることができる。 For example, the execution model generator 321 generates the DMAC 3 with specifications (data bus width, etc.) matching the specifications of the external bus EB on the host side. By increasing the data bus width and the number of DMA channels, the data transmission speed between the external memory and the first memory 1 or the second memory 2 can be improved.

＜学習工程（Ｓ１４）＞
ステップＳ１４において、ニューラルネットワーク生成装置３００の学習部３２２および推論部３２３は、学習データセットＤＳを用いて、生成されたＮＮ実行モデル１００の学習パラメータを学習する（学習工程）。学習工程（Ｓ１４）は、例えば、学習済みパラメータ生成工程（Ｓ１４－１）と、推論テスト工程（Ｓ１４－２）と、を有する。 <Learning step (S14)>
In step S14, the learning unit 322 and the inference unit 323 of the neural network generating device 300 learn learning parameters of the generated NN execution model 100 using the learning data set DS (learning step). The learning step (S14) includes, for example, a learned parameter generation step (S14-1) and an inference test step (S14-2).

＜学習工程：学習済みパラメータ生成工程（Ｓ１４－１）＞
学習部３２２は、ＮＮ実行モデル１００および学習データＤ１を用いて、学習済みパラメータＰＭを生成する。学習済みパラメータＰＭは、学習済みの重みｗおよび量子化パラメータｑ等である。 <Learning Step: Learned Parameter Generation Step (S14-1)>
Learning unit 322 generates learned parameters PM using NN execution model 100 and learning data D1. The learned parameters PM are the learned weight w, the quantization parameter q, and the like.

例えば、ＮＮ実行モデル１００が画像認識を実施するＣＮＮ２００の実行モデルである場合、学習データＤ１は入力画像と教師データＴとの組み合わせである。入力画像は、ＣＮＮ２００に入力される入力データａである。教師データＴは、画像に撮像された被写体の種類や、画像における検出対象物の有無や、画像における検出対象物の座標値などである。 For example, when the NN execution model 100 is the execution model of the CNN 200 that performs image recognition, the learning data D1 is a combination of the input image and the teacher data T. An input image is input data a input to the CNN 200 . The teacher data T includes the type of subject captured in the image, the presence or absence of the detection target in the image, the coordinate values of the detection target in the image, and the like.

学習部３２２は、公知の技術である誤差逆伝播法などによる教師あり学習によって、学習済みパラメータＰＭを生成する。学習部３２２は、入力画像に対するＮＮ実行モデル１００の出力と、入力画像に対応する教師データＴと、の差分Ｅを損失関数（誤差関数）により求め、差分Ｅが小さくなるように重みｗおよび量子化パラメータｑを更新する。 The learning unit 322 generates a learned parameter PM by supervised learning using a well-known technique such as the error backpropagation method. The learning unit 322 obtains the difference E between the output of the NN execution model 100 for the input image and the teacher data T corresponding to the input image using a loss function (error function), and calculates the weight w and the quantum update the optimization parameter q.

例えば重みｗを更新する場合、重みｗに関する損失関数の勾配が用いられる。勾配は、例えば損失関数を微分することにより算出される。誤差逆伝播法を用いる場合、勾配は逆伝番（ｂａｃｋｗａｒｄ）により算出される。 For example, when updating weight w, the gradient of the loss function with respect to weight w is used. The slope is calculated, for example, by differentiating the loss function. When using backpropagation, the gradient is calculated by backward propagation.

学習部３２２は、勾配を算出して重みｗを更新する際において、畳み込み演算に関連する演算を高精度化する。具体的には、ＮＮ実行モデル１００が使用する低ビットの重みｗ（例えば１ビット）より高精度な３２ビットの浮動小数点型の重みｗが学習に使用される。また、ＮＮ実行モデル１００の畳み込み演算回路４において実施する畳み込み演算が高精度化される。 The learning unit 322 increases the precision of operations related to the convolution operation when calculating the gradient and updating the weight w. Specifically, a 32-bit floating-point weight w that is more accurate than the low-bit weight w (for example, 1 bit) used by the NN execution model 100 is used for learning. Further, the precision of the convolution operation performed in the convolution operation circuit 4 of the NN execution model 100 is improved.

学習部３２２は、勾配を算出して重みｗを更新する際において、活性化関数に関連する演算を高精度化する。具体的には、ＮＮ実行モデル１００の量子化演算回路５において実施するＲｅＬＵ関数などの活性化関数より高精度なシグモンド関数が学習に使用される。 The learning unit 322 increases the precision of calculations related to the activation function when calculating the gradient and updating the weight w. Specifically, a Sigmond function, which is more accurate than an activation function such as a ReLU function implemented in the quantization arithmetic circuit 5 of the NN execution model 100, is used for learning.

一方、学習部３２２は、順伝搬（ｆоｒｗａｒｄ）により入力画像に対する出力データを算出する際においては、畳み込み演算および活性化関数に関連する演算を高精度化せず、ＮＮ実行モデル１００に基づいた演算を実施する。重みｗを更新する際に用いられた高精度な重みｗは、ルックアップテーブル等により低ビット化される。 On the other hand, when calculating the output data for the input image by forward propagation, the learning unit 322 does not increase the precision of the convolution operation and the operation related to the activation function, and performs the operation based on the NN execution model 100. to implement. The high-precision weight w used when updating the weight w is reduced in bits by a lookup table or the like.

学習部３２２は、勾配を算出して重みｗを更新する際において、畳み込み演算および活性化関数に関連する演算を高精度化することにより、演算における中間データの精度低下を防止して、高い推論精度を実現できる学習済みパラメータＰＭを生成できる。 When the gradient is calculated and the weight w is updated, the learning unit 322 increases the accuracy of the convolution operation and the operation related to the activation function, thereby preventing the accuracy of the intermediate data in the operation from decreasing and enabling high inference. A learned parameter PM that can achieve accuracy can be generated.

一方、学習部３２２は、入力画像に対する出力データを算出する際において、順伝搬（ｆоｒｗａｒｄ）の演算を高精度化せず、ＮＮ実行モデル１００に基づいた演算を実施する。そのため、学習部３２２が算出した出力データと、生成された学習済みパラメータＰＭを用いたＮＮ実行モデル１００の出力データと、が一致する。 On the other hand, when calculating the output data for the input image, the learning unit 322 performs calculation based on the NN execution model 100 without increasing the accuracy of the forward propagation calculation. Therefore, the output data calculated by the learning unit 322 and the output data of the NN execution model 100 using the generated learned parameter PM match.

＜学習工程：推論テスト工程（Ｓ１４－２）＞
推論部３２３は、学習部３２２が生成した学習済みパラメータＰＭ、ＮＮ実行モデル１００およびテストデータＤ２を用いて推論テストを実施する。例えば、ＮＮ実行モデル１００が画像認識を実施するＣＮＮ２００の実行モデルである場合、テストデータＤ２は、学習データＤ１同様に入力画像と教師データＴとの組み合わせである。 <Learning Step: Inference Test Step (S14-2)>
The inference unit 323 performs an inference test using the learned parameters PM generated by the learning unit 322, the NN execution model 100, and the test data D2. For example, if the NN execution model 100 is the execution model of the CNN 200 that performs image recognition, the test data D2 is a combination of the input image and the teacher data T, similar to the learning data D1.

推論部３２３は、推論テストの進捗および結果を表示部３５０に表示する。推論テストの結果は、例えばテストデータＤ２に対する正解率である。 The inference unit 323 displays the progress and results of the inference test on the display unit 350 . The result of the reasoning test is, for example, the accuracy rate for the test data D2.

＜確認工程（Ｓ１５）＞
ステップＳ１５において、ニューラルネットワーク生成装置３００の推論部３２３は、操作入力部３６０から結果に関する確認を入力することを使用者に促すメッセージや情報入力に必要なＧＵＩ画像を表示部３５０に表示させる。使用者は、推論テストの結果を許容するかを、操作入力部３６０から入力する。使用者が推論テストの結果を許容することを示す入力が操作入力部３６０から入力された場合、ニューラルネットワーク生成装置３００は、次にステップＳ１６を実施する。使用者が推論テストの結果を許容しないことを示す入力が操作入力部３６０から入力された場合、ニューラルネットワーク生成装置３００は、再度ステップＳ１２を実施する。なお、ニューラルネットワーク生成装置３００はステップＳ１１まで戻って、ハードウェア情報ＨＷを使用者に再入力させてもよい。 <Confirmation step (S15)>
In step S<b>15 , the inference unit 323 of the neural network generation device 300 causes the display unit 350 to display a message prompting the user to input confirmation regarding the result from the operation input unit 360 and a GUI image necessary for inputting information. The user inputs from the operation input unit 360 whether to accept the result of the inference test. When an input indicating that the user accepts the result of the inference test is input from the operation input unit 360, the neural network generation device 300 next performs step S16. When the user inputs an input from the operation input unit 360 indicating that the result of the inference test is not acceptable, the neural network generator 300 performs step S12 again. Incidentally, the neural network generation device 300 may return to step S11 and allow the user to re-input the hardware information HW.

＜出力工程（Ｓ１６）＞
ステップＳ１６において、ニューラルネットワーク生成装置３００のハードウェア生成部３２４は、ハードウェア情報ＨＷおよびＮＮ実行モデル１００に基づいてニューラルネットワークハードウェアモデル４００などを生成する。 <Output step (S16)>
In step S<b>16 , hardware generation unit 324 of neural network generation device 300 generates neural network hardware model 400 and the like based on hardware information HW and NN execution model 100 .

＜ソフトウェア生成工程（Ｓ１７）＞
ステップＳ１７において、ニューラルネットワーク生成装置３００のソフトウェア生成部３２５は、ネットワーク情報ＮＷおよびＮＮ実行モデル１００などに基づいて、ニューラルネットワークハードウェア６００（ニューラルネットワークハードウェアモデル４００を動作対象ハードウェアに実装したもの）を動作させるソフトウェア５００を生成する。ソフトウェア５００は、学習済みパラメータＰＭを必要に応じてニューラルネットワークハードウェア６００へ転送するソフトウェアを含む。また、ソフトウェア５００は、ニューラルネットワークハードウェア６００を動作または制御するための複数の形態のソフトウェアを含んでもよいし、動作または制御に用いるための命令コマンドやＮＮ実行モデル１００の一部の演算を実行するためのソフトウェアを含んでもよい。 <Software generation step (S17)>
In step S17, the software generation unit 325 of the neural network generation device 300 generates the neural network hardware 600 (the neural network hardware model 400 implemented in the operation target hardware) based on the network information NW and the NN execution model 100. ) is generated. Software 500 includes software that transfers learned parameters PM to neural network hardware 600 as needed. In addition, the software 500 may include multiple forms of software for operating or controlling the neural network hardware 600, and may also include instructions and commands for use in operating or controlling the neural network hardware 600 and some operations of the NN execution model 100. may include software for

ソフトウェア生成工程（Ｓ１７）は、例えば、入力データを演算する単位に合わせて分割する分割工程と、演算効率を高めるためにＣＮＮ２００の一部を分割するネットワーク分割工程と、入力データをどのようにメモリ上に配置するかのアロケーション工程などを有する。 The software generation step (S17) includes, for example, a division step of dividing the input data according to the unit to be calculated, a network division step of dividing a part of the CNN 200 in order to improve the calculation efficiency, and a memory for the input data. It has an allocation process such as placing on top.

＜入力データ分割工程＞
ソフトウェア生成部３２５は、第一メモリ１および第二メモリ２として割り当てるメモリのメモリ容量や演算器の仕様やサイズ（ＢｃやＢｄ）などに基づいて、畳み込み層２１０の畳み込み演算の入力データａを部分テンソルに分割する。部分テンソルへの分割方法や分割数は特に限定されない。部分テンソルは、例えば、入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃ）をａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ）に分割することにより形成される。 <Input data division process>
The software generation unit 325 partially converts the input data a for the convolution operation of the convolution layer 210 based on the memory capacity of the memory allocated as the first memory 1 and the second memory 2, the specifications and sizes (Bc and Bd) of the calculator, and the like. Split into tensors. The method of division into partial tensors and the number of divisions are not particularly limited. A partial tensor is formed, for example, by splitting the input data a(x+i, y+j, c) into a(x+i, y+j, co).

さらに、ソフトウェア生成部３２５は、ＮＮ実行モデル１００の畳み込み回路４に、分割された入力データａおよび重みｗを展開する。 Furthermore, the software generator 325 develops the divided input data a and weight w in the convolution circuit 4 of the NN execution model 100 .

また、ソフトウェア生成部３２５は、例えば、分割された入力データａ（２・Ｘ・Ｙ・Ｂｃビット）が第一メモリ１に複数格納されるように入力データａを部分テンソルに分割する。ソフトウェア生成部３２５は、レイヤごとに入力データａを分割する。なお、ニューラルネットワークハードウェア６００で演算しやすい単位とは、ニューラルネットワークハードウェア６００で並列演算できる数、第一メモリ１または第二メモリ２の容量や帯域、消費電力量、演算周波数などに基づいて決定する。例えば、並列演算可能な数が多い場合には分割数としては少なくすることが好ましい。 Further, the software generation unit 325 divides the input data a into partial tensors so that a plurality of divided input data a (2*X*Y*Bc bits) are stored in the first memory 1, for example. The software generation unit 325 divides the input data a for each layer. The units that can be easily calculated by the neural network hardware 600 are based on the number of parallel calculations that can be performed by the neural network hardware 600, the capacity and bandwidth of the first memory 1 or the second memory 2, the amount of power consumption, the calculation frequency, etc. decide. For example, when the number of parallel operations is large, it is preferable to reduce the number of divisions.

＜ネットワーク分割工程＞
ソフトウェア生成部３２５は、ＣＮＮ２００のネットワーク（レイヤ）を分割して、ループ状に形成された畳み込み演算回路４と量子化演算回路５とにマッピングする（ネットワーク分割工程）。 <Network division process>
The software generation unit 325 divides the network (layer) of the CNN 200 and maps it to the looped convolution operation circuit 4 and the quantization operation circuit 5 (network division step).

ソフトウェア生成部３２５は、ＣＮＮ２００のネットワーク（レイヤ）の分割を、ＣＮＮ２００全体に対して実施する。ソフトウェア生成部３２５は、ＤＭＡＣ３による第一メモリ１と外部メモリとの間のメモリ転送が可能か限り少なくなるように、ＣＮＮ２００のネットワーク（レイヤ）の分割を実施する。 The software generator 325 divides the network (layer) of the CNN 200 for the entire CNN 200 . The software generator 325 divides the network (layer) of the CNN 200 so that memory transfers between the first memory 1 and the external memory by the DMAC 3 are reduced as much as possible.

また、ＣＮＮ２００に入力データａのテンソル形状を変更する演算が含まれている場合も、当該演算の前においてネットワーク（レイヤ）を分割する。入力データａのテンソル形状を変更する演算とは、例えば、入力データａの深さ方向（ｃ方向）を短くして平面方向（ｘｙ方向）へ広げる演算や、テンソル（データ）の統合を行う演算などである。 Also, when the CNN 200 includes an operation for changing the tensor shape of the input data a, the network (layer) is divided before the operation. The operation to change the tensor shape of the input data a is, for example, an operation to shorten the depth direction (c direction) of the input data a and expand it in the plane direction (xy direction), or an operation to integrate the tensors (data). and so on.

＜アロケーション工程＞
ソフトウェア生成部３２５は、分割された演算をニューラルネットワークハードウェア６００に割り当てて実施させるソフトウェア５００を生成する（アロケーション工程）。ソフトウェア生成部３２５は、アローション工程においてメモリ上にデータを配置するための配置情報を生成する。 <Allocation process>
The software generation unit 325 generates the software 500 that allocates the divided operations to the neural network hardware 600 for execution (allocation step). The software generation unit 325 generates arrangement information for arranging data on the memory in the allocation process.

ここで、ソフトウェア生成部３２５が生成するソフトウェア５００はニューラルネットワークハードウェア６００を制御動作させるためのコマンド群を含むものとする。この場合、コマンド群にはＮＮ実行モデル１００などに基づいて図５などで示した各回路を適切なタイミングで動作させるための複数の命令コマンドが含まれる。さらに、ニューラルネットワークハードウェア６００以外の外部ホストＣＰＵなどのプロセッサ上で動作するコマンドまたはソフトウェアも含まれる。当該コマンド群は、外部メモリ上に保存され、外部ホストＣＰＵにより直接またはコントローラ６を経由して、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５に対する制御に用いられる。 Here, the software 500 generated by the software generation unit 325 includes a command group for controlling and operating the neural network hardware 600 . In this case, the command group includes a plurality of instruction commands for operating each circuit shown in FIG. 5 etc. at appropriate timing based on the NN execution model 100 etc. Also included are commands or software running on a processor other than neural network hardware 600, such as an external host CPU. The command group is stored in the external memory and used by the external host CPU to control the DMAC 3, the convolution circuit 4, and the quantization circuit 5 directly or via the controller 6. FIG.

図１９は、ニューラルネットワークハードウェア６００が設けられるエッジデバイスの一例を表すブロック図である。エッジデバイスは外部ホストＣＰＵ７００と、外部ホストＣＰＵ７００に含まれるバッファメモリ７１０と、外部メモリ７２０と、ニューラルネットワークハードウェア６００が含まれ、各ブロックは外部バスＥＢにより接続される。当該コマンド群は、外部メモリ７２０上に保存され、外部ホストＣＰＵ７００により、またはニューラルネットワークハードウェア６００に含まれるコントローラ６を経由して、ＤＭＡＣ３や畳み込み演算回路４や量子化演算回路５に対する制御が行われる。言い換えれば、当該コマンド群は、ニューラルネットワークハードウェアモデル４００の一部に相当する。そのため、当該コマンド群を適切なタイミングで実行することにより、エッジデバイス上でＮＮ実行モデル１００を適切に制御することとなる。 FIG. 19 is a block diagram representing an example of an edge device in which neural network hardware 600 is provided. The edge device includes an external host CPU 700, a buffer memory 710 included in the external host CPU 700, an external memory 720, and neural network hardware 600, and each block is connected by an external bus EB. The command group is stored in the external memory 720, and the external host CPU 700 or via the controller 6 included in the neural network hardware 600 controls the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5. will be In other words, the command group corresponds to part of the neural network hardware model 400 . Therefore, the NN execution model 100 can be appropriately controlled on the edge device by executing the command group at appropriate timing.

また、図１９において、ＣＮＮ２００が処理する入力データを画像データ等の多次元データであるとする。このような大容量のデータをニューラルネットワークハードウェア６００上で保持するためには大型のメモリが必要になり好ましくない。そのため、ＣＮＮ２００に含まれる各レイヤの入力データは、演算開始時には外部メモリ７２０に上に保持される。そして、分割された入力データを適宜読み出すことで、ニューラルネットワークハードウェア６００の回路規模を抑えることができる。さらに、ＣＮＮ２００に含まれる各レイヤから出力される出力データも同様に、演算結果として外部メモリ７２０に保持される。なお、外部メモリ７２０に保持された出力データは次のレイヤにおける入力データとなる。 Also, in FIG. 19, it is assumed that the input data processed by the CNN 200 is multidimensional data such as image data. In order to hold such a large amount of data on the neural network hardware 600, a large memory is required, which is not preferable. Therefore, input data of each layer included in CNN 200 is held in external memory 720 at the start of calculation. By appropriately reading the divided input data, the circuit scale of the neural network hardware 600 can be suppressed. Furthermore, output data output from each layer included in CNN 200 is similarly held in external memory 720 as a calculation result. Note that the output data held in the external memory 720 becomes input data in the next layer.

本実施形態において、ニューラルネットワークハードウェア６００へ入力データを入力する動作、及びニューラルネットワークハードウェア６００から出力データを取得する動作において、それぞれのデータは外部ホストＣＰＵ７００に含まれるバッファメモリ７１０を経由して外部メモリ７２０に対して読み出され、また書き込まれる。より詳細には、外部ホストＣＰＵ７００は、演算を実行しようとする対象のレイヤに対応する入力データを外部メモリ７２０から取得し、演算の開始に合わせてバッファメモリ７１０に一時的に保存する。バッファメモリ７１０は一例として高速に読み書きが可能であるＳＲＡＭなどで構成される。また、外部ホストＣＰＵ７００は、対象のレイヤにおける演算結果として出力されるデータを、一旦バッファメモリ７１０に保存した後に外部メモリ７２０へ書き込みを行う。このように、バッファメモリ７１０は一例として高速に読み書きが可能であるＳＲＡＭなどで構成される。 In this embodiment, in the operation of inputting input data to the neural network hardware 600 and the operation of acquiring output data from the neural network hardware 600, each data is sent via the buffer memory 710 included in the external host CPU 700. It is read from and written to external memory 720 . More specifically, the external host CPU 700 acquires from the external memory 720 the input data corresponding to the layer on which the operation is to be performed, and temporarily stores it in the buffer memory 710 when the operation starts. The buffer memory 710 is composed of, for example, an SRAM or the like that can be read and written at high speed. In addition, the external host CPU 700 temporarily saves the data output as the calculation result in the target layer in the buffer memory 710 and then writes the data in the external memory 720 . As described above, the buffer memory 710 is configured by, for example, an SRAM that can be read and written at high speed.

ニューラルネットワークハードウェア６００において処理される入力データが、画像データ等の多次元のデータ構造を有する場合、各レイヤで処理するデータ量は比較的に大きくなる。さらに、ＣＮＮ２００には複数のレイヤが含まれるため、ニューラルネットワークハードウェア６００には繰り返しデータが入力されることとなる。その結果として、繰り返し演算結果を取得するためにバッファメモリ７１０への読み書きが繰り返し発生する。 When the input data processed by the neural network hardware 600 has a multidimensional data structure such as image data, the amount of data processed in each layer is relatively large. Furthermore, since CNN 200 includes multiple layers, neural network hardware 600 is repeatedly fed with data. As a result, reading and writing to and from the buffer memory 710 are repeated in order to obtain repeated calculation results.

演算に用いられる各データは、高速に読み書き可能なバッファメモリ７１０上に保持され、繰り返し使用することで効率的な演算を行うことができる。そして、実行する演算の効率を高めるためには多くのデータをバッファメモリ７１０上に長期間保持することが好ましい。これはバッファメモリ７１０の使用量の増加につながっていた。そのため従来は、ＣＮＮ２００に含まれるレイヤの数が増えれば増えるほど多くのバッファメモリ７１０の容量が必要になっていた。 Each data used for computation is held in a buffer memory 710 that can be read and written at high speed, and can be used repeatedly for efficient computation. In order to increase the efficiency of the operations to be executed, it is preferable to retain a large amount of data in the buffer memory 710 for a long period of time. This has led to an increase in the amount of buffer memory 710 used. Therefore, conventionally, the larger the number of layers included in the CNN 200, the larger the capacity of the buffer memory 710 required.

高性能なバッファメモリ７１０の容量の増加は最終製品のコストや電力を引き上げるため、エッジデバイスなどの限られた演算資源を有効活用するには演算の効率を高めるだけではなく、バッファメモリ使用量も抑える必要があった。 An increase in the capacity of the high-performance buffer memory 710 raises the cost and power consumption of the final product, so in order to effectively utilize limited computing resources such as edge devices, it is necessary not only to increase the efficiency of computation, but also to reduce the amount of buffer memory used. I had to hold back.

図２０は、ＣＮＮ２００の一例（上図）と、それが動作するエッジデバイスにおけるバッファメモリ７１０に対する従来のアロケーション例（下図）を示す図である。本実施形態のＣＮＮ２００は７層のレイヤと１層のポスト処理を含む。 FIG. 20 shows an example of the CNN 200 (upper diagram) and a conventional allocation example (lower diagram) for the buffer memory 710 in the edge device in which it operates. The CNN 200 of this embodiment includes 7 layers and 1 layer of post-processing.

図２０におけるＣＮＮ２００の入力データは画像データであり、各レイヤには図３で示した畳み込み演算と量子化演算をそれぞれ含む。入力データは、まずレイヤＬ_０に入力され、各レイヤで演算が行われる。そして、最終のレイヤＬ_６の出力データはポスト処理Ｐの入力になる。ここで、ポスト処理Ｐは出力データである結果画像を出力するための画像処理を含む。画像処理の一例として、黒レベル調整、ゲイン調整、色調整などを含む。また、画像処理以外の処理としてバウンディングボックスの処理などを含んでもよい。ポスト処理Ｐについては、処理の全てをニューラルネットワークハードウェア６００上で実行される必要はなく、外部ホストＣＰＵ７００などの外部の複数のプロセッサ上で実行するように構成してもよい。なお、外部ホストＣＰＵ７００にてポスト処理Ｐを実行する場合には、処理の自由度を向上させることができる。 Input data of the CNN 200 in FIG. 20 is image data, and each layer includes the convolution operation and quantization operation shown in FIG. Input data is first input to layer _L0 , and operations are performed in each layer. Then, the output data of the final layer _L6 becomes the input of the post-processing P. Here, the post-processing P includes image processing for outputting a resultant image as output data. Examples of image processing include black level adjustment, gain adjustment, color adjustment, and the like. In addition, processing of bounding boxes may be included as processing other than image processing. As for the post-processing P, it is not necessary for all of the processing to be performed on the neural network hardware 600, and may be configured to be performed on a plurality of external processors such as the external host CPU 700. FIG. When the post-processing P is executed by the external host CPU 700, the degree of freedom of processing can be improved.

図２０におけるＣＮＮ２００の各レイヤにおける処理を実行するために、外部ホストＣＰＵ７００はバッファメモリ７１０を介して適切なタイミングで入力データをニューラルネットワークハードウェア６００などに供給する。具体的には、まずレイヤＬ_０での演算の開始に合わせて、外部メモリ７２０からレイヤＬ_０の演算に必要なデータＤ_０を読み出してバッファメモリ７１０に格納する。そして、レイヤＬ_０における演算処理の実行期間に合わせてデータＤ_０を供給するためにバッファメモリ７１０上にデータＤ_０を保持する。 In order to execute processing in each layer of CNN 200 in FIG. 20, external host CPU 700 supplies input data to neural network hardware 600 or the like via buffer memory 710 at appropriate timing. Specifically, first, data D ₀ necessary for the calculation of layer L ₀ is read out from the external memory 720 and stored in the buffer memory 710 at the start of the calculation of the layer L ₀ . Then, the data _D0 is held in the buffer memory 710 in order to supply the data _D0 in accordance with the execution period of the arithmetic processing in the layer _L0 .

次に、レイヤＬ_０での演算と並列してレイヤＬ_１での演算が開始される。そのため、この開始に合わせて、外部メモリ７２０からレイヤＬ_１の演算に必要なデータＤ_１を読み出してバッファメモリ７１０に格納する。そして、レイヤＬ_１における演算処理の実行期間に合わせてデータＤ_１を供給するためにバッファメモリ７１０上にデータＤ_１を保持する。本実施形態において、レイヤＬ_０での演算とレイヤＬ_１での演算は少なくとも一部が並列で行われるため、データＤ_０をバッファメモリ７１０で保持する期間と、データＤ_１を保持する期間は重複する。なお、レイヤＬ_０の出力データはデータＤ_０の使用済み領域に上書きされるようにしてもよいし、直接外部メモリ７２０へ書き出してもよい。 Next, the computation in layer _L1 is started in parallel with the computation in layer _L0 . Therefore, in accordance with this start, the data _D1 necessary for the calculation of the layer _L1 is read out from the external memory 720 and stored in the buffer memory 710 . Then, the data _D1 is held in the buffer memory 710 in order to supply the data _D1 in accordance with the execution period of the arithmetic processing in the layer _L1 . In the present embodiment, at _least _a part of the operations in layer _L0 and the operations in layer _L1 are performed in parallel. Duplicate. The output data of layer _L0 may be overwritten in the used area of data _D0 , or may be directly written to the external memory 720. FIG.

図２０に示すように、各レイヤ演算に合わせてバッファメモリ７１０上に対応するデータを読み出し、演算が終了するまで保持する動作をＣＮＮ２００の全てのレイヤに対して行う。具体的には、最終レイヤＬ_６の後の処理であるポスト処理Ｐが終了するまでデータＤｐを保持し、ポスト処理Ｐの演算結果を外部メモリ７２０へ書き込むことでＣＮＮ２００の処理は終了する。 As shown in FIG. 20, the operation of reading the corresponding data on the buffer memory 710 in accordance with each layer calculation and holding it until the calculation is completed is performed for all layers of the CNN 200 . Specifically, the data Dp is held until the post-processing P, which is the processing after the final layer L6, is completed, and the calculation result of the post _- processing P is written into the external memory 720, thereby completing the processing of the CNN 200.

図２０に示すＣＮＮ２００の各レイヤに対応するバッファメモリ７１０の制御において、レイヤＬ_ｎの演算はレイヤＬ_ｎ＋１の演算と並列に実行するため、データＤ_ｎはレイヤＬ_ｎ＋２の演算が開始するタイミングまでデータＤ_ｎ＋１とともにバッファメモリ上に保持する必要がある。図２０において、データＤ_ｎを読み出すタイミングを点線で示し、データＤ_ｎをバッファメモリ７１０上に保持する期間（以下、「ライフタイム」という。）を矩形で示している。図２０に示すＣＮＮ２００においては、連続するレイヤ間では並列的に演算が行われるため、データＤ_ｎのライフタイムとデータＤ_ｎ＋１のライフタイムは時間的に重複する。なお、各データのライフタイムは各レイヤの演算期間と対応する。 In controlling the buffer memory ₇₁₀ corresponding to _each layer of the _CNN ₂₀₀ shown in FIG. It must be held on the buffer memory together with the data Dn ₊₁ . In FIG. 20, the timing for reading data _Dn is indicated by a dotted line, and the period during which data _Dn is held in the buffer memory 710 (hereinafter referred to as "lifetime") is indicated by a rectangle. In the CNN 200 shown in FIG. 20, operations are performed in parallel between consecutive layers, so the lifetime of data _Dn and the lifetime of data Dn ₊₁ temporally overlap. The lifetime of each data corresponds to the calculation period of each layer.

データＤ_ｎとデータＤ_ｎ＋１は異なるデータであることから、バッファメモリ７１０の同一のメモリ領域には保持できない。そのため、それぞれレイヤに対応した専用の領域を設ける必要がある。従来のアロケーションによれば、図２０に示したように、各レイヤの演算を効率に行うことを目的として、バッファメモリ７１０に多くのデータを配置することになるため、ＣＮＮ２００に含まれるレイヤの数に依存して、必要となるバッファメモリ７１０のメモリ容量が増加していた。 Since the data _Dn and the data Dn ₊₁ are different data, they cannot be held in the same memory area of the buffer memory 710 . Therefore, it is necessary to provide a dedicated area corresponding to each layer. According to conventional allocation, as shown in FIG. The required memory capacity of the buffer memory 710 increases depending on the

次に、本実施形態にかかるアロケーションについて詳細に説明する。ソフトウェア生成部３２５は、ソフトウェア５００を生成する工程に含まれるアロケーション工程において、バッファメモリ７１０の必要な使用量を下げるために、時間的に重複するライフタイムに基づいてバッファメモリ上でのアロケーションを行う。 Next, allocation according to this embodiment will be described in detail. In the allocation process included in the process of generating the software 500, the software generator 325 performs allocation on the buffer memory based on temporally overlapping lifetimes in order to reduce the required amount of usage of the buffer memory 710. .

図２１は、本実施形態におけるアロケーション工程に含まれるステップを説明するためのフローチャートである。当該フローチャートの各ステップはソフトウェア生成部３２５が実行する。 FIG. 21 is a flow chart for explaining the steps included in the allocation process in this embodiment. Each step of the flowchart is executed by the software generation unit 325 .

処理が開始されると、ソフトウェア生成部３２５は、ステップＳ２１において、ネットワーク情報ＮＷまたはＮＮ実行モデル１００などに基づいて、対象となるＣＮＮ２００に関する情報を取得する。具体的には、ＣＮＮ２００における各レイヤを決定し、決定した各レイヤに関する演算内容・順番、入力データの種類・サイズ、入力データの入力順番などを取得する。そして、処理を次のステップへ移す。 When the process is started, the software generation unit 325 acquires information about the target CNN 200 based on the network information NW, the NN execution model 100, or the like in step S21. Specifically, each layer in the CNN 200 is determined, and the calculation content/order, the type/size of the input data, the input order of the input data, and the like regarding each determined layer are acquired. Then the process moves to the next step.

ソフトウェア生成部３２５は、ステップＳ２２において、ステップＳ２１にて取得したＣＮＮ２００に関する情報に基づいて、ＣＮＮ２００における各レイヤに対するライフタイムを決定する。具体的な決定方法として、例えば、ソフトウェア生成部３２５は、取得した複数の情報に基づいて、各レイヤでの演算をそれぞれ一つのノードとして、ＣＮＮ２００に対応する有向非巡回グラフを作成する。一例として、ネットワーク分割工程において分割した一つ以上の演算を一つのノードとして有向非巡回グラフを作成する。そして、それぞれのノードに対するトポロジカルオーダーを決定する。トポロジカルオーダーは、作成した有向非巡回グラフにおいて、グラフの向きを考慮し、ノードに対して入ってくる枝がないノードを任意で１つ選んで番号を付与し、当該ノードとこれにつながっている枝を除くなどの処理を繰り返すことで決定することができる。なお、本実施形態において、ソフトウェア生成部３２５が取得する複数の情報は、ニューラルネットワークモデルに関する情報に相当し、ソフトウェア生成部３２５は当該情報を解析する解析部に相当する。 At step S22, the software generation unit 325 determines the lifetime for each layer in the CNN 200 based on the information on the CNN 200 acquired at step S21. As a specific determination method, for example, the software generation unit 325 creates a directed acyclic graph corresponding to the CNN 200 based on a plurality of pieces of acquired information, with each layer treated as one node. As an example, a directed acyclic graph is created with one or more operations divided in the network division step as one node. Then, determine the topological order for each node. In the directed acyclic graph created, the topological order considers the direction of the graph, arbitrarily selects one node that has no incoming edge to the node, gives it a number, and connects this node and this node. It can be determined by repeating processing such as excluding branches that are present. In this embodiment, the multiple pieces of information acquired by the software generation unit 325 correspond to information regarding neural network models, and the software generation unit 325 corresponds to an analysis unit that analyzes the information.

ソフトウェア生成部３２５は、さらにＮＮ実行モデル１００に対応する有向非巡回グラフにおけるトポロジカルオーダーと、ステップＳ２１にて取得したＣＮＮ２００に関する情報を用いることで、各レイヤにおける処理の開始時間及びライフタイムを決定する。そして、処理を次のステップへ移す。なお、本実施形態の処理の開始時間は必要なデータをバッファメモリ７１０上へロードまたはニューラルネットワークハードウェア６００から取得するタイミングに基づいて決定される。開始時間は絶対的なタイミングである必要ななく、ノード間の相対的な順番として規定してもよい。また、ライフタイムは処理の開始時間及び所定のデータを入力する入力タイミング（複数回入力する場合には最後に入力するタイミング）に基づいて決定される。一例として、所定のノードにおいて、Ｇ回のデータの入力が必要な場合に、データ入力に必要な単位期間をＴ_Ｇとすれば、ライフタイムはＧｘＴ_Ｇとしてとして決定できる。言い換えれば、ライフタイムはデータを入力するためのタイミングを少なくとも１以上含み、レイヤ間の演算順と演算に必要な期間から決定することが可能である。また、ライフタイム終了時間は開始時間とライフタイムの期間から決定することが可能である。 The software generation unit 325 further uses the topological order in the directed acyclic graph corresponding to the NN execution model 100 and the information about the CNN 200 acquired in step S21 to determine the start time and lifetime of processing in each layer. do. Then the process moves to the next step. The start time of the processing of this embodiment is determined based on the timing of loading necessary data onto the buffer memory 710 or acquiring from the neural network hardware 600 . The start times need not be absolute timings, but may be defined as relative orders between nodes. Also, the lifetime is determined based on the start time of processing and the input timing of inputting predetermined data (when inputting multiple times, the last input timing). As an example, when a predetermined node requires G data inputs, the lifetime can be determined as _G ×T _G , where TG is the unit period required for data input. In other words, the lifetime includes at least one timing for inputting data, and can be determined from the order of operations between layers and the period required for the operations. Also, the lifetime end time can be determined from the start time and the duration of the lifetime.

ソフトウェア生成部３２５は、ステップＳ２３において、ステップＳ２２にて決定したＮＮ実行モデル１００の各レイヤに対するライフタイムに基づいて、グループの割り付けを行う。グループの割り付けの一例として、対象となる複数のライフタイムに対して、ライフタイムの期間が重複しないものをグループとしてまとめることで割り付けることができる。より詳細には、二つのライフタイムを任意に指定した場合、一方のライフタイム終了時間ともう一方のライフタイム開始時間との比較を行う。この場合、一方のライフタイム終了時間がもう一方のライフタイムの開始時間より早く到来すれば、この二つのライフタイムをグループ化する。この処理を全てのライフタイムに対して行う（ステップＳ２４）。この結果として、複数のグループが形成され、全てのライフタイムはいずれかのグループに割り付けられることとなる。 In step S23, the software generation unit 325 allocates groups based on the lifetime for each layer of the NN execution model 100 determined in step S22. As an example of group assignment, lifetime periods that do not overlap can be grouped together and assigned to a plurality of target lifetimes. More specifically, when two lifetimes are arbitrarily specified, the end time of one lifetime is compared with the start time of the other lifetime. In this case, if the end time of one lifetime arrives earlier than the start time of the other lifetime, these two lifetimes are grouped. This process is performed for all lifetimes (step S24). As a result, multiple groups are formed, and all lifetimes are assigned to one of the groups.

ここで、ライフタイムをグループに割り付ける方法の一例として、任意に指定した二つのライフタイムにおける終了時間と開始時間に着目する。一方のライフタイム終了時間がもう一方のライフタイムの開始時間より早く到来すれば、この二つのライフタイム間に枝を張ることができる。これを全てのライフタイムに対して行った結果は閉路のない有向グラフとみなせる。言い換えれば、互いに重複しないライフタイムの集合と有向グラフ上の有向パスが一対一に対応する。したがって、ライフタイムのグループ割り付けは、有向グラフ上の互いに交わらない有向パスによって, 有向グラフの全ての頂点を覆うこと（以下、パス被覆）に対応する。そして、ライフタイムのグループ数最小のグループ分けは、有向グラフの最小パス被覆から求めることができる。 Here, as an example of a method of allocating lifetimes to groups, the end time and start time of two arbitrarily specified lifetimes are focused. If the end time of one lifetime arrives earlier than the start time of the other lifetime, a branch can be extended between these two lifetimes. The result of doing this for all lifetimes can be regarded as a directed graph without cycles. In other words, there is a one-to-one correspondence between a set of non-overlapping lifetimes and a directed path on the directed graph. Therefore, lifetime group allocation corresponds to covering all vertices of a directed graph by directed paths that do not intersect each other on the directed graph (hereinafter referred to as path coverage). Then, the grouping with the minimum number of lifetime groups can be obtained from the minimum path coverage of the directed graph.

ソフトウェア生成部３２５は、ステップＳ２５において、ステップＳ２４までに割り付けにより決定した各グループの処理順番を決定する。この決定に際しては、最も開始時間が早いレイヤを含むグループから順番に処理順番を決定すること好ましいが、これに限られるものではない。例えば、割り付けられているレイヤ数やポストプロセスの有無などで決定してもよい。そして、割り付けが終了したら、本フローチャートに関する処理を終了する。なお、本実施形態の効果として得られるメモリ使用量の低減については、主にグループ数に依存するためグループの順番に対する依存は低い。そのため、グループ間の処理順番は適宜変更することが可能である。なお、一例としてグループ数はニューラルネットワーク内での分岐数Ｒにも依存する。具体的には、本実施形態に係るライフタイムに基づくアロケーションを行うことで、グループの最大数は分岐数Ｒの２倍以内に抑えることができる。 In step S25, the software generation unit 325 determines the processing order of each group determined by the allocation up to step S24. In this determination, it is preferable to determine the processing order in order from the group including the layer with the earliest start time, but it is not limited to this. For example, it may be determined by the number of allocated layers or the presence or absence of post-processing. Then, when the allocation ends, the processing related to this flowchart ends. Note that the reduction in memory usage obtained as an effect of the present embodiment mainly depends on the number of groups, so it is less dependent on the order of the groups. Therefore, the processing order between groups can be changed as appropriate. As an example, the number of groups also depends on the number of branches R within the neural network. Specifically, by performing lifetime-based allocation according to the present embodiment, the maximum number of groups can be suppressed to within twice the number R of branches.

図２２は、ＣＮＮ２００の一例（上図）と、それが動作するエッジデバイスにおけるバッファメモリ７１０の使用に関する本実施形態に係るアロケーション例（下図）を示す図である。なお、説明の簡略化のため、ＣＮＮ２００は図２０で示したものと同じ構成を有するものとする。 FIG. 22 shows an example of the CNN 200 (upper diagram) and an allocation example (lower diagram) according to the present embodiment regarding use of the buffer memory 710 in the edge device on which it operates. To simplify the explanation, the CNN 200 is assumed to have the same configuration as that shown in FIG.

本実施形態において図２２に示すように、ソフトウェア生成部３２５は、アロケーション工程において、時間的に重複するライフタイムを考慮して複数のグループ（例えば、グループ１とグループ２）に割り付けを行い、その後バッファメモリ上でのアロケーションを行う。より詳細には、各ライフタイムを解析し、各ライフタイムの開始時間と終了時間とを比較してパス被覆問題の解としてグループ化する。一例としてデータＤ_０の終了時間はデータＤ_２の開始時間よりも前に到来するため、この二つは同一グループに割り付けられ、データＤ_０の終了時間はデータＤ_１の開始時間よりも前に到来しないため、この二つは異なるグループに割り付けられる。 In the present embodiment, as shown in FIG. 22, the software generation unit 325 allocates to a plurality of groups (for example, group 1 and group 2) in the allocation process, taking into account temporally overlapping lifetimes. Allocate on the buffer memory. More specifically, each lifetime is analyzed and the start and end times of each lifetime are compared and grouped as solutions to the path coverage problem. As _an example, since the end time of data _D0 comes before the start time of data D2, these _two are allocated to the same group, and the end time of data _D0 comes before the start time of data D1. Since it never arrives, the two are assigned to different groups.

本実施形態においては、データＤ_０、データＤ_２、データＤ_４、データＤ_６をグループ１とし、データＤ_１、データＤ_３、データＤ_５、データＤ_７、データＤ_pをグループ２としてグループの割り付けが行われる。 In this embodiment, data D ₀ , data D ₂ , data D ₄ and data D ₆ are grouped as group 1, and data D ₁ , data D ₃ , data D ₅ , data D ₇ and data D _p are grouped as group 2. is assigned.

図２２に示すように、バッファメモリ７１０のメモリ使用量は、グループの数に依存し、図２０の従来例と比較し、大幅に低減することができる。さらに本実施形態に係るアロケーションの効果として、ライフタイムに基づくグループへの割り付けを行うことで、レイヤ数の増加に依存することなく、メモリ使用量を抑えることができる。なお、本実施形態におけるライフタイムに基づくアロケーション工程は、図２０等で示したＣＮＮ２００の形態に限らず、より多くのレイヤを備える場合や、２以上の分岐を含む場合などにも適用することができる。 As shown in FIG. 22, the memory usage of buffer memory 710 depends on the number of groups, and can be greatly reduced compared to the conventional example of FIG. Furthermore, as an effect of the allocation according to the present embodiment, memory usage can be suppressed without depending on an increase in the number of layers by performing allocation to groups based on lifetime. Note that the lifetime-based allocation process in this embodiment is not limited to the form of the CNN 200 shown in FIG. can.

複数のレイヤを有するＣＮＮ２００に対する従来のアロケーション工程においては、どの様なアロケーションを行うかは、ニューラルネットワーク内で繰り返えされる演算の複雑性から、ヒューリスティックなアロケーションに依存していた。そのため、ニューラルネットワークに関する演算に対して、適切なアロケーションを行うことが難しく、バッファ使用量を設計時点で見積もることは困難であった。しかし、本発明における時間的に重複するライフタイムをノードとして最小パス被覆を算出することで、適切なアロケーションを多項式時間で実行することが可能となるだけでなく、バッファ使用量を設計時点で見積もることができる。 In the conventional allocation process for CNN 200 with multiple layers, what kind of allocation is performed depends on the heuristic allocation due to the complexity of repeated operations in the neural network. Therefore, it has been difficult to allocate an appropriate amount for operations related to the neural network, and it has been difficult to estimate the amount of buffer usage at the time of design. However, by calculating the minimum path coverage using temporally overlapping lifetimes as nodes in the present invention, it is possible not only to perform appropriate allocation in polynomial time, but also to estimate the buffer usage at the design time. be able to.

なお、本実施形態においては、ソフトウェア生成部３２５は、分割された演算をニューラルネットワークハードウェア６００に割り当てて実施させるソフトウェア５００を生成するアロケーション工程において、レイヤを対象として割り付ける例を示したが、これに限るものではない。例えば、一つのレイヤを分割した分割データを対象としてもよい。 In this embodiment, the software generation unit 325 allocates the divided operations to the neural network hardware 600 to generate the software 500 for execution. is not limited to For example, divided data obtained by dividing one layer may be targeted.

なお、本実施形態においては、対象となるメモリを外部ホストＣＰＵ７００に設けられているバッファメモリ７１０としたが、これ以外のメモリをアロケーション工程の対象としてもよい。例えば、外部ホストＣＰＵ７００上に複数のメモリが設けられている場合や、外部ホストＣＰＵ７００とニューラルネットワークハードウェア６００を中継する回路にメモリが設けられている場合などにおいて当該メモリを対象としてもよい。 In this embodiment, the target memory is the buffer memory 710 provided in the external host CPU 700, but other memory may be the target of the allocation process. For example, when a plurality of memories are provided on the external host CPU 700, or when a memory is provided in a circuit that relays the external host CPU 700 and the neural network hardware 600, the memory may be targeted.

以上説明したように、本実施形態に係るニューラルネットワーク生成装置３００によれば、ＩｏＴ機器などの組み込み機器に組み込み可能であり、高性能に動作させることができるニューラルネットワークを生成および制御できる。 As described above, the neural network generation device 300 according to the present embodiment can generate and control a neural network that can be embedded in an embedded device such as an IoT device and that can operate with high performance.

以上、本発明の実施形態の一例について図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。また、上述の実施形態および変形例において示した構成要素は適宜に組み合わせて構成することが可能である。 As described above, an example of the embodiment of the present invention has been described in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and design changes and the like are included within the scope of the present invention. . Also, the constituent elements shown in the above-described embodiment and modifications can be combined as appropriate.

（変形例１）
上記実施形態において、第一メモリ１と第二メモリ２は別のメモリであったが、第一メモリ１と第二メモリ２の態様はこれに限定されない。第一メモリ１と第二メモリ２は、例えば、同一メモリにおける第一メモリ領域と第二メモリ領域であってもよい。 (Modification 1)
In the above embodiment, the first memory 1 and the second memory 2 are different memories, but the aspect of the first memory 1 and the second memory 2 is not limited to this. The first memory 1 and the second memory 2 may be, for example, a first memory area and a second memory area in the same memory.

（変形例２）
例えば、上記実施形態に記載のＮＮ実行モデル１００に入力されるデータは単一の形式に限定されず、静止画像、動画像、音声、文字、数値およびこれらの組み合わせで構成することが可能である。なお、ＮＮ実行モデル１００に入力されるデータは、ニューラルネットワークハードウェアモデル４００が設けられるエッジデバイスに搭載され得る、光センサ、温度計、Global Positioning System（GPS）計測器、角速度計測器、風速計などの物理量測定器における測定結果に限られない。周辺機器から有線または無線通信経由で受信する基地局情報、車両・船舶等の情報、天候情報、混雑状況に関する情報などの周辺情報や金融情報や個人情報等の異なる情報を組み合わせてもよい。 (Modification 2)
For example, the data input to the NN execution model 100 described in the above embodiment is not limited to a single format, and can be composed of still images, moving images, sounds, characters, numerical values, and combinations thereof. . It should be noted that the data input to the NN execution model 100 can be mounted on the edge device in which the neural network hardware model 400 is provided, such as an optical sensor, a thermometer, a Global Positioning System (GPS) measuring instrument, an angular velocity measuring instrument, an anemometer It is not limited to the measurement result in a physical quantity measuring instrument such as Peripheral information such as base station information, vehicle/vessel information, weather information, and congestion information received from peripheral devices via wired or wireless communication, and different information such as financial information and personal information may be combined.

（変形例３）
ＮＮ実行モデル１００が設けられるエッジデバイスは、バッテリー等で駆動する携帯電話などの通信機器、パーソナルコンピュータなどのスマートデバイス、デジタルカメラ、ゲーム機器、ロボット製品などのモバイル機器を想定するが、これに限られるものではない。Power on Ethernet（PoE）などでの供給可能なピーク電力制限、製品発熱の低減または長時間駆動の要請が高い製品に利用することでも他の先行例にない効果を得ることができる。例えば、車両や船舶などに搭載される車載カメラや、公共施設や路上などに設けられる監視カメラ等に適用することで長時間の撮影を実現できるだけでなく、軽量化や高耐久化にも寄与する。また、テレビやディスプレイ等の表示デバイス、医療カメラや手術ロボット等の医療機器、製造現場や建築現場で使用される作業ロボットなどにも適用することで同様の効果を奏することができる。 (Modification 3)
Edge devices provided with the NN execution model 100 are assumed to be communication devices such as mobile phones driven by batteries, smart devices such as personal computers, mobile devices such as digital cameras, game devices, and robot products, but are limited to these. It is not something that can be done. Unprecedented effects can also be obtained by using power on Ethernet (PoE), etc., to limit the peak power that can be supplied, reduce product heat generation, or use it for products that require long-time operation. For example, by applying it to in-vehicle cameras installed in vehicles and ships, surveillance cameras installed in public facilities and roads, etc., it is possible not only to realize long-time shooting, but also to contribute to weight reduction and durability. . Similar effects can be obtained by applying the present invention to display devices such as televisions and displays, medical equipment such as medical cameras and surgical robots, and work robots used at manufacturing sites and construction sites.

上述した実施形態におけるプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 It may be realized by recording the program in the above-described embodiment in a computer-readable recording medium and causing a computer system to read and execute the program recorded in the recording medium. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices. The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. Furthermore, "computer-readable recording medium" means a medium that dynamically retains a program for a short period of time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include something that holds the program for a certain period of time, such as a volatile memory inside a computer system that serves as a server or client in that case. Further, the program may be for realizing part of the functions described above, or may be capable of realizing the functions described above in combination with a program already recorded in the computer system.

また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Also, the effects described herein are merely illustrative or exemplary, and are not limiting. In other words, the technology according to the present disclosure can produce other effects that are obvious to those skilled in the art from the description of this specification, in addition to or instead of the above effects.

本発明は、ニューラルネットワークの生成に適用することができる。 The present invention can be applied to the generation of neural networks.

３００ニューラルネットワーク生成装置
２００畳み込みニューラルネットワーク（ＣＮＮ）
１００ニューラルネットワーク実行モデル（ＮＮ実行モデル）
４００ニューラルネットワークハードウェアモデル
５００ソフトウェア
６００ニューラルネットワークハードウェア
１第一メモリ
２第二メモリ
３ＤＭＡコントローラ（ＤＭＡＣ）
４畳み込み演算回路
４２乗算器
４３アキュムレータ回路
５量子化演算回路
５２ベクトル演算回路
５３量子化回路
６コントローラ
６１レジスタ
ＰＭ学習済みパラメータ
ＤＳ学習データセット
ＨＷハードウェア情報
ＮＷネットワーク情報 300 Neural network generator 200 Convolutional neural network (CNN)
100 neural network execution model (NN execution model)
400 neural network hardware model 500 software 600 neural network hardware 1 first memory 2 second memory 3 DMA controller (DMAC)
4 convolution operation circuit 42 multiplier 43 accumulator circuit 5 quantization operation circuit 52 vector operation circuit 53 quantization circuit 6 controller 61 register PM learned parameter DS learning data set HW hardware information NW network information

Claims

A software generation device for generating software for controlling a neural network circuit,
analysis means for analyzing information about a model including multiple layers operating in the neural network circuit;
determining means for determining lifetimes corresponding to the plurality of layers included in the model based on the analysis result of the analyzing means;
generating means for generating the software based on the lifetime;
A software generation device comprising:

2. The software generator according to claim 1, wherein said software includes a plurality of commands for controlling said neural network circuit.

3. The software generating apparatus according to claim 1, wherein said lifetime includes one or more timings for inputting data to said neural network circuit in a corresponding layer.

the software includes arrangement information of the data held on the external memory;
4. The software generating apparatus according to claim 1, wherein said generating means generates said allocation information based on a group to which a plurality of lifetimes are assigned.

the lifetime corresponds to a computation period in the corresponding layer;
5. The software generating apparatus according to any one of claims 1 to 4, wherein at least some of the corresponding computation periods in the plurality of layers overlap in time.

The neural network circuit includes a convolution operation circuit that performs a convolution operation and a quantization operation circuit that performs a subquantitative operation based on the result of the convolution operation,
6. The software generating apparatus according to claim 1, wherein said convolution operation circuit and quantization operation circuit are configured in a loop shape.

A software generation method for generating software for controlling a neural network circuit,
an analysis step of analyzing information about a model including multiple layers operating in the neural network circuit;
a determination step of determining lifetimes corresponding to the plurality of layers included in the model based on the analysis result of the analysis means;
a generating step of generating the software based on the lifetime;
A software generation method, comprising: