JP7178323B2

JP7178323B2 - LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM

Info

Publication number: JP7178323B2
Application number: JP2019096975A
Authority: JP
Inventors: 優大屋; 哲志八木; 慎河野; 仁中澤
Original assignee: Nippon Telegraph and Telephone Corp; Keio University
Current assignee: Nippon Telegraph and Telephone Corp; Keio University
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2022-11-25
Anticipated expiration: 2039-05-23
Also published as: JP2020191006A

Description

本発明は、学習装置、学習方法、および、学習プログラムに関する。 The present invention relates to a learning device, a learning method, and a learning program.

ディープニューラルネットワークは、画像や音声認識をはじめ、様々な分野で用いられるモデルである。このモデルは、多層のニューラルネットワークで構成され、ニューラルネットワークは、複数のパーセプトロンで構成される。 Deep neural networks are models used in various fields, including image and speech recognition. This model is composed of a multi-layered neural network, and the neural network is composed of multiple perceptrons.

このパーセプトロンは、複数の入力信号に対し、それぞれ重みと呼ばれるパラメータと積和することで１つの値を得る。また、パーセプトロンは、次の層の入力信号を与えるために、活性化関数と呼ばれる非線形な関数で得られた値を射影し、その信号値を出力する。ディープニューラルネットワークは、上記のような計算を入力層から出力層に向けて順に行い、各層に信号を伝えることで、入力信号に対する予測値を得ることができる。 This perceptron obtains one value by summing products with parameters called weights for a plurality of input signals. Also, the perceptron projects a value obtained by a nonlinear function called an activation function to give an input signal for the next layer, and outputs the signal value. A deep neural network can obtain a predicted value for an input signal by sequentially performing the above-described calculations from the input layer to the output layer and transmitting signals to each layer.

ここで、ディープニューラルネットワークのパラメータおよび信号の値を二値化し、計算時におけるメモリ消費量を低減する手法が知られている（例えば、非特許文献１を参照）。このように、パラメータおよび信号の値を二値化して計算を行うディープニューラルネットワークをバイナリネットワークと呼ぶ。 Here, there is known a method of binarizing the values of parameters and signals of a deep neural network to reduce memory consumption during calculation (see, for example, Non-Patent Document 1). A deep neural network that performs calculations by binarizing the values of parameters and signals in this way is called a binary network.

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1, pp.4107-4115, 2016.I. Hubara, M. Courbariaaux, D. Soudry, R. El-Yaniv, and Y. Bengio, Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1, pp.4107-4115, 2016. J. Xu, P. Wang, H. Yang, A. M. Lopez, Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving, arXiv preprint arXiv:1804.06332, 2018.J. Xu, P. Wang, H. Yang, A. M. Lopez, Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving, arXiv preprint arXiv:1804.06332, 2018.

しかし、バイナリネットワークは、パラメータおよび信号が二値に限定されるため、入力層にノイズが入ると出力層から得られる予測値が大きく変わってしまうおそれがある。つまり、バイナリネットワークはロバスト性が低いという問題がある。そこで、本発明は、前記した問題を解決し、ロバスト性の高いバイナリネットワークを提供することを課題とする。 However, since the binary network is limited to binary parameters and signals, if noise enters the input layer, the predicted value obtained from the output layer may change significantly. In other words, binary networks have the problem of low robustness. Accordingly, an object of the present invention is to solve the above-described problems and to provide a highly robust binary network.

前記した課題を解決するため、本発明は、ディープニューラルネットワークの各層で用いる重みの値を二値化する変換部と、前記重みの値が二値化されたディープニューラルネットワークへの入力値と当該入力値の関連情報とを用いて、情報ボトルネック法により、前記入力値を確率変数としたときの確率的写像を、前記ディープニューラルネットワークおいて、入力値から前記入力値の関連情報を予測する際に用いる潜在変数として出力する計算部とを備えることを特徴とする。 In order to solve the above-described problems, the present invention provides a conversion unit that binarizes weight values used in each layer of a deep neural network, an input value to the deep neural network in which the weight values are binarized, and the relevant Predicting the relevant information of the input value from the input value in the deep neural network by using the information bottleneck method using the relevant information of the input value and the probabilistic mapping when the input value is a random variable. and a calculation unit that outputs as a latent variable that is used in the process.

本発明によれば、ロバスト性の高いバイナリネットワークを提供することができる。 According to the present invention, it is possible to provide a highly robust binary network.

図１は、学習装置によるバイナリネットワークの学習の概要を説明する図である。FIG. 1 is a diagram explaining an outline of learning of a binary network by a learning device. 図２は、学習装置の構成例を示す図である。FIG. 2 is a diagram showing a configuration example of a learning device. 図３は、学習装置の処理手順の例を示すフローチャートである。FIG. 3 is a flow chart showing an example of a processing procedure of the learning device. 図４は、学習プログラムを実行するコンピュータの例を示す図である。FIG. 4 is a diagram showing an example of a computer that executes a learning program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について説明する。まず、本実施形態の学習装置による学習対象となるバイナリネットワークについて説明する。 EMBODIMENT OF THE INVENTION Hereinafter, the form (embodiment) for implementing this invention is demonstrated, referring drawings. First, a binary network to be learned by the learning device of this embodiment will be described.

バイナリネットワークは、順伝搬において、(l-1)層から入力された各信号x^（ｌ-1）をパラメータwと積和する。そして、バイナリネットワークは、この積和の結果を符号関数signにより活性化させた信号x^（ｌ）を得ると、この信号x^（ｌ）を次の層に出力する。なお、バイナリネットワークは、上記の積和の際、パラメータwを符号関数signにより二値化させる（式（１）参照）。 In forward propagation, the binary network multiplies each signal x (l-1 ^{) input from the (l-1} ) layer with the parameter w. Then, when the binary network obtains the signal x ^(l) obtained by activating the product-sum result with the sign function sign, it outputs this signal x ^(l) to the next layer. In addition, the binary network binarizes the parameter w using the sign function sign (see equation (1)) in the sum of products described above.

［概要］
次に、学習装置によるバイナリネットワークの学習の概要を、図１を用いて説明する。なお、図１に示すバイナリネットワークＡ，Ｂは、学習対象のバイナリネットワークに含まれるサブネットワークであるものとする。このうちバイナリネットワークＡは、パラメータθを用いて、入力データxの写像としてzを計算し、バイナリネットワークＢは、パラメータφを用いて、入力データzの写像として予測ラベルyを計算するものとする。ここでバイナリネットワークＡの確率分布はp_θ(z|x)であり、バイナリネットワークＢの確率分布はq_φ(y|z)であるものとする。 [Overview]
Next, an outline of learning of a binary network by a learning device will be described with reference to FIG. Binary networks A and B shown in FIG. 1 are assumed to be sub-networks included in the binary network to be learned. Of these, the binary network A uses the parameter θ to calculate z as a map of the input data x, and the binary network B uses the parameter φ to calculate the predicted label y as a map of the input data z. . Here, the probability distribution of binary network A is p _θ (z|x), and the probability distribution of binary network B is q _φ (y|z).

このような場合、学習装置は、まず、上記のパラータθ，φを二値化する。その後、学習装置は、情報ボトルネック（information bottleneck）法を用いて、バイナリネットワークへの入力データxの確率的写像ｚを求める。ここで求めた写像ｚの確率分布r_θ(z)は、バイナリネットワークへの入力データｘにノイズが含まれていたとしても、当該入力データの正解ラベルごとに共通のものとなる。換言すると、入力データｘが異なっていても、当該入力データの正解ラベルごとに共通の確率分布r_θ(z)が現れる。よって、学習装置１０は、ロバスト性の高いバイナリネットワークを得ることができる。 In such a case, the learning device first binarizes the parameters θ and φ. The learning device then uses the information bottleneck method to find a probabilistic mapping z of the input data x to the binary network. The probability distribution r _θ (z) of the mapping z obtained here is common for each correct label of the input data even if the input data x to the binary network contains noise. In other words, even if the input data x is different, a common probability distribution r _θ (z) appears for each correct label of the input data. Therefore, the learning device 10 can obtain a highly robust binary network.

［構成］
次に、図２を用いて学習装置の構成を説明する。学習装置１０は、入出力部１１と、制御部１２と、記憶部１３とを備える。入出力部１１は、各種情報の入出力を司る。例えば、入出力部１１は、制御部１２による学習対象のバイナリネットワークで用いるパラメータwの初期値等、学習に用いる各種データの入力を受け付ける。 [Constitution]
Next, the configuration of the learning device will be described with reference to FIG. The learning device 10 includes an input/output unit 11 , a control unit 12 and a storage unit 13 . The input/output unit 11 controls input/output of various information. For example, the input/output unit 11 receives input of various data used for learning, such as the initial value of the parameter w used in the binary network to be learned by the control unit 12 .

制御部１２は、学習装置１０全体の制御を司る。この制御部１２は、変換部１２１と、計算部１２２とを備える。変換部１２１は、ディープニューラルネットワーク（バイナリネットワーク）の各層で用いる重みの値を二値化する、例えば、変換部１２１は、符号関数signを用いて、ディープニューラルネットワーク（バイナリネットワーク）の各層で用いる重みの値を＋１および－１のいずれかに二値化する。 The control unit 12 controls the learning device 10 as a whole. This control unit 12 includes a conversion unit 121 and a calculation unit 122 . The conversion unit 121 binarizes the weight values used in each layer of the deep neural network (binary network). For example, the conversion unit 121 uses the sign function sign to be used in each layer of the deep neural network (binary network). The weight value is binarized to either +1 or -1.

計算部１２２は、変換部１２１により重みの値が二値化されたバイナリネットワークについて情報ボトルネック法を用いた学習を行う。計算部１２２は、重みの値が二値化されたディープニューラルネットワークへの入力値と当該入力値の関連情報とを用いて、情報ボトルネック法により、１以上の入力値を、当該入力値の関連情報が類似するようクラスタリングする。そして、計算部１２２は、上記のクラスタリングにおける入力値を確率変数としたときの確率的写像を、上記のディープニューラルネットワークおいて、入力値から当該入力値の関連情報を予測する際に用いる潜在変数として出力する。この計算部１２２の詳細は後記する。 The calculation unit 122 performs learning using the information bottleneck method on the binary network in which the weight values are binarized by the conversion unit 121 . Calculation unit 122 calculates one or more input values by the information bottleneck method using input values to the deep neural network with binarized weight values and related information of the input values. Cluster related information so that they are similar. Then, the calculation unit 122 converts the probabilistic mapping when the input values in the clustering are random variables into the latent variables used when predicting related information of the input values from the input values in the deep neural network. output as Details of the calculation unit 122 will be described later.

記憶部１３は、制御部１２による学習により得られたバイナリネットワークのモデルを記憶する。モデルは、例えば、上記のバイナリネットワークの各層で用いられる重み（パラメータw）の値や、潜在変数（z）、活性化関数等の情報を含む。 The storage unit 13 stores a binary network model obtained by learning by the control unit 12 . The model includes information such as the value of the weight (parameter w) used in each layer of the above binary network, the latent variable (z), activation function, and the like.

［処理手順］
学習装置１０の処理手順を、図３を用いて説明する。例えば、学習装置１０の変換部１２１は、ディープニューラルネットワーク（バイナリネットワーク）の各層で用いる重みの値を二値化する（Ｓ１）。その後、計算部１２２は、Ｓ１で重みの値が二値化されたバイナリネットワークについて情報ボトルネック法を用いた潜在変数の算出を行う（Ｓ２）。 [Processing procedure]
A processing procedure of the learning device 10 will be described with reference to FIG. For example, the conversion unit 121 of the learning device 10 binarizes weight values used in each layer of a deep neural network (binary network) (S1). After that, the calculation unit 122 calculates latent variables using the information bottleneck method for the binary network in which the weight values are binarized in S1 (S2).

［計算部の詳細］
上記の計算部１２２を詳細に説明する。計算部１２２は、重みの値が二値化されたバイナリネットワークについて、当該バイナリネットワークへの入力値と当該入力値の関連情報とを用いて、情報ボトルネックにより、入力値の関連情報が類似するようクラスタリングする。この関連情報は、入力値に関連する情報であり、例えば、入力値が単語である場合、当該入力値の関連情報は、当該単語を含む文書のトピック等である。 [Details of calculation part]
The above calculation unit 122 will be described in detail. The calculation unit 122 uses the input value to the binary network and the related information of the input value for the binary network in which the weight values are binarized, and the related information of the input value is similar due to the information bottleneck. Cluster like this. This related information is information related to the input value. For example, when the input value is a word, the related information of the input value is the topic of the document including the word.

ここで、計算部１２２は、上記のクラスタリングにおいて、入力値を離散確率変数としたときのクラスタ変数への確率的写像を、上記のバイナリネットワークおいて、入力値から当該入力値の関連情報を予測する際に用いる潜在変数として出力する。 Here, in the above clustering, the calculation unit 122 predicts the probabilistic mapping to the cluster variable when the input value is a discrete random variable, and predicts the relevant information of the input value from the input value in the above binary network. Output as a latent variable used when

一般に情報ボトルネックを用いたクラスタリングは、クラスタリングの対象である変数X、変数Xのクラスタ変数（変数Xの確率的写像）Z、変数Xの関連情報Yを用いて、式（２）の値を最小化することにより行われる。なお、式（２）におけるＩは相互情報量である。つまり、XとZとの相互情報量I（X；Z）をできるだけ小さくし、ZとYとの相互情報量I（Z；Y）をできるだけ大きくするようなZを求めることにより行われる。 In general, clustering using an information bottleneck uses the variable X to be clustered, the cluster variable (probabilistic mapping of the variable X) Z of the variable X, and the related information Y of the variable X to obtain the value of formula (2) This is done by minimizing Note that I in Equation (2) is the amount of mutual information. That is, it is performed by obtaining Z such that the mutual information I(X; Z) between X and Z is minimized and the mutual information I(Z; Y) between Z and Y is maximized.

ここで、学習装置１０による学習対象のバイナリネットワークが入力データxからその入力データxのラベル値yを予測するものである場合、計算部１２２は、上記の入力データxを離散確率変数とし、ラベル値yを入力データｘの関連情報とし、以下の式（３）を最小化する、入力データxの確率的写像z（潜在変数z）を求める。 Here, when the binary network to be learned by the learning device 10 predicts the label value y of the input data x from the input data x, the calculation unit 122 treats the input data x as a discrete random variable, labels Letting the value y be related information of the input data x, find a probabilistic mapping z (latent variable z) of the input data x that minimizes the following equation (3).

ここでr(z)を、周辺分布p(z)の変分近似としたとき、上記の式（３）を最小化することは、以下の式（４）を最小化するのと同義である。 Here, when r(z) is a variational approximation of the marginal distribution p(z), minimizing the above equation (3) is synonymous with minimizing the following equation (4). .

ここで、p_θ(z|x)は、パラメータθを持つバイナリネットワークにｘを与えたときのｚの確率分布であり、q_φ(y|z)は、パラメータφを持つバイナリネットワークにzを与えたときのyの確率分布である。なお、このp_θ(z|x)は、パラメータθを持つバイナリネットワークの出力値から得られ、q_φ(y|z)は、パラメータφを持つバイナリネットワークの出力値から得られるものとする。また、r_θ(z)は、z（潜在変数z）の事前分布であり、平均μ、分散σのガウス分布（N(μ,σ)）に従うものとする。 where p _θ (z|x) is the probability distribution of z given x in a binary network with parameter θ, and q _φ (y|z) is the probability distribution of z in a binary network with parameter φ. is the probability distribution of y given Note that this p _θ (z|x) is obtained from the output values of a binary network with parameter θ, and q _φ (y|z) is obtained from the output values of a binary network with parameter φ. Also, r _θ (z) is a prior distribution of z (latent variable z), and is assumed to follow a Gaussian distribution (N(μ, σ)) with mean μ and variance σ.

計算部１２２は、式（４）に示すようにKLダイバージェンスの項で正則化しながら、バイナリネットワークの学習を行う。これにより、当該バイナリネットワークのモデルは、入力データxから特徴zの得られるモデルとなるため、入力データxにノイズが入っていたとしても共通の特徴zが得られやすくなる。その結果、例えば、当該バイナリネットワークが入力データxから当該入力データxの予測ラベルyを出力する場合に、ロバスト性の高い予測ラベルyの出力を実現することができる。 The calculation unit 122 learns the binary network while performing regularization with the KL divergence term as shown in Equation (4). As a result, the binary network model becomes a model that can obtain the feature z from the input data x, so that the common feature z can be easily obtained even if the input data x contains noise. As a result, for example, when the binary network outputs the predicted label y of the input data x from the input data x, output of the predicted label y with high robustness can be achieved.

［プログラム］
また、上記の実施形態で述べた学習装置１０の機能を実現するプログラムを所望の情報処理装置（コンピュータ）にインストールすることによって実装できる。例えば、パッケージソフトウェアやオンラインソフトウェアとして提供される上記のプログラムを情報処理装置に実行させることにより、情報処理装置を学習装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータ、ラック搭載型のサーバコンピュータ等が含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistants）等がその範疇に含まれる。また、学習装置１０を、クラウドサーバに実装してもよい。 [program]
Moreover, it can be implemented by installing a program that implements the functions of the learning device 10 described in the above embodiment into a desired information processing device (computer). For example, the information processing device can function as the learning device 10 by causing the information processing device to execute the above program provided as package software or online software. The information processing apparatus referred to here includes desktop or notebook personal computers, rack-mounted server computers, and the like. In addition, information processing devices include smart phones, mobile communication terminals such as mobile phones and PHSs (Personal Handyphone Systems), and PDAs (Personal Digital Assistants). Also, the learning device 10 may be implemented in a cloud server.

図４を用いて、上記のプログラム（学習プログラム）を実行するコンピュータの一例を説明する。図４に示すように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 An example of a computer that executes the above program (study program) will be described with reference to FIG. As shown in FIG. 4, computer 1000 includes memory 1010, CPU 1020, hard disk drive interface 1030, disk drive interface 1040, serial port interface 1050, video adapter 1060, and network interface 1070, for example. These units are connected by a bus 1080 .

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。ディスクドライブ１１００には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１１１０およびキーボード１１２０が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１１３０が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100, for example. A mouse 1110 and a keyboard 1120 are connected to the serial port interface 1050, for example. For example, a display 1130 is connected to the video adapter 1060 .

ここで、図４に示すように、ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。前記した実施形態で説明した各種データや情報は、例えばハードディスクドライブ１０９０やメモリ１０１０に記憶される。 Here, as shown in FIG. 4, the hard disk drive 1090 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Various data and information described in the above embodiments are stored in the hard disk drive 1090 and the memory 1010, for example.

そして、ＣＰＵ１０２０が、ハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Then, CPU 1020 reads out program module 1093 and program data 1094 stored in hard disk drive 1090 to RAM 1012 as necessary, and executes each procedure described above.

なお、上記の学習プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、上記のプログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮやＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 Note that the program module 1093 and program data 1094 related to the learning program described above are not limited to being stored in the hard disk drive 1090. For example, they may be stored in a removable storage medium and processed by the CPU 1020 via the disk drive 1100 or the like. may be read out. Alternatively, the program module 1093 and program data 1094 related to the above program are stored in another computer connected via a network such as LAN or WAN (Wide Area Network), and are read by CPU 1020 via network interface 1070. may be

１０学習装置
１１入出力部
１２制御部
１３記憶部
１２１変換部
１２２計算部 10 learning device 11 input/output unit 12 control unit 13 storage unit 121 conversion unit 122 calculation unit

Claims

a conversion unit that binarizes the weight values used in each layer of the deep neural network;
Using the input value to the deep neural network in which the weight value is binarized and related information of the input value, the information bottleneck method is used to generate a probabilistic mapping when the input value is a random variable, In the deep neural network, a calculation unit that outputs as a latent variable used when predicting related information of the input value from the input value;
A learning device comprising:

The conversion unit
2. The learning device according to claim 1, wherein a weight value used in each layer of said deep neural network is converted to either +1 or -1 using a sign function.

A learning method performed by a deep neural network learning device, comprising:
a step of binarizing weight values used in each layer of the deep neural network;
Using the input value to the deep neural network in which the weight value is binarized and related information of the input value, the information bottleneck method is used to generate a probabilistic mapping when the input value is a random variable, In the deep neural network, outputting as a latent variable used when predicting related information of the input value from the input value;
A learning method comprising:

a step of binarizing the weight values used in each layer of the deep neural network;
Using the input value to the deep neural network in which the weight value is binarized and related information of the input value, the information bottleneck method is used to generate a probabilistic mapping when the input value is a random variable, In the deep neural network, outputting as a latent variable used when predicting related information of the input value from the input value;
A learning program characterized by causing a computer to execute