JP2020191006A

JP2020191006A - Learning device, learning method, and learning program

Info

Publication number: JP2020191006A
Application number: JP2019096975A
Authority: JP
Inventors: 優大屋; Masaru Oya; 哲志八木; Tetsushi Yagi; 慎河野; Shin Kono; 仁中澤; Hitoshi Nakazawa
Original assignee: Nippon Telegraph and Telephone Corp; Keio University
Current assignee: Nippon Telegraph and Telephone Corp; Keio University
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2020-11-26
Anticipated expiration: 2039-05-23
Also published as: JP7178323B2

Abstract

To provide a binary network with high robustness.SOLUTION: A learning device binarizes a weight value used in each layer of a deep neural network. Then, the learning device uses an input value to the deep neural network of which weight value is binarized and related information of the input value, and outputs a probabilistic map when using the input value as a probability variable in the deep neural network as a latent variable used when predicting the related information of the input value from the input value by an information bottleneck method.SELECTED DRAWING: Figure 1

Description

本発明は、学習装置、学習方法、および、学習プログラムに関する。 The present invention relates to a learning device, a learning method, and a learning program.

ディープニューラルネットワークは、画像や音声認識をはじめ、様々な分野で用いられるモデルである。このモデルは、多層のニューラルネットワークで構成され、ニューラルネットワークは、複数のパーセプトロンで構成される。 Deep neural networks are models used in various fields, including image and voice recognition. This model is composed of a multi-layer neural network, and the neural network is composed of a plurality of perceptrons.

このパーセプトロンは、複数の入力信号に対し、それぞれ重みと呼ばれるパラメータと積和することで１つの値を得る。また、パーセプトロンは、次の層の入力信号を与えるために、活性化関数と呼ばれる非線形な関数で得られた値を射影し、その信号値を出力する。ディープニューラルネットワークは、上記のような計算を入力層から出力層に向けて順に行い、各層に信号を伝えることで、入力信号に対する予測値を得ることができる。 This perceptron obtains one value by multiplying a plurality of input signals with a parameter called a weight. Further, the perceptron projects a value obtained by a non-linear function called an activation function in order to give an input signal of the next layer, and outputs the signal value. The deep neural network can obtain a predicted value for an input signal by performing the above calculations in order from the input layer to the output layer and transmitting a signal to each layer.

ここで、ディープニューラルネットワークのパラメータおよび信号の値を二値化し、計算時におけるメモリ消費量を低減する手法が知られている（例えば、非特許文献１を参照）。このように、パラメータおよび信号の値を二値化して計算を行うディープニューラルネットワークをバイナリネットワークと呼ぶ。 Here, a method of binarizing the parameters and signal values of a deep neural network to reduce the memory consumption at the time of calculation is known (see, for example, Non-Patent Document 1). A deep neural network that binarizes parameter and signal values and performs calculations in this way is called a binary network.

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1, pp.4107-4115, 2016.I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1, pp.4107-4115, 2016. J. Xu, P. Wang, H. Yang, A. M. Lopez, Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving, arXiv preprint arXiv:1804.06332, 2018.J. Xu, P. Wang, H. Yang, A. M. Lopez, Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving, arXiv preprint arXiv: 1804.06332, 2018.

しかし、バイナリネットワークは、パラメータおよび信号が二値に限定されるため、入力層にノイズが入ると出力層から得られる予測値が大きく変わってしまうおそれがある。つまり、バイナリネットワークはロバスト性が低いという問題がある。そこで、本発明は、前記した問題を解決し、ロバスト性の高いバイナリネットワークを提供することを課題とする。 However, in a binary network, parameters and signals are limited to binary values, so if noise enters the input layer, the predicted values obtained from the output layer may change significantly. In other words, the binary network has a problem of low robustness. Therefore, an object of the present invention is to solve the above-mentioned problems and provide a highly robust binary network.

前記した課題を解決するため、本発明は、ディープニューラルネットワークの各層で用いる重みの値を二値化する変換部と、前記重みの値が二値化されたディープニューラルネットワークへの入力値と当該入力値の関連情報とを用いて、情報ボトルネック法により、前記入力値を確率変数としたときの確率的写像を、前記ディープニューラルネットワークおいて、入力値から前記入力値の関連情報を予測する際に用いる潜在変数として出力する計算部とを備えることを特徴とする。 In order to solve the above-mentioned problems, the present invention presents a conversion unit that binarizes the weight values used in each layer of the deep neural network, and an input value to the deep neural network in which the weight values are binarized. Using the information related to the input value, the information bottleneck method predicts the stochastic mapping when the input value is used as a random variable in the deep neural network, and predicts the related information of the input value from the input value. It is characterized by having a calculation unit that outputs as a latent variable used in the case.

本発明によれば、ロバスト性の高いバイナリネットワークを提供することができる。 According to the present invention, it is possible to provide a highly robust binary network.

図１は、学習装置によるバイナリネットワークの学習の概要を説明する図である。FIG. 1 is a diagram illustrating an outline of learning of a binary network by a learning device. 図２は、学習装置の構成例を示す図である。FIG. 2 is a diagram showing a configuration example of the learning device. 図３は、学習装置の処理手順の例を示すフローチャートである。FIG. 3 is a flowchart showing an example of a processing procedure of the learning device. 図４は、学習プログラムを実行するコンピュータの例を示す図である。FIG. 4 is a diagram showing an example of a computer that executes a learning program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について説明する。まず、本実施形態の学習装置による学習対象となるバイナリネットワークについて説明する。 Hereinafter, embodiments (embodiments) for carrying out the present invention will be described with reference to the drawings. First, a binary network to be learned by the learning device of the present embodiment will be described.

バイナリネットワークは、順伝搬において、(l-1)層から入力された各信号x^（ｌ-1）をパラメータwと積和する。そして、バイナリネットワークは、この積和の結果を符号関数signにより活性化させた信号x^（ｌ）を得ると、この信号x^（ｌ）を次の層に出力する。なお、バイナリネットワークは、上記の積和の際、パラメータwを符号関数signにより二値化させる（式（１）参照）。 In the forward propagation, the binary network sums each signal x ^(l-1) input from the (l-1) layer with the parameter w. Then, when the binary network obtains the signal x ^{(l) in} which the result of the sum of products is activated by the sign function sign, the binary network outputs this signal x ^(l) to the next layer. In the binary network, the parameter w is binarized by the sign function sign at the time of the sum of products (see equation (1)).

［概要］
次に、学習装置によるバイナリネットワークの学習の概要を、図１を用いて説明する。なお、図１に示すバイナリネットワークＡ，Ｂは、学習対象のバイナリネットワークに含まれるサブネットワークであるものとする。このうちバイナリネットワークＡは、パラメータθを用いて、入力データxの写像としてzを計算し、バイナリネットワークＢは、パラメータφを用いて、入力データzの写像として予測ラベルyを計算するものとする。ここでバイナリネットワークＡの確率分布はp_θ(z|x)であり、バイナリネットワークＢの確率分布はq_φ(y|z)であるものとする。 [Overview]
Next, an outline of learning the binary network by the learning device will be described with reference to FIG. It is assumed that the binary networks A and B shown in FIG. 1 are sub-networks included in the binary network to be learned. Of these, the binary network A shall calculate z as a map of the input data x using the parameter θ, and the binary network B shall calculate the predicted label y as a map of the input data z using the parameter φ. .. Here, it is assumed that the probability distribution of the binary network A is p _θ (z | x) and the probability distribution of the binary network B is q _φ (y | z).

このような場合、学習装置は、まず、上記のパラータθ，φを二値化する。その後、学習装置は、情報ボトルネック（information bottleneck）法を用いて、バイナリネットワークへの入力データxの確率的写像ｚを求める。ここで求めた写像ｚの確率分布r_θ(z)は、バイナリネットワークへの入力データｘにノイズが含まれていたとしても、当該入力データの正解ラベルごとに共通のものとなる。換言すると、入力データｘが異なっていても、当該入力データの正解ラベルごとに共通の確率分布r_θ(z)が現れる。よって、学習装置１０は、ロバスト性の高いバイナリネットワークを得ることができる。 In such a case, the learning device first binarizes the above parathas θ and φ. The learning device then uses the information bottleneck method to determine the stochastic mapping z of the input data x to the binary network. The probability distribution r _θ (z) of the mapping z obtained here is common to each correct label of the input data even if the input data x to the binary network contains noise. In other words, even if the input data x is different, a common probability distribution r _θ (z) appears for each correct label of the input data. Therefore, the learning device 10 can obtain a highly robust binary network.

［構成］
次に、図２を用いて学習装置の構成を説明する。学習装置１０は、入出力部１１と、制御部１２と、記憶部１３とを備える。入出力部１１は、各種情報の入出力を司る。例えば、入出力部１１は、制御部１２による学習対象のバイナリネットワークで用いるパラメータwの初期値等、学習に用いる各種データの入力を受け付ける。 [Constitution]
Next, the configuration of the learning device will be described with reference to FIG. The learning device 10 includes an input / output unit 11, a control unit 12, and a storage unit 13. The input / output unit 11 controls the input / output of various information. For example, the input / output unit 11 receives input of various data used for learning, such as the initial value of the parameter w used in the binary network to be learned by the control unit 12.

制御部１２は、学習装置１０全体の制御を司る。この制御部１２は、変換部１２１と、計算部１２２とを備える。変換部１２１は、ディープニューラルネットワーク（バイナリネットワーク）の各層で用いる重みの値を二値化する、例えば、変換部１２１は、符号関数signを用いて、ディープニューラルネットワーク（バイナリネットワーク）の各層で用いる重みの値を＋１および−１のいずれかに二値化する。 The control unit 12 controls the entire learning device 10. The control unit 12 includes a conversion unit 121 and a calculation unit 122. The conversion unit 121 binarizes the weight values used in each layer of the deep neural network (binary network). For example, the conversion unit 121 uses the sign function sign to be used in each layer of the deep neural network (binary network). Binary the weight value to either +1 or -1.

計算部１２２は、変換部１２１により重みの値が二値化されたバイナリネットワークについて情報ボトルネック法を用いた学習を行う。計算部１２２は、重みの値が二値化されたディープニューラルネットワークへの入力値と当該入力値の関連情報とを用いて、情報ボトルネック法により、１以上の入力値を、当該入力値の関連情報が類似するようクラスタリングする。そして、計算部１２２は、上記のクラスタリングにおける入力値を確率変数としたときの確率的写像を、上記のディープニューラルネットワークおいて、入力値から当該入力値の関連情報を予測する際に用いる潜在変数として出力する。この計算部１２２の詳細は後記する。 The calculation unit 122 learns the binary network whose weight values are binarized by the conversion unit 121 using the information bottleneck method. The calculation unit 122 uses the input value to the deep neural network in which the weight value is binarized and the related information of the input value to obtain one or more input values of the input value by the information bottleneck method. Cluster so that the relevant information is similar. Then, the calculation unit 122 uses the stochastic mapping when the input value in the above clustering is used as a random variable in the above deep neural network when predicting the related information of the input value from the input value. Output as. Details of the calculation unit 122 will be described later.

記憶部１３は、制御部１２による学習により得られたバイナリネットワークのモデルを記憶する。モデルは、例えば、上記のバイナリネットワークの各層で用いられる重み（パラメータw）の値や、潜在変数（z）、活性化関数等の情報を含む。 The storage unit 13 stores the binary network model obtained by learning by the control unit 12. The model includes, for example, information such as the value of the weight (parameter w) used in each layer of the above binary network, the latent variable (z), and the activation function.

［処理手順］
学習装置１０の処理手順を、図３を用いて説明する。例えば、学習装置１０の変換部１２１は、ディープニューラルネットワーク（バイナリネットワーク）の各層で用いる重みの値を二値化する（Ｓ１）。その後、計算部１２２は、Ｓ１で重みの値が二値化されたバイナリネットワークについて情報ボトルネック法を用いた潜在変数の算出を行う（Ｓ２）。 [Processing procedure]
The processing procedure of the learning device 10 will be described with reference to FIG. For example, the conversion unit 121 of the learning device 10 binarizes the weight values used in each layer of the deep neural network (binary network) (S1). After that, the calculation unit 122 calculates the latent variable using the information bottleneck method for the binary network whose weight value is binarized in S1 (S2).

［計算部の詳細］
上記の計算部１２２を詳細に説明する。計算部１２２は、重みの値が二値化されたバイナリネットワークについて、当該バイナリネットワークへの入力値と当該入力値の関連情報とを用いて、情報ボトルネックにより、入力値の関連情報が類似するようクラスタリングする。この関連情報は、入力値に関連する情報であり、例えば、入力値が単語である場合、当該入力値の関連情報は、当該単語を含む文書のトピック等である。 [Details of calculation unit]
The above calculation unit 122 will be described in detail. The calculation unit 122 uses the input value to the binary network and the related information of the input value for the binary network in which the weight value is binarized, and the related information of the input value is similar due to the information bottleneck. Clustering. This related information is information related to the input value. For example, when the input value is a word, the related information of the input value is a topic of a document containing the word or the like.

ここで、計算部１２２は、上記のクラスタリングにおいて、入力値を離散確率変数としたときのクラスタ変数への確率的写像を、上記のバイナリネットワークおいて、入力値から当該入力値の関連情報を予測する際に用いる潜在変数として出力する。 Here, in the above clustering, the calculation unit 122 predicts the stochastic mapping to the cluster variable when the input value is a discrete random variable, and predicts the related information of the input value from the input value in the above binary network. It is output as a latent variable used when doing so.

一般に情報ボトルネックを用いたクラスタリングは、クラスタリングの対象である変数X、変数Xのクラスタ変数（変数Xの確率的写像）Z、変数Xの関連情報Yを用いて、式（２）の値を最小化することにより行われる。なお、式（２）におけるＩは相互情報量である。つまり、XとZとの相互情報量I（X；Z）をできるだけ小さくし、ZとYとの相互情報量I（Z；Y）をできるだけ大きくするようなZを求めることにより行われる。 In general, clustering using an information bottleneck uses the variable X that is the target of clustering, the cluster variable Z of the variable X (stochastic mapping of the variable X) Z, and the related information Y of the variable X to obtain the value of equation (2). It is done by minimizing. In addition, I in the formula (2) is a mutual information amount. That is, it is performed by finding Z such that the mutual information amount I (X; Z) between X and Z is made as small as possible and the mutual information amount I (Z; Y) between Z and Y is made as large as possible.

ここで、学習装置１０による学習対象のバイナリネットワークが入力データxからその入力データxのラベル値yを予測するものである場合、計算部１２２は、上記の入力データxを離散確率変数とし、ラベル値yを入力データｘの関連情報とし、以下の式（３）を最小化する、入力データxの確率的写像z（潜在変数z）を求める。 Here, when the binary network to be learned by the learning device 10 predicts the label value y of the input data x from the input data x, the calculation unit 122 sets the above input data x as a discrete probability variable and labels it. Using the value y as the related information of the input data x, the probabilistic mapping z (latent variable z) of the input data x that minimizes the following equation (3) is obtained.

ここでr(z)を、周辺分布p(z)の変分近似としたとき、上記の式（３）を最小化することは、以下の式（４）を最小化するのと同義である。 Here, when r (z) is a variational approximation of the marginal distribution p (z), minimizing the above equation (3) is synonymous with minimizing the following equation (4). ..

ここで、p_θ(z|x)は、パラメータθを持つバイナリネットワークにｘを与えたときのｚの確率分布であり、q_φ(y|z)は、パラメータφを持つバイナリネットワークにzを与えたときのyの確率分布である。なお、このp_θ(z|x)は、パラメータθを持つバイナリネットワークの出力値から得られ、q_φ(y|z)は、パラメータφを持つバイナリネットワークの出力値から得られるものとする。また、r_θ(z)は、z（潜在変数z）の事前分布であり、平均μ、分散σのガウス分布（N(μ,σ)）に従うものとする。 Here, p _θ (z | x) is the probability distribution of z when x is given to the binary network having the parameter θ, and q _φ (y | z) is z to the binary network having the parameter φ. It is the probability distribution of y when given. It is assumed that p _θ (z | x) is obtained from the output value of the binary network having the parameter θ, and q _φ (y | z) is obtained from the output value of the binary network having the parameter φ. Further, r _θ (z) is a prior distribution of z (latent variable z) and follows a Gaussian distribution (N (μ, σ)) of mean μ and variance σ.

計算部１２２は、式（４）に示すようにKLダイバージェンスの項で正則化しながら、バイナリネットワークの学習を行う。これにより、当該バイナリネットワークのモデルは、入力データxから特徴zの得られるモデルとなるため、入力データxにノイズが入っていたとしても共通の特徴zが得られやすくなる。その結果、例えば、当該バイナリネットワークが入力データxから当該入力データxの予測ラベルyを出力する場合に、ロバスト性の高い予測ラベルyの出力を実現することができる。 The calculation unit 122 learns the binary network while making it regular in the term of KL divergence as shown in the equation (4). As a result, the model of the binary network becomes a model in which the feature z can be obtained from the input data x, so that the common feature z can be easily obtained even if the input data x contains noise. As a result, for example, when the binary network outputs the predicted label y of the input data x from the input data x, it is possible to realize the output of the predicted label y with high robustness.

［プログラム］
また、上記の実施形態で述べた学習装置１０の機能を実現するプログラムを所望の情報処理装置（コンピュータ）にインストールすることによって実装できる。例えば、パッケージソフトウェアやオンラインソフトウェアとして提供される上記のプログラムを情報処理装置に実行させることにより、情報処理装置を学習装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータ、ラック搭載型のサーバコンピュータ等が含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistants）等がその範疇に含まれる。また、学習装置１０を、クラウドサーバに実装してもよい。 [program]
Further, it can be implemented by installing a program that realizes the function of the learning device 10 described in the above embodiment on a desired information processing device (computer). For example, the information processing device can function as the learning device 10 by causing the information processing device to execute the above program provided as package software or online software. The information processing device referred to here includes a desktop type or notebook type personal computer, a rack-mounted server computer, and the like. In addition, the information processing device includes smartphones, mobile phones, mobile communication terminals such as PHS (Personal Handyphone System), and PDA (Personal Digital Assistants). Further, the learning device 10 may be mounted on the cloud server.

図４を用いて、上記のプログラム（学習プログラム）を実行するコンピュータの一例を説明する。図４に示すように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 An example of a computer that executes the above program (learning program) will be described with reference to FIG. As shown in FIG. 4, the computer 1000 has, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。ディスクドライブ１１００には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１１１０およびキーボード１１２０が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１１３０が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. For example, a mouse 1110 and a keyboard 1120 are connected to the serial port interface 1050. A display 1130 is connected to the video adapter 1060, for example.

ここで、図４に示すように、ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。前記した実施形態で説明した各種データや情報は、例えばハードディスクドライブ１０９０やメモリ１０１０に記憶される。 Here, as shown in FIG. 4, the hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. The various data and information described in the above-described embodiment are stored in, for example, the hard disk drive 1090 or the memory 1010.

そして、ＣＰＵ１０２０が、ハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the hard disk drive 1090 into the RAM 1012 as needed, and executes each of the above-described procedures.

なお、上記の学習プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、上記のプログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮやＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and program data 1094 related to the above learning program are not limited to the case where they are stored in the hard disk drive 1090. For example, they are stored in a removable storage medium and are stored by the CPU 1020 via the disk drive 1100 or the like. It may be read out. Alternatively, the program module 1093 and program data 1094 related to the above program are stored in another computer connected via a network such as a LAN or WAN (Wide Area Network), and read by the CPU 1020 via the network interface 1070. May be done.

１０学習装置
１１入出力部
１２制御部
１３記憶部
１２１変換部
１２２計算部 10 Learning device 11 Input / output unit 12 Control unit 13 Storage unit 121 Conversion unit 122 Calculation unit

Claims

A converter that binarizes the weight values used in each layer of the deep neural network,
Using the input value to the deep neural network in which the weight value is binarized and the related information of the input value, a probabilistic mapping when the input value is used as a random variable by the information bottleneck method is obtained. In the deep neural network, a calculation unit that outputs as a latent variable used when predicting related information of the input value from the input value,
A learning device characterized by comprising.

The conversion unit
The learning apparatus according to claim 1, wherein the weight value used in each layer of the deep neural network is converted into a value of +1 or -1 by using a sign function.

A learning method executed by a learning device of a deep neural network.
The step of binarizing the weight value used in each layer of the deep neural network, and
Using the input value to the deep neural network in which the weight value is binarized and the related information of the input value, a probabilistic mapping when the input value is used as a random variable by the information bottleneck method is obtained. In the deep neural network, a step of outputting as a latent variable used when predicting the related information of the input value from the input value, and
A learning method characterized by including.

Steps to binarize the weight values used in each layer of the deep neural network,
Using the input value to the deep neural network in which the weight value is binarized and the related information of the input value, a probabilistic mapping when the input value is used as a random variable by the information bottleneck method is obtained. In the deep neural network, a step of outputting as a latent variable used when predicting the related information of the input value from the input value, and
A learning program characterized by having a computer execute.