JP7296270B2

JP7296270B2 - Image feature extraction device and its program

Info

Publication number: JP7296270B2
Application number: JP2019139406A
Authority: JP
Inventors: 吉彦河合
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2023-06-22
Anticipated expiration: 2039-07-30
Also published as: JP2021022256A

Description

本発明は、畳み込みニューラルネットワークにより画像の特徴を抽出する画像特徴抽出装置およびそのプログラムに関する。 The present invention relates to an image feature extracting device and its program for extracting features of an image using a convolutional neural network.

従来、画像内の物体認識、顔認識等の画像認識の分野において、畳み込みニューラルネットワーク（ＣＮＮ：Convolutional Neural Network）を利用する手法が知られている。
このＣＮＮは、ＶＧＧ（Visual Geometry Group）等のように、ネットワークの層を深くして学習を行うことで、認識の精度を向上させている。図９に、ＣＮＮの基本構造を示す。従来のＣＮＮは、図９に示すように、複数（図９では２層）の畳み込み層を連結し、入力ｘに対して、出力Ｈ（ｘ）を最適化したネットワーク構造である。この場合、学習対象である写像Ｆ（ｘ）は、Ｈ（ｘ）そのものである。
しかし、従来のＣＮＮは、単純にネットワークの層を深くしていくと、学習を行う際の勾配が消失したり、発散したりすることで、正しく学習が行われず、認識精度が劣化してしまうことが知られている。 2. Description of the Related Art Conventionally, in the field of image recognition such as object recognition in an image and face recognition, a technique using a convolutional neural network (CNN) is known.
This CNN, like VGG (Visual Geometry Group), deepens the layers of the network and performs learning to improve recognition accuracy. FIG. 9 shows the basic structure of CNN. A conventional CNN, as shown in FIG. 9, has a network structure in which a plurality of (two layers in FIG. 9) convolutional layers are connected and an output H(x) is optimized with respect to an input x. In this case, the map F(x) to be learned is H(x) itself.
However, in conventional CNN, if the network layer is simply made deeper, the gradient during learning disappears or diverges, preventing correct learning and degrading recognition accuracy. It is known.

そこで、近年では、残差ニューラルネットワーク（ＲｅｓＮｅｔ：Residual Network）を使用する手法が知られている。図１０に、ＲｅｓＮｅｔの基本構造を示す。
このＲｅｓＮｅｔは、図１０に示すように、複数（図１０では２層）の畳み込み層を連結するとともに、その出力と入力ｘとを足し合わせた出力Ｈ（ｘ）を最適化したネットワーク構造である。この場合、学習対象である写像Ｆ（ｘ）は、出力Ｈ（ｘ）と入力ｘとの残差であるＨ（ｘ）－ｘである。
このように、ＲｅｓＮｅｔは、入力ｘをそのまま伝達するパスを設けることで、情報をネットワークの下層に伝達させ、学習時における勾配消失等の不具合を抑えて、ネットワークの多層化を実現している。 Therefore, in recent years, a technique using a residual neural network (ResNet: Residual Network) is known. FIG. 10 shows the basic structure of ResNet.
As shown in FIG. 10, this ResNet is a network structure in which a plurality of convolutional layers (two layers in FIG. 10) are connected and the output H(x), which is the sum of the output and the input x, is optimized. . In this case, the map F(x) to be learned is H(x)-x, which is the residual between the output H(x) and the input x.
In this way, ResNet provides a path that transmits the input x as it is, thereby transmitting information to the lower layers of the network, suppressing defects such as vanishing gradients during learning, and realizing a multi-layered network.

K. He, X. Zhang, S. Ren, and J. Sun,“Deep Residual Learning for Image Recognition”, in Proc. CVPR, 2015.K. He, X. Zhang, S. Ren, and J. Sun,“Deep Residual Learning for Image Recognition”, in Proc. CVPR, 2015.

図１０で説明したように、ＲｅｓＮｅｔの基本構造は、畳み込み層による畳み込みを行わずに、入力ｘをネットワークの下層にそのまま伝達するパスを設けている。
このＲｅｓＮｅｔの基本構造を多層化して、画像認識等を行う畳み込みニューラルネットワークを構築する場合、どの層においても、学習対象である写像Ｆ（ｘ）と、畳み込み層を通らない入力ｘとは、同じ重みで加算されることになる。
しかし、入力ｘの重要性は、層の位置によって異なる場合があり、必ずしも均一に加算することがネットワークの最適化に寄与しない場合がある。
そのため、畳み込みニューラルネットワークを最適化するさらなる工夫が求められていた。 As explained in FIG. 10, the basic structure of ResNet provides a path through which the input x is passed directly to the lower layers of the network without convolution by the convolution layers.
When constructing a convolutional neural network that performs image recognition, etc. by multilayering the basic structure of this ResNet, the map F(x) to be learned and the input x that does not pass through the convolutional layer are the same in any layer. will be added by weight.
However, the importance of the input x may vary depending on the layer position, and uniform addition may not always contribute to network optimization.
Therefore, there has been a demand for further measures to optimize convolutional neural networks.

本発明は、このような問題に鑑みてなされたものであり、畳み込みニューラルネットワークの複数の畳み込み層を通さずに下層に伝達するデータに重みを付加して、画像特徴を抽出することが可能な画像特徴抽出装置およびそのプログラムを提供することを課題とする。 The present invention has been made in view of such problems, and it is possible to extract image features by adding weights to data transmitted to lower layers without passing through a plurality of convolutional layers of a convolutional neural network. An object of the present invention is to provide an image feature extraction device and its program.

前記課題を解決するため、本発明に係る画像特徴抽出装置は、画像のデータから前記画像の特徴量を抽出する画像特徴抽出装置であって、畳み込み手段と、スケーリング手段と、加算手段と、を有する基本構成部を多段構成で備える構成とした。 In order to solve the above-described problems, an image feature extraction device according to the present invention is an image feature extraction device for extracting a feature amount of an image from image data, comprising convolution means, scaling means, and addition means. A multi-stage configuration is adopted.

かかる構成において、画像特徴抽出装置は、基本構成部の畳み込み手段によって、畳み込みニューラルネットワークのパラメータとして予め学習したカーネルを用いて、入力したデータに対して複数の畳み込み層の演算を行う。これによって、畳み込み手段は、画像の特徴を順次畳み込んで抽出する。
また、画像特徴抽出装置は、基本構成部のスケーリング手段によって、畳み込みニューラルネットワークのパラメータとして予め学習したスケーリング係数を用いて、入力したデータにスケーリング係数を乗算する。これによって、スケーリング手段は、畳み込み層を通らない経路において、入力したデータに学習に応じた重みを付加する。
そして、画像特徴抽出装置は、基本構成部の加算手段によって、畳み込み手段の演算結果とスケーリング手段の演算結果とを加算する。
また、画像特徴抽出装置は、この基本構成部を多段構成で備え、順次演算を行うことで、畳み込みニューラルネットワークの複数の畳み込み層を通さずに下層に伝達するデータに対して、基本構成部ごとに予め学習した重みを付加することが可能になる。 In such a configuration, the image feature extracting device performs operations of a plurality of convolution layers on the input data by using a kernel learned in advance as a parameter of the convolution neural network by the convolution means of the basic component. Thereby, the convolution means sequentially convolves and extracts the features of the image.
Further, the image feature extracting device multiplies the input data by the scaling coefficient, which is learned in advance as a parameter of the convolutional neural network, by the scaling means of the basic component. As a result, the scaling means adds weights according to learning to the input data on paths that do not pass through the convolutional layers.
Then, the image feature extracting device adds the calculation result of the convolution means and the calculation result of the scaling means by the adding means of the basic component.
In addition, the image feature extraction device is equipped with this basic configuration unit in a multi-stage configuration, and by performing sequential operations, for data transmitted to lower layers without passing through a plurality of convolution layers of the convolutional neural network, each basic configuration unit can be added with pre-learned weights.

また、本発明は、コンピュータを、前記画像特徴抽出装置として機能させるための画像特徴抽出プログラムで実現することもできる。 Also, the present invention can be realized by an image feature extraction program for causing a computer to function as the image feature extraction device.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、畳み込みニューラルネットワークにおいて、複数の畳み込み層を通さずに下層に伝達するデータに重みを付加することができる。
これによって、本発明は、入力データの重要性を層の位置によって異なるように学習した畳み込みニューラルネットワークを用いて、精度の高い画像特徴を抽出することができる。 ADVANTAGE OF THE INVENTION This invention has the outstanding effect shown below.
According to the present invention, in a convolutional neural network, weights can be added to data transmitted to lower layers without passing through a plurality of convolutional layers.
As a result, the present invention can extract image features with high accuracy using a convolutional neural network that learns the importance of input data differently depending on the position of the layer.

本発明の実施形態に係る画像特徴抽出装置で使用するＣＮＮの基本構造を示すネットワーク図である。It is a network diagram showing the basic structure of CNN used in the image feature extraction device according to the embodiment of the present invention. 本発明の実施形態に係る画像特徴抽出装置の基本構成部の構成を示すブロック図である。1 is a block diagram showing the configuration of a basic component of an image feature extraction device according to an embodiment of the present invention; FIG. 本発明の実施形態に係る画像特徴抽出装置の基本構成部の他の構成を示すブロック図である。FIG. 4 is a block diagram showing another configuration of the basic configuration unit of the image feature extraction device according to the embodiment of the present invention; 本発明の実施形態に係る画像特徴抽出装置が用いるＣＮＮのモデルを示すネットワーク図である。1 is a network diagram showing a CNN model used by an image feature extraction device according to an embodiment of the present invention; FIG. 本発明の実施形態に係る画像特徴抽出装置の構成を示す全体構成図である。1 is an overall configuration diagram showing the configuration of an image feature extraction device according to an embodiment of the present invention; FIG. 本発明の実施形態に係る画像特徴抽出装置の動作を示すフローチャートである。4 is a flow chart showing the operation of the image feature extraction device according to the embodiment of the present invention; 画像特徴抽出装置のパラメータを学習するためのパラメータ学習装置が用いるＣＮＮのモデルを示すネットワーク図である。FIG. 4 is a network diagram showing a CNN model used by a parameter learning device for learning parameters of an image feature extraction device; 画像特徴抽出装置のパラメータを学習するためのパラメータ学習装置の構成を示す全体構成図である。1 is an overall configuration diagram showing the configuration of a parameter learning device for learning parameters of an image feature extraction device; FIG. 従来のＣＮＮの基本構造を示すネットワーク図である。1 is a network diagram showing the basic structure of a conventional CNN; FIG. 従来のＲｅｓＮｅｔの基本構造を示すネットワーク図である。1 is a network diagram showing the basic structure of a conventional ResNet; FIG.

以下、本発明の実施形態について図面を参照して説明する。
＜畳み込みニューラルネットワークの基本構造の概要＞
図１を参照して、本発明の実施形態に係る画像特徴抽出装置１（図５）で用いる畳み込みニューラルネットワーク（以下、ＣＮＮ）の基本構造Ｎの概要について説明する。
図１に示すように、画像特徴抽出装置１（図５）で用いるＣＮＮの基本構造Ｎは、複数（図１では２層）の畳み込み層ＣＬ，ＣＬを連結して入力ｘを畳み込んだデータ（Ｆ（ｘ））と、スケーリング層ＳＬで入力ｘをスケーリング（ａ倍）したデータ（ａｘ）とを加算した出力Ｈ（ｘ）を最適化したネットワークモデルである。
ＣＮＮの基本構造Ｎは、複数の畳み込み層ＣＬ，ＣＬで行う写像Ｆ（ｘ）が、出力Ｈ（ｘ）と入力ｘとのａ倍のデータ（ａｘ）との残差であるＨ（ｘ）－ａｘとなるように予め学習したネットワークモデルである。
このように、ＣＮＮの基本構造Ｎは、複数の畳み込み層ＣＬ，ＣＬで畳み込み演算を行ったデータと、畳み込み層ＣＬ，ＣＬを通さずに下層に伝達するデータとを、重みを付けて加算する。
なお、ＣＮＮの基本構造Ｎの畳み込み層ＣＬは、２層に限定されず、３層以上であっても構わない。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described with reference to the drawings.
<Outline of basic structure of convolutional neural network>
With reference to FIG. 1, an outline of a basic structure N of a convolutional neural network (hereinafter referred to as CNN) used in the image feature extraction device 1 (FIG. 5) according to the embodiment of the present invention will be described.
As shown in FIG. 1, the basic structure N of the CNN used in the image feature extraction device 1 (FIG. 5) is data obtained by connecting a plurality of (two layers in FIG. 1) convolution layers CL, CL and convolving the input x. It is a network model that optimizes the output H(x) obtained by adding (F(x)) and data (ax) obtained by scaling (a times) the input x in the scaling layer SL.
The basic structure N of the CNN is that the mapping F(x) performed by the multiple convolution layers CL, CL is H(x) which is the residual between the output H(x) and the data (ax) a times as large as the input x. −ax is a network model learned in advance.
In this way, the basic structure N of the CNN weights and adds data that has undergone convolutional operations in a plurality of convolutional layers CL and CL and data that is transmitted to lower layers without passing through the convolutional layers CL and CL. .
Note that the number of convolutional layers CL of the basic structure N of the CNN is not limited to two, and may be three or more.

＜画像特徴抽出装置の基本構成部＞
次に、図２，図３を参照して、本発明の実施形態に係る画像特徴抽出装置１（図５）を構成する基本構成部について説明する。
図２は、図１で説明したＣＮＮの基本構造Ｎを装置構成として具現化した画像特徴抽出装置１の基本構成部１０の構成図である。
基本構成部１０は、入力データを畳み込み演算し、特徴を抽出して出力データとするものである。 <Basic Configuration of Image Feature Extraction Device>
Next, with reference to FIGS. 2 and 3, the basic components of the image feature extraction device 1 (FIG. 5) according to the embodiment of the present invention will be described.
FIG. 2 is a configuration diagram of the basic configuration unit 10 of the image feature extraction device 1 that embodies the basic structure N of the CNN described in FIG. 1 as the device configuration.
The basic configuration unit 10 performs a convolution operation on input data, extracts features, and outputs data.

入力データは、Ｗ×Ｈ×Ｃのサイズを持つ３次元の行列（特徴量行列）である。行列の列Ｗおよび行Ｈは、特徴量の幅と高さを表し、行列の深さＣは、チャンネルの数（特徴量の種類数）を表す。なお、画像特徴抽出装置１の先頭階層の基本構成部１０への入力データは、画像データである。この場合、入力データは、予め定めた幅Ｗ画素、高さＨ画素、チャンネル数Ｃ（ＲＧＢの場合“３”）のデータである。
基本構成部１０の入力データと出力データとは、行列の次元が同じである。ここで、行列の次元とは、行列の幅、高さ、チャンネル数のそれぞれの配列の数をいう。
図２に示すように、基本構成部１０は、畳み込み手段１１と、スケーリング手段１２と、加算手段１３と、を備える。 Input data is a three-dimensional matrix (feature matrix) having a size of W×H×C. The columns W and rows H of the matrix represent the width and height of the feature amount, and the depth C of the matrix represents the number of channels (the number of feature amount types). Input data to the basic configuration unit 10 of the top layer of the image feature extraction device 1 is image data. In this case, the input data has a predetermined width W pixels, height H pixels, and the number of channels C (“3” for RGB).
The input data and the output data of the basic configuration unit 10 have the same matrix dimension. Here, the dimensions of the matrix refer to the number of arrays for the width, height, and number of channels of the matrix.
As shown in FIG. 2 , the basic configuration unit 10 includes convolution means 11 , scaling means 12 and addition means 13 .

畳み込み手段１１は、予め定めたサイズの学習済のパラメータ（結合重み係数）を有する複数のカーネルを用いて、入力データに対して畳み込み演算を行うものである。
例えば、畳み込み手段１１は、１つの畳み込み層の演算として、Ｗ（幅）×Ｈ（高さ）×Ｃ（チャンネル）のサイズの入力データに対して、予め学習した３×３サイズのカーネルをＣ個用いて、入力データの幅および高さ方向の両端に１画素分の画素を付加（例えば、ゼロパディング）する。そして、畳み込み手段１１は、ストライド“１”で順次カーネルをシフトして畳み込み演算を行う。これによって、畳み込み手段１１は、入力データの行列の次元（幅、高さ、チャンネル数）と同じデータを生成する。
畳み込み手段１１は、この畳み込み層の演算を、図１に示したＣＮＮの予め定めた複数の畳み込み層ＣＬの数だけ行う。
畳み込み手段１１は、演算後のデータを加算手段１３に出力する。 The convolution means 11 performs a convolution operation on input data using a plurality of kernels having learned parameters (connection weight coefficients) of a predetermined size.
For example, the convolution means 11 applies a pre-learned 3×3 size kernel to C are used to add one pixel (for example, zero padding) to both ends of the input data in the width and height directions. Then, the convolution means 11 sequentially shifts the kernels with a stride of "1" to perform the convolution operation. As a result, the convolution means 11 generates data having the same dimensions as the matrix of the input data (width, height, number of channels).
The convolution means 11 performs this convolution layer operation for the number of a plurality of predetermined convolution layers CL of the CNN shown in FIG.
The convolution means 11 outputs the calculated data to the addition means 13 .

スケーリング手段１２は、入力データの各要素に対して、予め学習済みのパラメータであるスケーリング係数を乗算するものである。このスケーリング手段１２は、畳み込み層の演算を行わないようにバイパスする。
スケーリング係数は、予めＣＮＮのパラメータとして学習したスカラ値である。このスケーリング係数は、畳み込み手段１１を通る経路と、スケーリング手段１２を通る経路との重みを示す。
スケーリング手段１２は、Ｗ（幅）×Ｈ（高さ）×Ｃ（チャンネル）のサイズの入力データである行列の各要素を、すべて、スケーリング係数倍する。
スケーリング手段１２が演算したデータの次元（幅、高さ、チャンネル数）は、入力データと同じである。
スケーリング手段１２は、演算後のデータを加算手段１３に出力する。 The scaling means 12 multiplies each element of the input data by a scaling coefficient, which is a pre-learned parameter. This scaling means 12 bypasses the operation of the convolutional layers.
The scaling factor is a scalar value learned in advance as a CNN parameter. This scaling factor indicates the weight of the path through convolution means 11 and the path through scaling means 12 .
The scaling means 12 multiplies each element of the matrix, which is input data of size W (width)×H (height)×C (channel), by a scaling factor.
The dimensions (width, height, number of channels) of the data calculated by the scaling means 12 are the same as the input data.
The scaling means 12 outputs the calculated data to the adding means 13 .

加算手段１３は、畳み込み手段１１で演算されたデータと、スケーリング手段１２で演算されたデータとを加算するものである。
加算手段１３に入力される畳み込み手段１１で演算されたデータと、スケーリング手段１２で演算されたデータとは、共に同じ次元（幅、高さ、チャンネル数）のデータであるため、加算手段１３は、２つの行列の同じ位置の要素をそれぞれ加算する。
これによって、加算手段１３は、入力データと同じ次元で、特徴量を抽出した出力データを生成する。 The adding means 13 adds the data calculated by the convolution means 11 and the data calculated by the scaling means 12 .
Since the data calculated by the convolution means 11 and the data calculated by the scaling means 12 input to the addition means 13 are data of the same dimension (width, height, number of channels), the addition means 13 , add the elements at the same position in the two matrices, respectively.
As a result, the adding means 13 generates output data with the same dimension as the input data and the feature quantity extracted.

次に、図３を参照して、画像特徴抽出装置１の基本構成部の他の構成（基本構成部１０Ｂ）について説明する。
基本構成部１０Ｂは、基本構成部１０（図２）と同様、入力データを畳み込み演算し、特徴を抽出して出力データとするものである。
図２で説明した基本構成部１０における入力データと出力データとは、同じ次元の行列であった。しかし、基本構成部１０Ｂにおける入力データと出力データとは、異なる次元の行列である。 Next, another configuration (basic configuration unit 10B) of the basic configuration unit of the image feature extraction device 1 will be described with reference to FIG.
As with the basic configuration unit 10 (FIG. 2), the basic configuration unit 10B performs a convolution operation on input data, extracts features, and outputs them as output data.
The input data and output data in the basic configuration unit 10 described with reference to FIG. 2 are matrices of the same dimension. However, the input data and the output data in the basic configuration unit 10B are matrices of different dimensions.

図３に示すように、基本構成部１０Ｂは、畳み込み手段１１Ｂと、スケーリング手段１２Ｂと、加算手段１３と、を備える。
加算手段１３は、図２で説明した基本構成部１０の構成と同じであるため、説明を省略する。 As shown in FIG. 3, the basic configuration unit 10B includes convolution means 11B, scaling means 12B, and addition means 13. As shown in FIG.
Since the adding means 13 has the same configuration as the basic configuration section 10 explained in FIG. 2, the explanation thereof is omitted.

畳み込み手段１１Ｂは、予め定めたサイズの学習済のパラメータ（結合重み係数）を有する複数のカーネルを用いて、入力データに対して畳み込み演算を行うものである。なお、畳み込み手段１１Ｂは、入力データの行列の次元を変えたデータを生成する。
例えば、畳み込み手段１１Ｂは、１つの畳み込み層の演算として、Ｗ（幅）×Ｈ（高さ）×Ｃ（チャンネル）のサイズの入力データに対して、予め学習した３×３サイズのカーネルを２×Ｃ個用いて、入力データの幅および高さ方向の両端に１画素分の画素を付加（例えば、ゼロパディング）する。そして、畳み込み手段１１Ｂは、ストライド“２”で順次カーネルをシフトして畳み込み演算を行う。これによって、畳み込み手段１１Ｂは、入力データの行列の幅および高さを半分にするとともに、チャンネル数を２倍にしたデータを生成する。 The convolution means 11B performs a convolution operation on input data using a plurality of kernels having learned parameters (connection weight coefficients) of a predetermined size. Note that the convolution means 11B generates data obtained by changing the dimension of the matrix of the input data.
For example, the convolution means 11B applies two pre-learned kernels of 3×3 size to input data of size W (width)×H (height)×C (channel) as an operation of one convolution layer. xC pixels are used to add one pixel (for example, zero padding) to both ends of the input data in the width and height directions. Then, the convolution means 11B sequentially shifts the kernels with a stride of "2" to perform the convolution operation. As a result, the convolution means 11B halves the width and height of the matrix of the input data and generates data with double the number of channels.

なお、畳み込み手段１１Ｂは、複数の畳み込み層の演算すべてにおいて、行列の次元を変える必要はなく、最初の畳み込み層の演算のみ、行列の次元を変え、それ以降の畳み込み層の演算においては、行列の次元を変えないように演算してもよい。
畳み込み手段１１Ｂは、この畳み込み層の演算を、図１に示したＣＮＮの予め定めた複数の畳み込み層ＣＬの数だけ行う。
畳み込み手段１１Ｂは、演算後のデータを加算手段１３に出力する。 Note that the convolution means 11B does not need to change the dimension of the matrix in all the operations of a plurality of convolution layers. Only the operation of the first convolution layer changes the dimension of the matrix. may be calculated without changing the dimension of .
The convolution means 11B performs this convolution layer operation for the number of predetermined convolution layers CL of the CNN shown in FIG.
The convolution means 11B outputs the data after the calculation to the addition means 13. FIG.

スケーリング手段１２Ｂは、入力データの行列の次元を、畳み込み手段１１Ｂが出力するデータの次元に揃えて、行列の各要素に対して、予め学習済みのパラメータであるスケーリング係数を乗算するものである。
このスケーリング手段１２Ｂは、入力データの行列を、畳み込み手段１１Ｂが出力するデータの次元に線形射影する。 The scaling means 12B aligns the dimensions of the input data matrix with the dimensions of the data output from the convolution means 11B, and multiplies each element of the matrix by a scaling factor, which is a pre-learned parameter.
This scaling means 12B linearly projects the matrix of the input data onto the dimensions of the data output by the convolution means 11B.

例えば、畳み込み手段１１Ｂが、Ｗ（幅）×Ｈ（高さ）×Ｃ（チャンネル）のサイズの入力データを、（Ｗ／２）×（Ｈ／２）×（２×Ｃ）の行列に畳み込み演算を行うものとする。
この場合、スケーリング手段１２Ｂは、１チャンネルごとにＷ×Ｈのデータを（Ｗ／２）×（Ｈ／２）に縮小し、さらに、チャンネルを線形補間して、２×Ｃのチャンネル数の行列を生成する。
そして、スケーリング手段１２Ｂは、Ｗ／２（幅）×Ｈ／２（高さ）×２×Ｃ（チャンネル）のサイズの入力データである行列の各要素を、すべて、スケーリング係数倍する。
これによって、スケーリング手段１２Ｂが演算したデータの次元（幅、高さ、チャンネル数）は、畳み込み手段１１Ｂが演算したデータの次元と一致させることができる。
スケーリング手段１２Ｂは、演算後のデータを加算手段１３に出力する。 For example, the convolution means 11B convolves input data having a size of W (width) x H (height) x C (channels) into a (W/2) x (H/2) x (2 x C) matrix. It is assumed that calculations are performed.
In this case, the scaling means 12B reduces the W×H data to (W/2)×(H/2) for each channel, further linearly interpolates the channels, and obtains a 2×C channel number matrix to generate
Then, the scaling means 12B multiplies each element of the matrix, which is input data of size W/2 (width)×H/2 (height)×2×C (channel), by the scaling factor.
As a result, the dimensions (width, height, number of channels) of the data calculated by the scaling means 12B can be matched with the dimensions of the data calculated by the convolution means 11B.
The scaling means 12B outputs the data after the calculation to the addition means 13. FIG.

以上説明したように、基本構成部１０，１０Ｂは、畳み込み手段１１，１１Ｂにおける入力データに対する複数の畳み込み層の演算結果と、畳み込み層を通らない入力データとを、重みを付けて加算することができる。 As described above, the basic configuration units 10 and 10B can weight and add the operation results of a plurality of convolution layers for the input data in the convolution means 11 and 11B and the input data that does not pass through the convolution layers. can.

＜画像特徴抽出装置のＣＮＮの概要＞
次に、図４を参照して、画像特徴抽出装置１（図５）が動作するＣＮＮのモデルＭの概要について説明する。 <Overview of CNN for image feature extraction device>
Next, with reference to FIG. 4, an overview of the CNN model M in which the image feature extraction device 1 (FIG. 5) operates will be described.

モデルＭは、画像Ｉのデータを入力し、画像Ｉの特徴量Ｖを抽出する畳み込みニューラルネットワーク（ＣＮＮ）である
モデルＭは、予め定めた複数の畳み込み層ＣＬ（ここでは、〔ＣＬ_１，ＣＬ_２〕と〔ＣＬ_３，ＣＬ_４〕と〔ＣＬ_５，ＣＬ_６〕と〔ＣＬ_７，ＣＬ_８〕）ごとに、畳み込み層ＣＬを通らない（バイパスした）スケーリング層ＳＬ（ＳＬ_１～ＳＬ_４）を備える。
そして、モデルＭは、予め定めた複数の畳み込み層ＣＬの出力と当該畳み込み層を通らないスケーリング層ＳＬの出力とを加算し、後段の畳み込み層ＣＬに画像特徴を順次出力する。
モデルＭの最終出力は、予め定めた次元の特徴量Ｖである。
このように、モデルＭは、ネットワークの深さに応じて、畳み込みを行わない経路の重みを変えたネットワークとなる。 The model M is a convolutional neural network (CNN) that inputs the data of the image I and extracts the feature value V of the image I. The model M includes a plurality of predetermined convolution layers CL (here, [CL ₁ , CL ₂ ], [CL ₃ , CL ₄ ], [CL ₅ , CL ₆ ] and [CL ₇ , CL ₈ ]), scaling layers SL (SL ₁ to SL ₄ ) that bypass the convolutional layer CL Prepare.
Then, the model M adds the outputs of a plurality of predetermined convolutional layers CL and the outputs of the scaling layers SL that do not pass through the convolutional layers, and sequentially outputs the image features to the subsequent convolutional layers CL.
The final output of the model M is a feature V of predetermined dimensions.
In this way, the model M becomes a network in which the weights of routes that are not convolved are changed according to the depth of the network.

なお、このモデルＭは、一例であって、全体の畳み込み層ＣＬの数や、スケーリング層ＳＬがバイパスする畳み込み層ＣＬの数は、この例に限定されるものではない。また、それぞれの畳み込み層ＣＬの出力データや、加算後のデータに対して活性化関数（ＲｅＬＵ：Rectified Linear Units等）を適用してもよい。 Note that this model M is merely an example, and the total number of convolutional layers CL and the number of convolutional layers CL bypassed by the scaling layer SL are not limited to this example. Also, an activation function (ReLU: Rectified Linear Units, etc.) may be applied to the output data of each convolutional layer CL and the data after addition.

＜画像特徴抽出装置の構成＞
次に、図５を参照して、本発明の実施形態に係る画像特徴抽出装置１の構成について説明する。
画像特徴抽出装置１は、図４で説明したモデルＭによって、画像Ｉから特徴量Ｖを抽出するものである。
図５に示すように、画像特徴抽出装置１は、複数の基本構成部１０，１０Ｂと、パラメータ記憶手段２０と、を備える。 <Configuration of image feature extraction device>
Next, the configuration of the image feature extraction device 1 according to the embodiment of the present invention will be described with reference to FIG.
The image feature extraction device 1 extracts a feature amount V from an image I using the model M described with reference to FIG.
As shown in FIG. 5, the image feature extraction device 1 includes a plurality of basic configuration units 10 and 10B and parameter storage means 20. In FIG.

基本構成部１０，１０Ｂは、入力データを畳み込み演算し、特徴を抽出して出力データとするもので、図２，図３で説明したものと同じである。
ここでは、画像特徴抽出装置１は、基本構成部１０_１、１０Ｂ_２、１０Ｂ_３、１０Ｂ_４の順に多段に構成している。 The basic configuration units 10 and 10B perform a convolution operation on input data, extract features, and produce output data, which are the same as those described with reference to FIGS.
Here, the image feature extraction device 1 is configured in multiple stages in the order of basic configuration units 10 ₁ , 10B ₂ , 10B ₃ and 10B ₄ .

基本構成部１０_１は、畳み込み手段１１_１（１１）と、スケーリング手段１２_１（１２）と、加算手段１３_１（１３）と、を備える。
畳み込み手段１１_１は、Ｗ（幅）×Ｈ（高さ）×Ｃ（チャンネル）の画像Ｉに対して、２つの畳み込み層ＣＬ_１，ＣＬ_２の演算を行うものである。なお、畳み込み層ＣＬ_１，ＣＬ_２の「３×３ｃｏｎｖ，６４」は、３×３サイズのカーネルを６４個用いて、畳み込み演算を行うことを示す。
これによって、畳み込み手段１１_１は、画像Ｉから、Ｗ（幅）×Ｈ（高さ）が同じで６４個のチャンネルのデータを生成する。
畳み込み手段１１_１が行う畳み込み層ＣＬ_１，ＣＬ_２の演算として使用するカーネルのパラメータ（結合重み係数）は、予め学習済のパラメータとしてパラメータ記憶手段２０に記憶されている。
畳み込み手段１１_１は、演算後のデータを加算手段１３_１に出力する。 The basic configuration unit 10 ₁ includes convolution means 11 ₁ (11), scaling means 12 ₁ (12), and addition means 13 ₁ (13).
The convolution means 11 ₁ performs two convolution layer CL ₁ and CL ₂ calculations on an image I of W (width)×H (height)×C (channel). Note that “3×3 conv, 64” in the convolution layers CL ₁ and CL ₂ indicates that 64 kernels of 3×3 size are used to perform the convolution operation.
As a result, the convolution means ₁₁₁ generates data of 64 channels from the image I with the same W (width)×H (height).
Kernel parameters (connection weight coefficients) used in the calculations of the convolution layers CL ₁ and CL ₂ performed by the convolution means 11 ₁ are pre-stored in the parameter storage means 20 as learned parameters.
The convolution means _11-1 outputs the data after the operation to the addition means _13-1 .

スケーリング手段１２_１は、Ｗ（幅）×Ｈ（高さ）×Ｃ（チャンネル）の画像Ｉの行列の各要素に対して、予め学習済みのパラメータであるスケーリング係数を乗算するものである。なお、学習済のスケーリング係数は、予めパラメータ記憶手段２０に記憶されている。
スケーリング手段１２_１は、演算後のデータを加算手段１３_１に出力する。 The scaling means ₁₂₁ multiplies each element of the matrix of the image I of W (width)×H (height)×C (channel) by a scaling coefficient which is a pre-learned parameter. Note that the learned scaling coefficients are stored in the parameter storage means 20 in advance.
The scaling means _12-1 outputs the data after the calculation to the adding means _13-1 .

加算手段１３_１は、畳み込み手段１１_１で演算されたデータと、スケーリング手段１２_１で演算されたデータとを加算するものである。
加算手段１３_１は、加算結果のデータ（特徴行列）を、後段の基本構成部１０Ｂ_２に出力する。 The addition means _13-1 adds the data calculated by the convolution means _11-1 and the data calculated by the scaling means _12-1 .
The addition means _13-1 outputs the data (feature matrix) of the addition result to the subsequent basic configuration section 10B- ₂ .

基本構成部１０Ｂ_２は、畳み込み手段１１Ｂ_２（１１Ｂ）と、スケーリング手段１２Ｂ_２（１２Ｂ）と、加算手段１３_２（１３）と、を備える。
畳み込み手段１１Ｂ_２は、基本構成部１０_１で演算された特徴行列に対して、２つの畳み込み層ＣＬ_３，ＣＬ_４の演算を行うものである。なお、畳み込み層ＣＬ_３，ＣＬ_４の「３×３ｃｏｎｖ，１２８」は、３×３サイズのカーネルを１２８個用いて、畳み込み演算を行うことを示す。また、畳み込み層ＣＬ_３の「／２」は、ストライド“２”でカーネルをシフトさせることを示す。なお、「／２」がない他の畳み込み層ＣＬは、ストライド“１”とする。
これによって、畳み込み手段１１Ｂ_２は、特徴行列から、Ｗ／２（幅）×Ｈ／２（高さ）、１２８個のチャンネルのデータを生成する。
畳み込み手段１１Ｂ_２が行う畳み込み層ＣＬ_３，ＣＬ_４の演算として使用するカーネルのパラメータ（結合重み係数）は、予め学習済のパラメータとしてパラメータ記憶手段２０に記憶されている。
畳み込み手段１１Ｂ_２は、演算後のデータを加算手段１３_２に出力する。 The basic configuration unit 10B ₂ includes convolution means 11B ₂ (11B), scaling means 12B ₂ (12B), and addition means 13 ₂ (13).
The convolution means 11B ₂ performs calculation of two convolution layers CL ₃ and CL ₄ on the feature matrix calculated by the basic component 10 ₁ . Note that “3×3 conv, 128” in the convolution layers CL ₃ and CL ₄ indicates that 128 kernels of 3×3 size are used to perform the convolution operation. Also, "/2" in the convolutional layer CL ₃ indicates that the kernel is shifted with a stride of "2". Note that other convolutional layers CL without "/2" have a stride of "1".
As a result, the convolution means _11B2 generates W/2 (width)×H/2 (height), 128 channel data from the feature matrix.
Kernel parameters (connection weight coefficients) used in the calculations of the convolution layers CL ₃ and CL ₄ performed by the convolution means 11B ₂ are pre-stored in the parameter storage means 20 as learned parameters.
The convolution means 11B- ₂ outputs the data after the calculation to the addition means _13-2 .

スケーリング手段１２Ｂ_２は、基本構成部１０_１で演算された特徴行列を、畳み込み手段１１Ｂ_２が出力する行列と同じ次元に揃え、行列の各要素に対して、予め学習済みのパラメータであるスケーリング係数を乗算するものである。なお、学習済のスケーリング係数は、予めパラメータ記憶手段２０に記憶されている。
スケーリング手段１２Ｂ_２は、演算後のデータを加算手段１３_２に出力する。 The scaling means 12B ₂ aligns the feature matrix calculated by the basic component 10 ₁ to the same dimension as the matrix output by the convolution means 11B ₂ , and applies a scaling coefficient, which is a pre-learned parameter, to each element of the matrix. is multiplied by Note that the learned scaling coefficients are stored in the parameter storage means 20 in advance.
The scaling means 12B- ₂ outputs the data after the calculation to the addition means _13-2 .

加算手段１３_２は、畳み込み手段１１Ｂ_２で演算されたデータと、スケーリング手段１２Ｂ_２で演算されたデータとを加算するものである。
加算手段１３_２は、加算結果のデータ（特徴行列）を、後段の基本構成部１０Ｂ_３に出力する。 The adding means _13-2 adds the data calculated by the convolution means 11B- ₂ and the data calculated by the scaling means 12B- ₂ .
The addition means _13-2 outputs the data (feature matrix) of the addition result to the subsequent basic configuration section 10B- ₃ .

基本構成部１０Ｂ_３，１０Ｂ_４は、カーネルのサイズが異なるだけで、基本構成部１０Ｂ_２と同じ構成であるため説明を省略する。
最終段の基本構成部１０Ｂ_４は、最終演算結果を、画像Ｉの特徴量Ｖとして出力する。 The basic configuration units 10B ₃ and 10B ₄ have the same configuration as the basic configuration unit 10B ₂ except for the size of the kernel, so the explanation is omitted.
The basic configuration unit _10B4 at the final stage outputs the final calculation result as the feature amount V of the image I. FIG.

パラメータ記憶手段２０は、図４で説明したモデルＭの構造および学習済のパラメータを予め記憶するものである。パラメータ記憶手段２０は、半導体メモリ等の一般的な記憶媒体で構成することができる。 The parameter storage means 20 stores in advance the structure of the model M described in FIG. 4 and the learned parameters. The parameter storage means 20 can be composed of a general storage medium such as a semiconductor memory.

以上説明したように、画像特徴抽出装置１は、画像Ｉから畳み込み演算によって特徴量Ｖを抽出する際に、畳み込みを行わずに下層にデータを経由する経路に重みを付けて学習したモデルを用いる。これによって、画像特徴抽出装置１は、ネットワークの深さによる特徴を重みによって反映させて、より精度よく画像の特徴量を抽出することができる。
なお、画像特徴抽出装置１は、コンピュータを、前記した各手段として機能させるための画像特徴抽出プログラムで動作させることができる。 As described above, the image feature extraction apparatus 1 uses a model learned by weighting the route passing through the data in the lower layer without convolution when extracting the feature amount V from the image I by the convolution operation. . As a result, the image feature extracting apparatus 1 can extract the feature amount of the image with higher accuracy by reflecting the feature of the depth of the network with the weight.
Note that the image feature extraction device 1 can be operated by an image feature extraction program for causing a computer to function as each means described above.

＜画像特徴抽出装置の動作＞
次に、図６を参照（構成については、適宜図２，図３，図５参照）して、本発明の実施形態に係る画像特徴抽出装置１の動作について説明する。なお、パラメータ記憶手段２０には、予め学習したパラメータが記憶されているものとする。また、基本構成部１０または基本構成部１０Ｂは、モデルＭ（図４）の構造によってどちらを使用するかが異なるため、ここでは、基本的に基本構成部１０のみで説明し、基本構成部１０，１０Ｂで動作が異なる場合のみ、その相違について説明を行う。 <Operation of image feature extraction device>
Next, the operation of the image feature extraction device 1 according to the embodiment of the present invention will be described with reference to FIG. It is assumed that the parameter storage means 20 stores pre-learned parameters. Further, since which of the basic configuration unit 10 and the basic configuration unit 10B is used differs depending on the structure of the model M (FIG. 4), only the basic configuration unit 10 will be basically described here, and the basic configuration unit 10 , 10B, the difference will be explained.

ステップＳ１において、最前段の基本構成部１０が、画像Ｉをデータとして入力する。
ステップＳ２において、基本構成部１０は、畳み込み手段１１によって、パラメータ記憶手段２０に記憶されているパラメータであるカーネルの結合重み係数を参照して、複数の畳み込み層における畳み込み演算を行う。なお、このとき、基本構成部が基本構成部１０Ｂの場合、入力したデータの行列と、畳み込み演算後のデータの行列の次元が異なる。 In step S1, the frontmost basic configuration unit 10 inputs an image I as data.
In step S<b>2 , the convolution unit 11 of the basic configuration unit 10 refers to the kernel connection weight coefficients, which are parameters stored in the parameter storage unit 20 , and performs convolution operations in a plurality of convolution layers. At this time, when the basic configuration unit is the basic configuration unit 10B, the dimensions of the matrix of the input data and the matrix of the data after the convolution operation are different.

ステップＳ３において、畳み込み演算前後で行列の次元が異なる場合、すなわち、ステップＳ２の演算を基本構成部１０Ｂで行っている場合（ステップＳ３でＹｅｓ）、ステップＳ４において、スケーリング手段１２Ｂは、畳み込み演算前の入力データを、畳み込み演算後の行列の次元に揃えるように線形射影を行い、次元変換を行う。
一方、畳み込み演算前後で行列の次元が同じ場合、すなわち、ステップＳ２の演算を基本構成部１０で行っている場合（ステップＳ３でＮｏ）、スケーリング手段１２は、ステップＳ４の動作を行わずにステップＳ５に動作を進める。 In step S3, if the dimensions of the matrix are different before and after the convolution operation, that is, if the operation in step S2 is performed in the basic configuration unit 10B (Yes in step S3), in step S4, the scaling means 12B performs the Linear projection is performed so that the input data of is aligned with the dimensions of the matrix after the convolution operation, and dimension conversion is performed.
On the other hand, when the dimensions of the matrix are the same before and after the convolution operation, that is, when the operation of step S2 is performed by the basic configuration unit 10 (No in step S3), the scaling means 12 does not perform the operation of step S4 and performs step The operation proceeds to S5.

ステップＳ５において、スケーリング手段１２は、入力データまたはステップＳ４で次元変換された入力データを、パラメータ記憶手段２０に記憶されているパラメータであるスケーリング係数を参照して、入力データの各要素に対して、スケーリング係数を乗算する。 In step S5, the scaling means 12 converts the input data or the input data dimensionally transformed in step S4 to each element of the input data by referring to the scaling coefficients, which are parameters stored in the parameter storage means 20. , multiplied by the scaling factor.

ステップＳ６において、加算手段１３は、ステップＳ２で畳み込み演算を行った演算結果の行列と、ステップＳ５でスケーリング係数が乗算された行列とを、要素ごとに加算する。 In step S6, the addition unit 13 adds the matrix resulting from the convolution operation in step S2 and the matrix multiplied by the scaling factor in step S5 element by element.

ステップＳ７において、後段に基本構成部１０が連結されている場合（ステップＳ７でＹｅｓ）、ステップＳ２に戻って、後段の基本構成部１０がステップＳ６の加算結果のデータに対して、畳み込み演算を行う。
一方、後段に基本構成部１０が連結されていない場合（ステップＳ７でＮｏ）、ステップＳ８において、基本構成部１０は、演算結果を特徴量Ｖとして出力する。
以上の動作によって、画像特徴抽出装置１は、画像Ｉから、特徴量を抽出することができる。 In step S7, if the basic configuration unit 10 is connected to the subsequent stage (Yes in step S7), return to step S2, and the subsequent basic configuration unit 10 performs a convolution operation on the data of the addition result of step S6. conduct.
On the other hand, if the basic configuration unit 10 is not connected to the subsequent stage (No in step S7), the basic configuration unit 10 outputs the calculation result as the feature amount V in step S8.
The image feature extraction device 1 can extract the feature amount from the image I by the above operation.

以上、本発明の実施形態に係る画像特徴抽出装置１の構成および動作について説明したが、本発明はこの実施形態に限定されるものではない。
ここでは、説明を簡略化するため、基本構成部１０（１０Ｂ）の段数を“４”、基本構成部１０（１０Ｂ）内の畳み込み層の数を“２”として説明したが、この数は、これ以外であっても構わない。例えば、基本構成部１０（１０Ｂ）の段数を、一般的なニューラルネットワークで使用される１０～５０段程度としてもよい。また、基本構成部１０（１０Ｂ）内の畳み込み層は、３層以上あっても構わない。 Although the configuration and operation of the image feature extraction device 1 according to the embodiment of the present invention have been described above, the present invention is not limited to this embodiment.
Here, in order to simplify the explanation, the number of stages of the basic configuration unit 10 (10B) is "4", and the number of convolution layers in the basic configuration unit 10 (10B) is "2". It does not matter if it is other than this. For example, the number of stages of the basic configuration unit 10 (10B) may be about 10 to 50 stages used in general neural networks. Also, the number of convolution layers in the basic configuration unit 10 (10B) may be three or more.

また、ここでは、基本構成部１０（１０Ｂ）を物理的に連結した。しかし、１つの基本構成部がモデルＭに応じて繰り返し演算を行うことで、基本構成部を多段構成した処理と同様の演算を行うこととしてもよい。 Also, here, the basic configuration units 10 (10B) are physically connected. However, one basic configuration unit may repeatedly perform calculations in accordance with the model M, thereby performing the same calculation as in the processing in which the basic configuration units are configured in multiple stages.

また、ここでは、基本構成部１０Ｂが入力データと出力データとで行列の次元を変換（幅を１／２、高さを１／２に変換）する構成とした。しかし、基本構成部１０Ｂは、行列の幅および高さを縮小（それぞれ１／２）するプーリング層の演算（最大プーリング等）を行うプーリング手段（不図示）と基本構成部１０とを連結して構成してもよい。 Further, here, the basic configuration unit 10B is configured to convert the dimension of the matrix (convert the width to 1/2 and the height to 1/2) between the input data and the output data. However, the basic configuration unit 10B connects the pooling means (not shown) that performs a pooling layer operation (maximum pooling, etc.) that reduces the width and height of the matrix (to 1/2 each) and the basic configuration unit 10. may be configured.

また、画像特徴抽出装置１は、画像内に映る人物や物体を分類するための画像分類用の特徴量を抽出する場合、最終段の基本構成部１０（１０Ｂ）の出力のチャンネル数を、画像分類用の分類数にしてもよい。そして、画像特徴抽出装置１は、グローバルアベレージプーリング（Global Average Pooling：ＧＡＰ）層の演算を行うグローバルアベレージプーリング手段を、最終段の基本構成部１０（１０Ｂ）に連結した構成として、画像分類装置として構成することができる。 Further, when the image feature extraction device 1 extracts image classification feature amounts for classifying persons and objects appearing in an image, the number of channels output from the basic configuration unit 10 (10B) at the final stage is A classification number for classification may be used. Then, the image feature extraction device 1 has a configuration in which a global average pooling means for performing calculation of the global average pooling (GAP) layer is connected to the final stage basic configuration unit 10 (10B), and as an image classification device Can be configured.

また、ここでは、ＲｅｓＮｅｔをベースにしてスケーリング層を付加したが、本発明は、ＣＮＮのすべてのネットワーク構造に適用することができる。例えば、以下の参考文献に記載されているＩｎｃｅｐｔｉｏｎ－ＲｅｓＮｅｔにおいて、データを畳み込み演算を行わずに下層に伝送する経路に、スケーリング層を設けることとしてもよい。
（参考文献）
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi,“Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”, in Proc. CVPR, 2016. Also, although a scaling layer is added here based on ResNet, the present invention can be applied to all network structures of CNN. For example, in Inception-ResNet described in the following reference, a scaling layer may be provided in the path through which data is transmitted to lower layers without convolution.
(References)
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi,“Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning”, in Proc. CVPR, 2016.

＜パラメータ学習装置＞
次に、図７，図８を参照して、画像特徴抽出装置１が用いるＣＮＮのパラメータを学習するパラメータ学習装置２の例について説明する。
パラメータ学習装置２は、図７に示すＣＮＮのモデル（学習用モデル）Ｍ２によって、予め準備した学習データである学習画像ＬＩと、学習画像ＬＩの認識結果の正解データＬＣとから、モデルＭ２のパラメータを学習する。
なお、モデルＭ２は、モデルＭ（図４）に全結合層ＦＬを付加したモデルである。 <Parameter learning device>
Next, an example of the parameter learning device 2 that learns the CNN parameters used by the image feature extraction device 1 will be described with reference to FIGS. 7 and 8. FIG.
The parameter learning device 2 uses the CNN model (learning model) M2 shown in FIG. to learn.
The model M2 is a model obtained by adding a fully connected layer FL to the model M (FIG. 4).

学習データは、例えば、多数（約１０万人）の人物をそれぞれ複数（約１００枚）撮影した学習画像（顔画像）ＬＩと、学習画像ＬＩに映る人物のラベルである正解データＬＣと、を対としたデータを用いることができる。
パラメータ学習装置２は、モデルＭ２のパラメータを学習することで、モデルＭ２を構成するモデルＭのパラメータを学習する。 The learning data includes, for example, learning images (face images) LI obtained by photographing a plurality (approximately 100) of a large number (approximately 100,000) of people, and correct data LC, which are the labels of the persons appearing in the learning images LI. Paired data can be used.
The parameter learning device 2 learns the parameters of the model M that constitutes the model M2 by learning the parameters of the model M2.

図８に示すように、パラメータ学習装置２は、複数の基本構成部１０，１０Ｂと、パラメータ記憶手段２０と、全結合手段３０と、誤差演算手段４０と、を備える。
ここでは、パラメータ学習装置２は、画像特徴抽出装置１と同じ構成で基本構成部１０_１、１０Ｂ_２、１０Ｂ_３、１０Ｂ_４の順に多段に構成している。 As shown in FIG. 8, the parameter learning device 2 includes a plurality of basic configuration units 10 and 10B, parameter storage means 20, full coupling means 30, and error calculation means 40.
Here, the parameter learning device 2 has the same configuration as the image feature extraction device 1 and is configured in multiple stages in the order of basic configuration units 10 ₁ , 10B ₂ , 10B ₃ and 10B ₄ .

基本構成部１０，１０Ｂは、画像特徴抽出装置１の構成と同様、パラメータ記憶手段２０に記憶されているパラメータを参照して、入力データに対して畳み込み演算を行うとともに、畳み込みを行わない経路で入力データをスケーリングし、それぞれを加算するものである。
また、パラメータ学習装置２の基本構成部１０，１０Ｂは、ネットワークの後段から入力される誤差に基づいて誤差逆伝播法により結合重み係数、スケーリング係数を更新し、誤差を前段に伝播するものでもある。 As in the configuration of the image feature extraction device 1, the basic configuration units 10 and 10B refer to the parameters stored in the parameter storage means 20, perform convolution operations on the input data, and It scales the input data and adds them together.
In addition, the basic configuration units 10 and 10B of the parameter learning device 2 update the coupling weight coefficients and scaling coefficients by the error backpropagation method based on the error input from the latter stage of the network, and propagate the error to the previous stage. .

パラメータ記憶手段２０は、図７で説明したモデルＭ２の構造およびパラメータを記憶するものである。なお、パラメータには予め乱数等の初期値が設定されている。そして、パラメータ記憶手段２０は、基本構成部１０，１０Ｂ、全結合手段３０によって、パラメータが更新される。 The parameter storage means 20 stores the structure and parameters of the model M2 explained in FIG. Initial values such as random numbers are set in advance for the parameters. The parameters of the parameter storage means 20 are updated by the basic configuration units 10 and 10B and the total coupling means 30 .

全結合手段３０は、パラメータ記憶手段２０に記憶されているパラメータを参照して、基本構成部１０_１、１０Ｂ_２、１０Ｂ_３、１０Ｂ_４で順次畳み込みされたデータの各要素を、１つ以上の全結合層によって、全結合演算を行い、予め定めたデータ長の１次元のベクトルを生成するものである。この演算結果は、学習画像ＬＩを認識（ここでは、顔認識）した認識結果Ｒである。
また、全結合手段３０は、誤差演算手段４０から入力される誤差に基づいて誤差逆伝播法により結合重み係数を更新し、誤差を前段の基本構成部１０に伝播するものでもある。
なお、全結合手段３０は、正解データＬＣのデータと次元を揃えるための全結合層が最終段に付加されている。 The total combining means 30 refers to the parameters stored in the parameter storage means 20, and converts each element of the data sequentially convoluted by the basic configuration units 10 ₁ , 10B ₂ , 10B ₃ and 10B ₄ into one or more The fully-connected layer performs a fully-connected operation to generate a one-dimensional vector of a predetermined data length. This calculation result is the recognition result R obtained by recognizing (here, face recognition) the learning image LI.
Further, the full coupling means 30 updates the coupling weight coefficients by the error backpropagation method based on the error input from the error computing means 40, and propagates the error to the basic configuration section 10 in the previous stage.
The fully-connected means 30 has a fully-connected layer added at the final stage for aligning the dimensions with the data of the correct data LC.

誤差演算手段４０は、全結合手段３０の出力（認識結果Ｒ）と、正解データＬＣとの誤差を演算するものである。誤差演算手段４０は、誤差を全結合手段３０に出力する。なお、誤差演算手段４０は、予め定めた回数、あるいは、パラメータ記憶手段２０に記憶されているパラメータの変化の度合いが予め定めた閾値を下回るまで、基本構成部１０，１０Ｂ、全結合手段３０を動作させる。 The error computing means 40 computes the error between the output (recognition result R) of the full combining means 30 and the correct data LC. The error computing means 40 outputs the error to the full coupling means 30 . Note that the error computing means 40 repeats the operation of the basic configuration units 10 and 10B and the total coupling means 30 a predetermined number of times or until the degree of change in the parameter stored in the parameter storage means 20 falls below a predetermined threshold value. make it work.

以上説明したように、パラメータ学習装置２は、学習データを用いて、画像特徴抽出装置１が用いるＣＮＮのパラメータを学習することができる。
これによって、パラメータ学習装置２は、人物の顔を認識するための画像特徴を抽出するモデルのパラメータを学習することができる。 As described above, the parameter learning device 2 can learn CNN parameters used by the image feature extraction device 1 by using learning data.
Thereby, the parameter learning device 2 can learn parameters of a model for extracting image features for recognizing a person's face.

なお、ここでは、学習データとして、画像とその画像に映る人物のラベルとを用いたが、画像特徴を抽出したい対象に応じて、種々の学習データを用いればよい。例えば、画像内の物体（例えば、動物）を識別するためのパラメータを学習したければ、画像とそれに対応する物体のラベルを学習データとすればよい。その場合、全結合手段３０の最終段に付加する全結合層の次元は、物体のラベルの数に合わせればよい。 Note that although an image and a label of a person appearing in the image are used as the learning data here, various learning data may be used according to the object from which the image feature is to be extracted. For example, if you want to learn parameters for identifying objects (eg, animals) in images, you can use images and corresponding object labels as learning data. In that case, the dimension of the fully connected layer added to the final stage of the fully connecting means 30 should be matched with the number of labels of the object.

また、ここでは、全結合手段３０を備えることとしたが、最終段の基本構成部１０，１０Ｂの出力のチャンネル数を、物体のラベルの数とした場合、全結合手段３０の代わりに、グローバルアベレージプーリング（ＧＡＰ）層の演算を行うプーリング手段を備えてもよい。これによって、学習するパラメータの数を減らすことができる。 Further, here, the total coupling means 30 is provided, but if the number of channels of the outputs of the final-stage basic configuration units 10 and 10B is the number of object labels, the global Pooling means may be provided for performing average pooling (GAP) layer operations. This reduces the number of parameters to learn.

＜スケーリング層を設けた場合の性能評価＞
前記した参考文献に記載されているＩｎｃｅｐｔｉｏｎ－ＲｅｓＮｅｔＶ２（従来手法）と、Ｉｎｃｅｐｔｉｏｎ－ＲｅｓＮｅｔＶ２にスケーリング層を設けたＣＮＮ（本手法）との性能比較を行った。なお、従来手法および本手法ともに、同じ学習データで学習を行っている。 <Performance evaluation when scaling layer is provided>
A performance comparison was performed between Inception-ResNet V2 (conventional method) described in the above-mentioned reference and CNN (this method) provided with a scaling layer in Inception-ResNet V2. In addition, both the conventional method and the present method perform learning with the same learning data.

性能評価として、顔認識分野で広く利用されているＬＦＷ（Labeled Faces in the Wild）データセット（http://vis-www.cs.umass.edu/lfw/）を用いた。
このデータセットは、２枚の顔画像と、両画像に映っている人物が同一人物であるか否かを示す正解データとからなる組を、約６０００組有するデータである。
ここでは、２枚の顔画像についてそれぞれ特徴量を抽出し、その特徴量の距離（ユークリッド距離等）が予め定めた閾値以下である場合に、両画像に映っている人物が同一であると判定し、閾値よりも大きい場合に、両画像に映っている人物が同一ではないと判定した。この判定結果を正解データと比較し、正しく判定できた割合を認識精度と定義する。
従来手法および本手法を用いた認識精度の結果を以下の表に示す。 For performance evaluation, an LFW (Labeled Faces in the Wild) dataset (http://vis-www.cs.umass.edu/lfw/) widely used in the field of face recognition was used.
This data set has about 6000 pairs of two face images and correct data indicating whether or not the person shown in the two images is the same person.
Here, a feature amount is extracted for each of the two face images, and if the distance of the feature amount (Euclidean distance, etc.) is equal to or less than a predetermined threshold value, it is determined that the two images are the same person. If it is larger than the threshold, it is determined that the persons appearing in both images are not the same. This determination result is compared with the correct data, and the rate of correct determination is defined as the recognition accuracy.
The results of recognition accuracy using the conventional method and this method are shown in the table below.

この表に示すように、本手法を用いた方が、従来手法を用いる場合よりも、顔認識の認識精度が高くなった。
このように、本手法は、ＲｅｓＮｅｔにスケーリング層を設けることで、認識精度が上がり、画像から特徴量をより正確に抽出できていることになる。 As shown in this table, the accuracy of face recognition using the present method was higher than that using the conventional method.
Thus, in this method, by providing a scaling layer in ResNet, the recognition accuracy is improved and the feature amount can be more accurately extracted from the image.

１画像特徴抽出装置
１０，１０Ｂ基本構成部
１１，１１Ｂ畳み込み手段
１２，１２Ｂスケーリング手段
１３加算手段
２０パラメータ記憶手段
３０全結合手段
４０誤差演算手段
Ｉ画像
Ｖ特徴量
Ｍモデル
ＣＬ畳み込み層
ＳＬスケーリング層
ＦＬ全結合層 1 image feature extraction device 10, 10B basic configuration unit 11, 11B convolution means 12, 12B scaling means 13 addition means 20 parameter storage means 30 full connection means 40 error calculation means I image V feature quantity M model CL convolution layer SL scaling layer FL fully connected layer

Claims

An image feature extraction device for extracting a feature amount of the image from image data,
A convolution means for performing operations of a plurality of convolution layers on input data using pre-learned kernels as parameters of the convolution neural network;
scaling means for multiplying the input data by the scaling factor, using a pre-learned scaling factor as a parameter of the convolutional neural network;
a basic configuration unit having a multi-stage configuration including addition means for adding the operation result of the convolution means and the operation result of the scaling means;
An image feature extracting apparatus characterized in that the calculation result of a final-stage basic configuration unit is used as the feature amount.

At least one or more of the basic constituent parts
The convolution means generates a calculation result of a dimension different from the input data in the calculation of a plurality of convolution layers,
2. The image feature extracting apparatus according to claim 1, wherein said scaling means converts said input data into data of said dimensions by linear projection, and then multiplies said data by said scaling factor.

3. When the feature amount is used as the feature amount for image classification, the number of channels of the data output from the final-stage basic configuration unit is used as the classification number for image classification. An image feature extraction device as described.

An image feature extraction program for causing a computer to function as the image feature extraction device according to any one of claims 1 to 3.