JP2021022368A

JP2021022368A - Image recognition device and training device using neural network

Info

Publication number: JP2021022368A
Application number: JP2020104577A
Authority: JP
Inventors: 藤吉　弘亘; Hironobu Fujiyoshi; 弘亘藤吉; 隆義山下; Takayoshi Yamashita; 翼平川; Tsubasa Hirakawa
Original assignee: Chubu University
Current assignee: Chubu University
Priority date: 2019-07-25
Filing date: 2020-06-17
Publication date: 2021-02-18
Anticipated expiration: 2040-06-17
Also published as: JP7542802B2; JP2024091853A

Abstract

To incorporate human knowledge into a recognition results of a neural network that outputs an attention map that expresses a gaze area of an input image.SOLUTION: A processing unit 5 of an image recognition device 1 applies modification corresponding to a person's modification operation to an attention map 53 generated by a neural network 10 in which an input image 51 is input. The processing unit 5 then outputs a recognition result of the input image 51 generated by the neural network 10 based on the modified attention map 53 and the input image 51.SELECTED DRAWING: Figure 8

Description

本発明は、ニューラルネットワークを用いた画像認識装置およびトレーニング装置に関するものである。 The present invention relates to an image recognition device and a training device using a neural network.

従来、ＣＮＮ（Convolutional Neural Network）等のニューラルネットワークを用いた画像認識技術において、ニューラルネットワークによる推論時における注視領域を表現したアテンションマップを生成する技術が知られている（例えば、非特許文献１、２参照）。 Conventionally, in an image recognition technique using a neural network such as CNN (Convolutional Neural Network), a technique for generating an attention map expressing a gaze area at the time of inference by a neural network is known (for example, Non-Patent Document 1, 2).

Ramprasaath, R., S., Michael, C., Abhishek, D., Ramakrishna, V., Devi, P. and Dhruv, B.: Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, International Conference on Computer Vision, pp. 618-626 (2017).Ramprasaath, R., S., Michael, C., Abhishek, D., Ramakrishna, V., Devi, P. and Dhruv, B .: Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, International Conference on Computer Vision, pp. 618-626 (2017). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.: Learning Deep Features for Discriminative Localization, Computer Vision and Pattern Recognition, pp. 2921-2929 (2016).Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A .: Learning Deep Features for Discriminative Localization, Computer Vision and Pattern Recognition, pp. 2921-2929 (2016).

しかし、発明者の検討によれば、ある画像についてニューラルネットワークが認識結果とアテンションマップを生成したとき、認識結果とアテンションマップ中の注視領域とが一致しない場合がある。例えば、認識結果が「笑顔」であるにもかかわらず注視領域が頭髪である場合、認識結果とアテンションマップ中の注視領域とが一致していない。認識結果とアテンションマップ中の注視領域とに不一致が有れば、それは画像認識自体の誤りにも繋がる問題である。そして、現状では、この不一致を修正する術はない。 However, according to the inventor's examination, when the neural network generates a recognition result and an attention map for a certain image, the recognition result and the gaze area in the attention map may not match. For example, when the recognition result is "smile" but the gaze area is hair, the recognition result and the gaze area in the attention map do not match. If there is a discrepancy between the recognition result and the gaze area in the attention map, it is a problem that leads to an error in the image recognition itself. And at present, there is no way to correct this discrepancy.

本開示は上記点に鑑み、アテンションマップを出力するニューラルネットワークを用いた画像認識技術において、ニューラルネットワークの認識機能または学習に人の知見を取り入れることを目的とする。 In view of the above points, it is an object of the present disclosure to incorporate human knowledge into the recognition function or learning of a neural network in an image recognition technique using a neural network that outputs an attention map.

本開示の１つの観点によれば、画像認識装置は、ニューラルネットワーク（１０）に画像（５１）を入力する入力部（１２０）と、入力された前記画像の注視領域を表現するアテンションマップ（５３）を前記ニューラルネットワークが生成したとき、生成された前記アテンションマップに対して、人の修正操作に応じた修正を行うマップ修正部（１４０）と、修正された前記アテンションマップおよび前記画像に基づいて前記ニューラルネットワーク（１０）が前記画像の認識結果を生成したとき、生成された前記認識結果を出力する出力部（１６０）と、を備える。 According to one aspect of the present disclosure, the image recognition device includes an input unit (120) for inputting an image (51) into the neural network (10) and an attention map (53) representing the gaze area of the input image. ) Is generated by the neural network, based on the map correction unit (140) that corrects the generated attention map according to a human correction operation, the corrected attention map, and the image. When the neural network (10) generates the recognition result of the image, it includes an output unit (160) that outputs the generated recognition result.

このように、人の知見を利用してアテンションマップ５３が修正されることで、認知部１４が人の意図した領域を重視する。その結果、人の意図に沿った画像認識をすることができる。またこのとき、ニューラルネットワーク１０のパラメータは変更されていない。つまり、ニューラルネットワーク１０の再学習を必要とせず、人の意図した認識結果を得ることができる。 In this way, the attention map 53 is modified by utilizing the knowledge of the person, so that the cognitive unit 14 attaches importance to the area intended by the person. As a result, image recognition can be performed according to the intention of the person. At this time, the parameters of the neural network 10 have not been changed. That is, it is possible to obtain a recognition result intended by a person without requiring re-learning of the neural network 10.

本開示の他の観点によれば、トレーニング装置は、入力された画像（５１）の注視領域を表現するアテンションマップ（５３）および前記画像の認識結果を生成するよう学習されたニューラルネットワーク（１０）を読み出す読出部（２０５）と、再学習用のデータセットを用いて、前記ニューラルネットワークを再学習させるトレーニング部（２５０）と、を備え、前記再学習用のデータセットは、複数の画像が前記ニューラルネットワークに入力されたときのアテンションマップの正解値とされる複数の教師アテンションマップを含む。 According to another aspect of the present disclosure, the training device is a neural network (10) trained to generate an attention map (53) representing the gaze area of the input image (51) and a recognition result of the image. The re-learning data set includes a reading unit (205) for reading out the above, and a training unit (250) for re-learning the neural network using the re-learning data set. Includes multiple teacher attention maps that are the correct values for the attention map when input to the neural network.

このように、アテンションマップを生成するニューラルネットワークを再学習するときに、アテンションマップの正解値とされる教師アテンションマップが使用される。教師アテンションマップは、人の知見に基づいて作成されたものなので、このようにすることで、ニューラルネットワークの学習に人の知見を取り入れることが可能となる。 In this way, when re-learning the neural network that generates the attention map, the teacher attention map, which is the correct answer value of the attention map, is used. Since the teacher attention map is created based on human knowledge, it is possible to incorporate human knowledge into the learning of the neural network by doing so.

なお、各構成要素等に付された括弧付きの参照符号は、その構成要素等と後述する実施形態に記載の具体的な構成要素等との対応関係の一例を示すものである。 The reference numerals in parentheses attached to each component or the like indicate an example of the correspondence between the component or the like and the specific component or the like described in the embodiment described later.

画像認識装置の構成図である。It is a block diagram of an image recognition apparatus. ニューラルネットワークの構成を示す図である。It is a figure which shows the structure of a neural network. アテンション部の構成を示す図である。It is a figure which shows the structure of the attention part. 処理部が実行する処理のフローチャートである。It is a flowchart of the process executed by a processing unit. 入力画像と修正前のアテンションマップとが重なった表示を示す図である。It is a figure which shows the display which the input image and the attention map before correction overlap. 入力画像と注視領域が削除されたアテンションマップとが重なった表示を示す図である。It is a figure which shows the display in which the input image and the attention map which deleted the gaze area overlap. 入力画像と注視領域が追加されたアテンションマップとが重なった表示を示す図である。It is a figure which shows the display in which the input image and the attention map to which the gaze area is added overlap. 修正されたアテンションマップとニューラルネットワークの関係を示す図である。It is a figure which shows the relationship between the modified attention map and a neural network. 第２実施形態において処理部が実行する処理のフローチャートである。It is a flowchart of the process executed by the processing part in 2nd Embodiment. 再学習の概要を示す図である。It is a figure which shows the outline of relearning.

（第１実施形態）
以下、第１実施形態について説明する。本実施形態に係る画像認識装置１は、図１に示すように、操作装置２、表示装置３、メモリ４、処理部５を備えている。 (First Embodiment)
Hereinafter, the first embodiment will be described. As shown in FIG. 1, the image recognition device 1 according to the present embodiment includes an operation device 2, a display device 3, a memory 4, and a processing unit 5.

操作装置２は、人の操作を受け付け、受け付けた操作に応じた信号を処理部５に出力する装置である。操作装置２は、例えば、マウス、キーボード、タッチパネル等であってもよい。表示装置３は、映像を人に表示する装置である。 The operation device 2 is a device that receives a human operation and outputs a signal corresponding to the received operation to the processing unit 5. The operation device 2 may be, for example, a mouse, a keyboard, a touch panel, or the like. The display device 3 is a device that displays an image on a person.

メモリ４は、書き換え可能な揮発性記憶媒体であるＲＡＭ、書き換え不可能な不揮発性記憶媒体であるＲＯＭ、書き換え可能な不揮発性記憶媒体であるフラッシュメモリを含む。ＲＡＭ、ＲＯＭ、フラッシュメモリは、非遷移的実体的記憶媒体である。フラッシュメモリには、学習済みのニューラルネットワーク１０のデータがあらかじめ記録されている。 The memory 4 includes a RAM which is a rewritable volatile storage medium, a ROM which is a non-rewritable non-volatile storage medium, and a flash memory which is a rewritable non-volatile storage medium. RAM, ROM, and flash memory are non-transitional substantive storage media. The data of the trained neural network 10 is recorded in advance in the flash memory.

処理部５は、ＲＯＭまたはフラッシュメモリに記憶された不図示のプログラムを実行し、その実行の際にＲＡＭを作業領域として用いることで、後述する種々の処理を実現する。 The processing unit 5 executes a program (not shown) stored in the ROM or the flash memory, and uses the RAM as a work area during the execution to realize various processes described later.

ここで、ニューラルネットワーク１０について説明する。ニューラルネットワーク１０は、図３に示すように、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４を含んだ、ディープニューラルネットワークである。 Here, the neural network 10 will be described. As shown in FIG. 3, the neural network 10 is a deep neural network including a feature extraction unit 11, an attention unit 12, a synthesis unit 13, and a cognitive unit 14.

ニューラルネットワーク１０は、入力画像５１が入力されると、アテンションマップ５３を生成する。アテンションマップ５３は、ニューラルネットワーク１０の推論時の注視領域を表現するデータである。つまり、アテンションマップ５３は、ニューラルネットワーク１０の推論時において、入力画像５１のどの領域が重視されているかを説明する視覚的説明用のデータである。 The neural network 10 generates an attention map 53 when the input image 51 is input. The attention map 53 is data representing a gaze area at the time of inference of the neural network 10. That is, the attention map 53 is data for visual explanation explaining which region of the input image 51 is emphasized at the time of inference of the neural network 10.

またニューラルネットワーク１０は、入力画像５１およびアテンションマップ５３に基づいて入力画像５１の分類結果を出力する。入力画像５１の分類結果とは、画像の認識対象に相当する複数のクラス（例えば、ダルメシアン、ザリガニ、フィンチ、カエル等）にそれぞれ対応する複数の尤度である。ここでは、クラスの数をＫとする。 Further, the neural network 10 outputs the classification result of the input image 51 based on the input image 51 and the attention map 53. The classification result of the input image 51 is a plurality of likelihoods corresponding to a plurality of classes (for example, Dalmatian, crayfish, finch, frog, etc.) corresponding to the recognition target of the image. Here, the number of classes is K.

特徴抽出部１１は、複数の層を有するニューラルネットワークである。これら複数の層は、複数の畳み込み層を少なくとも含む。更にこれら複数の層は、更に複数の残差ブロックの構成要素となっていてもよいし、複数のプーリング層等を有していてもよい。そして特徴抽出部１１は、入力された入力画像５１の情報をこれら複数の層に伝播させることで、特徴マップ５２を生成する。 The feature extraction unit 11 is a neural network having a plurality of layers. These plurality of layers include at least a plurality of convolution layers. Further, these plurality of layers may be components of a plurality of residual blocks, or may have a plurality of pooling layers and the like. Then, the feature extraction unit 11 generates the feature map 52 by propagating the information of the input input image 51 to these plurality of layers.

特徴マップ５２は、Ｋ個のクラスにそれぞれ対応するＫ個の解像度ｈ×ｗのマップである。ｈ、ｗは、任意の整数である。したがって、特徴マップ５２のチャンネル数はＫである。特徴マップ５２の解像度は、入力画像５１の解像度と同じであってもよいし、入力画像５１の解像度よりも低くてもよい。 The feature map 52 is a map having K resolutions h × w corresponding to each of the K classes. h and w are arbitrary integers. Therefore, the number of channels in the feature map 52 is K. The resolution of the feature map 52 may be the same as the resolution of the input image 51, or may be lower than the resolution of the input image 51.

特徴抽出部１１は、ベースラインモデルのうち入力層から始まり最初の全結合層よりも前の部分によって構成されていてもよい。ベースラインモデルとしては、複数の畳み込み層を有し、ニューラルネットワーク１０と同じ種類の複数のクラスの尤度を生成するものが選ばれる。例えば、ベースラインモデルとしては、非特許文献３に示すＶＧＧＮｅｔが用いられてもよいし、非特許文献４に示すＲｅｓＮｅｔが用いられてもよいし、他のＣＮＮ（Convolutional Neural Network）が用いられてもよい。 The feature extraction unit 11 may be composed of a portion of the baseline model starting from the input layer and before the first fully connected layer. As the baseline model, a model having a plurality of convolution layers and generating the likelihoods of a plurality of classes of the same type as the neural network 10 is selected. For example, as the baseline model, VGGNet shown in Non-Patent Document 3 may be used, ResNet shown in Non-Patent Document 4 may be used, or another CNN (Convolutional Neural Network) is used. May be good.

アテンション部１２は、特徴抽出部１１によって生成された特徴マップ５２からアテンションマップ５３を生成する。アテンション部１２は、複数の層を有するニューラルネットワークである。これら複数の層は、図３に示すように、１つ以上の畳み込み層または１つ以上の残差ブロックを有する第１部分１２ａ、第１部分の後段におけるＫ×１×１畳み込み層１２ｂを有する。ここで、Ｌ、ａ、ｂを任意の自然数とすると、Ｌ×ａ×ｂ畳み込み層は、Ｌ個のチャネルの各々でａ×ｂのカーネルを用いた畳み込み層を意味する。 The attention unit 12 generates an attention map 53 from the feature map 52 generated by the feature extraction unit 11. The attention unit 12 is a neural network having a plurality of layers. As shown in FIG. 3, these plurality of layers have a first portion 12a having one or more convolution layers or one or more residual blocks, and a K × 1 × 1 convolution layer 12b in the subsequent stage of the first portion. .. Here, assuming that L, a, and b are arbitrary natural numbers, the L × a × b convolution layer means a convolution layer using an a × b kernel in each of the L channels.

そしてアテンション部１２は、畳み込み層１２ｂの後段において分岐する２つのＫ×１×１畳み込み層１２ｃと１×１×１畳み込み層１２ｅを有する。そしてアテンション部１２は、畳み込み層１２ｃの後段におけるＧＡＰ（Global Average Pooling）層１２ｄを有する。 The attention portion 12 has two K × 1 × 1 convolution layers 12c and a 1 × 1 × 1 convolution layer 12e that are branched in the subsequent stage of the convolution layer 12b. The attention portion 12 has a GAP (Global Average Pooling) layer 12d in the subsequent stage of the convolution layer 12c.

アテンション部１２に入力された特徴マップ５２の情報が、第１部分１２ａ、畳み込み層１２ｂ、畳み込み層１２ｃ、ＧＡＰ層１２ｄを伝播し、ＧＡＰ層１２ｄの出力がＳｏｆｔｍａｘ関数に入力されることで、ニューラルネットワーク１０と同じ種類の複数のクラスの尤度が分類結果として生成される。分類結果は、認識結果の一種である。 The information of the feature map 52 input to the attention unit 12 propagates through the first portion 12a, the convolution layer 12b, the convolution layer 12c, and the GAP layer 12d, and the output of the GAP layer 12d is input to the Softmax function, so that the neural network is used. Likelihoods of a plurality of classes of the same type as network 10 are generated as classification results. The classification result is a kind of recognition result.

また、アテンション部１２に入力された特徴マップ５２の情報が、第１部分１２ａ、畳み込み層１２ｂ、畳み込み層１２ｅに伝播されることで、アテンションマップ５３が生成される。全結合層ではなく畳み込み層１２ｂを介してアテンションマップ５３が生成されることで、注視領域の情報が局所化されたままでアテンションマップ５３に伝播される。また、１×１×１畳み込み層１２ｅを介することで、すべてのクラスに対応した注視領域の重み付き総和として１チャンネルのアテンションマップ５３が生成される。畳み込み層１２ｅのカーネルの各値は、すべて１でもよいし、それ以外でもよい。 Further, the information of the feature map 52 input to the attention unit 12 is propagated to the first portion 12a, the convolution layer 12b, and the convolution layer 12e, so that the attention map 53 is generated. By generating the attention map 53 through the convolution layer 12b instead of the fully connected layer, the information in the gaze area is propagated to the attention map 53 while being localized. Further, through the 1 × 1 × 1 convolution layer 12e, a 1-channel attention map 53 is generated as a weighted sum of the gaze areas corresponding to all the classes. Each value of the kernel of the convolution layer 12e may be 1 or other.

特徴マップ５２の各マップの解像度とアテンションマップ５３の解像度は同じである。そうなるよう、アテンション部１２は構成されている。アテンションマップ５３は、注視領域に該当する画素には比較的高い画素値が与えられ、注視領域に該当しない画素には注視領域と比べて低い画素値が与えられる。アテンションマップ５３の各画素値が取り得る値は、２値でもよいし、２５６段階の値でもよい。ある画素の画素値が高いほど、その画素の位置における注目度が高い。 The resolution of each map of the feature map 52 and the resolution of the attention map 53 are the same. The attention portion 12 is configured so as to be so. In the attention map 53, a relatively high pixel value is given to the pixel corresponding to the gaze area, and a lower pixel value is given to the pixel not corresponding to the gaze area as compared with the gaze area. The value that each pixel value of the attention map 53 can take may be a binary value or a value in 256 steps. The higher the pixel value of a pixel, the higher the degree of attention at the position of that pixel.

合成部１３は、特徴マップ５２とアテンションマップ５３との合成を行う。具体的には、特徴マップ５２におけるＫ個のチャネルの各々における解像度ｈ×ｗのマップに対し、アテンションマップ５３が乗算される。アテンションマップ５３と解像度ｈ×ｗのマップとの乗算は、同じ位置座標の画素同士で行われる。なお、合成は、上記のように乗算であってもよいし、加算であってもよいし、加算と乗算の組み合わせから成る演算であってもよい。この合成によって、合成マップ５４が得られる。合成マップ５４のチャネル数と解像度は、特徴マップ５２と同じである。 The synthesizing unit 13 synthesizes the feature map 52 and the attention map 53. Specifically, the attention map 53 is multiplied by the map having the resolution h × w in each of the K channels in the feature map 52. The multiplication of the attention map 53 and the map having the resolution h × w is performed between pixels having the same position coordinates. Note that the composition may be multiplication, addition, or an operation consisting of a combination of addition and multiplication as described above. By this composition, a composition map 54 is obtained. The number of channels and the resolution of the composite map 54 are the same as those of the feature map 52.

認知部１４は、合成マップ５４に基づいて各クラスの尤度を出力する。認知部１４は、複数の層を有するニューラルネットワークである。これら複数の層は、複数の畳み込み層を少なくとも含む。また、これら複数の層は、全結合層およびＧＡＰ層のうち一方または両方を含む。更にこれら複数の層は、更に複数の残差ブロックの構成要素となっていてもよいし、複数のプーリング層を有していてもよい。認知部１４は、入力された合成マップ５４の情報をこれら複数の層に伝播させることで、各クラスの尤度を分類結果として出力する。分類結果は、認識結果でもある。認知部１４は、上述のベースラインモデルのうち、アテンション部１２で利用された部分のすぐ後段から出力層までの部分によって構成されていてもよい。 The cognitive unit 14 outputs the likelihood of each class based on the synthetic map 54. The cognitive unit 14 is a neural network having a plurality of layers. These plurality of layers include at least a plurality of convolution layers. In addition, these plurality of layers include one or both of the fully connected layer and the GAP layer. Further, these plurality of layers may be components of a plurality of residual blocks, or may have a plurality of pooling layers. The cognitive unit 14 propagates the input information of the synthetic map 54 to these plurality of layers, and outputs the likelihood of each class as a classification result. The classification result is also a recognition result. The cognitive unit 14 may be composed of a portion of the above-mentioned baseline model from the portion immediately after the portion used by the attention unit 12 to the output layer.

なお、ニューラルネットワーク１０、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４が行うと上で説明した機能は、実際には、処理部５が当該ニューラルネットワーク１０の構造およびパラメータに従った処理を行うことで実現される。 The functions described above that the neural network 10, the feature extraction unit 11, the attention unit 12, the synthesis unit 13, and the cognitive unit 14 perform are actually performed by the processing unit 5 according to the structure and parameters of the neural network 10. It is realized by performing the processing.

特徴抽出部１１、アテンション部１２、合成部１３は、上記のような機能が実現するよう、あらかじめ教師有り学習で誤差逆伝播法によって学習されている。学習においては、学習誤差（損失関数ともいう）Ｌとして、Ｌ＝Ｌａｔｔ＋Ｌｐｅｒが用いられる。ここで、Ｌａｔｔは、アテンション部１２が出力する分類結果に関する学習誤差であり、Ｌｐｅｒは、認知部１４が出力する分類結果に関する学習誤差である。ＬａｔｔおよびＬｐｅｒは、それぞれの分類結果に対してＳｏｆｔｍａｘ関数とクロスエントロピーの組み合わせを適用することで算出されてもよい。特徴抽出部１１は、誤差逆伝播法においてアテンション部１２と認知部１４の勾配を通り抜けることで学習される。 The feature extraction unit 11, the attention unit 12, and the synthesis unit 13 are learned in advance by the error back-propagation method by supervised learning so as to realize the above functions. In learning, L = Latt + Lper is used as the learning error (also referred to as a loss function) L. Here, Latt is a learning error related to the classification result output by the attention unit 12, and Lper is a learning error related to the classification result output by the cognitive unit 14. Latt and Lper may be calculated by applying a combination of Softmax function and cross entropy to each classification result. The feature extraction unit 11 is learned by passing through the gradients of the attention unit 12 and the cognitive unit 14 in the error back propagation method.

以下、このように構成された学習済みのニューラルネットワーク１０を用いた処理部５の画像分類処理について説明する。 Hereinafter, the image classification process of the processing unit 5 using the trained neural network 10 configured in this way will be described.

処理部５は、人による操作装置２に対する実行開始操作等の所定の条件が満たされると、メモリ４に記録された所定のプログラムに規定された図４に示す処理を開始する。この処理において処理部５は、まずステップ１１０で、ニューラルネットワーク１０をメモリ４から読み出す。 When a predetermined condition such as an execution start operation for the operation device 2 by a person is satisfied, the processing unit 5 starts the process shown in FIG. 4 defined in the predetermined program recorded in the memory 4. In this process, the processing unit 5 first reads the neural network 10 from the memory 4 in step 110.

続いてステップ１２０で、入力画像５１を取得し、この入力画像５１をニューラルネットワーク１０に対して入力する。入力画像５１は、あらかじめメモリ４に記録されている複数の画像のうちから人の操作装置２に対する操作等によって選択された画像であってもよいし、不図示の通信ネットワークを介して他の装置から受信した画像であってもよい。 Subsequently, in step 120, the input image 51 is acquired, and the input image 51 is input to the neural network 10. The input image 51 may be an image selected by a human operation on the operation device 2 or the like from a plurality of images recorded in the memory 4 in advance, or another device via a communication network (not shown). It may be an image received from.

ニューラルネットワーク１０に入力画像５１が入力されると、ニューラルネットワーク１０は、上述の通り、特徴抽出部１１が入力画像５１から特徴マップ５２および分類結果を生成し、アテンション部１２が特徴マップ５２からアテンションマップ５３を生成する。 When the input image 51 is input to the neural network 10, in the neural network 10, as described above, the feature extraction unit 11 generates the feature map 52 and the classification result from the input image 51, and the attention unit 12 draws attention from the feature map 52. Generate map 53.

処理部５は、ステップ１２０に続くステップ１３０で、このように生成されたアテンションマップ５３を取得する。すなわちニューラルネットワーク１０によってメモリ４内に生成されたアテンションマップ５３をメモリ４内の他の領域にコピーまたは移動する。 The processing unit 5 acquires the attention map 53 generated in this way in step 130 following step 120. That is, the attention map 53 generated in the memory 4 by the neural network 10 is copied or moved to another area in the memory 4.

続いてステップ１４０で、処理部５は、取得された（すなわち、コピー先または移動先の）アテンションマップ５３を、人の操作装置２に対する修正操作に基づいて、修正する。これにより、人の知見によってアテンションマップ５３が修正される。 Subsequently, in step 140, the processing unit 5 modifies the acquired (that is, the copy destination or the moving destination) attention map 53 based on the modification operation for the human operating device 2. As a result, the attention map 53 is modified according to human knowledge.

具体的には、処理部５は、修正前のアテンションマップ５３およびポインタを表示装置３に表示させる。ポインタは、表示装置３に表示されたアテンションマップ５３の表示範囲内を操作装置２に対する人の操作に応じて移動する画像である。人は、操作装置２に対して所定の修正操作（例えば、消去操作、追加操作等）を行うことで、表示されたアテンションマップ５３中のポインタと重なる位置範囲の値を修正する。 Specifically, the processing unit 5 causes the display device 3 to display the attention map 53 and the pointer before modification. The pointer is an image that moves within the display range of the attention map 53 displayed on the display device 3 in response to a person's operation on the operation device 2. A person corrects the value of the position range that overlaps with the pointer in the displayed attention map 53 by performing a predetermined correction operation (for example, erasing operation, addition operation, etc.) on the operation device 2.

なおこの際、処理部５は、図５に示すように、入力画像５１をアテンションマップ５３に透過的に位置を合わせて重ねて、表示装置３に表示させた状態で、上記修正操作に応じた修正をアテンションマップ５３に反映させてもよい。この際、入力画像５１とアテンションマップ５３の解像度が異なる場合は、処理部５は、入力画像５１の解像度をアテンションマップ５３と一致するよう下げた上で、アテンションマップ５３に透過的に重ねる。 At this time, as shown in FIG. 5, the processing unit 5 responded to the above-mentioned correction operation in a state where the input image 51 was transparently aligned with the attention map 53 and superposed on the display device 3. The modification may be reflected in the attention map 53. At this time, if the resolutions of the input image 51 and the attention map 53 are different, the processing unit 5 lowers the resolution of the input image 51 so as to match the attention map 53, and then transparently superimposes the input image 51 on the attention map 53.

図５においては、ダルメシアンがサッカーボールを咥えている入力画像５１が、アテンションマップ５３に透過的に重ねられている。 In FIG. 5, the input image 51 in which the Dalmatian is holding the soccer ball is transparently superimposed on the attention map 53.

このアテンションマップ５３では、注視領域がサッカーボールの領域にある。このままアテンションマップ５３が合成部１３に入力され、そのアテンションマップ５３と特徴マップ５２の合成結果である合成マップ５４が認知部１４の最初の層に入力された場合、認知部１４が生成する分類結果としては、サッカーボールの尤度が最も高くなる。つまり、ニューラルネットワーク１０は、入力画像５１をサッカーボールの画像であると認識する。 In this attention map 53, the gaze area is in the area of the soccer ball. When the attention map 53 is input to the synthesis unit 13 as it is, and the composition map 54 which is the composition result of the attention map 53 and the feature map 52 is input to the first layer of the cognitive unit 14, the classification result generated by the cognitive unit 14 is generated. As a result, the likelihood of a soccer ball is the highest. That is, the neural network 10 recognizes the input image 51 as an image of a soccer ball.

しかし、画像認識装置１を使う人は、入力画像５１をダルメシアンの画像として認識して欲しいと考えていた場合、このアテンションマップ５３では注視領域がダルメシアンのいる領域であるべきである。 However, if the person using the image recognition device 1 wants the input image 51 to be recognized as a Dalmatian image, the gaze area should be the area where the Dalmatian is present in this attention map 53.

そこでこのような場合、人が、操作装置２を用いて、アテンションマップ５３中の注視領域を修正する。具体的には、まず、人が、操作装置２を用いて、アテンションマップ５３中の注視領域を消去する。例えば、人が、操作装置２の所定の消去ボタンを押しながら、ポインタを移動させてアテンションマップ５３中の注視領域全体を走査する。これにより、処理部５は、消去ボタンを押しながらポインタで走査された領域におけるアテンションマップ５３の画素値を下げて、図６に示すように、注視領域とならない画素値とする。 Therefore, in such a case, a person uses the operating device 2 to correct the gaze area in the attention map 53. Specifically, first, a person erases the gaze area in the attention map 53 by using the operation device 2. For example, a person moves a pointer while pressing a predetermined erase button of the operating device 2 to scan the entire gaze area in the attention map 53. As a result, the processing unit 5 lowers the pixel value of the attention map 53 in the region scanned by the pointer while pressing the erase button to obtain a pixel value that does not serve as the gaze region, as shown in FIG.

そしてその後、人は、操作装置２を用いて、アテンションマップ５３中の新たに注視領域としたい領域を設定する。例えば、人が、操作装置２の所定の追加ボタンを押しながら、ポインタを移動させてアテンションマップ５３中の注視領域としたい領域全体を走査する。これにより、処理部５は、追加ボタンを押しながらポインタで走査された領域におけるアテンションマップ５３の画素値を上げて、図７に示すように、注視領域となる画素値とする。図７の例では、人によって指定された新たな注視領域は、ダルメシアンの顔部分である。 After that, the person sets a new region to be the gaze region in the attention map 53 by using the operation device 2. For example, a person moves a pointer while pressing a predetermined additional button of the operating device 2 to scan the entire region to be the gaze region in the attention map 53. As a result, the processing unit 5 raises the pixel value of the attention map 53 in the area scanned by the pointer while pressing the add button, and sets the pixel value to be the gaze area as shown in FIG. In the example of FIG. 7, the new gaze area designated by the person is the Dalmatian face.

このように、入力画像５１がアテンションマップ５３に重ねられて表示装置３に表示されることで、人は、入力画像５１のどの部分を注視領域とすべきかを判断できる場合は、その知見を効率よく利用して、アテンションマップ５３中の注視領域を容易に指定できる。このようなステップ１４０の処理により、ステップ１３０で取得されたアテンションマップ５３がメモリ４中で修正される。 By displaying the input image 51 on the attention map 53 and displaying it on the display device 3 in this way, if a person can determine which part of the input image 51 should be the gaze area, the knowledge is efficient. It is often used to easily specify the gaze area in the attention map 53. By such processing in step 140, the attention map 53 acquired in step 130 is modified in the memory 4.

続いて処理部５はステップ１５０で、直前のステップ１４０で修正されたアテンションマップ５３を、合成部１３に入力する。すると、合成部１３は、特徴マップ５２とアテンションマップ５３を上述の通り合成して合成マップ５４を生成して認知部１４の最初の層に入力する。合成マップ５４が入力された認知部１４は、上述の通り合成マップ５４に基づいて分類結果を生成する。この分類結果においては、ダルメシアンの尤度が最も高くなる。つまり、ニューラルネットワーク１０は、入力画像５１をダルメシアンの画像であると認識する。 Subsequently, in step 150, the processing unit 5 inputs the attention map 53 corrected in the immediately preceding step 140 to the synthesis unit 13. Then, the synthesis unit 13 synthesizes the feature map 52 and the attention map 53 as described above to generate the synthesis map 54, and inputs the feature map 52 to the first layer of the cognitive unit 14. The cognitive unit 14 into which the synthetic map 54 is input generates a classification result based on the synthetic map 54 as described above. In this classification result, the likelihood of Dalmatian is the highest. That is, the neural network 10 recognizes the input image 51 as a Dalmatian image.

処理部５は、ステップ１５０に続くステップ１６０で、このようにして認知部１４が生成した分類結果を取得して出力する。出力先は、不図示の通信ネットワークを介した他の装置であってもよいし、メモリ４であってもよいし、表示装置３であってもよい。 In step 160 following step 150, the processing unit 5 acquires and outputs the classification result generated by the cognitive unit 14 in this way. The output destination may be another device via a communication network (not shown), a memory 4, or a display device 3.

このように、人の知見を利用してアテンションマップ５３が修正されることで、認知部１４が人の意図した領域により高い重み付けがされる。その結果、人の意図に沿った画像認識をすることができる。つまり、人の知見に基づいて手動で修正されたアテンションマップを用いることで認識結果の調整が可能となる。 In this way, by modifying the attention map 53 by utilizing the knowledge of the person, the cognitive unit 14 is given a higher weight to the area intended by the person. As a result, image recognition can be performed according to the intention of the person. That is, the recognition result can be adjusted by using the attention map manually modified based on human knowledge.

またこのとき、ニューラルネットワーク１０のパラメータは変更されていない。つまり、ニューラルネットワーク１０の再学習を必要とせず、人の意図した認識結果を得ることができる。 At this time, the parameters of the neural network 10 have not been changed. That is, it is possible to obtain a recognition result intended by a person without requiring re-learning of the neural network 10.

例えば、眼底画像が入力画像５１としてニューラルネットワーク１０に入力されたときに、医師が自分の経験に基づく知見を用いてアテンションマップ５３の注視領域を修正することで、眼の疾患のグレードをクラスとして識別がより正確になる。このように、例えば医用画像診断において、本実施形態の機能は有用である。 For example, when a fundus image is input to the neural network 10 as an input image 51, the doctor corrects the gaze area of the attention map 53 by using the knowledge based on his own experience, so that the grade of the eye disease is classified as a class. The identification becomes more accurate. Thus, for example, in medical imaging, the function of this embodiment is useful.

以上説明した通り、画像認識装置１の処理部５は、図８に示すように、入力画像５１が入力されたニューラルネットワーク１０によって生成されたアテンションマップ５３に対して、人の修正操作に応じた修正を行う（ステップ１４０）。そして処理部５は、修正されたアテンションマップ５３および入力画像５１に基づいてニューラルネットワーク１０が生成した入力画像５１の認識結果を出力する（ステップ１６０）。 As described above, as shown in FIG. 8, the processing unit 5 of the image recognition device 1 responds to a human correction operation with respect to the attention map 53 generated by the neural network 10 to which the input image 51 is input. Make the correction (step 140). Then, the processing unit 5 outputs the recognition result of the input image 51 generated by the neural network 10 based on the modified attention map 53 and the input image 51 (step 160).

また、アテンションマップ５３を生成するために画像の情報が伝播する経路と、認識結果を生成するために画像の情報が伝播する経路とが、一部（すなわち特徴抽出部１１）において共有されて、他の部分（すなわちアテンション部１２と認知部１４）で分離されている。そして、合成部１３により、その分離部分の認知部１４の側に、修正後のアテンションマップ５３が反映された合成マップ５４が入力される。このように、修正後のアテンションマップ５３に基づいた合成マップ５４の入力箇所が、ニューラルネットワーク１０の構造に適したものになっていることで、修正されたアテンションマップ５３による認識結果の改善度合いが向上する。 Further, the path through which the image information is propagated to generate the attention map 53 and the path through which the image information is propagated to generate the recognition result are partially shared (that is, the feature extraction unit 11). It is separated by other parts (that is, attention part 12 and cognitive part 14). Then, the synthesis unit 13 inputs the synthesis map 54 reflecting the modified attention map 53 to the side of the recognition unit 14 of the separated portion. As described above, since the input portion of the composite map 54 based on the modified attention map 53 is suitable for the structure of the neural network 10, the degree of improvement in the recognition result by the modified attention map 53 is improved. improves.

また、処理部５は、入力画像５１をアテンションマップ５３に透過的に重ねて表示装置３に表示させた状態で、人の修正操作に応じた修正をアテンションマップ５３に反映させる。人は、入力画像５１のどの部分を注視領域とすべきかを、その入力画像５１を見ることで比較的容易に判断できる。したがって、入力画像５１がアテンションマップ５３に重ねられて表示装置３に表示されることで、人は、自分の知見を視覚的に効率よく利用して、アテンションマップ５３中の注視領域を容易に指定できる。 Further, the processing unit 5 transparently superimposes the input image 51 on the attention map 53 and displays it on the display device 3, and reflects the correction according to the correction operation of a person on the attention map 53. A person can relatively easily determine which part of the input image 51 should be the gaze area by looking at the input image 51. Therefore, by superimposing the input image 51 on the attention map 53 and displaying it on the display device 3, the person can easily specify the gaze area in the attention map 53 by visually and efficiently utilizing his / her knowledge. it can.

なお、本実施形態では、処理部５が、ステップ１２０を実行することで入力部として機能し、ステップ１４０を実行することでマップ修正部として機能し、ステップ１６０を実行することで出力部として機能する。 In the present embodiment, the processing unit 5 functions as an input unit by executing step 120, functions as a map correction unit by executing step 140, and functions as an output unit by executing step 160. To do.

（第２実施形態）
次に第２実施形態について説明する。本実施形態では、人の修正操作に応じた修正されたアテンションマップに基づいて、ニューラルネットワーク１０の重み、バイアス等の学習パラメータが補正される。すなわち、修正されたアテンションマップに基づいてニューラルネットワーク１０が再学習される。 (Second Embodiment)
Next, the second embodiment will be described. In the present embodiment, learning parameters such as weights and biases of the neural network 10 are corrected based on the corrected attention map according to the human correction operation. That is, the neural network 10 is relearned based on the modified attention map.

本実施形態のハードウェア構成は、第１実施形態において図１に示したものと同じである。また、メモリ４に記憶されている学習済みのニューラルネットワーク１０の構成についても、第１実施形態と同じである。なお、本実施形態の画像認識装置１は、トレーニング装置に対応する。 The hardware configuration of this embodiment is the same as that shown in FIG. 1 in the first embodiment. Further, the configuration of the trained neural network 10 stored in the memory 4 is the same as that of the first embodiment. The image recognition device 1 of the present embodiment corresponds to a training device.

本実施形態第１実施形態と異なる点の１つは、処理部５が図４の処理を実行するのではなく、その代わりに、アテンションマップ５３を修正せずに、入力画像５１に対応する分類結果をニューラルネットワーク１０に生成させることである。 One of the differences from the first embodiment of the present embodiment is that the processing unit 5 does not execute the processing of FIG. 4, but instead, the attention map 53 is not modified, and the classification corresponding to the input image 51 is performed. The result is to be generated by the neural network 10.

すなわち、処理部５は、まず、第１実施形態と同様、ニューラルネットワーク１０をメモリ４から読み出し、続いて、入力画像５１を取得し、この入力画像５１をニューラルネットワーク１０に対して入力する。 That is, the processing unit 5 first reads the neural network 10 from the memory 4 as in the first embodiment, then acquires the input image 51, and inputs the input image 51 to the neural network 10.

するとニューラルネットワーク１０においては、第１実施形態と同様に特徴抽出部１１およびアテンション部１２が機能することで、アテンション部１２によってアテンションマップ５３および分類結果が生成される。このアテンションマップ５３は人の修正操作を受けることなく、すなわち修正されることなく、合成部１３に入力される。合成部１３は、特徴マップ５２と人の修正操作を受けなかったアテンションマップ５３とを合成することで、合成マップ５４を生成する。認知部１４は、この合成マップ５４に基づいて、第１実施形態と同様に分類結果を生成する。処理部５は、この分類結果を第１実施形態と同様に取得して出力する。 Then, in the neural network 10, the feature extraction unit 11 and the attention unit 12 function as in the first embodiment, so that the attention map 53 and the classification result are generated by the attention unit 12. The attention map 53 is input to the compositing unit 13 without being corrected by a person, that is, without being corrected. The compositing unit 13 generates a compositing map 54 by compositing the feature map 52 and the attention map 53 that has not been modified by a person. The cognitive unit 14 generates a classification result based on the synthetic map 54 as in the first embodiment. The processing unit 5 acquires and outputs this classification result in the same manner as in the first embodiment.

そして、処理部５は、上記のようにニューラルネットワーク１０を用いて入力画像５１からその入力画像５１の分類結果を取得する処理に加え、ニューラルネットワーク１０を再学習させるため、図９に示す処理を実行する。この再学習によって、ニューラルネットワーク１０はファインチューニングされる。 Then, in addition to the process of acquiring the classification result of the input image 51 from the input image 51 using the neural network 10 as described above, the processing unit 5 performs the process shown in FIG. 9 in order to relearn the neural network 10. Execute. By this re-learning, the neural network 10 is fine-tuned.

処理部５は、操作装置２に対して人による所定の再学習開始操作が行われたことに基づいて、図９の処理を開始する。この処理において、処理部５は、再学習用のデータセットを用いる。再学習用のデータセットは、学習用画像と教師ラベルからなるグループを複数個（１０個でも１００個でも１０万個でもよい）有している。 The processing unit 5 starts the processing of FIG. 9 based on the fact that a predetermined re-learning start operation is performed on the operating device 2. In this process, the processing unit 5 uses a data set for re-learning. The data set for re-learning has a plurality of groups (10, 100, or 100,000) including the image for learning and the teacher label.

学習用画像は、入力画像５１のように特徴抽出部１１に入力されるデータである。教師ラベルは、同じグループの学習用画像が特徴抽出部１１に入力されたときにアテンション部１２および認知部１４から出力される分類結果の正解値とされるデータである。 The learning image is data input to the feature extraction unit 11 like the input image 51. The teacher label is data that is the correct answer value of the classification result output from the attention unit 12 and the cognitive unit 14 when the learning image of the same group is input to the feature extraction unit 11.

再学習用のデータセットは、あらかじめ生成されてメモリ４の不揮発性記憶媒体に記録されていてもよいし、不図示の通信ネットワークを介してデータサーバから取得されてもよい。また、再学習用のデータセットの学習用画像および教師ラベルとしては、ニューラルネットワーク１０の初期の学習時に用いられた学習用データセットと同じものが流用されてもよいし、当該学習用データセットと異なるものであってもよい。 The data set for re-learning may be generated in advance and recorded in the non-volatile storage medium of the memory 4, or may be acquired from the data server via a communication network (not shown). Further, as the learning image and the teacher label of the data set for re-learning, the same learning data set used at the time of initial learning of the neural network 10 may be diverted, or the same as the learning data set may be used. It may be different.

処理部５は、図９の処理において、まず、ステップ２１０、２２０のループ処理を、再学習用データセットに含まれるグループ毎に、実行する。処理部５は、ループ処理の各回において、まずステップ２１０で、対象となるグループ中の学習用画像を特徴抽出部１１に入力する。続いてステップ２２０で、入力された学習用画像に基づいてアテンション部１２が生成したアテンションマップ５３および分類結果、ならびに、学習用画像に基づいて認知部１４が生成した分類結果を取得してメモリ４に記録する。 In the process of FIG. 9, the processing unit 5 first executes the loop processing of steps 210 and 220 for each group included in the re-learning data set. In each loop processing, the processing unit 5 first inputs the learning image in the target group to the feature extraction unit 11 in step 210. Subsequently, in step 220, the attention map 53 and the classification result generated by the attention unit 12 based on the input learning image and the classification result generated by the cognitive unit 14 based on the learning image are acquired and acquired in the memory 4. Record in.

なお、入力された学習用画像に基づいてニューラルネットワーク１０がアテンションマップ５３および２種類の分類結果を出力する方法は、学習用画像を入力画像５１に置き換えた上述の方法と同等である。ステップ２２０の後、１回分のループ処理が終了する。 The method in which the neural network 10 outputs the attention map 53 and the two types of classification results based on the input learning image is the same as the above-mentioned method in which the learning image is replaced with the input image 51. After step 220, one loop process is completed.

グループの数だけループ処理が終了すると、処理部５の処理はステップ２３０に進む。この時点で、すべての再学習用データセット中の各グループに対して、アテンション部１２が生成したアテンションマップ５３および分類結果、および、認知部１４が生成した分類結果が、対応付けられて、メモリ４に記録されている。 When the loop processing is completed for the number of groups, the processing of the processing unit 5 proceeds to step 230. At this point, the attention map 53 and the classification result generated by the attention unit 12 and the classification result generated by the cognitive unit 14 are associated with each group in all the retraining data sets and stored in the memory. It is recorded in 4.

処理部５は、ステップ２３０では、複数のグループのうち、誤認識が発生したグループを抽出する。誤認識が発生したとして抽出されるのは、認知部１４が出力した分類結果において尤度が最も高いクラスと、教師ラベルが示すクラス（すなわち、教師ラベルにおいて尤度が最も高いクラス）とが一致しなかったグループである。あるいは、アテンション部１２が出力した分類結果において尤度が最も高いクラスと、教師ラベルが示すクラスとが一致しなかったグループが、誤認識が発生したとして抽出されてもよい。またあるいは、それらの両方が抽出されてもよい。抽出されるグループは、殆どの場合複数である。 In step 230, the processing unit 5 extracts the group in which the erroneous recognition has occurred from the plurality of groups. The class with the highest likelihood in the classification result output by the cognitive unit 14 and the class indicated by the teacher label (that is, the class with the highest likelihood in the teacher label) are extracted as the occurrence of erroneous recognition. This is a group that did not do it. Alternatively, a group in which the class having the highest likelihood in the classification result output by the attention unit 12 and the class indicated by the teacher label do not match may be extracted as a misrecognition. Alternatively, both of them may be extracted. In most cases, there are multiple groups to be extracted.

続いてステップ２４０では、直前のステップ２３０で抽出したグループの各々に対応してメモリ４に記録されているアテンションマップ５３を、人の知見に基づいて修正する。具体的には、図４のステップ１４０と同様の処理により、操作装置２に対する人の修正操作に基づいて、当該アテンションマップ５３を修正する。そして処理部５は、修正後のアテンションマップ５３を、当該グループに属する教師アテンションマップとして、メモリ４に保存する。 Subsequently, in step 240, the attention map 53 recorded in the memory 4 corresponding to each of the groups extracted in the immediately preceding step 230 is modified based on human knowledge. Specifically, the attention map 53 is modified based on a person's modification operation on the operating device 2 by the same process as in step 140 of FIG. Then, the processing unit 5 stores the modified attention map 53 as a teacher attention map belonging to the group in the memory 4.

このように作成される教師アテンションマップは、同じグループの学習用画像が特徴抽出部１１に入力されたときにアテンション部１２から出力されるアテンションマップ５３の正解値とされるデータである。この処理により、教師アテンションマップは、再学習用データセットに追加される。 The teacher attention map created in this way is data that is the correct answer value of the attention map 53 output from the attention unit 12 when the learning images of the same group are input to the feature extraction unit 11. By this process, the teacher attention map is added to the data set for retraining.

続いて処理部５は、ステップ２５０で、今回の図９の処理で取得した２種類の分類結果、アテンションマップ５３、および再学習用データセットに基づいて、ニューラルネットワーク１０を再学習させる。上述の通り、再学習用データセットには、教師アテンションマップ、教師ラベルが含まれる。 Subsequently, in step 250, the processing unit 5 retrains the neural network 10 based on the two types of classification results, the attention map 53, and the relearning data set acquired in the process of FIG. 9 this time. As mentioned above, the retraining dataset includes a teacher attention map and a teacher label.

具体的には、図１０に示すように、３つの学習誤差Ｌａｔｔ、Ｌｐｅｒ、Ｌｍａｐの和から成る量Ｌ＝Ｌａｔｔ＋Ｌｐｅｒ＋Ｌｍａｐを学習誤差として、誤差逆伝播法により、アテンション部１２および認知部１４の重み、バイアス等の学習パラメータが更新される。図１０においては、認知部１４の出力層１４ｂと、認知部１４の出力層１４ｂよりも前段の部分１４ａとが表されている。なお、本実施形態では、特徴抽出部１１の重み、バイアス等の学習パラメータは更新されない。 Specifically, as shown in FIG. 10, the weights of the attention unit 12 and the cognitive unit 14 are determined by the error backpropagation method, with the quantity L = Latt + Lper + Lmap consisting of the sum of the three learning errors Latt, Lper, and Lmap as the learning error. Learning parameters such as bias are updated. In FIG. 10, the output layer 14b of the cognitive unit 14 and the portion 14a in front of the output layer 14b of the cognitive unit 14 are shown. In this embodiment, learning parameters such as weights and biases of the feature extraction unit 11 are not updated.

ここで、Ｌａｔｔは、学習用画像６１がニューラルネットワーク１０に入力されたときにアテンション部１２が出力する分類結果と、当該学習用画像６１と同じグループに属する教師ラベル６０との間の、誤差を示す量である。 Here, Latt sets an error between the classification result output by the attention unit 12 when the learning image 61 is input to the neural network 10 and the teacher label 60 belonging to the same group as the learning image 61. The amount shown.

また、Ｌｐｅｒは、学習用画像６１がニューラルネットワーク１０に入力されたときに特徴抽出部１１が出力する分類結果と、当該学習用画像６１と同じグループに属する教師ラベル６０との間の、誤差を示す量である。 Further, Lper determines an error between the classification result output by the feature extraction unit 11 when the learning image 61 is input to the neural network 10 and the teacher label 60 belonging to the same group as the learning image 61. The amount shown.

また、Ｌｍａｐは、学習用画像６１がニューラルネットワーク１０に入力されたときにアテンション部１２が出力するアテンションマップ５３と、当該学習用画像６１と同じグループに属する教師ラベル６０との間の、誤差を示す量である。 Further, Lmap determines an error between the attention map 53 output by the attention unit 12 when the learning image 61 is input to the neural network 10 and the teacher label 60 belonging to the same group as the learning image 61. The amount shown.

学習誤差Ｌｍａｐとしては、以下の式のようにＬ２ノルム誤差が採用されてもよいし、他の形態の誤差が採用されてもよい。
Ｌｍａｐ＝γ×｜｜Ｍ’−Ｍ｜｜_２
ここで、Ｍは学習用画像６１がニューラルネットワーク１０に入力されたときにアテンション部１２が出力するアテンションマップ５３の値を示す。Ｍ’は、学習用画像６１と同じグループに対応する修正後のアテンションマップの値を示す。これら２つのアテンションマップの要素毎に誤差を求めることで，人の知見に近いアテンションマップを出力するようアテンション部１２が学習される。 As the learning error Lmap, an L2 norm error may be adopted as shown in the following equation, or an error of another form may be adopted.
Lmap = γ × || M'-M || ₂
Here, M indicates the value of the attention map 53 output by the attention unit 12 when the learning image 61 is input to the neural network 10. M'indicates the value of the modified attention map corresponding to the same group as the learning image 61. By obtaining an error for each of these two attention map elements, the attention unit 12 is learned so as to output an attention map close to human knowledge.

ここで、γは学習誤差Ｌｍａｐを調整する係数である。ＬｍａｐはＬａｔｔ、Ｌｐｅｒと比べて誤差の値が大きい。そのため、γをＬｍａｐに乗算することで、３つの学習誤差Ｌｍａｐ、Ｌａｔｔ、Ｌｐｅｒの大きさを調整することができる。ステップ２５０の後、図９の処理が終了し、再学習されたニューラルネットワーク１０がメモリ４に記録される。 Here, γ is a coefficient for adjusting the learning error Lmap. Lmap has a larger error value than Latt and Lper. Therefore, by multiplying γ by Lmap, the magnitudes of the three learning errors Lmap, Latt, and Lper can be adjusted. After step 250, the process of FIG. 9 ends and the relearned neural network 10 is recorded in memory 4.

このように、人の知見に基づいて修正されたアテンションマップに基づいてニューラルネットワーク１０のファインチューニングが行われることで、ニューラルネットワーク１０による画像認識機能が向上する。つまり、処理部５がファインチューニング後のニューラルネットワーク１０に種々の入力画像５１を入力したときに認知部１４が生成する認識結果の正解率が向上する。 In this way, fine tuning of the neural network 10 is performed based on the attention map modified based on human knowledge, so that the image recognition function of the neural network 10 is improved. That is, when the processing unit 5 inputs various input images 51 to the neural network 10 after fine tuning, the correct answer rate of the recognition result generated by the recognition unit 14 is improved.

以上説明した通り、処理部５は、再学習用のデータセットを用いて、ニューラルネットワーク１０を再学習させる（ステップ２５０）。そして、再学習用のデータセットは、複数の教師アテンションマップを含む。 As described above, the processing unit 5 retrains the neural network 10 using the data set for retraining (step 250). Then, the data set for re-learning includes a plurality of teacher attention maps.

このように、アテンションマップを生成するニューラルネットワーク１０を再学習するときに、アテンションマップの正解値とされる教師アテンションマップが使用される。教師アテンションマップは、人の知見に基づいて作成されたものなので、このようにすることで、ニューラルネットワーク１０の学習に人の知見を取り入れることが可能となる。 In this way, when re-learning the neural network 10 that generates the attention map, the teacher attention map that is the correct answer value of the attention map is used. Since the teacher attention map is created based on human knowledge, it is possible to incorporate human knowledge into the learning of the neural network 10 by doing so.

また、処理部５は、ニューラルネットワーク１０に複数の学習用画像を入力することによって複数の学習用画像にそれぞれ対応した複数のアテンションマップを取得する（ステップ２１０、２２０）。そして処理部５は、人の修正操作に応じてそれら複数のアテンションマップを修正して教師アテンションマップとする（ステップ２４０）。 Further, the processing unit 5 acquires a plurality of attention maps corresponding to the plurality of learning images by inputting the plurality of learning images into the neural network 10 (steps 210 and 220). Then, the processing unit 5 modifies the plurality of attention maps to obtain the teacher attention map according to the correction operation of the person (step 240).

このように、ニューラルネットワーク１０が生成したアテンションマップに対して人がした修正操作に基づいて、教師アテンションマップを生成することができる。したがって、より直接的に、ニューラルネットワーク１０の学習に人の知見を取り入れることが可能となる。しかも、ゼロから教師アテンションマップを作成する場合に比べて、修正操作が簡単である。 In this way, the teacher attention map can be generated based on the correction operation performed by a person on the attention map generated by the neural network 10. Therefore, it is possible to more directly incorporate human knowledge into the learning of the neural network 10. Moreover, the correction operation is easier than when creating a teacher attention map from scratch.

また、再学習用のデータセットに含まれる複数の教師アテンションマップは、再学習の前にニューラルネットワーク１０によって誤認識された学習用画像のみである。このように、誤認識された学習用画像に対応する教師アテンションマップを多く再学習に用いることで、より高い効率で再学習を行うことができる。これは、誤認識された学習用画像を入力として生成されたアテンションマップは、それ自体も誤りが多い可能性が高いからである。 Further, the plurality of teacher attention maps included in the data set for re-learning are only the learning images misrecognized by the neural network 10 before the re-learning. In this way, by using many teacher attention maps corresponding to misrecognized learning images for re-learning, re-learning can be performed with higher efficiency. This is because the attention map generated by inputting the erroneously recognized learning image is likely to have many errors in itself.

また、ニューラルネットワーク１０は、入力画像５１およびアテンションマップ５３に基づいて入力画像５１の認識結果を生成する。このように、入力画像５１のみならずアテンションマップ５３も画像認識のための情報としてフィードバックするようなニューラルネットワーク１０においては、入力画像５１の認識結果とアテンションマップ５３との間の関連性が強い。したがって、そのようなニューラルネットワーク１０においては、教師アテンションマップを用いた再学習の効果が、入力画像５１の認識結果の向上に寄与する度合いが、高い。 Further, the neural network 10 generates a recognition result of the input image 51 based on the input image 51 and the attention map 53. As described above, in the neural network 10 in which not only the input image 51 but also the attention map 53 is fed back as information for image recognition, the recognition result of the input image 51 and the attention map 53 are strongly related. Therefore, in such a neural network 10, the effect of re-learning using the teacher attention map contributes to the improvement of the recognition result of the input image 51 to a high degree.

また、処理部５は、特徴抽出部１１を再学習させずにアテンション部１２を再学習させる。このように、ニューラルネットワーク１０のうちでもアテンションマップ５３の生成に強く関係する部分が再学習されることにより、効率の高いニューラルネットワーク１０のファインチューニングが実現する。 Further, the processing unit 5 relearns the attention unit 12 without relearning the feature extraction unit 11. In this way, the portion of the neural network 10 that is strongly related to the generation of the attention map 53 is relearned, so that highly efficient fine tuning of the neural network 10 is realized.

なお、本実施形態では、処理部５が、ステップ２０５を実行することで読出部として機能し、ステップ２５０を実行することでトレーニング部として機能し、ステップ２１０、２２０を実行することで取得部として機能し、ステップ２４０を実行することでマップ修正部として機能する。 In the present embodiment, the processing unit 5 functions as a reading unit by executing step 205, functions as a training unit by executing step 250, and serves as an acquisition unit by executing steps 210 and 220. It functions and functions as a map correction unit by executing step 240.

（他の実施形態）
なお、本発明は上記した実施形態に限定されるものではなく、適宜変更が可能である。また、上記各実施形態は、互いに無関係なものではなく、組み合わせが明らかに不可な場合を除き、適宜組み合わせが可能である。また、上記各実施形態において、実施形態を構成する要素は、特に必須であると明示した場合および原理的に明らかに必須であると考えられる場合等を除き、必ずしも必須のものではない。また、上記各実施形態において、実施形態の構成要素の個数、数値、量、範囲等の数値が言及されている場合、特に必須であると明示した場合および原理的に明らかに特定の数に限定される場合等を除き、その特定の数に限定されるものではない。また、ある量について複数個の値が例示されている場合、特に別記した場合および原理的に明らかに不可能な場合を除き、それら複数個の値の間の値を採用することも可能である。 (Other embodiments)
The present invention is not limited to the above-described embodiment, and can be appropriately modified. Further, the above-described embodiments are not unrelated to each other, and can be appropriately combined unless the combination is clearly impossible. Further, in each of the above embodiments, the elements constituting the embodiment are not necessarily essential except when it is clearly stated that they are essential and when it is clearly considered to be essential in principle. Further, in each of the above embodiments, when numerical values such as the number, numerical values, amounts, and ranges of the constituent elements of the embodiment are mentioned, when it is clearly stated that they are particularly essential, and in principle, the number is clearly limited to a specific number. It is not limited to the specific number except when it is done. Further, when a plurality of values are exemplified for a certain quantity, it is also possible to adopt a value between the plurality of values unless otherwise specified or when it is clearly impossible in principle. ..

また、本発明は、上記各実施形態に対する以下のような変形例および均等範囲の変形例も許容される。なお、以下の変形例は、それぞれ独立に、上記実施形態に適用および不適用を選択できる。すなわち、以下の変形例のうち任意の組み合わせを、上記実施形態に適用することができる。 In addition, the present invention also allows the following modifications and equal range modifications for each of the above embodiments. In addition, the following modified examples can be independently selected to be applied or not applied to the above embodiment. That is, any combination of the following modifications can be applied to the above embodiment.

（変形例１）
画像認識装置１は、第１実施形態の機能（すなわち、人の知見に基づいて修正されたアテンションマップを用いた画像認識）と第２実施形態の機能（すなわち、人の知見に基づいて修正されたアテンションマップを用いた再学習）の両方の機能を有していてもよい。 (Modification example 1)
The image recognition device 1 has a function of the first embodiment (that is, image recognition using an attention map modified based on human knowledge) and a function of the second embodiment (that is, modified based on human knowledge). It may have both functions (re-learning using an attention map).

（変形例２）
上記実施形態では、アテンション部１２および認知部１４が出力する認識結果の一例として、分類結果が上げられている。しかし、アテンション部１２および認知部１４が出力する認識結果は、分類結果に限らず、回帰による結果でもよい。つまり、ニューラルネットワーク１０が行う画像の認識は、分類でもよいし、回帰でもよい。 (Modification 2)
In the above embodiment, the classification result is given as an example of the recognition result output by the attention unit 12 and the cognitive unit 14. However, the recognition result output by the attention unit 12 and the cognitive unit 14 is not limited to the classification result, but may be the result by regression. That is, the image recognition performed by the neural network 10 may be classification or regression.

（変形例３）
上記第１実施形態では、ニューラルネットワーク１０は、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４を有している。しかし、人の知見に基づいて修正されたアテンションマップを用いた画像認識を実現するためのニューラルネットワークは、このような構成のものに限られない。すなわち、入力された画像に基づいてアテンションマップを生成し、当該画像とアテンションマップに基づいて画像の認識結果を生成するニューラルネットワークであれば、アテンションマップが修正されることで画像の認識機能が向上し得る。 (Modification 3)
In the first embodiment, the neural network 10 has a feature extraction unit 11, an attention unit 12, a synthesis unit 13, and a cognitive unit 14. However, the neural network for realizing image recognition using the attention map modified based on human knowledge is not limited to such a configuration. That is, in the case of a neural network that generates an attention map based on the input image and generates an image recognition result based on the image and the attention map, the image recognition function is improved by modifying the attention map. Can be done.

（変形例４）
上記第２実施形態では、ニューラルネットワーク１０は、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４を有している。しかし、人の知見に基づいて修正されてもよい。しかし、人の知見に基づいて修正されたアテンションマップを用いた再学習を実現するためのニューラルネットワークは、このような構成のものに限られない。すなわち、入力された画像に基づいてアテンションマップおよび画像の認識結果を生成するニューラルネットワークであれば、修正されたアテンションマップを用いて再学習することで画像の認識機能が向上し得る。例えば、非特許文献２に記載されたＣＡＭ（Class Activation Mapping）のようなニューラルネットワークが、人の知見に基づいて修正されたアテンションマップを用いて再学習されてもよい。 (Modification example 4)
In the second embodiment, the neural network 10 has a feature extraction unit 11, an attention unit 12, a synthesis unit 13, and a cognitive unit 14. However, it may be modified based on human knowledge. However, the neural network for realizing re-learning using the attention map modified based on human knowledge is not limited to such a configuration. That is, in the case of a neural network that generates an attention map and an image recognition result based on an input image, the image recognition function can be improved by re-learning using the modified attention map. For example, a neural network such as CAM (Class Activation Mapping) described in Non-Patent Document 2 may be relearned using an attention map modified based on human knowledge.

（変形例５）
上記第１実施形態では、処理部５は、入力画像５１をアテンションマップ５３に透過的に重ねて、表示装置３に表示させた状態で、人の修正操作に応じた修正をアテンションマップに反映させている。しかし、必ずしもこのようにしなくてもよい。例えば、処理部５は、入力画像５１とアテンションマップ５３を重ならずに並べて表示装置３に表示させた状態で、人の修正操作に応じた修正をアテンションマップに反映させてもよい。また例えば、処理部５は、アテンションマップ５３を表示装置３に表示させて入力画像５１を表示装置３に表示させない状態で、人の修正操作に応じた修正をアテンションマップに反映させてもよい。 (Modification 5)
In the first embodiment, the processing unit 5 transparently superimposes the input image 51 on the attention map 53 and displays it on the display device 3, and reflects the correction according to the human correction operation on the attention map. ing. However, it is not always necessary to do this. For example, the processing unit 5 may reflect the correction according to the correction operation of a person on the attention map in a state where the input image 51 and the attention map 53 are displayed side by side on the display device 3 without overlapping. Further, for example, the processing unit 5 may reflect the correction according to the correction operation of a person on the attention map in a state where the attention map 53 is displayed on the display device 3 and the input image 51 is not displayed on the display device 3.

（変形例６）
上記第１、２実施形態では、アテンションマップの修正方法として、アテンション部１２によって生成されたアテンションマップ中の一部の画素の値のみを変更し、残りの画素の値は変更しない方法が示されている。つまり、アテンション部１２によって生成されたアテンションマップに変更を加える方法が示されている。 (Modification 6)
In the first and second embodiments, as a method of modifying the attention map, a method of changing only the values of some pixels in the attention map generated by the attention unit 12 and not changing the values of the remaining pixels is shown. ing. That is, a method of making changes to the attention map generated by the attention unit 12 is shown.

しかし、アテンションマップの修正方法は、必ずしもこのような方法に限られない。例えば、画像がニューラルネットワーク１０に入力されたときにアテンション部１２によって出力されたアテンションマップとは別に、新たなアテンションマップがゼロから作成されてもよい。この場合、第１実施形態では、この新たなアテンションマップが合成部１３に入力され、第２実施形態では、この新たなアテンションマップが教師アテンションマップになる。 However, the method of modifying the attention map is not necessarily limited to such a method. For example, a new attention map may be created from scratch in addition to the attention map output by the attention unit 12 when the image is input to the neural network 10. In this case, in the first embodiment, the new attention map is input to the synthesis unit 13, and in the second embodiment, the new attention map becomes the teacher attention map.

新たなアテンションマップの作成方法としては、例えば、以下のような方法がある。まず、人が、ニューラルネットワーク１０に入力された画像を見て注視領域の位置範囲を決める。そして人が、その決めた注視領域の位置範囲を反映する新たなアテンションマップを、コンピュータを操作して作成してもよい。このコンピュータは、画像認識装置１であってもよいし、他の装置であってもよい。 As a method of creating a new attention map, for example, there are the following methods. First, a person determines the position range of the gaze area by looking at the image input to the neural network 10. Then, a person may operate a computer to create a new attention map that reflects the position range of the gaze area determined. This computer may be an image recognition device 1 or another device.

（変形例７）
上記第２実施形態では、再学習に使用される教師アテンションマップは、再学習の前にニューラルネットワーク１０によって誤認識された学習用画像に対応する教師アテンションマップのみである。しかし、再学習に使用される教師アテンションマップに、再学習の前にニューラルネットワーク１０によって正しく認識された学習用画像に対応する教師アテンションマップが含まれていてもよい。 (Modification 7)
In the second embodiment, the teacher attention map used for the re-learning is only the teacher attention map corresponding to the learning image misrecognized by the neural network 10 before the re-learning. However, the teacher attention map used for relearning may include a teacher attention map corresponding to the learning image correctly recognized by the neural network 10 prior to relearning.

その場合も、誤認識された学習用画像に対応する教師アテンションマップの数が、正しく認識された学習用画像に対応する教師アテンションマップよりも多ければ、再学習の高効率化を行うことができる。 Even in that case, if the number of teacher attention maps corresponding to the misrecognized learning image is larger than the number of teacher attention maps corresponding to the correctly recognized learning image, the efficiency of re-learning can be improved. ..

あるいは、誤認識された学習用画像に対応する教師アテンションマップの数が、正しく認識された学習用画像に対応する教師アテンションマップより少なくてもよい。 Alternatively, the number of teacher attention maps corresponding to the misrecognized learning image may be less than the number of teacher attention maps corresponding to the correctly recognized learning image.

（変形例８）
上記実施形態では、ニューラルネットワーク１０の再学習においては、特徴抽出部１１は再学習されず、アテンション部１２、認知部１４のみが再学習される。ニューラルネットワーク１０の再学習は、この形態に限られない。例えば、特徴抽出部１１、認知部１４が再学習されず、アテンション部１２のみが再学習されてもよい。また例えば、特徴抽出部１１のみが再学習され、アテンション部１２、認知部１４が再学習されなくてもよい。また例えば、特徴抽出部１１、認知部１４が再学習され、アテンション部１２が再学習されなくてもよい。 (Modification 8)
In the above embodiment, in the re-learning of the neural network 10, the feature extraction unit 11 is not re-learned, and only the attention unit 12 and the cognitive unit 14 are re-learned. The re-learning of the neural network 10 is not limited to this form. For example, the feature extraction unit 11 and the cognitive unit 14 may not be relearned, and only the attention unit 12 may be relearned. Further, for example, only the feature extraction unit 11 may be relearned, and the attention unit 12 and the cognitive unit 14 may not be relearned. Further, for example, the feature extraction unit 11 and the cognitive unit 14 may be relearned, and the attention unit 12 may not be relearned.

また、特徴抽出部１１、アテンション部１２、認知部１４が再学習される形態も許容される。この場合は、ニューラルネットワーク１０の再学習はファインチューニングではない。 Further, a form in which the feature extraction unit 11, the attention unit 12, and the cognitive unit 14 are relearned is also allowed. In this case, the re-learning of the neural network 10 is not fine tuning.

（変形例９）
上記実施形態では、再学習は、Ｌｍａｐ、Ｌａｔｔ、Ｌｐｅｒの３つの学習誤差を用いて誤差逆伝播法を用いて行われている。しかし、Ｌｍａｐ、Ｌａｔｔ、Ｌｐｅｒのすべてを用いなくてもよい。例えば、Ｌｍａｐのみを用いてもよい。 (Modification 9)
In the above embodiment, the re-learning is performed by using the error back-propagation method using three learning errors of Lmap, Latt, and Lper. However, it is not necessary to use all of Lmap, Latt, and Lper. For example, only Lmap may be used.

１…画像認識装置、２…操作装置、３…表示装置、４…メモリ、５…処理部、１０…ニューラルネットワーク、１１…特徴抽出部、１２…アテンション部、１３…合成部、１４…認知部、５１…入力画像、５２…特徴マップ、５３…アテンションマップ、５４…合成マップ、６０…教師ラベル、６１…学習用画像 1 ... image recognition device, 2 ... operation device, 3 ... display device, 4 ... memory, 5 ... processing unit, 10 ... neural network, 11 ... feature extraction unit, 12 ... attention unit, 13 ... synthesis unit, 14 ... cognitive unit , 51 ... Input image, 52 ... Feature map, 53 ... Attention map, 54 ... Composite map, 60 ... Teacher label, 61 ... Learning image

Claims

An input unit (120) for inputting an image (51) into the neural network (10),
When the neural network generates an attention map (53) expressing the gaze area of the input image, the map correction unit (140) corrects the generated attention map according to a human correction operation. )When,
Image recognition including an output unit (160) that outputs the generated recognition result when the neural network (10) generates the recognition result of the image based on the modified attention map and the image. apparatus.

The neural network includes a feature extraction unit (11), an attention unit (12), a synthesis unit (13), and a cognitive unit (14).
The feature extraction unit generates a feature map (52) including the features of the image by including the plurality of convolution layers and propagating the information of the image through the plurality of convolution layers.
The attention unit generates the attention map based on the feature map.
The compositing unit synthesizes the feature map and the modified attention map to generate a compositing map (54).
The image recognition device according to claim 1, wherein the recognition unit generates the recognition result based on the synthetic map.

The map correction unit is characterized in that, in a state where the image is transparently superimposed on the attention map and displayed on the display device (3), the correction according to the correction operation is reflected in the attention map. The image recognition device according to claim 1 or 2.

An attention map (53) expressing the gaze area of the input image (51) and a reading unit (205) for reading out a neural network (10) learned to generate a recognition result of the image, and
A training unit (250) for retraining the neural network using a data set for retraining is provided.
The data set for retraining is a training device including a plurality of teacher attention maps which are correct values of the attention maps when a plurality of images are input to the neural network.

Acquisition units (210, 220) that acquire a plurality of attention maps corresponding to the plurality of learning images by inputting a plurality of learning images into the neural network.
The training device according to claim 4, further comprising a map correction unit (240) that modifies the plurality of attention maps to be a teacher attention map according to a person's correction operation.

In the plurality of teacher attention maps included in the data set for re-learning, the teacher attention map corresponding to the learning image misrecognized by the neural network before the re-learning is performed by the neural network before the re-learning. The training device according to claim 4 or 5, which is more than a teacher attention map corresponding to a correctly recognized learning image.

The training device according to any one of claims 4 to 6, wherein the neural network generates a recognition result of the image based on the image and the attention map.

The neural network includes a feature extraction unit (11), an attention unit (12), a synthesis unit (13), and a cognitive unit (14).
The feature extraction unit generates a feature map (52) including the features of the image by including the plurality of convolution layers and propagating the information of the image through the plurality of convolution layers.
The attention unit generates the attention map based on the feature map.
The compositing unit synthesizes the feature map and the attention map to generate a compositing map (54).
The cognitive unit generates the recognition result based on the synthetic map.
The training device according to any one of claims 4 to 7, wherein the training unit relearns the attention unit without relearning the feature extraction unit.