JP2020008896A

JP2020008896A - Image identification apparatus, image identification method and program

Info

Publication number: JP2020008896A
Application number: JP2018126346A
Authority: JP
Inventors: 松永　和久; Kazuhisa Matsunaga; 和久松永
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2020-01-16
Anticipated expiration: 2038-07-02
Also published as: JP7135504B2

Abstract

To improve an identification accuracy of an image by a convolutional neural network (CNN).SOLUTION: An image identification apparatus 100 includes a CNN identifier 11 that has an intermediate layer that is a layer other than an input layer which receives an input image and an output layer which outputs an identification result of the input image and that identifies the input image, feature map acquisition means that acquires feature map information for identifying the input image in the intermediate layer, activation map generation means that generates an activation map in which an activation state of the intermediate layer is visualized, edit means that refers to the activation map generated by the activation map generation means and edits the feature map acquired by the feature map acquisition means, and identification means that identifies the input image using the feature map edited by the edit means.SELECTED DRAWING: Figure 1

Description

本発明は、画像識別装置、画像識別方法及びプログラムに関する。 The present invention relates to an image identification device, an image identification method, and a program.

畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：ＣＮＮ）は画像解析において高い性能を発揮するため、ＣＮＮを用いて画像を識別する装置が開発されてきている。そして、ＣＮＮによる画像の識別精度を向上させるための技術の開発も進められている。例えば、特許文献１には、特定の識別対象に特化した場合の識別精度を向上可能とする技術が開示されている。 Since a convolutional neural network (Convolutional Neural Network: CNN) exhibits high performance in image analysis, an apparatus for identifying an image using the CNN has been developed. And, technology for improving the accuracy of image identification by the CNN is also being developed. For example, Patent Literature 1 discloses a technique capable of improving identification accuracy when specialized for a specific identification target.

特開２０１４−２０３１３５号公報JP 2014-203135 A

特許文献１に開示されている技術は、不連続関数を用いる符号化処理を実行する場合でも、ＣＮＮの学習方法である誤差逆伝播（Ｂａｃｋｐｒｏｐａｇａｔｉｏｎ）法を適用できるようにすることによって、特定の識別対象に特化した場合の識別精度を向上させている。しかし、このような従来技術においては、ＣＮＮの内部処理をブラックボックスとして扱っているため、識別精度の向上に改善の余地があった。 The technology disclosed in Patent Literature 1 enables specific identification by enabling the backpropagation (error propagation) method, which is a CNN learning method, to be performed even when an encoding process using a discontinuous function is performed. The accuracy of discrimination when specialized for the target is improved. However, in such a conventional technique, since the internal processing of the CNN is treated as a black box, there is room for improvement in the identification accuracy.

本発明は、上記問題を解決するためになされたものであり、従来よりもＣＮＮによる画像の識別精度を向上させることができる画像識別装置、画像識別方法及びプログラムを提供することを目的とする。 SUMMARY An advantage of some aspects of the invention is to provide an image identification device, an image identification method, and a program capable of improving the accuracy of image identification by a CNN as compared with the related art.

上記目的を達成するため、本発明の画像識別装置は、
入力画像が入力される入力層及び前記入力画像の識別結果が出力される出力層以外の層である中間層を有し、前記入力画像を識別する識別器と、
前記中間層において、前記入力画像を識別するための特徴マップを取得する特徴マップ取得手段と、
前記中間層の活性化状態を可視化した活性化マップを生成する活性化マップ生成手段と、
前記活性化マップ生成手段が生成した前記活性化マップを参照して、前記特徴マップ取得手段が取得した特徴マップを編集する編集手段と、
前記編集手段により編集された特徴マップを用いて前記入力画像を識別する識別手段と、
を備える。 In order to achieve the above object, an image identification device of the present invention comprises:
An input device to which an input image is input and an intermediate layer which is a layer other than an output layer from which an identification result of the input image is output, and an identifier for identifying the input image,
In the intermediate layer, a feature map obtaining means for obtaining a feature map for identifying the input image,
Activation map generation means for generating an activation map visualizing the activation state of the intermediate layer,
Editing means for editing the feature map obtained by the feature map obtaining means, with reference to the activation map generated by the activation map generating means,
Identification means for identifying the input image using the feature map edited by the editing means,
Is provided.

本発明によれば、ＣＮＮによる画像の識別精度を向上させることができる。 ADVANTAGE OF THE INVENTION According to this invention, the identification accuracy of the image by CNN can be improved.

本発明の実施形態１に係る画像識別装置の機能構成を示す図である。FIG. 2 is a diagram illustrating a functional configuration of the image identification device according to the first embodiment of the present invention. 畳み込みニューラルネットワーク（ＣＮＮ）の処理の概要を説明する図である。It is a figure explaining the outline of processing of a convolutional neural network (CNN). ＣＮＮの畳み込み処理及びプーリング処理の具体例を説明する図である。FIG. 9 is a diagram illustrating a specific example of a CNN convolution process and a pooling process. ＣＮＮによる出力の算出について説明する図である。It is a figure explaining calculation of the output by CNN. 単純平均による活性化マップ生成方法を説明する図である。FIG. 4 is a diagram illustrating an activation map generation method using simple averaging. ＣＡＭによる活性化マップ生成方法を説明する図である。FIG. 3 is a diagram illustrating an activation map generation method using a CAM. Ｇｒａｄ−ＣＡＭによる活性化マップ生成方法を説明する図である。It is a figure explaining the activation map generation method by Grad-CAM. ＣＮＮの入力側に近い層と出力側に近い層とにおける活性化マップを説明する図である。It is a figure explaining the activation map in the layer near the input side of CNN, and the layer near the output side. 実施形態１に係る画像識別処理のフローチャートである。5 is a flowchart of an image identification process according to the first embodiment. 実施形態１に係る画像識別処理を具体例で説明する図である。FIG. 4 is a diagram illustrating a specific example of the image identification processing according to the first embodiment. 変形例１に係る画像識別処理のフローチャートである。13 is a flowchart of an image identification process according to a first modification. 変形例１に係る画像識別処理で入力画像と活性化マップが重ねて表示されている例を説明する図である。FIG. 14 is a diagram illustrating an example in which an input image and an activation map are displayed in an overlapping manner in the image identification processing according to Modification 1.

以下、本発明の実施形態に係る画像識別装置等について、図表を参照して説明する。なお、図中同一又は相当部分には同一符号を付す。 Hereinafter, an image identification device and the like according to an embodiment of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding parts have the same reference characters allotted.

（実施形態１）
本発明の実施形態１に係る画像識別装置１００は、学習用の画像を用いて学習させたＣＮＮ識別器を用いて未知の画像を識別する。画像識別装置１００は、未知の画像の識別の際に、ＣＮＮ識別器の中間層において、画像識別に不要と推定される領域の情報を削除することによって、画像識別の精度を向上させることができる。このような画像識別装置１００について、以下に説明する。 (Embodiment 1)
The image identification device 100 according to the first embodiment of the present invention identifies an unknown image using a CNN identifier trained using a learning image. When identifying an unknown image, the image identification device 100 can improve the accuracy of image identification by deleting information of an area estimated to be unnecessary for image identification in the intermediate layer of the CNN identifier. . Such an image identification device 100 will be described below.

実施形態１に係る画像識別装置１００は、図１に示すように、制御部１０、記憶部２０、画像入力部３１、出力部３２、通信部３３、操作入力部３４、を備える。 As shown in FIG. 1, the image identification device 100 according to the first embodiment includes a control unit 10, a storage unit 20, an image input unit 31, an output unit 32, a communication unit 33, and an operation input unit 34.

制御部１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等で構成され、記憶部２０に記憶されたプログラムを実行することにより、後述する各部（ＣＮＮ識別器１１、不要領域取得部１２、不要領域削除部１３）の機能を実現する。 The control unit 10 is configured by a CPU (Central Processing Unit) or the like, and executes programs stored in the storage unit 20 to thereby control each unit (CNN discriminator 11, unnecessary area acquisition unit 12, unnecessary area deletion unit 13) described later. ).

記憶部２０は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等で構成され、制御部１０のＣＰＵが実行するプログラム及び必要なデータを記憶する。 The storage unit 20 includes a ROM (Read Only Memory), a RAM (Random Access Memory), and the like, and stores a program executed by the CPU of the control unit 10 and necessary data.

画像入力部３１は、学習用の画像データ又は識別する（未知の）画像データを入力するためのデバイスである。制御部１０は、画像入力部３１を介して画像データを取得する。画像入力部３１としては、制御部１０が画像データを取得できるなら、任意のデバイスを使用することができる。例えば、記憶部２０に画像データを記憶させておき、制御部１０が記憶部２０を読み出すことによって画像データを取得する場合は、記憶部２０が画像入力部３１を兼ねることになる。また、制御部１０が通信部３３を介して外部のサーバ等から画像データを取得する場合は、通信部３３が画像入力部３１を兼ねることになる。 The image input unit 31 is a device for inputting image data for learning or image data to be identified (unknown). The control unit 10 acquires image data via the image input unit 31. As the image input unit 31, any device can be used as long as the control unit 10 can acquire image data. For example, when image data is stored in the storage unit 20 and the control unit 10 acquires the image data by reading out the storage unit 20, the storage unit 20 also serves as the image input unit 31. When the control unit 10 acquires image data from an external server or the like via the communication unit 33, the communication unit 33 also serves as the image input unit 31.

出力部３２は、制御部１０が、画像入力部３１から入力した画像を識別した結果や後述する活性化マップ等を出力するためのデバイスである。例えば、出力部３２は、液晶ディスプレイや有機ＥＬ（Ｅｌｅｃｔｏｒｏ−Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイである。ただし、画像識別装置１００は、出力部３２としてこれらディスプレイを備えてもよいし、外部のディスプレイを接続するためのインタフェースとしての出力部３２を備えてもよい。画像識別装置１００は、インタフェースとしての出力部３２を備える場合は、出力部３２を介して接続した外部のディスプレイに識別結果等を表示する。出力部３２は、出力手段として機能する。 The output unit 32 is a device for the control unit 10 to output a result of identifying an image input from the image input unit 31, an activation map described later, and the like. For example, the output unit 32 is a liquid crystal display or an organic EL (Electro-Luminescence) display. However, the image identification device 100 may include these displays as the output unit 32, or may include the output unit 32 as an interface for connecting an external display. When the image identification device 100 includes the output unit 32 as an interface, the identification result and the like are displayed on an external display connected via the output unit 32. The output unit 32 functions as an output unit.

通信部３３は、外部の他の装置（例えば、画像データのデータベースが格納されているサーバ等）とデータの送受信を行うためのデバイス（ネットワークインタフェース等）である。制御部１０は、通信部３３を介して画像データを取得することができる。 The communication unit 33 is a device (such as a network interface) for transmitting and receiving data to and from another external device (for example, a server that stores a database of image data). The control unit 10 can acquire image data via the communication unit 33.

操作入力部３４は、画像識別装置１００に対するユーザの操作入力を受け付けるデバイスであり、例えば、キーボード、マウス、タッチパネル等である。画像識別装置１００は、操作入力部３４を介して、ユーザからの指示等を受け付ける。操作入力部３４は、操作入力手段として機能する。 The operation input unit 34 is a device that receives a user's operation input to the image identification device 100, and is, for example, a keyboard, a mouse, a touch panel, or the like. The image identification device 100 receives an instruction or the like from a user via the operation input unit 34. The operation input unit 34 functions as an operation input unit.

次に、制御部１０の機能について説明する。制御部１０は、ＣＮＮ識別器１１、不要領域取得部１２、不要領域削除部１３の機能を実現する。 Next, the function of the control unit 10 will be described. The control unit 10 realizes the functions of the CNN discriminator 11, the unnecessary area acquisition unit 12, and the unnecessary area deletion unit 13.

ＣＮＮ識別器１１は、畳み込みニューラルネットワーク（ＣＮＮ）による画像の識別器である。ＣＮＮによる識別器を実現するプログラムを制御部１０が実行することにより、制御部１０はＣＮＮ識別器１１としても機能する。ＣＮＮ識別器１１は、画像入力部３１を介して入力画像が入力される入力層と、入力画像の識別結果が出力される出力層と、入力層及び出力層以外の層である中間層と、を有し、入力画像を識別した結果を出力層から出力する。ＣＮＮによる画像識別の処理概要については後述する。 The CNN discriminator 11 is an image discriminator using a convolutional neural network (CNN). The control unit 10 also functions as a CNN discriminator 11 when the control unit 10 executes a program for realizing the discriminator by the CNN. The CNN discriminator 11 includes an input layer to which an input image is input via the image input unit 31, an output layer to which an identification result of the input image is output, an intermediate layer that is a layer other than the input layer and the output layer, And outputs the result of identifying the input image from the output layer. The outline of the process of image identification by the CNN will be described later.

不要領域取得部１２は、ＣＮＮ識別器１１の有するＣＮＮの中間層において、入力画像を識別する際に使用しない方が良いと推定される不要領域を取得する。不要領域取得部１２は、不要領域取得手段として機能する。 The unnecessary area obtaining unit 12 obtains an unnecessary area in the intermediate layer of the CNN included in the CNN discriminator 11 which is estimated not to be used when identifying an input image. The unnecessary area obtaining unit 12 functions as an unnecessary area obtaining unit.

不要領域削除部１３は、不要領域取得部１２が取得した不要領域の情報をＣＮＮ識別器１１の有するＣＮＮの中間層から削除する。不要領域削除部１３は、不要領域削除手段として機能する。また、「削除すべき領域を取得して、その領域の情報を削除する」という処理は編集処理の一種と考えられるため、不要領域取得部１２と不要領域削除部１３とにより、編集手段が構成される。 The unnecessary area deletion unit 13 deletes the information of the unnecessary area acquired by the unnecessary area acquisition unit 12 from the CNN intermediate layer of the CNN identifier 11. The unnecessary area deletion unit 13 functions as an unnecessary area deletion unit. Further, the process of “acquiring an area to be deleted and deleting the information of the area” is considered as a kind of editing processing. Therefore, the unnecessary area acquiring unit 12 and the unnecessary area deleting unit 13 constitute an editing unit. Is done.

以上、画像識別装置１００の機能構成について説明した。次にＣＮＮの概要を説明する。ＣＮＮは、人間の視覚野の神経細胞の働きを模倣したニューラルネットワークであり、その原型はネオコグニトロンである。ＣＮＮは、一般的な順伝搬型のニューラルネットワークとは異なり、全結合層だけでなく、畳み込み層（ＣｏｎｖｏｌｕｔｉｏｎＬａｙｅｒ）とプーリング層（ＰｏｏｌｉｎｇＬａｙｅｒ）を中間層として含んでおり、中間層によって入力画像の特徴が抽出される。そして、出力層において、入力画像の識別結果が確率的に表現される。ＣＮＮによるＮクラス識別の典型的な処理概要について、図２等を参照して説明する。 The functional configuration of the image identification device 100 has been described above. Next, an outline of the CNN will be described. CNN is a neural network that mimics the function of nerve cells in the human visual cortex, and its prototype is neocognitron. The CNN differs from a general forward-propagation type neural network in that not only a fully connected layer but also a convolutional layer and a pooling layer are included as intermediate layers. Features are extracted. Then, in the output layer, the identification result of the input image is stochastically expressed. A typical process outline of N class identification by the CNN will be described with reference to FIG.

図２に示すように、ＣＮＮによるＮクラス識別の処理は、入力画像１１１に対して、畳み込み処理（フィルタの走査）やプーリング処理（ウィンドウの走査）を行って徐々にサイズの小さな特徴マップを算出していき、最終的に出力１１８を得る処理である。入力画像１１１を記憶する層は入力層、出力１１８を記憶する層は出力層とも呼ばれる。図２に示す例では、入力された入力画像１１１に対して、畳み込み処理用のフィルタ１２１，１２３，１２４，１２５やプーリング処理用のウィンドウ１２２，１２６を縦横ともにストライド２で走査することによって徐々にサイズの小さな特徴マップを算出し、最終的に出力１１８を得ている。なお、「ストライド２で走査する」とは、画素又は特徴マップを構成する要素を１つ飛ばしで走査することを言う。 As shown in FIG. 2, in the N-class identification processing by the CNN, a convolution process (scanning of a filter) and a pooling process (scanning of a window) are performed on the input image 111 to calculate a feature map having a gradually smaller size. This is a process of finally obtaining an output 118. The layer storing the input image 111 is also called an input layer, and the layer storing the output 118 is also called an output layer. In the example illustrated in FIG. 2, the convolution processing filters 121, 123, 124, 125 and the pooling processing windows 122, 126 are gradually scanned in the vertical and horizontal directions by stride 2 on the input image 111. A small-sized feature map is calculated, and an output 118 is finally obtained. Note that “scanning by stride 2” means scanning by skipping one pixel or element constituting a feature map.

フィルタ１２１，１２３，１２４，１２５の各要素には、重み係数が割り当てられており、同一チャネル数の入力画像又は特徴マップの上を平面方向に走査することにより、入力画像又は特徴マップの各注目点において、スカラ値の内積結果が出力され、新たな特徴マップが得られる。そして、フィルタを複数枚（ｎ枚）適用することにより、その枚数分（ｎチャネル）の特徴マップが得られる。また、ストライド２で走査する度に、特徴マップのサイズは縦横ともに１／２のサイズになる。この結果、後段に行くに従い、より大局的な特徴抽出が行われる（フィルタサイズは、特徴マップサイズに対して相対的に拡大する）ことになる。 A weighting factor is assigned to each element of the filters 121, 123, 124, and 125. By scanning on the input image or the feature map having the same number of channels in the plane direction, each attention of the input image or the feature map is obtained. At the point, the inner product result of the scalar values is output and a new feature map is obtained. Then, by applying a plurality of filters (n), filter maps of the number (n channels) can be obtained. Also, each time scanning is performed with stride 2, the size of the feature map becomes half the size both vertically and horizontally. As a result, more general feature extraction is performed (the filter size is relatively enlarged with respect to the feature map size) in the later stage.

なお、１×１よりも大きなフィルタで入力画像又は特徴マップを走査する場合、フィルタ領域が入力画像又は特徴マップからはみ出さないようにすると、走査時の注目点を入力画像又は特徴マップの端よりも内側の点にせざるを得ず、出力される特徴マップは元の入力画像又は特徴マップのサイズよりも小さくなってしまう。そこで、フィルタサイズに応じて必要な分（例えば、７×７のフィルタの場合は、３要素分）だけ０データ等を入力画像又は特徴マップの外側にパディングすることが行われる。これにより、入力画像又は特徴マップの最外周部も注目点とすることが可能になる。 When scanning an input image or a feature map with a filter larger than 1 × 1, if the filter area is set so as not to protrude from the input image or the feature map, a point of interest at the time of scanning is set at an end of the input image or the feature map. Must be set to an inside point, and the output feature map is smaller than the size of the original input image or feature map. Therefore, padding 0 data or the like to the outside of the input image or the feature map by a necessary amount (for example, in the case of a 7 × 7 filter, three elements) according to the filter size is performed. Thereby, the outermost peripheral portion of the input image or the feature map can also be set as a point of interest.

また、フィルタ出力値（スカラ値）は、通常、活性化関数ＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ：ｙ＝ｍａｘ（ｘ，０））を適用することにより、負値を０にする。ただし近年、活性化関数ＲｅＬＵを適用する前に、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ（フィルタ出力値を平均が０、分散が１となるように正規化する処理）を行うことが一般的になってきている。ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎを行うと、活性度の偏りを補正することが可能になり、学習を加速させることができる等の効果があるためである。なお、特徴マップの各要素の値は活性度とも呼ばれ、その値が大きくなることを活性化するとも言う。 In addition, the filter output value (scalar value) is usually set to a negative value of 0 by applying an activation function ReLU (Rectified Linear Unit: y = max (x, 0)). However, in recent years, before applying the activation function ReLU, it has become common to perform Batch Normalization (a process of normalizing a filter output value so that an average becomes 0 and a variance becomes 1). This is because, when Batch Normalization is performed, it is possible to correct the bias of the degree of activity, and it is possible to accelerate learning. Note that the value of each element of the feature map is also referred to as the degree of activation, and an increase in the value is also referred to as activation.

フィルタによる畳み込み処理や、ウィンドウによるプーリング処理について、図３を参照して具体例で説明する。ここでは、入力画像１１０は８×８×１チャネル（白黒）の画像、フィルタ１２０は３×３×１チャネルのフィルタ、ウィンドウ１３１は３×３のサイズでその領域内の最大値を返すウィンドウである場合で説明する。また、走査は縦横ともストライド２で行うこととする。 The convolution process using a filter and the pooling process using a window will be described with a specific example with reference to FIG. Here, the input image 110 is an image of 8 × 8 × 1 channel (black and white), the filter 120 is a filter of 3 × 3 × 1 channel, and the window 131 is a window of 3 × 3 size which returns the maximum value in the area. An example will be described. Scanning is performed by stride 2 both vertically and horizontally.

入力画像１１０（入力画像の外側に０をパディングしておく）に対してフィルタ１２０をストライド２で走査させると、特徴マップ１３０が得られる。例えば、入力画像１１０の最も左上の点を注目点とした場合、フィルタ１２０を適用すると、左上にはみ出ている（パディングされた）５つの点と注目点並びに注目点の右及び下の点の値は０なので、フィルタ１２０を適用しても０のままであり、注目点の右下の点の値は１だがフィルタ１２０の右下の点の値が０であるため、結局、（最上行の左端から右方向に行き、右端まで来たら１つ下の行をまた左端から右端まで行く順番で演算すると）０×０＋０×１＋０×０＋０×１＋０×（−４）＋０×１＋０×０＋０×１＋１×０＝０となって、フィルタ出力値は０となる。従って、特徴マップ１３０の最も左上の点の値は０となる。 When the filter 120 is scanned with the stride 2 on the input image 110 (0 is padded outside the input image), a feature map 130 is obtained. For example, when the point at the upper left of the input image 110 is set as the point of interest, the filter 120 is applied, and the five points that are protruding (padded) to the upper left, the point of interest, and the values of the right and lower points of the point of interest Is 0, the value remains at 0 even when the filter 120 is applied, and the value of the lower right point of the target point is 1 but the value of the lower right point of the filter 120 is 0. Go from the left end to the right, and if it reaches the right end, calculate the next lower row from the left end to the right end.) 0 = 0, and the filter output value becomes 0. Therefore, the value of the top left point of the feature map 130 is 0.

ストライド２での走査を行うので、次の注目点は、入力画像１１０の最上行の左から３番目の点である。この点にフィルタ１２０を適用すると、上にはみ出ている（パディングされた）３つの点と注目点及び注目点の左右の点の値は０なので、フィルタ１２０を適用しても０のままであり、上記同様の順番で演算すると、０×０＋０×１＋０×０＋０×１＋０×（−４）＋０×１＋１×０＋１×１＋１×０＝１となって、フィルタ出力値は１となる。従って、特徴マップ１３０の最上行の左から２番目の点の値は１となる。なお、例えば、特徴マップ１３０の上から２行目、左から２番目の点の値を上記同様の順番で計算すると、１×０＋１×１＋１×０＋１×１＋２×（−４）＋２×１＋１×０＋２×１＋３×０＝−２となるが、−２に活性化関数ＲｅＬＵを適用すると０になるので、図３では０になっている。（なお、図３では、わかりやすく示すため、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎは行っていない。） Since scanning is performed with stride 2, the next point of interest is the third point from the left of the top row of the input image 110. When the filter 120 is applied to this point, the values of the three points (padded) protruding above, the point of interest, and the points on the left and right of the point of interest remain at 0, so that even when the filter 120 is applied, the value remains at 0. , In the same order as described above, 0 × 0 + 0 × 1 + 0 × 0 + 0 × 1 + 0 × (−4) + 0 × 1 + 1 × 0 + 1 × 1 + 1 × 0 = 1, and the filter output value becomes 1. Therefore, the value of the second point from the left of the top row of the feature map 130 is 1. For example, when the values of the second line from the top and the second point from the left of the feature map 130 are calculated in the same order as above, 1 × 0 + 1 × 1 + 1 × 0 + 1 × 1 + 2 × (−4) + 2 × 1 + 1 × 0 + 2 X1 + 3x0 = -2, but when the activation function ReLU is applied to -2, it becomes 0, and therefore it is 0 in FIG. (Note that, in FIG. 3, Batch Normalization is not performed for easy understanding.)

次にプーリング処理についても図３を参照して具体例を説明する。入力画像１１０に対してＭａｘＰｏｏｌｉｎｇウィンドウ１３１をストライド２で走査させると、特徴マップ１３２が得られる。例えば、入力画像１１０の最も左上の点を注目点とした場合、ウィンドウ１３１を適用すると、左上にはみ出ている（パディングされた）５つの点と注目点並びに注目点の右及び下の点の値は０で、注目点の右下の点の値は１なので、これら９つの点の中の最大値は１であり、ウィンドウ出力値は１となる。従って、特徴マップ１３２の最も左上の点の値は１となる。他の点も同様にして求めることができる。なお、図３では説明しないが、図２のＡｖｅｒａｇｅＰｏｏｌｉｎｇウィンドウ１２６は、７×７のウィンドウ領域内の４９個の点のスカラ値の平均値を出力するウィンドウである。 Next, a specific example of the pooling process will be described with reference to FIG. When the Max Pooling window 131 is scanned with the stride 2 on the input image 110, a feature map 132 is obtained. For example, when the point at the upper left of the input image 110 is set as the point of interest, the window 131 is applied, and the five points protruding (padded) to the upper left, the point of interest, and the values of the right and lower points of the point of interest Is 0 and the value of the lower right point of the point of interest is 1, so the maximum value of these 9 points is 1 and the window output value is 1. Therefore, the value of the top left point of the feature map 132 is 1. Other points can be similarly obtained. Although not described in FIG. 3, the Average Pooling window 126 in FIG. 2 is a window that outputs an average value of scalar values of 49 points in a 7 × 7 window area.

図２に戻り、ＣＮＮの中間層の最終層（特徴マップ１１７）と、出力層（出力１１８）とは全結合接続１２７で接続されており、通常のニューラルネットと同様に重み付け加算が行われる。ＣＮＮの中間層の最終層は、出力層と全結合接続１２７で接続していることから、全結合層とも呼ばれる。この例では、Ｎクラス識別を行うので、出力１１８はＮ個の素子（若しくはユニット）を持ち、その素子の値の大小により、推論の確率の大小が表現される。 Returning to FIG. 2, the final layer (feature map 117) of the intermediate layer of the CNN and the output layer (output 118) are connected by a fully connected connection 127, and weighted addition is performed as in a normal neural network. The final layer of the intermediate layer of the CNN is also called an all-coupling layer because it is connected to the output layer by a full-coupling connection 127. In this example, since the N class is identified, the output 118 has N elements (or units), and the magnitude of the element represents the magnitude of the inference probability.

ＣＮＮでは、全結合接続１２７の各結合に割り当てられている重み係数や、上記フィルタ１２１，１２３，１２４，１２５の重み係数を、予め用意した学習データを用いて取得することができる。具体的には、学習データを入力画像として入力し、後述する順方向伝播を行い、出力結果と正解（入力した学習データの正しい識別結果）との違い（誤差）を求め、誤差逆伝播法を用いて、誤差を減らす方向に重み係数を更新する。この操作を、学習率（誤差逆伝播法における重み係数の更新量）を下げながら繰り返し実行することにより、重み係数の値を収束させる。 In the CNN, the weight coefficient assigned to each connection of the all-connection connection 127 and the weight coefficients of the filters 121, 123, 124, and 125 can be acquired using learning data prepared in advance. More specifically, learning data is input as an input image, forward propagation described below is performed, a difference (error) between an output result and a correct answer (correct identification result of the input learning data) is obtained, and an error back propagation method is performed. To update the weighting coefficient in a direction to reduce the error. This operation is repeatedly performed while lowering the learning rate (the amount of update of the weighting coefficient in the backpropagation method) to converge the value of the weighting coefficient.

ＣＮＮの各重み係数を学習データで学習させた後は、未知の画像データを入力画像データとして順方向伝播させることで、最終出力結果が入力画像の識別の推論値として得られる。 After each weight coefficient of the CNN is trained with the learning data, the unknown image data is forward-propagated as input image data, so that the final output result is obtained as an inference value for identifying the input image.

順方向伝播の処理の具体例について、図２を参照して説明する。入力画像１１１は、図２の例では、サイズが２２４×２２４画素の正方形で、一つの画素がＲＧＢ（Ｒｅｄ，Ｇｒｅｅｎ，Ｂｌｕｅ）の３チャネルから構成されるものとする。各画素の１チャネル分（ＲＧＢそれぞれ）の値は、一般的な画素値表現の８ビット整数絶対値表現（０〜２５５）から、画像データベースの平均値を引き、０中心の符号付き表現に変換したものとする。 A specific example of the forward propagation process will be described with reference to FIG. In the example of FIG. 2, the input image 111 is a square having a size of 224 × 224 pixels, and one pixel includes three channels of RGB (Red, Green, Blue). The value of one channel (each of RGB) of each pixel is converted to a signed representation centered on 0 by subtracting the average value of the image database from an 8-bit integer absolute value representation (0 to 255) of a general pixel value representation. Shall be done.

入力画像１１１を、サイズが７×７×３チャネルのフィルタ１２１で縦方向、横方向それぞれストライド２で走査することで、特徴マップ１１２が得られる。上述の図３での説明では入力画像１１０もフィルタ１２０も１チャネルだったが、入力画像１１１もフィルタ１２１も３チャネルなので、図３で説明した演算を各チャネルに対して行って得られる３つのスカラ値の和に活性化関数ＲｅＬＵを適用した値が特徴マップ１１２の各点の値となる。フィルタ１２１は６４枚用意されているため、特徴マップ１１２も６４枚（チャネル）得られる。特徴マップ１１２はサイズ２２４×２２４画素の入力画像１１１を縦横ともストライド２で走査した結果の値であるため、縦横とも入力画像１１１の１／２のサイズ（１１２×１１２）になる。 A feature map 112 is obtained by scanning the input image 111 with a filter 121 having a size of 7 × 7 × 3 channels in a vertical direction and a horizontal direction with a stride 2. In the above description with reference to FIG. 3, the input image 110 and the filter 120 each have one channel, but since the input image 111 and the filter 121 each have three channels, three channels obtained by performing the operation described in FIG. The value obtained by applying the activation function ReLU to the sum of the scalar values is the value of each point in the feature map 112. Since 64 filters 121 are prepared, 64 feature maps 112 (channels) are obtained. Since the feature map 112 is a value obtained by scanning the input image 111 having a size of 224 × 224 pixels both vertically and horizontally by stride 2, the size of the input image 111 in both the vertical and horizontal directions is １／ (112 × 112).

特徴マップ１１２の各チャネルに対して、サイズが３×３のＭａｘＰｏｏｌｉｎｇウィンドウ１２２で縦方向、横方向それぞれストライド２で走査することで、特徴マップ１１３が得られる。ＭａｘＰｏｏｌｉｎｇウィンドウ１２２は、図３で説明したように、この３×３の領域内の最大値を出力するので、入力画像内の微細な位置の揺らぎを吸収する働きがある。特徴マップ１１３はサイズ１１２×１１２の特徴マップ１１２を縦横ともストライド２で走査した結果の値であるため、縦横とも特徴マップ１１２の１／２のサイズ（５６×５６）になる。ＭａｘＰｏｏｌｉｎｇウィンドウ１２２を、特徴マップ１１２の各チャネルに対して走査し、得られた出力を特徴マップ１３の同じチャネルのデータとするため、特徴マップ１１３のチャネル数は特徴マップ１１２と変わらず、６４チャネルである。 A feature map 113 is obtained by scanning each channel of the feature map 112 with a 3 × 3 Max Pooling window 122 in the vertical and horizontal directions with stride 2 respectively. As described with reference to FIG. 3, the Max Pooling window 122 outputs the maximum value in this 3 × 3 area, and thus has a function of absorbing fluctuations in minute positions in the input image. Since the feature map 113 is a value obtained by scanning the feature map 112 having a size of 112 × 112 in both the vertical and horizontal directions by stride 2, the size of the feature map 112 in both the vertical and horizontal directions is × (56 × 56). Since the Max Pooling window 122 is scanned for each channel of the feature map 112 and the obtained output is used as data of the same channel of the feature map 13, the number of channels of the feature map 113 is the same as that of the feature map 112, and is 64. Channel.

特徴マップ１１３を、サイズが３×３×６４チャネルのフィルタ１２３で縦方向、横方向それぞれストライド２で走査することで、（上述した特徴マップ１１２が得られたのと同様にして）特徴マップ１１４が得られる。フィルタ１２３は１２８枚用意されているため、特徴マップ１１４も１２８枚（チャネル）得られる。特徴マップ１１４はサイズ５６×５６の特徴マップ１１３を縦横ともストライド２で走査した結果の値であるため、縦横とも特徴マップ１１３の１／２のサイズ（２８×２８）になる。 The feature map 113 is scanned by the filter 123 having a size of 3 × 3 × 64 channels in the vertical direction and the horizontal direction with the stride 2 to obtain the feature map 114 (in the same manner as the above-described feature map 112 is obtained). Is obtained. Since 128 filters 123 are prepared, 128 feature maps 114 (channels) are obtained. Since the feature map 114 is a value obtained by scanning the feature map 113 having a size of 56 × 56 in both the vertical and horizontal directions by the stride 2, the size of the feature map 113 in both the vertical and horizontal directions is １／ (28 × 28).

特徴マップ１１４を、サイズが３×３×１２８チャネルのフィルタ１２４で縦方向、横方向それぞれストライド２で走査することで、（上述した特徴マップ１１２，１１４が得られたのと同様にして）特徴マップ１１５が得られる。フィルタ１２４は２５６枚用意されているため、特徴マップ１１５も２５６枚（チャネル）得られる。特徴マップ１１５はサイズ２８×２８の特徴マップ１１４を縦横ともストライド２で走査した結果の値であるため、縦横とも特徴マップ１１４の１／２のサイズ（１４×１４）になる。 The feature map 114 is scanned by a filter 124 having a size of 3 × 3 × 128 channels in the vertical direction and the horizontal direction in the stride 2 respectively, so that the feature map 114 is obtained (in the same manner as the above-described feature maps 112 and 114 are obtained). A map 115 is obtained. Since 256 filters 124 are provided, 256 feature maps 115 (channels) are obtained. Since the feature map 115 is a value obtained by scanning the feature map 114 having a size of 28 × 28 in both the vertical and horizontal directions by the stride 2, the size of the feature map 114 in both the vertical and horizontal directions is 1/2 (14 × 14).

特徴マップ１１５を、サイズが３×３×２５６チャネルのフィルタ１２５で縦方向、横方向それぞれストライド２で走査することで、（上述した特徴マップ１１２，１１４，１１５が得られたのと同様にして）特徴マップ１１６が得られる。フィルタ１２５は５１２枚用意されているため、特徴マップ１１６も５１２枚（チャネル）得られる。特徴マップ１１６はサイズ１４×１４の特徴マップ１１５を縦横ともストライド２で走査した結果の値であるため、縦横とも特徴マップ１１５の１／２のサイズ（７×７）になる。このように、特徴マップのサイズを縮小することによって、畳み込み処理時のフィルタの相対的なカバー領域が広がり、より大局的、抽象的特徴を捉えられるようになる。また、チャネルサイズ（フィルタ枚数）を増大させることにより、局所特徴の組合せによる特徴の多様化に対応できるようになる。 The feature map 115 is scanned by the filter 125 having a size of 3 × 3 × 256 channels with the stride 2 in the vertical direction and the horizontal direction, respectively (in the same manner as the above-described feature maps 112, 114, and 115 are obtained). 3.) A feature map 116 is obtained. Since 512 filters 125 are provided, 512 feature maps 116 (channels) are obtained. Since the feature map 116 is a value obtained by scanning the feature map 115 having a size of 14 × 14 in both the vertical and horizontal directions by the stride 2, the size of the feature map 115 in both the vertical and horizontal directions is 1/2 (7 × 7). As described above, by reducing the size of the feature map, the relative coverage area of the filter during the convolution process is expanded, so that more global and abstract features can be captured. In addition, by increasing the channel size (the number of filters), it becomes possible to cope with diversification of features due to combinations of local features.

特徴マップ１１６の各チャネルに対して、サイズが７×７のＡｖｅｒａｇｅＰｏｏｌｉｎｇウィンドウ１２６を走査することで、特徴マップ１１７が得られる。ＡｖｅｒａｇｅＰｏｏｌｉｎｇウィンドウ１２６は、この７×７の領域内の平均値を出力するが、特徴マップ１１６の各チャネルのサイズは７×７なので、結局、各チャネルの全要素の平均値が特徴マップ１１７となる。従って、特徴マップ１１７は５１２個（チャネル）のスカラ値からなる、５１２次元のベクトル（特徴ベクトル）として扱うことができる。この操作は、各チャネル内で、全平均を取るので、ＧｌｏｂａｌＡｖｅｒａｇｅＰｏｏｌｉｎｇと呼ばれる。 By scanning an average pooling window 126 having a size of 7 × 7 for each channel of the feature map 116, a feature map 117 is obtained. The average pooling window 126 outputs the average value in the 7 × 7 area. Since the size of each channel of the feature map 116 is 7 × 7, the average value of all elements of each channel is eventually equal to that of the feature map 117. Become. Therefore, the feature map 117 can be treated as a 512-dimensional vector (feature vector) composed of 512 (channel) scalar values. This operation is called Global Average Pooling because the total average is taken within each channel.

特徴マップ１１７を、出力１１８と全結合接続１２７で結び、各結合に与えられた重み係数に従って出力１１８のＮ個の素子の値が得られる。より詳細には、図４に示すように、特徴マップ１１７（特徴ベクトル）の各要素をｈ_１〜ｈ_５１２で表すと、出力１１８の各要素に対応する値Ａ_ｉを各要素ｈ_ｊに重み係数ｗ_ｉ，ｊを掛けて総和を取ることによって求め、それをＳｏｆｔｍａｘ処理（図４のｙ_ｎの算出に使っているＳｏｆｔｍａｘ関数を適用する処理。出力１１８の各素子の出力値範囲を［０，１］に正規化し、確率的表現にする。ここで、ｅｘｐ関数は、負値を含めた値を０からの単調増加の値に変換する働きをしている。）によって正規化した値が出力１１８のＮ個の値ｙ_１〜ｙ_Ｎとなる。正規化されたＮ個の各値ｙ_１〜ｙ_Ｎが該当クラスの推論値（確率的表現）となる。簡単には、ｙ_１〜ｙ_Ｎの中で最大の値を持つ素子がｙ_iであるとすると、入力画像はクラスｉとして識別されたことになる。 The feature map 117 is connected to the output 118 by a full connection connection 127, and the values of the N elements of the output 118 are obtained according to the weighting factors given to each connection. More specifically, as shown in FIG. 4, when each element of the feature map 117 (feature vector) is represented by h ₁ to h ₅₁₂ , a value A _i corresponding to each element of the output 118 is weighted to each element h _j . coefficient _{w i,} computed by taking the sum over the _j, it Softmax processing (the output value range of each element of the processing. the output 118 of applying a Softmax function using the calculation of _{y n} in FIG. 4 [0 , 1] to obtain a probabilistic expression, where the exp function serves to convert a value including a negative value into a monotonically increasing value from 0.) The output 118 has N values y _{1 to} y _N. Each of the N normalized values y _{1 to} y _N is an inference value (probabilistic expression) of the corresponding class. Briefly, if the element having the largest value among y _{1 to} y _N is y _i , the input image is identified as class i.

以上、ＣＮＮによるＮクラス識別の典型的な処理概要について説明した。以上で説明したＣＮＮの各中間層（各特徴マップ）の状態を解析することによって、ＣＮＮによる識別精度を向上させる方法を検討することができると考えられる。しかし、各特徴マップは多数のチャネルから構成されるため、一見して状態を把握することは困難である。そこで、特徴マップの多数のチャネルをまとめたものを視覚化することによって、ＣＮＮの内部状態を解釈しやすくする試みが行われている。 The outline of the typical process of the N class identification by the CNN has been described above. By analyzing the state of each intermediate layer (each feature map) of the CNN described above, it is considered that a method of improving the identification accuracy by the CNN can be examined. However, since each feature map is composed of a large number of channels, it is difficult to grasp the state at a glance. Therefore, an attempt has been made to make it easy to interpret the internal state of the CNN by visualizing a large number of channels in the feature map.

例えば、特徴マップの［ｘ，ｙ］要素の値（スカラ値）が０なら［ｘ，ｙ］に対応する位置の画素の色を黒にし、［ｘ，ｙ］要素の値が大きくなるにつれて［ｘ，ｙ］に対応する位置の画素の色を黒→紫→青→水色→緑→黄緑→黄色→橙色→赤等のように変化させて表示させることにより、特徴マップの活性化状態（どの部分の要素の値がどの程度大きくなっているか）を可視化することができる。 For example, if the value (scalar value) of the [x, y] element of the feature map is 0, the color of the pixel at the position corresponding to [x, y] is set to black, and as the value of the [x, y] element increases, [ [x, y], by changing the color of the pixel at the position corresponding to black → purple → blue → light blue → green → yellow green → yellow → orange → red etc. and displaying the activated state of the feature map ( It is possible to visualize which part of the element has a large value.

ただし、以下に説明する図５以降の図面では特徴マップの活性化状態を白黒で表すこととする。これらの図面では、特徴マップの［ｘ，ｙ］要素の値（スカラ値）が０なら［ｘ，ｙ］に対応する位置の四角の色を白にし、［ｘ，ｙ］要素の値が大きくなるにつれて［ｘ，ｙ］に対応する位置の四角の色をより黒っぽく見えるような網掛けで示すことにより、特徴マップの活性化状態を可視化している。 However, in the drawings after FIG. 5 described below, the activated state of the feature map is represented by black and white. In these drawings, if the value (scalar value) of the [x, y] element of the feature map is 0, the color of the square at the position corresponding to [x, y] is white, and the value of the [x, y] element is large. The activation state of the feature map is visualized by shading the square color at the position corresponding to [x, y] as darker as possible.

このように、特徴マップの活性化状態を可視化したものを、ここでは活性化マップと呼ぶことにする。また、特徴マップで要素（スカラ値）が大きい値になっている領域を、活性化した領域（活性化領域）と呼ぶことにする。活性化マップの生成方法には種々の方法があるが、ここでは代表的な方法を説明する。 The visualization of the activation state of the feature map is referred to as an activation map here. Further, an area where the element (scalar value) has a large value in the feature map is referred to as an activated area (activated area). There are various methods for generating the activation map. A typical method will be described here.

１つ目の方法は、図５に示すように、特徴マップをチャネル方向に単純に足し合わせたものを活性化マップとする方法である（足し合わせた後にチャネル数で割って、単純平均を取ってもよい）。図５は、猫とウサギが写っている入力画像１１１がＣＮＮ識別器１１に入力された場合に、７×７のサイズの特徴マップ１１６を可視化する例を示している。図５では、特徴マップ１１６を５１２チャネル分足し合わせて、活性化マップ１４０を生成している。図５で、特徴マップ１１６−１は、特徴マップ１１６の１番目のチャネル、特徴マップ１１６−５１２は、特徴マップ１１６の５１２番目のチャネルを示している。なお、図５では、特徴マップ１１６を可視化して活性化マップ１４０を生成しているが、他の特徴マップ（例えば特徴マップ１１２，１１４等）も、同様の方法で活性化マップを生成することができる。この手法では、全クラスの反応が合算された活性化マップが得られるので、活性化状態がどのクラスの反応によるものなのかの区別はつかない。 The first method is, as shown in FIG. 5, a method in which a feature map is simply added in the channel direction to obtain an activation map (after adding, the sum is divided by the number of channels to obtain a simple average). May be). FIG. 5 illustrates an example in which a 7 × 7 size feature map 116 is visualized when an input image 111 including a cat and a rabbit is input to the CNN discriminator 11. In FIG. 5, the activation map 140 is generated by adding the feature maps 116 for 512 channels. In FIG. 5, the feature map 116-1 indicates the first channel of the feature map 116, and the feature map 116-512 indicates the 512th channel of the feature map 116. In FIG. 5, the activation map 140 is generated by visualizing the characteristic map 116. However, the activation map may be generated in the same manner for other characteristic maps (for example, the characteristic maps 112 and 114). Can be. In this method, an activation map in which the reactions of all the classes are summed up is obtained, so that it is not possible to distinguish which class of the activation state is due to the reaction.

２つ目の方法は、図６に示すように、特徴マップの各チャネルを、活性化対象クラス（ｉ）の出力（ｙ_ｉ）を算出する際の全結合接続１２７の重み係数（ｗ_ｉ，ｊ）で重み付けして加算したものを、そのクラスｉの活性化マップとする方法である。この方法は、ＣＡＭ（ＣｌａｓｓＡｃｔｉｖａｔｉｏｎＭａｐｐｉｎｇ）と呼ばれている。図６は、猫とウサギが写っている入力画像１１１がＣＮＮ識別器１１に入力された場合に、猫に対応するクラスを対象クラスとして、７×７のサイズの特徴マップ１１６を可視化する例を示している。 In the second method, as shown in FIG. 6, each channel of the feature map is assigned a weighting factor (wi, _wi, _i ) of the all-connection connection 127 when calculating the output (yi) of the activation target class (i) _{. In this method} , the weighted and added values in _j ) are used as the activation map of the class i. This method is called CAM (Class Activation Mapping). FIG. 6 illustrates an example in which, when an input image 111 including a cat and a rabbit is input to the CNN discriminator 11, a class corresponding to the cat is set as a target class and a feature map 116 having a size of 7 × 7 is visualized. Is shown.

図６では、全結合接続１２７のクラスｉの出力を得る重み係数ｗ_ｉ，ｊ（ｊはチャネル番号）を、特徴マップ１１６のｊ番目のチャネルに乗算し、これを５１２チャネル分足し合わせて、クラスｉの活性化マップ１４１を生成している。図６で、特徴マップ１１６−１は、特徴マップ１１６の１番目のチャネル、特徴マップ１１６−５１２は、特徴マップ１１６の５１２番目のチャネルを示している。この手法では、クラス別に活性化マップが得られるが、計算原理上、出力側の特徴マップで、かつその特徴マップがＧｌｏｂａｌＡｖｅｒａｇｅＰｏｏｌｉｎｇで１次元特徴ベクトルとなり、さらにそのベクトルが出力に全結合接続される、という構成においてのみ適用が可能となる。 In FIG. 6, the weighting coefficient w _{i, j} (j is a channel number) for obtaining the output of class i of the fully connected connection 127 is multiplied by the j-th channel of the feature map 116, and this is added for 512 channels, The activation map 141 of the class i is generated. In FIG. 6, the feature map 116-1 indicates the first channel of the feature map 116, and the feature map 116-512 indicates the 512th channel of the feature map 116. In this method, an activation map is obtained for each class. However, due to the calculation principle, the output map is a feature map on the output side, and the feature map becomes a one-dimensional feature vector by Global Average Pooling, and the vector is fully connected to the output. This can be applied only in the configuration of

３つ目の方法は、図７に示すように、活性化対象クラス出力ｙ_ｃのみを逆伝搬させ、勾配の大きさを対象クラスへの出力の寄与度と見做すことによって求めた、特徴マップの各チャネルの重み係数（α_ｃ，ｋ）で重み付けして加算したものを、クラスｃの活性化マップとする方法である。この方法は、Ｇｒａｄ−ＣＡＭ（Ｇｒａｄｉｅｎｔ−ｗｅｉｇｈｔｅｄＣｌａｓｓＡｃｔｉｖａｔｉｏｎＭａｐｐｉｎｇ）と呼ばれている。図７は、猫とウサギが写っている入力画像１１１がＣＮＮ識別器１１に入力された場合に、猫に対応するクラスを対象クラスとして、７×７のサイズの特徴マップ１１６を可視化する例を示している。特徴マップ１１６は５１２チャネルあるので、ｋ番目のチャネルに対応する特徴マップ１１６をＭ_ｋで表すこととする。そして、Ｍ_ｋは７×７のサイズなので、Ｍ_ｋの［ｉ，ｊ］の要素（ｉ及びｊは１以上７以下の整数）をＭ_ｋ［ｉ，ｊ］で表すこととする。 The third method, as shown in FIG. 7, is reversely propagate only activated target class output y _c, was determined by be regarded as the contribution of the output of the target class gradient magnitude, characterized This is a method in which a weighted coefficient (α _{c, k} ) of each channel of the map and added are weighted as an activation map of class c. This method is called Grad-CAM (Gradient-weighted Class Activation Mapping). FIG. 7 illustrates an example in which, when an input image 111 including a cat and a rabbit is input to the CNN discriminator 11, a 7 × 7 size feature map 116 is visualized using a class corresponding to the cat as a target class. Is shown. Since the feature map 116 has 512 channels, the feature map 116 corresponding to the k-th channel is represented by _Mk . Since M _k has a size of 7 × 7, elements of [i, j] of M _k (i and j are integers of 1 to 7) are represented by M _k [i, j].

Ｇｒａｄ−ＣＡＭでは、まず入力画像１１１をＣＮＮ識別器で普通に推論する。そして得られた出力１１８のうち、活性化対象クラス（ｃ）の出力（ｙ_ｃ）のみを１、他（ｙ_ｎ（ｎ≠ｃ））を０にして、勾配（∂ｙ_ｃ／∂Ｍ_ｋ［ｉ，ｊ］）を逆伝播計算する。そして、各チャネルで勾配の平均を取り、重み係数α_ｃ，ｋとする。なお、図７のα_ｃ，ｋの式中のＺは、特徴マップ１１６の各チャネルの要素数であり、ここではＺ＝７×７＝４９となる。そして、特徴マップの値Ｍ_ｋに、重み係数α_ｃ，ｋを掛けて足し合わせることにより、対象クラスｋの活性化マップ１４２を生成している。この手法では、ＣＡＭと異なり、特徴マップ１１６のみならず、他の特徴マップ（例えば特徴マップ１１２，１１４等）も、同様の方法で活性化マップを生成することができる。 In the Grad-CAM, first, an input image 111 is normally inferred by a CNN discriminator. And among the obtained output 118, the output of the activation target class _{(c) (y} c) only one, others _{(y n (n ≠ c)} ) to 0, the gradient (∂y _c / ∂M _k [I, j]) is calculated. Then, the average of the gradient is calculated for each channel, and is set as a weighting coefficient α _{c, k} . Note that Z in the expression of α _{c, k} in FIG. 7 is the number of elements of each channel of the feature map 116, and here, Z = 7 × 7 = 49. Then, the activation map 142 of the target class k is generated by multiplying the value M _k of the feature map by a weight coefficient α _{c, k} and adding the values. In this method, unlike the CAM, an activation map can be generated not only for the feature map 116 but also for other feature maps (for example, the feature maps 112 and 114) in a similar manner.

上述のＧｒａｄ−ＣＡＭでは、各チャネルで特徴マップ内の勾配の平均を取ることによって各クラスに対するチャネルの寄与を明瞭にすることができる。しかし、クラスによる違いを明瞭にする必要がないのであれば、勾配の平均を取らずに正値に限ってＣＮＮを入力層まで逆伝播させて画素単位の寄与を可視化することもできる。この方法は、ＧｕｉｄｅｄＢａｃｋｐｒｏｐａｇａｔｉｏｎと呼ばれる（図７の出力１１８からＧｕｉｄｅｄＢａｃｋｐｒｏｐａｇａｔｉｏｎの画像１４３に向かう矢印で示される）。 In the above-described Grad-CAM, the contribution of the channel to each class can be made clear by averaging the gradient in the feature map for each channel. However, if it is not necessary to clarify the difference between the classes, it is also possible to visualize the contribution in pixel units by backpropagating the CNN to the input layer only for positive values without taking the average of the gradient. This method is referred to as Guided Backpropagation (indicated by the arrow from the output 118 in FIG. 7 to the Guided Backpropagation image 143).

ＧｕｉｄｅｄＢａｃｋｐｒｏｐａｇａｔｉｏｎでは、入力画像と同様の水準の解像度が得られる反面、クラスによる違いが明瞭ではない。そこで、ＧｕｉｄｅｄＢａｃｋｐｒｏｐａｇａｔｉｏｎの結果にＧｒａｄ−ＣＡＭの出力を重ねることで特徴マップを可視化する方法もある。この方法はＧｕｉｄｅｄＧｒａｄ−ＣＡＭと呼ばれ、クラス毎の特徴箇所を明瞭に区別すると同時に、高い解像度で特徴箇所を可視化できる。図７では、ＧｕｉｄｅｄＢａｃｋｐｒｏｐａｇａｔｉｏｎの画像１４３とＧｒａｄ−ＣＡＭによる活性化マップ１４２とを合成して、ＧｕｉｄｅｄＧｒａｄ−ＣＡＭによる活性化マップ１４４が得られる様子を示している。 In Guided Backpropagation, the same level of resolution as the input image can be obtained, but the difference between the classes is not clear. Therefore, there is a method of visualizing the feature map by superimposing the output of the Grad-CAM on the result of Guided Backpropagation. This method is called Guided Grad-CAM, and it is possible to clearly distinguish feature points for each class and to visualize feature points with high resolution. FIG. 7 shows a state in which an image 143 of Guided Backpropagation is combined with an activation map 142 based on Grad-CAM to obtain an activation map 144 based on Guided Grad-CAM.

以上、活性化マップ生成方法を説明した。各中間層で活性化マップを生成すると、図８に示すように、ＣＮＮの入力側に近い層（特徴マップ１１２を可視化した活性化マップ１４５）ではエッジ抽出が行われ、ＣＮＮの出力側に近づくにつれて（特徴マップ１１４を可視化した活性化マップ１４６や、特徴マップ１１６を可視化した活性化マップ１４７）、より複雑な大きな領域の特徴が抽出されることがわかる。 The activation map generation method has been described above. When the activation map is generated in each intermediate layer, as shown in FIG. 8, in a layer near the input side of the CNN (an activation map 145 in which the feature map 112 is visualized), edge extraction is performed, and the layer approaches the output side of the CNN. (The activation map 146 in which the feature map 114 is visualized and the activation map 147 in which the feature map 116 is visualized), it can be understood that the feature of a more complicated large area is extracted.

図８の各活性化マップ（特徴マップ１１２，１１４，１１６をそれぞれ可視化した活性化マップ１４５，１４６，１４７）は、ウサギ及び猫が写っている入力画像１１１に対して、Ｇｒａｄ−ＣＡＭを用いて、猫を識別するクラスを活性化対象クラスとした時の各層の活性化領域を示している（ただし、特徴マップは出力側に近づくにつれてサイズが小さくなっていくので、図８の各活性化マップ１４５，１４６，１４７は、それぞれのサイズを入力画像サイズにリサイズして表示している）。そうすると、出力側に最も近い層（特徴マップ１１６を可視化した活性化マップ１４７）では、入力画像１１１における、識別するクラス（猫）が写っている位置（左下）に対応する部分が活性化している（特徴マップのその部分が正の大きい値になっている）ことが確認できる。 Each of the activation maps (activation maps 145, 146, and 147 obtained by visualizing the feature maps 112, 114, and 116) of FIG. 8 uses the Grad-CAM for the input image 111 in which a rabbit and a cat are captured. 8 shows an activation area of each layer when a class for identifying a cat is set as an activation target class. (However, since the size of the feature map decreases as approaching the output side, each activation map in FIG. 145, 146, and 147 resize and display each size to the input image size). Then, in the layer closest to the output side (the activation map 147 in which the feature map 116 is visualized), the portion corresponding to the position (lower left) where the class (cat) to be identified is shown in the input image 111 is activated. (That portion of the feature map has a large positive value).

ＣＮＮでは、小領域のフィルタ処理を重ねていくので、平面上の距離が離れた特徴については、後段にならない限り、統合して評価がされない。逆に言うと、中間層で抽出された特徴の局所性は、一定度維持される。そして、ＣＮＮでは、識別に関与しない特徴（負のフィルタ出力）を、活性化関数ＲｅＬＵで非線形に積極的に切り捨てながら、切り捨てられずに残った（識別に関与する）特徴を局所的に統合していく操作が繰り返し行われる。そのため、後段の活性化領域を抽出することにより、入力画像の中におけるＣＮＮ識別器の大局的注目領域を知ることができる。 In the CNN, since the filter processing of the small area is repeated, features that are far apart on a plane are not integrated and evaluated unless they are at a later stage. Conversely, the locality of the features extracted in the intermediate layer is maintained to a certain degree. Then, the CNN actively and non-linearly truncates features not involved in discrimination (negative filter output) with the activation function ReLU, and locally integrates the remaining features (not related to discrimination) without being truncated. Operation is repeatedly performed. Therefore, by extracting the activation area at the subsequent stage, the global attention area of the CNN discriminator in the input image can be known.

次に、画像識別装置１００が行う画像識別処理の内容について、図９を参照して説明する。画像識別処理は、操作入力部３４を介して、ユーザにより、画像識別装置１００に対して画像識別処理の開始が指示されると開始される。 Next, the content of the image identification processing performed by the image identification device 100 will be described with reference to FIG. The image identification process is started when the user instructs the image identification device 100 to start the image identification process via the operation input unit 34.

まず、画像識別装置１００の制御部１０は、大量の学習用画像データにより、ＣＮＮ識別器１１を学習させる（ステップＳ１０１）。ステップＳ１０１は、画像識別処理を開始する前に、予め行っておいてもよい。 First, the control unit 10 of the image identification device 100 causes the CNN identifier 11 to learn from a large amount of learning image data (step S101). Step S101 may be performed before starting the image identification processing.

次に、制御部１０は、画像入力部３１を介して未知の画像をＣＮＮ識別器１１に入力する（ステップＳ１０２）。ステップＳ１０２は画像入力ステップとも呼ばれる。 Next, the control unit 10 inputs an unknown image to the CNN discriminator 11 via the image input unit 31 (Step S102). Step S102 is also called an image input step.

そして、制御部１０は、ＣＮＮ識別器１１の中間層（例えば、図２における特徴マップ１１６）の活性化領域を抽出して活性化マップを得る（ステップＳ１０３）。ステップＳ１０３は、特徴マップを取得する特徴マップ取得ステップと、取得した特徴マップから活性化マップを生成する活性化マップ生成ステップとを含む。制御部１０は、特徴マップ取得ステップでは特徴マップ取得手段として機能し、活性化マップ生成ステップでは活性化マップ生成手段として機能する。 Then, the control unit 10 extracts an activation area of the intermediate layer (for example, the feature map 116 in FIG. 2) of the CNN discriminator 11 to obtain an activation map (step S103). Step S103 includes a feature map obtaining step of obtaining a feature map, and an activation map generating step of generating an activation map from the obtained feature map. The control unit 10 functions as a feature map obtaining unit in the feature map obtaining step, and functions as an activation map generating unit in the activation map generating step.

ステップＳ１０３では、制御部１０は、例えば、上述したように、特徴マップをチャネル方向に単純平均した活性化マップを算出する。このステップでは、活性化マップ（例えば、図８の活性化マップ１４７に示されるような画像）を入力画像とともにユーザに提示（出力部３２に表示）してもよい。活性化領域抽出手法として、特徴マップのチャネル方向の単純平均ではなく、ＣＡＭ又はＧｒａｄ−ＣＡＭを用いると、クラス毎の特徴マップ（活性化マップ）が得られるが、この場合は、クラス毎の特徴マップ（活性化マップ）を全クラスで平均したものをユーザに提示すればよい。 In step S103, for example, as described above, the control unit 10 calculates an activation map obtained by simply averaging the feature map in the channel direction. In this step, an activation map (for example, an image as shown in the activation map 147 of FIG. 8) may be presented to the user together with the input image (displayed on the output unit 32). If CAM or Grad-CAM is used as an activation area extraction method instead of simple averaging in the channel direction of the feature map, a feature map (activation map) for each class can be obtained. What is necessary is just to show the map (activation map) an average of all classes to the user.

次に、制御部１０は、活性化マップ中に、画像識別に使用しない方が良いと推定される不要領域があるか否かを判定する（ステップＳ１０４）。このステップでは制御部１０は、ユーザがステップＳ１０３で提示された活性化マップを確認して得た不要領域の有無の情報を操作入力部３４を介してユーザから取得することによって判定してもよいし、活性化マップの外縁付近（例えば、上下左右ともに、端から１番目と２番目の領域）が活性化されている（スカラ値が大きい値（例えば正の値）になっている）か否かによって不要領域があるか否かを判定してもよい。さらには、またユーザが提供した不要領域情報等を元に作成した機械学習による識別器によって判定してもよい。この場合、この機械学習による識別器は自動編集手段として機能する。 Next, the control unit 10 determines whether or not there is an unnecessary area in the activation map which is estimated not to be used for image identification (step S104). In this step, the control unit 10 may determine by obtaining from the user via the operation input unit 34 information on the presence / absence of an unnecessary area obtained by the user confirming the activation map presented in step S103. Then, whether the vicinity of the outer edge of the activation map (for example, the first and second regions from the end in all directions) is activated (the scalar value is a large value (for example, a positive value)) It may be determined whether or not there is an unnecessary area. Further, the determination may be made by a machine learning-based classifier created based on unnecessary area information or the like provided by the user. In this case, the classifier based on the machine learning functions as an automatic editing unit.

活性化マップ中に、画像識別に使用しない方が良いと推定される不要領域がないなら（ステップＳ１０４；Ｎｏ）、制御部１０は、そのままＣＮＮ識別器１１で画像識別を行い（ステップＳ１０７）、画像識別処理を終了する。 If there is no unnecessary area in the activation map that is not estimated to be used for image identification (step S104; No), the control unit 10 performs image identification with the CNN identifier 11 as it is (step S107). The image identification processing ends.

活性化マップ中に、画像識別に使用しない方が良いと推定される不要領域があるなら（ステップＳ１０４；Ｙｅｓ）、不要領域取得部１２は、その不要領域（の位置、サイズ、形状等）を取得する（ステップＳ１０５）。ステップＳ１０４でユーザから不要領域の有無の情報を取得している場合は、不要領域取得部１２は、操作入力部３４を介してユーザから不要領域を取得する。ステップＳ１０４で、活性化マップの外縁付近が活性化されていることに基づいて不要領域があると判定した場合は、不要領域取得部１２は、その活性化された外縁付近を不要領域として取得する。ステップＳ１０５は、不要領域取得ステップとも呼ばれる。 If there is an unnecessary area in the activation map that is estimated not to be used for image identification (step S104; Yes), the unnecessary area obtaining unit 12 determines the unnecessary area (position, size, shape, etc.). It is acquired (step S105). If the information on the presence or absence of the unnecessary area has been obtained from the user in step S104, the unnecessary area obtaining unit 12 obtains the unnecessary area from the user via the operation input unit 34. If it is determined in step S104 that there is an unnecessary area based on the fact that the vicinity of the outer edge of the activation map is activated, the unnecessary area acquisition unit 12 acquires the vicinity of the activated outer edge as an unnecessary area. . Step S105 is also called an unnecessary area acquisition step.

次に、不要領域削除部１３は、特徴マップの各チャネルにおいて、不要領域取得部１２が取得した不要領域に該当する領域内の要素の値を０にする（ステップＳ１０６）。ステップＳ１０６は、不要領域削除ステップとも呼ばれる。また、「不要領域の有無を判定して、不要領域を取得し、不要領域に該当する領域内の要素の値を０にする」という処理は、編集処理の一種と考えられるため、ステップＳ１０４からステップＳ１０６までの処理を編集ステップとも呼ぶ。また、制御部１０は、この編集ステップにおいて編集手段として機能する。 Next, the unnecessary area deletion unit 13 sets the value of an element in the area corresponding to the unnecessary area acquired by the unnecessary area acquisition unit 12 to 0 in each channel of the feature map (Step S106). Step S106 is also called an unnecessary area deletion step. The process of “determining the presence / absence of an unnecessary area, acquiring the unnecessary area, and setting the value of an element in the area corresponding to the unnecessary area to 0” is considered as a kind of editing processing. The processing up to step S106 is also called an editing step. The control unit 10 also functions as an editing unit in this editing step.

そして、ＣＮＮ識別器１１は、不要領域削除部１３が不要領域を削除した特徴マップを用いて画像識別を行い（ステップＳ１０７）、画像識別処理を終了する。ステップＳ１０７は、識別ステップとも呼ばれる。ステップＳ１０７では、制御部１０は識別手段として機能する。 Then, the CNN discriminator 11 performs image identification using the feature map from which the unnecessary area deletion unit 13 has deleted the unnecessary area (step S107), and ends the image identification processing. Step S107 is also called an identification step. In step S107, the control unit 10 functions as an identification unit.

具体的な処理の内容について、図１０を参照して説明する。例として、入力画像１１１に本来の識別対象１１１１以外に物差し１１１２が写り込んでいるとする。この場合、ステップＳ１０３で抽出した活性化マップ１５１には、本来の識別対象１１１１に対応する必要領域１５１１以外に、画像識別に使用しない方が良いと推定される不要領域１５１２が存在することが確認できる。 The specific processing will be described with reference to FIG. As an example, assume that a ruler 1112 appears in the input image 111 in addition to the original identification target 1111. In this case, in the activation map 151 extracted in step S103, it is confirmed that there is an unnecessary area 1512 that is estimated not to be used for image identification, in addition to the necessary area 1511 corresponding to the original identification target 1111. it can.

そこで、ステップＳ１０４での判定はＹｅｓとなり、ステップＳ１０５で不要領域１５１２の位置やサイズが取得される。そして、ステップＳ１０６では、特徴マップ１１６の各チャネル１５２において、不要領域に該当する領域１５２１の値が削除されて０になる。全てのチャネル１５２について、不要領域に該当する領域１５２１の値が０にされた後に、その不要領域削除後の特徴マップ１１６から特徴マップ１１７が算出され、最終的に出力１１８が得られる。すると、ここで得られた出力１１８は、本来の識別対象１１１１以外（物差し１１１２等）の影響を受けていない識別結果になると考えられる。 Therefore, the determination in step S104 is Yes, and the position and size of the unnecessary area 1512 are acquired in step S105. Then, in step S106, in each channel 152 of the feature map 116, the value of the area 1521 corresponding to the unnecessary area is deleted and becomes zero. After the value of the area 1521 corresponding to the unnecessary area is set to 0 for all the channels 152, the characteristic map 117 is calculated from the characteristic map 116 after the unnecessary area is deleted, and the output 118 is finally obtained. Then, the output 118 obtained here is considered to be an identification result that is not affected by other than the original identification target 1111 (such as a ruler 1112).

したがって、画像識別装置１００は、画像識別に不要と推定される領域の情報を削除することによって、本来の識別対象以外のものからの影響を防ぐことができるので、ＣＮＮによる画像の識別精度を向上させることができる。 Therefore, the image identification device 100 can prevent the influence from the object other than the original identification target by deleting the information of the area estimated to be unnecessary for the image identification, thereby improving the accuracy of the image identification by the CNN. Can be done.

（変形例１）
上述の実施形態１では、画像識別処理（図９）における中間層の活性化領域の抽出（ステップＳ１０３）において、活性化マップを入力画像とともにユーザに提示してもよいとした。しかし、単に活性化マップと入力画像とが提示されただけではユーザは不要領域の有無の判定がしづらい場合も考えられる。そこで、入力画像と活性化マップとを半透明で重ねて表示することによって、不要領域の有無の判定をしやすくする変形例１について説明する。 (Modification 1)
In the first embodiment described above, the activation map may be presented to the user together with the input image in the extraction of the activation region of the intermediate layer in the image identification processing (FIG. 9) (step S103). However, it may be difficult for the user to determine whether or not there is an unnecessary area simply by presenting the activation map and the input image. Therefore, a description will be given of a first modification example in which the input image and the activation map are displayed translucently and superimposed so that it is easy to determine whether or not there is an unnecessary area.

変形例１の画像識別処理は、図１１に示すように、実施形態１の画像識別処理（図９）のステップＳ１０３とステップＳ１０４の間に、ステップＳ１１１とステップＳ１１２を追加した処理内容になっている。 As shown in FIG. 11, the image identification processing of the first modification has the processing contents in which steps S111 and S112 are added between steps S103 and S104 of the image identification processing of the first embodiment (FIG. 9). I have.

ステップＳ１１１では、制御部１０は、図１２に示すように、入力画像１１１とステップＳ１０３で得た活性化マップ１５１とを半透明で重ね合わせた画像１５３を出力部３２に表示する。半透明での重ね方は、入力画像１１１を半透明にして活性化マップ１５１の上に重ねてもよいし、活性化マップ１５１を半透明にして入力画像１１１の上に重ねてもよい。 In step S111, the control unit 10 displays, on the output unit 32, an image 153 in which the input image 111 and the activation map 151 obtained in step S103 are translucently superimposed, as shown in FIG. As for the method of translucent superimposition, the input image 111 may be translucent and superimposed on the activation map 151, or the activation map 151 may be translucent and superimposed on the input image 111.

ステップＳ１１２では、制御部１０は、操作入力部３４を介してユーザから不要領域の選択を受け付ける。例えば、ディスプレイに表示された画像１５３の上で、ユーザがマウスで不要領域に対応する網掛けの四角をクリックすることによって、不要領域が選択されるようにしてもよい。 In step S112, the control unit 10 receives selection of an unnecessary area from the user via the operation input unit 34. For example, on the image 153 displayed on the display, the user may select an unnecessary area by clicking a shaded square corresponding to the unnecessary area with a mouse.

ステップＳ１１１及びステップＳ１１２以外の処理は、上述した実施形態１の画像識別処理と同じなので、説明を省略する。以上のように、変形例１では、活性化マップが入力画像の上にオーバーラップされて表示されるので、ユーザは画像識別に不要と推定される不要領域を容易に選択することができる。 The processing other than steps S111 and S112 is the same as the image identification processing of the first embodiment described above, and thus the description is omitted. As described above, in the first modification, the activation map is displayed so as to overlap the input image, so that the user can easily select an unnecessary area estimated to be unnecessary for image identification.

（変形例２）
上述の実施形態では、画像識別処理（図９）における不要領域の有無の判定（ステップＳ１０４）や不要領域の取得（ステップＳ１０５）において、ユーザから不要領域の情報を取得したり、活性化マップの外縁付近を不要領域として取得したりしていた。しかし、これに限られない。例えば、ＣＮＮ識別器１１とは別に制御部１０が画像認識を行うプログラムを実行することによって、不要領域の情報を取得してもよい。このような実施形態を変形例２とする。変形例２では、画像認識を行うプログラムを制御部１０が実行することにより、制御部１０は画像認識手段としても機能する。 (Modification 2)
In the above-described embodiment, in the determination of the presence or absence of the unnecessary area (step S104) and the acquisition of the unnecessary area (step S105) in the image identification processing (FIG. 9), information on the unnecessary area is obtained from the user, In some cases, the area near the outer edge is acquired as an unnecessary area. However, it is not limited to this. For example, the information of the unnecessary area may be acquired by the control unit 10 executing a program for performing image recognition separately from the CNN discriminator 11. Such an embodiment is referred to as a second modification. In the second modification, the control unit 10 executes a program for performing image recognition, so that the control unit 10 also functions as an image recognition unit.

例えば、患部を撮影した画像を入力すると皮膚疾患を識別するＣＮＮ識別器１１により画像識別を行う場合、皮膚疾患以外の画像（例えば物差し、髪の毛等）を認識するプログラムを制御部１０が実行することにより、皮膚疾患以外の画像が写っている領域を不要領域として取得し、特徴マップの各チャネルにおいて、その不要領域に対応する要素の値を０にするようにしてもよい。このようにすることによって、変形例２に係る画像識別装置は、ユーザによる不要領域の選択操作無しで識別精度を向上させることができる。 For example, when an image obtained by capturing an image of an affected part is input and the CNN classifier 11 for identifying a skin disease performs image identification, the control unit 10 executes a program for recognizing an image other than the skin disease (eg, ruler, hair, etc.). Thus, a region in which an image other than a skin disease is shown may be acquired as an unnecessary region, and the value of an element corresponding to the unnecessary region may be set to 0 in each channel of the feature map. By doing so, the image identification device according to Modification 2 can improve the identification accuracy without the user's operation of selecting an unnecessary area.

（変形例３）
上述の実施形態１では、画像識別処理（図９）における中間層の活性化領域の抽出（ステップＳ１０３）において、特定のクラスを特別扱いすることはせずに、全チャネル又は全クラスの平均をとった活性化マップを生成した。しかしこれに限られない。活性化マップの生成にＣＡＭ又はＧｒａｄ−ＣＡＭを用いると、クラス毎の活性化マップが得られるので、これを全クラスで平均化するのではなく、個別のクラス毎の活性化マップをユーザに提示してもよい。このような実施形態を変形例３とする。 (Modification 3)
In the first embodiment described above, in the extraction of the activation region of the intermediate layer in the image identification processing (FIG. 9) (step S103), the average of all channels or all classes is calculated without specially treating a specific class. Generated activation map. However, it is not limited to this. If CAM or Grad-CAM is used to generate an activation map, an activation map for each class can be obtained. Therefore, instead of averaging the activation map for all classes, an activation map for each individual class is presented to the user. May be. Such an embodiment is referred to as a third modification.

例えば、Ｎ個のクラスのうちｍ個のクラスだけを識別対象とすればよい場合は、その識別対象とするｍ個のクラスに対応するＣＡＭ又はＧｒａｄ−ＣＡＭによる活性化マップｍ枚と、入力画像とを、出力部３２を介してディスプレイ上に並べてユーザに提示してもよい。さらに、これらの活性化マップｍ枚のうちのｓ枚（ｓ＝１〜ｍ、ユーザが適宜選択可能とする）と、入力画像とを、半透明で重ね合わせた画像を表示することによって、不要領域の有無の判定や領域選択をさらにやりやすくしてもよい。 For example, when it is sufficient to identify only m classes out of N classes, m activation maps by CAM or Grad-CAM corresponding to the m classes to be identified, and an input image May be arranged on the display via the output unit 32 and presented to the user. Further, by displaying an s image (s = 1 to m, which can be appropriately selected by the user) out of the m activation maps and an image obtained by superimposing a translucent image on the input image, it is unnecessary. The determination of the presence or absence of a region and the selection of a region may be further facilitated.

（変形例４）
さらに、ＣＡＭ又はＧｒａｄ−ＣＡＭを用いて得られるクラス毎の活性化マップをそのクラスの活性化状態とし、識別対象とするｍ個のクラスの活性化状態（クラス毎の活性化マップ）ｍ枚を平均化したものを活性化マップとして用いてもよい。このような実施形態を変形例４とする。 (Modification 4)
Further, the activation map for each class obtained using CAM or Grad-CAM is set as the activation state of the class, and m activation states (activation maps for each class) of m classes to be identified are used. The averaged one may be used as the activation map. Such an embodiment is referred to as a fourth modification.

例えば、皮膚疾患の患者の患部の画像を識別する場合、患者の疾患が特定のもの（例えば疾患Ａ、疾患Ｂ、疾患Ｃの３つの疾患のうちの何れか）であることが確定している場合が考えられる。この場合、ＣＮＮ識別器１１が識別するＮ個のクラスのうち、一部のクラス（疾患Ａ、疾患Ｂ、疾患Ｃのいずれか）だけを識別すればよいことになる。 For example, when identifying an image of an affected part of a patient with a skin disease, it is determined that the patient's disease is a specific disease (for example, any one of the three diseases of disease A, disease B, and disease C). The case is conceivable. In this case, among the N classes identified by the CNN identifier 11, only a part of the classes (any of the disease A, the disease B, and the disease C) needs to be identified.

そこで、変形例４では、このような場合には、画像識別処理（図９）における中間層の活性化領域の抽出（ステップＳ１０３）において、ＣＡＭ又はＧｒａｄ−ＣＡＭで生成したクラス毎の活性化マップ（活性化状態）のうち、今回の識別対象のクラスの活性化マップ（活性化状態）のみを平均化して、新たな活性化マップを得るようにする。識別対象のクラスのみを平均化した活性化マップは、識別対象外のクラスの活性化マップ（活性化状態）の情報を含まないので、ステップＳ１０４における不要領域の有無の判定及びステップＳ１０５における不要領域の取得を、より精度高く行うことができる。 Therefore, in Modification 4, in such a case, the activation map for each class generated by the CAM or Grad-CAM in the extraction of the activation region of the intermediate layer in the image identification processing (FIG. 9) (step S103). Of the (activation states), only the activation map (activation state) of the class to be identified this time is averaged to obtain a new activation map. Since the activation map obtained by averaging only the classes to be identified does not include the information of the activation map (activation state) of the class not to be identified, the determination of the presence / absence of the unnecessary region in step S104 and the unnecessary region in step S105 Can be obtained with higher accuracy.

そして、識別の際のノイズとなり得る不要領域の選定の精度が高くなるので、変形例４では、ＣＮＮ識別器１１による画像の識別精度をより向上させることができる。 Then, since the accuracy of selecting an unnecessary area that can be noise at the time of identification is increased, in the fourth modification, the accuracy of image identification by the CNN identifier 11 can be further improved.

なお、上述の実施形態１では、画像識別処理（図９）のステップＳ１０５において活性化マップから不要領域を取得していたが、逆に、画像識別に必要と推定される必要領域を取得してもよい。その場合は、画像識別処理のステップＳ１０６において、特徴マップの必要領域以外の要素の値を０にする。これは変形例１においても同様であり、画像識別処理（図１１）のステップＳ１１２では、ユーザに画像識別に必要と推定される必要領域を選択してもらってもよい。その場合は、必要領域以外の領域があるなら（ステップＳ１０４；Ｙｅｓ）、ステップＳ１０５で必要領域を取得し、ステップ１０６で特徴マップの必要領域以外の要素の値を０にする。 In the above-described first embodiment, the unnecessary area is acquired from the activation map in step S105 of the image identification processing (FIG. 9). Conversely, the unnecessary area estimated to be necessary for image identification is acquired. Is also good. In that case, in step S106 of the image identification processing, the values of the elements other than the necessary area of the feature map are set to 0. This is the same in the first modification, and in step S112 of the image identification processing (FIG. 11), the user may select a necessary area estimated to be necessary for image identification. In this case, if there is an area other than the necessary area (Step S104; Yes), the necessary area is acquired in Step S105, and the value of the element other than the necessary area in the feature map is set to 0 in Step 106.

また、上述の実施形態及び変形例では、ＣＮＮによる識別器を実現するプログラムを制御部１０が実行することにより、制御部１０はＣＮＮ識別器１１としても機能することとしていたが、これに限られない。画像識別装置１００は、制御部１０とは別に（例えば専用のＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の）ＣＮＮ識別器１１の機能を実現するデバイスを備えてもよい。 Further, in the above-described embodiment and the modified example, the control unit 10 executes the program for realizing the discriminator by the CNN, so that the control unit 10 also functions as the CNN discriminator 11, but is not limited thereto. Absent. The image identification device 100 may include a device that realizes the function of the CNN identifier 11 (for example, a dedicated IC (Integrated Circuit)) separately from the control unit 10.

また、上述の変形例２及び変形例４では、皮膚の疾患を例にとって説明したが、本発明は皮膚科の分野に限定されるものではなく、画像識別の分野において広く適用できる。例えば、花の識別、細菌の顕微鏡写真の識別等にも適用できる。 In the above-described Modifications 2 and 4, skin diseases are described as examples, but the present invention is not limited to the field of dermatology and can be widely applied in the field of image identification. For example, the present invention can be applied to identification of flowers, identification of micrographs of bacteria, and the like.

また、上述の実施形態及び変形例は適宜組み合わせることができる。例えば、変形例１と変形例４とを組み合わせることにより、特定の（識別対象の）クラスに絞って平均化した活性化マップを入力画像に重ね合わせて表示することができ、ユーザは不要領域の選定を、より精度高く容易に行うことができるようになる。 Further, the above-described embodiments and modified examples can be appropriately combined. For example, by combining Modifications 1 and 4, it is possible to display an activation map, which is averaged by focusing on a specific (identification target) class, superimposed on an input image, and the user can display an unnecessary area. The selection can be made more accurately and easily.

なお、画像識別装置１００の各機能は、通常のＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等のコンピュータによっても実施することができる。具体的には、上記実施形態では、画像識別装置１００が行う画像識別処理のプログラムが、記憶部２０のＲＯＭに予め記憶されているものとして説明した。しかし、プログラムを、フレキシブルディスク、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＭＯ（Ｍａｇｎｅｔｏ−ＯｐｔｉｃａｌＤｉｓｃ）、メモリカード、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等のコンピュータ読み取り可能な記録媒体に格納して配布し、そのプログラムをコンピュータに読み込んでインストールすることにより、上述の各機能を実現することができるコンピュータを構成してもよい。 Note that each function of the image identification apparatus 100 can also be performed by a computer such as a normal PC (Personal Computer). Specifically, in the above-described embodiment, the description has been given assuming that the program of the image identification processing performed by the image identification device 100 is stored in the ROM of the storage unit 20 in advance. However, the program may be stored on a flexible disk, a compact disc read only memory (CD-ROM), a digital versatile disc (DVD), a magneto-optical disc (MO), a memory card, a universal bus (USB) memory such as a USB (Universal Serial) memory. A computer capable of realizing each of the above-described functions may be configured by storing the program in a recording medium, distributing the program, reading the program into the computer, and installing the program.

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲が含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 As described above, the preferred embodiments of the present invention have been described, but the present invention is not limited to the specific embodiments, and the present invention includes the inventions described in the claims and equivalents thereof. It is. Hereinafter, the inventions described in the claims of the present application will be additionally described.

（付記１）
入力画像が入力される入力層及び前記入力画像の識別結果が出力される出力層以外の層である中間層を有し、前記入力画像を識別する識別器と、
前記中間層において、前記入力画像を識別するための特徴マップを取得する特徴マップ取得手段と、
前記中間層の活性化状態を可視化した活性化マップを生成する活性化マップ生成手段と、
前記活性化マップ生成手段が生成した前記活性化マップを参照して、前記特徴マップ取得手段が取得した特徴マップを編集する編集手段と、
前記編集手段により編集された特徴マップを用いて前記入力画像を識別する識別手段と、
を備える画像識別装置。 (Appendix 1)
An input device to which an input image is input and an intermediate layer which is a layer other than an output layer from which an identification result of the input image is output, and an identifier for identifying the input image,
In the intermediate layer, a feature map obtaining means for obtaining a feature map for identifying the input image,
Activation map generation means for generating an activation map visualizing the activation state of the intermediate layer,
Editing means for editing the feature map obtained by the feature map obtaining means, with reference to the activation map generated by the activation map generating means,
Identification means for identifying the input image using the feature map edited by the editing means,
An image identification device comprising:

（付記２）
前記活性化マップ生成手段は、前記識別器が識別するクラス毎にＣＡＭ又はＧｒａｄ−ＣＡＭを用いて前記活性化マップを生成する、
付記１に記載の画像識別装置。 (Appendix 2)
The activation map generation means generates the activation map using CAM or Grad-CAM for each class identified by the discriminator.
The image identification device according to attachment 1.

（付記３）
前記編集手段が前記特徴マップを編集する際に前記入力画像と前記活性化マップとを重ね合わせた画像を表示する出力手段をさらに備える、
付記１又は２に記載の画像識別装置。 (Appendix 3)
When the editing unit edits the feature map, the editing unit further includes an output unit that displays an image obtained by superimposing the input image and the activation map.
3. The image identification device according to claim 1 or 2.

（付記４）
前記中間層において、前記入力画像を識別する際に使用しない方が良いと推定される不要領域を取得する不要領域取得手段と、
前記中間層において、前記不要領域取得手段が取得した不要領域の情報を削除する不要領域削除手段と、
をさらに備える、
付記１から３のいずれか１つに記載の画像識別装置。 (Appendix 4)
In the intermediate layer, unnecessary area obtaining means for obtaining an unnecessary area estimated to be not used when identifying the input image,
In the intermediate layer, unnecessary area deletion means for deleting information of the unnecessary area acquired by the unnecessary area acquisition means,
Further comprising,
The image identification device according to any one of supplementary notes 1 to 3.

（付記５）
予め機械学習により前記不要領域の識別を学習しておいた識別器による自動編集手段をさらに備え、
前記編集手段は、前記自動編集手段を用いて前記特徴マップを編集する、
付記４に記載の画像識別装置。 (Appendix 5)
Automatic editing means by a classifier previously learning the identification of the unnecessary area by machine learning is further provided,
The editing means edits the feature map using the automatic editing means,
The image identification device according to attachment 4.

（付記６）
入力画像が入力される入力層及び前記入力画像の識別結果が出力される出力層以外の層である中間層を有し、前記入力画像を識別する識別器と、
前記中間層において、前記入力画像を識別する際に使用しない方が良いと推定される不要領域を取得する不要領域取得手段と、
前記中間層において、前記不要領域取得手段が取得した不要領域の情報を削除する不要領域削除手段と、
を備える画像識別装置。 (Appendix 6)
An input device to which an input image is input and an intermediate layer which is a layer other than an output layer from which an identification result of the input image is output, and an identifier for identifying the input image,
In the intermediate layer, unnecessary area obtaining means for obtaining an unnecessary area estimated to be not used when identifying the input image,
In the intermediate layer, unnecessary area deletion means for deleting information of the unnecessary area acquired by the unnecessary area acquisition means,
An image identification device comprising:

（付記７）
前記不要領域取得手段は、前記出力層に接続している全結合層の直前の中間層において、前記不要領域を取得し、
前記不要領域削除手段は、前記出力層に接続している全結合層の直前の中間層において、前記不要領域取得手段が取得した不要領域の情報を削除する、
付記４から６のいずれか１つに記載の画像識別装置。 (Appendix 7)
The unnecessary area obtaining means obtains the unnecessary area in an intermediate layer immediately before a fully connected layer connected to the output layer,
The unnecessary area deletion unit deletes information of the unnecessary area obtained by the unnecessary area obtaining unit in the intermediate layer immediately before the all connected layers connected to the output layer,
7. The image identification device according to any one of supplementary notes 4 to 6.

（付記８）
前記不要領域取得手段は、前記中間層の活性化状態を可視化した活性化マップを生成し、前記活性化マップから前記不要領域を取得する、
付記４から７のいずれか１つに記載の画像識別装置。 (Appendix 8)
The unnecessary area obtaining unit generates an activation map that visualizes an activation state of the intermediate layer, and obtains the unnecessary area from the activation map.
8. The image identification device according to any one of supplementary notes 4 to 7.

（付記９）
前記不要領域取得手段は、前記識別器が識別するクラス毎にＣＡＭ又はＧｒａｄ−ＣＡＭを用いて前記活性化マップを生成し、前記活性化マップから前記不要領域を取得する、
付記８に記載の画像識別装置。 (Appendix 9)
The unnecessary area acquisition unit generates the activation map using CAM or Grad-CAM for each class identified by the classifier, and acquires the unnecessary area from the activation map.
An image identification device according to attachment 8.

（付記１０）
前記不要領域取得手段は、ＣＡＭ又はＧｒａｄ−ＣＡＭを用いて前記クラス毎に前記活性化状態を取得し、前記入力画像の識別対象となるクラスの前記活性化状態のみを用いて前記活性化マップを生成し、前記活性化マップから前記不要領域を取得する、
付記９に記載の画像識別装置。 (Appendix 10)
The unnecessary area obtaining means obtains the activation state for each of the classes using CAM or Grad-CAM, and generates the activation map using only the activation state of the class to be identified in the input image. Generating and acquiring the unnecessary area from the activation map,
An image identification device according to attachment 9.

（付記１１）
ユーザの操作入力を受け付ける操作入力手段と、
前記活性化マップを表示する出力手段と、
をさらに備え、
前記不要領域取得手段は、前記活性化マップを前記出力手段に表示した後に、前記操作入力手段により選択された領域を前記不要領域として取得する、
付記８から１０のいずれか１つに記載の画像識別装置。 (Appendix 11)
Operation input means for receiving an operation input of a user;
Output means for displaying the activation map;
Further comprising
The unnecessary area acquiring unit acquires the area selected by the operation input unit as the unnecessary area after displaying the activation map on the output unit.
The image identification device according to any one of supplementary notes 8 to 10.

（付記１２）
前記不要領域取得手段は、前記入力画像と前記活性化マップとを重ね合わせた画像を前記出力手段に表示した後に、前記操作入力手段により選択された領域を前記不要領域として取得する、
付記１１に記載の画像識別装置。 (Appendix 12)
The unnecessary area obtaining unit obtains, as the unnecessary area, a region selected by the operation input unit after displaying an image obtained by superimposing the input image and the activation map on the output unit.
An image identification device according to attachment 11.

（付記１３）
前記不要領域取得手段は、前記活性化マップの外縁の領域を前記不要領域として取得する、
付記８から１０のいずれか１つに記載の画像識別装置。 (Appendix 13)
The unnecessary area obtaining means obtains an outer edge area of the activation map as the unnecessary area,
The image identification device according to any one of supplementary notes 8 to 10.

（付記１４）
前記識別器が識別する画像以外の画像を認識する画像認識手段をさらに備え、
前記不要領域取得手段は、前記画像認識手段が認識した画像の領域を前記不要領域として取得する、
付記８から１０のいずれか１つに記載の画像識別装置。 (Appendix 14)
Further comprising an image recognition means for recognizing an image other than the image identified by the classifier,
The unnecessary area acquisition unit acquires an area of an image recognized by the image recognition unit as the unnecessary area,
The image identification device according to any one of supplementary notes 8 to 10.

（付記１５）
前記入力画像は皮膚疾患の患部を撮影した画像であり、
前記不要領域取得手段は、前記識別器が有する中間層において、前記患部以外の領域を前記不要領域として取得する、
付記４から１４のいずれか１つに記載の画像識別装置。 (Appendix 15)
The input image is an image of the affected part of the skin disease,
The unnecessary area obtaining means obtains an area other than the affected part as the unnecessary area in the intermediate layer of the discriminator.
15. The image identification device according to any one of supplementary notes 4 to 14.

（付記１６）
入力画像が入力される入力層及び前記入力画像の識別結果が出力される出力層以外の層である中間層を有する識別器により、前記入力画像を識別する画像識別方法であって、
前記中間層において、前記入力画像を識別するための特徴マップを取得する特徴マップ取得ステップと、
前記中間層の活性化状態を可視化した活性化マップを生成する活性化マップ生成ステップと、
前記活性化マップ生成ステップで生成した前記活性化マップを参照して、前記特徴マップ取得ステップで取得した特徴マップを編集する編集ステップと、
前記編集ステップにより編集された特徴マップを用いて前記入力画像を識別する識別ステップと、
を含む画像識別方法。 (Appendix 16)
An image identification method for identifying the input image by an identifier having an input layer where an input image is input and an intermediate layer other than an output layer from which an identification result of the input image is output,
In the intermediate layer, a feature map obtaining step of obtaining a feature map for identifying the input image,
An activation map generation step of generating an activation map that visualizes the activation state of the intermediate layer,
An editing step of editing the feature map obtained in the feature map obtaining step with reference to the activation map generated in the activation map generating step;
An identification step of identifying the input image using the feature map edited by the editing step,
An image identification method including:

（付記１７）
入力画像が入力される入力層及び前記入力画像の識別結果が出力される出力層以外の層である中間層を有する識別器を備える画像識別装置のコンピュータに、
前記中間層において、前記入力画像を識別するための特徴マップを取得する特徴マップ取得ステップ、
前記中間層の活性化状態を可視化した活性化マップを生成する活性化マップ生成ステップ、
前記活性化マップ生成ステップで生成した前記活性化マップを参照して、前記特徴マップ取得ステップで取得した特徴マップを編集する編集ステップ、及び、
前記編集ステップにより編集された特徴マップを用いて前記入力画像を識別する識別ステップ、
を実行させるためのプログラム。 (Appendix 17)
A computer of an image identification device including an input layer in which an input image is input and a classifier having an intermediate layer that is a layer other than an output layer in which an identification result of the input image is output,
In the intermediate layer, a feature map obtaining step of obtaining a feature map for identifying the input image,
An activation map generating step of generating an activation map visualizing an activation state of the intermediate layer;
An editing step of editing the feature map obtained in the feature map obtaining step with reference to the activation map generated in the activation map generating step, and
An identification step of identifying the input image using the feature map edited by the editing step;
The program to execute.

１０…制御部、１１…ＣＮＮ識別器、１２…不要領域取得部、１３…不要領域削除部、２０…記憶部、３１…画像入力部、３２…出力部、３３…通信部、３４…操作入力部、１００…画像識別装置、１１０，１１１…入力画像、１１２，１１３，１１４，１１５，１１６，１１７，１３０，１３２…特徴マップ、１１８…出力、１２０，１２１，１２３，１２４，１２５…フィルタ、１２２，１２６，１３１…ウィンドウ、１２７…全結合接続、１４０，１４１，１４２，１４４，１４５，１４６，１４７，１５１…活性化マップ、１４３，１５３…画像、１５２…チャネル、１１１１…識別対象、１１１２…物差し、１５１１…必要領域、１５１２…不要領域、１５２１…領域 DESCRIPTION OF SYMBOLS 10 ... Control part, 11 ... CNN discriminator, 12 ... Unnecessary area acquisition part, 13 ... Unnecessary area deletion part, 20 ... Storage part, 31 ... Image input part, 32 ... Output part, 33 ... Communication part, 34 ... Operation input Unit, 100 ... image identification device, 110, 111 ... input image, 112, 113, 114, 115, 116, 117, 130, 132 ... feature map, 118 ... output, 120, 121, 123, 124, 125 ... filter, 122, 126, 131: window, 127: all connection, 140, 141, 142, 144, 145, 146, 147, 151: activation map, 143, 153: image, 152: channel, 1111: identification target, 1112 ... ruler, 1511 ... necessary area, 1512 ... unnecessary area, 1521 ... area

Claims

An input device to which an input image is input and an intermediate layer which is a layer other than an output layer from which an identification result of the input image is output, and an identifier for identifying the input image,
In the intermediate layer, a feature map obtaining means for obtaining a feature map for identifying the input image,
Activation map generation means for generating an activation map visualizing the activation state of the intermediate layer,
Editing means for editing the feature map obtained by the feature map obtaining means, with reference to the activation map generated by the activation map generating means,
Identification means for identifying the input image using the feature map edited by the editing means,
An image identification device comprising:

The activation map generation means generates the activation map using CAM or Grad-CAM for each class identified by the discriminator.
The image identification device according to claim 1.

When the editing unit edits the feature map, the editing unit further includes an output unit that displays an image obtained by superimposing the input image and the activation map.
The image identification device according to claim 1.

In the intermediate layer, unnecessary area obtaining means for obtaining an unnecessary area estimated to be not used when identifying the input image,
In the intermediate layer, unnecessary area deletion means for deleting information of the unnecessary area acquired by the unnecessary area acquisition means,
Further comprising,
The image identification device according to claim 1.

Automatic editing means by a classifier previously learning the identification of the unnecessary area by machine learning is further provided,
The editing means edits the feature map using the automatic editing means,
The image identification device according to claim 4.

An input device to which an input image is input and an intermediate layer which is a layer other than an output layer from which an identification result of the input image is output, and an identifier for identifying the input image,
In the intermediate layer, unnecessary area obtaining means for obtaining an unnecessary area estimated to be not used when identifying the input image,
In the intermediate layer, unnecessary area deletion means for deleting information of the unnecessary area acquired by the unnecessary area acquisition means,
An image identification device comprising:

The unnecessary area obtaining means obtains the unnecessary area in an intermediate layer immediately before a fully connected layer connected to the output layer,
The unnecessary area deletion unit deletes information of the unnecessary area obtained by the unnecessary area obtaining unit in the intermediate layer immediately before the all connected layers connected to the output layer,
The image identification device according to claim 4.

The unnecessary area obtaining unit generates an activation map that visualizes an activation state of the intermediate layer, and obtains the unnecessary area from the activation map.
The image identification device according to claim 4.

The unnecessary area acquisition unit generates the activation map using CAM or Grad-CAM for each class identified by the classifier, and acquires the unnecessary area from the activation map.
The image identification device according to claim 8.

The unnecessary area obtaining means obtains the activation state for each of the classes using CAM or Grad-CAM, and generates the activation map using only the activation state of the class to be identified in the input image. Generating and acquiring the unnecessary area from the activation map,
The image identification device according to claim 9.

Operation input means for receiving an operation input of a user;
Output means for displaying the activation map;
Further comprising
The unnecessary area acquiring unit acquires the area selected by the operation input unit as the unnecessary area after displaying the activation map on the output unit.
The image identification device according to claim 8.

The unnecessary area obtaining unit obtains, as the unnecessary area, a region selected by the operation input unit after displaying an image obtained by superimposing the input image and the activation map on the output unit.
The image identification device according to claim 11.

The unnecessary area obtaining means obtains an outer edge area of the activation map as the unnecessary area,
The image identification device according to claim 8.

Further comprising an image recognition means for recognizing an image other than the image identified by the classifier,
The unnecessary area acquisition unit acquires an area of an image recognized by the image recognition unit as the unnecessary area,
The image identification device according to claim 8.

The input image is an image of the affected part of the skin disease,
The unnecessary area obtaining means obtains an area other than the affected part as the unnecessary area in the intermediate layer of the discriminator.
The image identification device according to any one of claims 4 to 14.

An image identification method for identifying the input image by an identifier having an input layer where an input image is input and an intermediate layer other than an output layer from which an identification result of the input image is output,
In the intermediate layer, a feature map obtaining step of obtaining a feature map for identifying the input image,
An activation map generation step of generating an activation map that visualizes the activation state of the intermediate layer,
An editing step of editing the feature map obtained in the feature map obtaining step with reference to the activation map generated in the activation map generating step;
An identification step of identifying the input image using the feature map edited by the editing step,
An image identification method including:

A computer of an image identification device including an input layer in which an input image is input and a classifier having an intermediate layer that is a layer other than an output layer in which an identification result of the input image is output,
In the intermediate layer, a feature map obtaining step of obtaining a feature map for identifying the input image,
An activation map generating step of generating an activation map visualizing an activation state of the intermediate layer;
An editing step of editing the feature map obtained in the feature map obtaining step with reference to the activation map generated in the activation map generating step, and
An identification step of identifying the input image using the feature map edited by the editing step;
The program to execute.