JP7439953B2

JP7439953B2 - Learning device, processing device, learning method, processing method and program

Info

Publication number: JP7439953B2
Application number: JP2022561800A
Authority: JP
Inventors: 一郁児島; 真宏谷; 圭佑池田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2024-02-28
Anticipated expiration: 2040-11-13
Also published as: WO2022102075A1; JPWO2022102075A1

Description

本発明は、学習装置、処理装置、学習方法、処理方法及びプログラムに関する。 The present invention relates to a learning device, a processing device, a learning method, a processing method, and a program.

本発明に関連する技術が、非特許文献１、特許文献１及び２に開示されている。 Technologies related to the present invention are disclosed in Non-Patent Document 1 and Patent Documents 1 and 2.

非特許文献１は、画像に基づき特徴量マップ、再現性に関するマップ（a repeatability map）及び信頼性に関するマップ(a reliability map)を生成し、それらに基づき、画像に含まれる被写体の外観の特徴的な部分（キーポイント）を高精度に抽出する技術（Ｒ２Ｄ２：Repeatable and Reliable Detector and Descriptor）を開示している。特許文献１は、２つの画像間の対応点を高精度に抽出する技術を開示している。特許文献２は、ピクセル毎に特徴量を算出し、画像を解析する技術を開示している。 Non-patent Document 1 generates a feature value map, a repeatability map, and a reliability map based on an image, and based on these, characterizes the external appearance of a subject included in the image. discloses a technology (R2D2: Repeatable and Reliable Detector and Descriptor) that extracts key points with high precision. Patent Document 1 discloses a technique for extracting corresponding points between two images with high precision. Patent Document 2 discloses a technique for calculating feature amounts for each pixel and analyzing an image.

特開２００８－２２０６１７号JP 2008-220617 国際公開第２０２０／１００６３０号International Publication No. 2020/100630

Jerome Revaud、他３名、"R2D2: Repeatable and Reliable Detector and Descriptor"、［online］、［令和２年１０月２３日検索］、インターネット<URL: https://papers.nips.cc/paper/9407-r2d2-reliable-and-repeatable-detector-and-descriptor.pdf>Jerome Revaud, 3 others, "R2D2: Repeatable and Reliable Detector and Descriptor", [online], [searched October 23, 2020], Internet <URL: https://papers.nips.cc/paper/ 9407-r2d2-reliable-and-repeatable-detector-and-descriptor.pdf>

非特許文献１に記載の技術を利用することで、画像に含まれる被写体の外観のキーポイントを高精度に抽出することが可能となる。しかし、非特許文献１に記載の技術で抽出されるキーポイントは、照明条件の変化に弱いという問題があった。このため、例えば、ある照明条件（例：昼間）で撮影された画像においてはそのキーポイント（ピクセル）は周囲のピクセルとの弁別ができるが、異なる照明条件（例：夜間）で撮影された画像においてはそのキーポイントは周囲のピクセルとの弁別ができないという状況が発生し得る。また、同じ被写体を含む画像であっても、各画像生成時の照明条件が異なれば（例：一方は昼間に撮影し、他方は夜間に撮影）、各画像から抽出されるキーポイントが互いに異なるという状況が発生し得る。いずれの先行技術文献も、当該課題及びその解決手段を開示していない。 By using the technology described in Non-Patent Document 1, it becomes possible to extract key points of the appearance of a subject included in an image with high precision. However, the key points extracted by the technique described in Non-Patent Document 1 have a problem in that they are susceptible to changes in illumination conditions. Therefore, for example, in an image taken under certain lighting conditions (e.g. daytime), the key points (pixels) can be distinguished from surrounding pixels, but in images taken under different lighting conditions (e.g. nighttime), the key points (pixels) can be distinguished from surrounding pixels. In some cases, a situation may occur in which a key point cannot be distinguished from surrounding pixels. Furthermore, even if the images contain the same subject, if the lighting conditions at the time each image was generated are different (e.g., one was taken during the day and the other at night), the key points extracted from each image will differ from each other. This situation may occur. None of the prior art documents discloses the problem and its solution.

本発明は、照明条件の変化に頑健なキーポイントを抽出できるようにすることを課題とする。 An object of the present invention is to enable extraction of key points that are robust to changes in illumination conditions.

本発明によれば、
入力画像に基づき、ピクセル毎の特徴量を示す特徴量マップと、キーポイントとするピクセルを決定する処理で利用される重み付け値をピクセル毎に示す第１の重み付けマップとを生成する学習モデルのパラメータを記憶する記憶手段と、
照明条件が互いに異なり、かつ同一の被写体を含む複数の学習画像の組み合わせを取得する取得手段と、
前記複数の学習画像各々から生成される複数の前記特徴量マップ各々に基づき算出される複数のパラメータと、前記複数の学習画像から生成される前記第１の重み付けマップに基づき算出されるパラメータとを用いて定義される損失関数に基づき、前記学習モデルのパラメータを調整する学習手段と、
を有する学習装置が提供される。 According to the invention,
Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. a storage means for storing the
acquisition means for acquiring a combination of a plurality of learning images that have different lighting conditions and include the same subject;
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. learning means for adjusting parameters of the learning model based on a loss function defined using the learning model;
A learning device having the following is provided.

また、本発明によれば、
コンピュータが、
入力画像に基づき、ピクセル毎の特徴量を示す特徴量マップと、キーポイントとするピクセルを決定する処理で利用される重み付け値をピクセル毎に示す第１の重み付けマップとを生成する学習モデルのパラメータを記憶しておき、
照明条件が互いに異なり、かつ同一の被写体を含む複数の学習画像の組み合わせを取得し、
前記複数の学習画像各々から生成される複数の前記特徴量マップ各々に基づき算出される複数のパラメータと、前記複数の学習画像から生成される前記第１の重み付けマップに基づき算出されるパラメータとを用いて定義される損失関数に基づき、前記学習モデルのパラメータを調整する学習方法が提供される。 Further, according to the present invention,
The computer is
Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. Remember,
Obtain a combination of multiple training images that have different lighting conditions and include the same subject,
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. A learning method is provided for adjusting parameters of the learning model based on a loss function defined using the learning model.

また、本発明によれば、
コンピュータを、
入力画像に基づき、ピクセル毎の特徴量を示す特徴量マップと、キーポイントとするピクセルを決定する処理で利用される重み付け値をピクセル毎に示す第１の重み付けマップとを生成する学習モデルのパラメータを記憶する記憶手段、
照明条件が互いに異なり、かつ同一の被写体を含む複数の学習画像の組み合わせを取得する取得手段、
前記複数の学習画像各々から生成される複数の前記特徴量マップ各々に基づき算出される複数のパラメータと、前記複数の学習画像から生成される前記第１の重み付けマップに基づき算出されるパラメータとを用いて定義される損失関数に基づき、前記学習モデルのパラメータを調整する学習手段、
として機能させるプログラムが提供される。 Further, according to the present invention,
computer,
Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. storage means for storing
acquisition means for acquiring a combination of a plurality of learning images that have different lighting conditions and include the same subject;
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. learning means for adjusting parameters of the learning model based on a loss function defined using the learning model;
A program is provided to enable this function.

また、本発明によれば、
前記学習装置で生成された学習モデルを用いて、入力画像のキーポイントを決定する処理装置が提供される。 Further, according to the present invention,
A processing device is provided that determines key points of an input image using a learning model generated by the learning device.

また、本発明によれば、
コンピュータが、前記学習装置で生成された学習モデルを用いて、入力画像のキーポイントを決定する処理方法が提供される。 Further, according to the present invention,
A processing method is provided in which a computer determines key points of an input image using a learning model generated by the learning device.

また、本発明によれば、
コンピュータを、前記学習装置で生成された学習モデルを用いて、入力画像のキーポイントを決定する手段として機能させるプログラムが提供される。 Further, according to the present invention,
A program is provided that causes a computer to function as means for determining key points of an input image using a learning model generated by the learning device.

本発明によれば、照明条件の変化に頑健なキーポイントを抽出できるようなる。 According to the present invention, key points that are robust to changes in lighting conditions can be extracted.

本実施形態の学習モデルを説明するための図である。FIG. 2 is a diagram for explaining a learning model of this embodiment. 本実施形態の学習装置及び処理装置のハードウエア構成図の一例である。It is an example of the hardware block diagram of the learning device and processing device of this embodiment. 本実施形態の学習装置の機能ブロック図の一例である。It is an example of the functional block diagram of the learning device of this embodiment. 本実施形態の処理装置の機能ブロック図の一例である。It is an example of the functional block diagram of the processing device of this embodiment. 本実施形態の処理装置の利用例を説明するための図である。FIG. 3 is a diagram for explaining an example of usage of the processing device according to the present embodiment. 本実施形態の処理装置の他の利用例を説明するための図である。FIG. 7 is a diagram for explaining another usage example of the processing device of the present embodiment.

以下、本発明の実施の形態について、図面を用いて説明する。尚、すべての図面において、同様な構成要素には同様の符号を付し、適宜説明を省略する。 Embodiments of the present invention will be described below with reference to the drawings. Note that in all the drawings, similar components are denoted by the same reference numerals, and descriptions thereof will be omitted as appropriate.

＜第１の実施形態＞
「概要」
まず、図１を用いて、本実施形態の学習装置の概要を説明する。本実施形態の学習装置は、特徴的な学習データに基づく特徴的な学習を実行し、図示する学習モデル（例：ＣＮＮ：Convolutional Neural Network）のパラメータを調整する。 <First embodiment>
"overview"
First, an overview of the learning device of this embodiment will be explained using FIG. The learning device of this embodiment executes characteristic learning based on characteristic learning data and adjusts the parameters of the illustrated learning model (eg CNN: Convolutional Neural Network).

図示する学習モデルは、入力画像（図中、「Ｈ×Ｗの画像」）に基づき、ピクセル毎の特徴量を示す特徴量マップ（図中、「Ｈ´×Ｗ´×Ｃの特徴量群」）と、ピクセル毎の重み付け値を示す第１の重み付けマップ（図中、「Saliency map」）とを生成する機能を少なくとも有する。 The illustrated learning model is based on an input image ("H x W image" in the figure) and a feature map ("H' x W' x C feature group" in the figure) that shows the feature values for each pixel. ) and a first weighting map ("Saliency map" in the figure) indicating a weighting value for each pixel.

当該特徴量マップと当該第１の重み付けマップとに基づき、キーポイントとするピクセルが決定される。具体的には、当該特徴量マップと当該第１の重み付けマップとに基づきピクセル毎に評価値が算出される。そして、当該評価値に基づき、キーポイントとするピクセルが決定される。なお、図示する学習モデルが、当該キーポイントを決定する機能をさらに備えてもよいし、図示する学習モデルと物理的及び／又は論理的に分かれたその他の処理手段が当該機能を備えてもよい。 Based on the feature amount map and the first weighting map, pixels to be key points are determined. Specifically, an evaluation value is calculated for each pixel based on the feature amount map and the first weighting map. Then, a pixel to be a key point is determined based on the evaluation value. Note that the illustrated learning model may further include a function to determine the key points, or another processing means that is physically and/or logically separate from the illustrated learning model may have this function. .

詳細は後述するが、本実施形態の学習装置は、特徴的な学習データに基づく特徴的な学習を実行し、図示する学習モデルのパラメータを調整する。結果、照明条件の変化に頑健なピクセルの評価値が相対的に高くなる特徴量マップ及び第１の重み付けマップが生成されるようになる。すなわち、特定の照明条件で撮影された場合だけでなく、各種照明条件で撮影された場合においても、周囲のピクセルとの弁別が可能なピクセルの評価値が相対的に高くなる特徴量マップ及び第１の重み付けマップが生成されるようになる。結果、照明条件の変化に頑健なピクセルがキーポイントとして決定されやすくなる。 Although details will be described later, the learning device of this embodiment executes characteristic learning based on characteristic learning data and adjusts the parameters of the illustrated learning model. As a result, a feature map and a first weighting map are generated in which the evaluation value of pixels that are robust to changes in illumination conditions is relatively high. In other words, a feature map and feature map in which the evaluation value of a pixel that can be distinguished from surrounding pixels is relatively high not only when the image is taken under a specific lighting condition but also when the image is taken under various lighting conditions. A weighting map of 1 is now generated. As a result, pixels that are robust to changes in lighting conditions are more likely to be determined as key points.

「構成」
次に、学習装置の構成を説明する。まず、学習装置のハードウエア構成の一例を説明する。学習装置の各機能部は、任意のコンピュータのＣＰＵ（Central Processing Unit）、メモリ、メモリにロードされるプログラム、そのプログラムを格納するハードディスク等の記憶ユニット（あらかじめ装置を出荷する段階から格納されているプログラムのほか、ＣＤ（Compact Disc）等の記憶媒体やインターネット上のサーバ等からダウンロードされたプログラムをも格納できる）、ネットワーク接続用インターフェイスを中心にハードウエアとソフトウエアの任意の組合せによって実現される。そして、その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。 "composition"
Next, the configuration of the learning device will be explained. First, an example of the hardware configuration of the learning device will be explained. Each functional part of the learning device consists of a CPU (Central Processing Unit) of any computer, a memory, a program loaded into the memory, and a storage unit such as a hard disk that stores the program (the program is stored in advance at the stage of shipping the device). (In addition to programs, it can also store programs downloaded from storage media such as CDs (Compact Discs) or servers on the Internet, etc.), and is realized by any combination of hardware and software, centering on network connection interfaces. . It will be understood by those skilled in the art that there are various modifications to the implementation method and device.

図２は、学習装置のハードウエア構成を例示するブロック図である。図２に示すように、学習装置は、プロセッサ１Ａ、メモリ２Ａ、入出力インターフェイス３Ａ、周辺回路４Ａ、バス５Ａを有する。周辺回路４Ａには、様々なモジュールが含まれる。学習装置は周辺回路４Ａを有さなくてもよい。なお、学習装置は物理的及び／又は論理的に分かれた複数の装置で構成されてもよいし、物理的及び／又は論理的に一体となった１つの装置で構成されてもよい。学習装置が物理的及び／又は論理的に分かれた複数の装置で構成される場合、複数の装置各々が上記ハードウエア構成を備えることができる。 FIG. 2 is a block diagram illustrating the hardware configuration of the learning device. As shown in FIG. 2, the learning device includes a processor 1A, a memory 2A, an input/output interface 3A, a peripheral circuit 4A, and a bus 5A. The peripheral circuit 4A includes various modules. The learning device does not need to have the peripheral circuit 4A. Note that the learning device may be composed of a plurality of physically and/or logically separated devices, or may be composed of one physically and/or logically integrated device. When the learning device is composed of a plurality of physically and/or logically separated devices, each of the plurality of devices can be provided with the above hardware configuration.

バス５Ａは、プロセッサ１Ａ、メモリ２Ａ、周辺回路４Ａ及び入出力インターフェイス３Ａが相互にデータを送受信するためのデータ伝送路である。プロセッサ１Ａは、例えばＣＰＵ、ＧＰＵ（Graphics Processing Unit）などの演算処理装置である。メモリ２Ａは、例えばＲＡＭ（Random Access Memory）やＲＯＭ（Read Only Memory）などのメモリである。入出力インターフェイス３Ａは、入力装置、外部装置、外部サーバ、外部センサ、カメラ等から情報を取得するためのインターフェイスや、出力装置、外部装置、外部サーバ等に情報を出力するためのインターフェイスなどを含む。入力装置は、例えばキーボード、マウス、マイク、物理ボタン、タッチパネル等である。出力装置は、例えばディスプレイ、スピーカ、プリンター、メーラ等である。プロセッサ１Ａは、各モジュールに指令を出し、それらの演算結果をもとに演算を行うことができる。 The bus 5A is a data transmission path through which the processor 1A, memory 2A, peripheral circuit 4A, and input/output interface 3A exchange data with each other. The processor 1A is, for example, an arithmetic processing device such as a CPU or a GPU (Graphics Processing Unit). The memory 2A is, for example, a RAM (Random Access Memory) or a ROM (Read Only Memory). The input/output interface 3A includes an interface for acquiring information from an input device, an external device, an external server, an external sensor, a camera, etc., an interface for outputting information to an output device, an external device, an external server, etc. . Input devices include, for example, a keyboard, mouse, microphone, physical button, touch panel, and the like. Examples of the output device include a display, a speaker, a printer, and a mailer. The processor 1A can issue commands to each module and perform calculations based on the results of those calculations.

次に、学習装置の機能構成を説明する。図３に示すように、学習装置１０は、記憶部１１と、取得部１２と、学習部１３とを有する。 Next, the functional configuration of the learning device will be explained. As shown in FIG. 3, the learning device 10 includes a storage section 11, an acquisition section 12, and a learning section 13.

記憶部１１は、入力画像に基づき特徴量マップ、及び第１の重み付けマップの両方を生成する学習モデル（以下、単に「学習モデル」という場合がある）のパラメータを記憶する。すなわち、記憶部１１は、図１に示す学習モデルのパラメータを記憶する。 The storage unit 11 stores parameters of a learning model (hereinafter sometimes simply referred to as a "learning model") that generates both a feature map and a first weighting map based on an input image. That is, the storage unit 11 stores the parameters of the learning model shown in FIG.

特徴量マップは、ピクセル毎の特徴量を示す。各ピクセルの特徴量は、Ｃ個（Ｃは２以上）の値で示される。入力画像がＨ×Ｗのピクセルを有する場合、特徴量マップは図１に示すようにＨ´×Ｗ´×Ｃと示される。特徴量の種類、及び特徴量マップの生成手段は特段制限されず、従来のあらゆる技術を採用できる。 The feature amount map shows the feature amount for each pixel. The feature amount of each pixel is represented by C values (C is 2 or more). When the input image has H×W pixels, the feature amount map is expressed as H′×W′×C as shown in FIG. The type of feature amount and the means for generating the feature amount map are not particularly limited, and any conventional techniques can be employed.

なお、後述する学習部１３により学習モデルのパラメータが調整されることで、照明条件の変化に頑健なピクセルの評価値が相対的に高くなる特徴量マップが生成されるようになる。 Note that by adjusting the parameters of the learning model by the learning unit 13, which will be described later, a feature map is generated in which the evaluation value of pixels that are robust to changes in illumination conditions is relatively high.

第１の重み付けマップは、例えば特徴量マップから生成される。その他、ネットワークの中間層の出力を用いて、第１の重み付けマップが生成されてもよい。第１の重み付けマップは、例えば特徴量マップを畳み込んで生成される。特徴量マップがＨ´×Ｗ´×Ｃで示される場合、第１の重み付けマップはＨ´×Ｗ´×１となる。すなわち、第１の重み付けマップにおいて、各ピクセルの値は１個である。この各ピクセルの１個の値は、キーポイントとするピクセルを決定する処理で参照される重み付け値となる。Ｈ´×Ｗ´×Ｃの特徴量マップを畳み込んでＨ´×Ｗ´×１のマップを生成する手段は特段制限されず、従来のあらゆる技術を採用できる。 The first weighting map is generated from the feature map, for example. Alternatively, the first weighting map may be generated using the output of the intermediate layer of the network. The first weighting map is generated, for example, by convolving the feature map. When the feature map is expressed as H'×W'×C, the first weighting map is H'×W'×1. That is, in the first weighting map, each pixel has one value. This one value for each pixel becomes a weighting value that is referenced in the process of determining a pixel to be a key point. The means for convolving the H′×W′×C feature map to generate the H′×W′×1 map is not particularly limited, and any conventional technique can be employed.

なお、後述する学習手段により学習モデルのパラメータが調整されることで、照明条件の変化に頑健なピクセルの評価値が相対的に高くなる第１の重み付けマップが生成されるようになる。 Note that by adjusting the parameters of the learning model by a learning means described later, a first weighting map is generated in which the evaluation value of pixels that are robust to changes in illumination conditions is relatively high.

取得部１２は、特徴的な学習データを取得する。具体的には、取得部１２は、照明条件が互いに異なり、かつ同一の被写体を含む複数の学習画像の組み合わせを学習データとして取得する。各組み合わせの複数の学習画像は、２つであってもよいし、３つ以上であってもよい。 The acquisition unit 12 acquires characteristic learning data. Specifically, the acquisition unit 12 acquires, as learning data, a combination of a plurality of learning images that have different lighting conditions and include the same subject. The number of learning images in each combination may be two, three or more.

照明条件が互いに異なるとは、撮影時の照明の状態が互いに異なることを意味する。すなわち、照明条件が互いに異なるとは、撮影時の自然光（太陽光や月光等）の状態、及び人工で用意された光（ライト、ろうそく、カメラのフラッシュ等）の状態の少なくとも一方が互いに異なることを意味する。なお、効率的に学習するためには、照明条件の違いが十分に大きい複数の学習画像を利用することが好ましい。例えば、複数の学習画像は、撮影時間帯が互いに異なってもよいし（例：一方は昼間に撮影し、他方は夜間に撮影）、撮影場所が屋外か屋内かにおいて互いに異なってもよいし、撮影時の天候が互いに異なってもよいし（例：一方は晴れの時に撮影し、他方は雨の時に撮影）、ライト（建物に設置されたライト、撮影用に準備されたライト、カメラのフラッシュ等）の状態（ＯＮ／ＯＦＦや強度等）が互いに異なってもよいし、その他が互いに異なってもよい。 The different lighting conditions mean that the lighting conditions at the time of photographing are different from each other. In other words, different lighting conditions mean that at least one of the conditions of natural light (sunlight, moonlight, etc.) and the conditions of artificial light (lights, candles, camera flash, etc.) at the time of shooting are different from each other. means. Note that in order to learn efficiently, it is preferable to use a plurality of learning images with sufficiently large differences in lighting conditions. For example, the plurality of learning images may be taken at different times (for example, one taken during the day and the other at night), or may be taken at different locations, whether outdoors or indoors. The weather at the time of shooting may be different from each other (for example, one shot was taken when it was sunny and the other was taken when it was raining), and the lighting (lights installed on buildings, lights prepared for shooting, camera flash) etc.) may be different from each other in their states (ON/OFF, intensity, etc.), or may be different from each other in other respects.

学習部１３は、特徴的な損失関数に基づき、記憶部１１に記憶されている学習モデルのパラメータを調整する。この特徴的な損失関数に基づき学習することで、照明条件の変化に頑健なピクセルの評価値が相対的に高くなる特徴量マップ及び第１の重み付けマップが生成されるようになる。換言すれば、照明条件の変化に頑健なピクセルの評価値が相対的に高くなる特徴量マップ及び第１の重み付けマップが生成されるように、損失関数は設計されている。 The learning unit 13 adjusts the parameters of the learning model stored in the storage unit 11 based on the characteristic loss function. By learning based on this characteristic loss function, a feature amount map and a first weighting map in which evaluation values of pixels that are robust to changes in illumination conditions are relatively high are generated. In other words, the loss function is designed so that a feature map and a first weighting map are generated in which the evaluation value of pixels that are robust to changes in illumination conditions is relatively high.

照明条件の変化に頑健なピクセルとは、特定の照明条件で撮影された場合だけでなく、各種照明条件で撮影された場合においても、周囲のピクセルとの弁別が可能なピクセルである。周囲のピクセルとの弁別が可能なピクセルとは、周囲のピクセルとの特徴量（特徴量マップで示される特徴量）の差が十分に大きいピクセルである。 A pixel that is robust to changes in illumination conditions is a pixel that can be distinguished from surrounding pixels not only when photographed under a specific illumination condition but also when photographed under various illumination conditions. A pixel that can be distinguished from surrounding pixels is a pixel that has a sufficiently large difference in feature amount (feature amount shown in the feature amount map) from the surrounding pixels.

ここで、損失関数について詳細に説明する。本実施形態の損失関数は、例えば以下の式（１）で示される。なお、同様の結果が得られる範囲で、以下の損失関数を適宜変更してもよい。 Here, the loss function will be explained in detail. The loss function of this embodiment is expressed, for example, by the following equation (1). Note that the following loss function may be changed as appropriate within the range that similar results can be obtained.

当該損失関数は、２つの学習画像の組み合わせを学習データとして取得する場合の例を示す。各種パラメータは以下のように定義される。 The loss function shows an example in which a combination of two learning images is acquired as learning data. Various parameters are defined as follows.

Ｉは、第１の学習画像を示す。
Ｉ´は、第２の学習画像を示す。
ｉ及びｊは、第１の学習画像の中のピクセルの座標値を示す。
ｐ_ｉｊは、第１の学習画像の中の被写体が存在するエリアの中のピクセルを示す。被写体が存在するエリアの中の全てのピクセルを示してもよいし、任意の手段でピックアップした一部のピクセルを示してもよい。本実施形態では、画像の中の被写体が存在するエリアを示す情報が利用される。被写体が存在するエリアを示す情報が外部から学習装置１０に入力されてもよいし、学習装置１０が学習画像を解析して、被写体が存在するエリアを特定してもよい。
Ｓ（ｐ_ｉｊ）は、第１の学習画像のｐ_ｉｊピクセルの状態値を示す。状態値については後述する。
Ｕ（ｉ，ｊ）は、第１の学習画像の（ｉ，ｊ）ピクセルに対応する第２の学習画像のピクセルを示す。「対応するピクセル」は、同じ被写体の同じ部分を示すピクセルである。本実施形態では、複数の学習画像の対応するピクセルを示す情報が利用される。複数の学習画像の対応するピクセルを示す情報が外部から学習装置１０に入力されてもよいし、学習装置１０が学習画像を解析して、複数の学習画像の対応するピクセルを特定してもよい。
ｐ´_{Ｕ（ｉ，ｊ）}は、第１の学習画像のｐ_ｉｊピクセルに対応する第２の学習画像のピクセルを示す。
Ｓ（ｐ´_{Ｕ（ｉ，ｊ）}）は、第２の学習画像のｐ´_{Ｕ（ｉ，ｊ）}ピクセルの状態値を示す。
Ｃ_ｉｊは、第１の学習画像に基づき生成された第１の重み付けマップの（ｉ，ｊ）ピクセルの重み付け値を示す。他の例として、Ｃ_ｉｊは、第１の学習画像に基づき生成された第１の重み付けマップの（ｉ，ｊ）ピクセルの重み付け値と、第２の学習画像に基づき生成された第２の重み付けマップのＵ（ｉ，ｊ）ピクセルの重み付け値との統計値（平均値、最大値、最小値等）を示してもよい。
Ｐは、ｐ_ｉｊピクセルに着目したパッチ群を示す。パッチ群は、着目したピクセル及びその周囲のピクセルを含むピクセルの集合である。着目したピクセルとどのような位置関係にあるピクセルを「周囲のピクセル」に含めるかは、要求性能などに基づき任意に決定することができる。
|Ｐ|は、パッチ群に含まれるピクセルの数を示す。 I indicates the first learning image.
I′ indicates the second learning image.
i and j indicate the coordinate values of pixels in the first learning image.
p _ij indicates a pixel in the area where the object exists in the first learning image. All pixels in the area where the subject exists may be shown, or some pixels picked up by any means may be shown. In this embodiment, information indicating an area where a subject exists in an image is used. Information indicating the area where the subject exists may be input to the learning device 10 from the outside, or the learning device 10 may analyze the learning image and identify the area where the subject exists.
S(p _ij ) indicates the state value of p _ij pixel of the first learning image. The status value will be described later.
U(i,j) indicates the pixel of the second learning image that corresponds to the (i,j) pixel of the first learning image. "Corresponding pixels" are pixels that indicate the same part of the same object. In this embodiment, information indicating corresponding pixels of a plurality of learning images is used. Information indicating the corresponding pixels of the plurality of learning images may be input to the learning device 10 from the outside, or the learning device 10 may analyze the learning images and identify the corresponding pixels of the plurality of learning images. .
p _{′ U (i, j)} indicates a pixel of the second learning image that corresponds to p _ij pixel of the first learning image.
S(p'U _(i,j)) indicates the state value of p'U _(i,j) pixel of the second learning image.
C _ij indicates the weighting value of the (i,j) pixel of the first weighting map generated based on the first learning image. As another example, C _ij is the weighting value of the (i,j) pixel of the first weighting map generated based on the first training image and the second weighting value generated based on the second training image. It may also show statistical values (average value, maximum value, minimum value, etc.) with weighting values of U(i,j) pixels of the map.
P indicates a patch group focused on p _ij pixels. A patch group is a set of pixels including a pixel of interest and pixels around it. What kind of positional relationship the pixels have with the pixel of interest to be included in the "surrounding pixels" can be arbitrarily determined based on required performance and the like.
|P| indicates the number of pixels included in the patch group.

ここで、各ピクセルの状態値について説明する。第１の学習画像のｐ_ｉｊピクセルの状態値Ｓ（ｐ_ｉｊ）は、以下の式（２）で示される。なお、第２の学習画像のｐ´_{Ｕ（ｉ，ｊ）}ピクセルの状態値も同様の式で求められる。 Here, the state value of each pixel will be explained. The state value S(p _ij ) of the p _ij pixel of the first learning image is expressed by the following equation (2). Note that the state value of p _{′ U (i, j)} pixel of the second learning image is also obtained using a similar formula.

Ｆ_ｉｊは、ｐ_ｉｊピクセルの複数の特徴量（Ｃ個（Ｃは２以上）の値）の集まりを示す。
ｖａｒ（Ｆ_ｉｊ）は、Ｆ_ｉｊのＣ個（Ｃは２以上）の値の不偏分散を示す。
ｍ及びｎは、ｐ_ｉｊに着目したパッチ群に含まれるｐ_ｉｊを除くピクセルの座標値を示す。
Ｆ_ｍｎは、ｐ_ｉｊに着目したパッチ群に含まれるｐ_ｉｊを除くピクセルの複数の特徴量（Ｃ個（Ｃは２以上）の値）の集まりを示す。
ｖａｒ（Ｆ_ｍｎ）は、Ｆ_ｍｎのＣ個（Ｃは２以上）の値の不偏分散を示す。
|ｐ|は、パッチ群の個数から１を引いた数を示す。 F _ij indicates a collection of a plurality of feature amounts (C values (C is 2 or more)) of the p _ij pixel.
var(F _ij ) indicates the unbiased variance of C values (C is 2 or more) of F _ij .
m and n indicate coordinate values of pixels other than p _ij included in the patch group focused on p _ij .
F _mn indicates a collection of a plurality of feature amounts (C values (C is 2 or more)) of pixels other than p _ij included in the patch group focused on p _ij .
var(F _mn ) indicates the unbiased variance of C values (C is 2 or more) of F _mn .
|p| indicates the number of patch groups minus 1.

このように、損失関数は、特徴量マップに基づき算出される各ピクセルの状態値（Ｓ（ｐ_ｉｊ）、Ｓ（ｐ´_{Ｕ（ｉ，ｊ）}））と、第１の重み付けマップで示される各ピクセルの重み付け値Ｃ_ｉｊとを用いて定義される。各ピクセルの状態値は、各ピクセルの特徴量と、各ピクセルの周囲の複数のピクセルの特徴量とに基づき算出される。具体的には、各ピクセルの特徴量がＣ個（Ｃは２以上）の値で示される場合、各ピクセルの状態値は、各ピクセルのＣ個の値の不偏分散と、各ピクセルの周囲の複数のピクセル各々のＣ個の値の不偏分散とに基づき算出される。 In this way, the loss function is expressed by the state values of each pixel (S(p _ij ), S(p′ _{U(i, j)} )) calculated based on the feature map and the first weighting map. It is defined using the weighting value C _ij of each pixel. The state value of each pixel is calculated based on the feature amount of each pixel and the feature amounts of a plurality of pixels surrounding each pixel. Specifically, when the feature amount of each pixel is represented by C values (C is 2 or more), the state value of each pixel is determined by the unbiased variance of the C values of each pixel and the surroundings of each pixel. It is calculated based on the unbiased variance of C values of each of a plurality of pixels.

ここで、特徴量マップ及び第１の重み付けマップに基づき評価値を算出する方法、及び評価値に基づきキーポイントを決定する方法の一例を説明する。本実施形態では、第１の重み付けマップで示される各記述子（各ピクセル）の重み付け値が評価値となる。そして、評価値が大きいピクセルがキーポイントとして決定される。例えば、評価値が基準値以上のピクセルがキーポイントとして決定されてもよいし、評価値が大きい方から所定数のピクセルがキーポイントとして決定されてもよいし、その他の基準でキーポイントが決定されてもよい。 Here, an example of a method of calculating an evaluation value based on the feature amount map and the first weighting map, and a method of determining a key point based on the evaluation value will be described. In this embodiment, the weighting value of each descriptor (each pixel) indicated by the first weighting map becomes the evaluation value. Then, pixels with large evaluation values are determined as key points. For example, pixels with evaluation values equal to or higher than a reference value may be determined as key points, a predetermined number of pixels with larger evaluation values may be determined as key points, or key points may be determined based on other criteria. may be done.

上述の通り、特徴的な損失関数に基づき学習することで、照明条件の変化に頑健なピクセルの評価値が相対的に高くなる特徴量マップ及び第１の重み付けマップが生成されるようになる。上記方法で評価値を算出し、キーポイントを決定することで、照明条件の変化に頑健なピクセルがキーポイントとして決定されるようになる。 As described above, by learning based on a characteristic loss function, a feature amount map and a first weighting map in which the evaluation value of pixels that are robust to changes in illumination conditions are relatively high are generated. By calculating evaluation values and determining key points using the above method, pixels that are robust to changes in illumination conditions can be determined as key points.

以上、本実施形態の学習装置は、特徴的な学習データに基づく特徴的な学習（特徴的な損失関数に基づく学習）を実行し、学習モデルのパラメータを調整する。結果、照明条件の変化に頑健なピクセルの評価値が相対的に高くなる特徴量マップ及び第１の重み付けマップが生成されるようになる。すなわち、特定の照明条件で撮影された場合だけでなく、各種照明条件で撮影された場合においても、周囲のピクセルとの弁別が可能なピクセルの評価値が相対的に高くなる特徴量マップ及び第１の重み付けマップが生成されるようになる。結果、照明条件の変化に頑健なピクセルがキーポイントとして決定されやすくなる。 As described above, the learning device of this embodiment executes characteristic learning based on characteristic learning data (learning based on a characteristic loss function) and adjusts the parameters of a learning model. As a result, a feature amount map and a first weighting map are generated in which the evaluation value of pixels that are robust to changes in illumination conditions is relatively high. In other words, a feature map and a feature map in which the evaluation value of a pixel that can be distinguished from surrounding pixels is relatively high not only when the image is taken under a specific lighting condition but also when the image is taken under various lighting conditions. A weighting map of 1 is now generated. As a result, pixels that are robust to changes in lighting conditions are more likely to be determined as key points.

＜第２の実施形態＞
「概要」
本実施形態の処理装置は、第１の実施形態で説明した学習装置１０によりパラメータを調整された学習モデルを用いて、入力画像から特徴量マップ及び第１の重み付けマップを生成し、それらに基づきキーポイントを決定する。 <Second embodiment>
"overview"
The processing device of this embodiment generates a feature map and a first weighting map from an input image using a learning model whose parameters have been adjusted by the learning device 10 described in the first embodiment, and based on them, Determine key points.

「構成」
まず、処理装置のハードウエア構成の一例を説明する。処理装置の各機能部は、任意のコンピュータのＣＰＵ、メモリ、メモリにロードされるプログラム、そのプログラムを格納するハードディスク等の記憶ユニット（あらかじめ装置を出荷する段階から格納されているプログラムのほか、ＣＤ等の記憶媒体やインターネット上のサーバ等からダウンロードされたプログラムをも格納できる）、ネットワーク接続用インターフェイスを中心にハードウエアとソフトウエアの任意の組合せによって実現される。そして、その実現方法、装置にはいろいろな変形例があることは、当業者には理解されるところである。 "composition"
First, an example of the hardware configuration of the processing device will be described. Each functional part of the processing device consists of a CPU of any computer, a memory, a program loaded into the memory, a storage unit such as a hard disk that stores the program (in addition to a program stored in advance at the stage of shipping the device, a CD It can also store programs downloaded from storage media such as , servers on the Internet, etc.), and is realized by any combination of hardware and software centered on a network connection interface. It will be understood by those skilled in the art that there are various modifications to the implementation method and device.

図２は、処理装置のハードウエア構成を例示するブロック図である。図２に示すように、処理装置は、プロセッサ１Ａ、メモリ２Ａ、入出力インターフェイス３Ａ、周辺回路４Ａ、バス５Ａを有する。周辺回路４Ａには、様々なモジュールが含まれる。処理装置は周辺回路４Ａを有さなくてもよい。なお、処理装置は物理的及び／又は論理的に分かれた複数の装置で構成されてもよいし、物理的及び／又は論理的に一体となった１つの装置で構成されてもよい。処理装置が物理的及び／又は論理的に分かれた複数の装置で構成される場合、複数の装置各々が上記ハードウエア構成を備えることができる。なお、図２の各要素の説明は第１の実施形態で行ったので、ここでは省略する。 FIG. 2 is a block diagram illustrating the hardware configuration of the processing device. As shown in FIG. 2, the processing device includes a processor 1A, a memory 2A, an input/output interface 3A, a peripheral circuit 4A, and a bus 5A. The peripheral circuit 4A includes various modules. The processing device does not need to have the peripheral circuit 4A. Note that the processing device may be composed of a plurality of physically and/or logically separated devices, or may be composed of one physically and/or logically integrated device. When the processing device is composed of a plurality of physically and/or logically separated devices, each of the plurality of devices can be provided with the above hardware configuration. Note that each element in FIG. 2 has been described in the first embodiment, and will therefore be omitted here.

次に、処理装置の機能構成を説明する。図４に示すように、処理装置２０は、記憶部２１と、入力部２２と、推定部２３と、出力部２４とを有する。 Next, the functional configuration of the processing device will be explained. As shown in FIG. 4, the processing device 20 includes a storage section 21, an input section 22, an estimation section 23, and an output section 24.

記憶部２１は、第１の実施形態で説明した学習装置１０によりパラメータを調整された学習モデルを記憶する。具体的には、記憶部２１は、学習モデルのパラメータ等、学習モデルの実行に必要な情報（データ）を記憶する。 The storage unit 21 stores a learning model whose parameters have been adjusted by the learning device 10 described in the first embodiment. Specifically, the storage unit 21 stores information (data) necessary for executing the learning model, such as parameters of the learning model.

入力部２２は、入力画像の入力を受付ける。 The input unit 22 receives input of an input image.

推定部２３は、記憶部２１に記憶されている学習モデルに入力画像を入力し、その推定結果を得る。 The estimation unit 23 inputs the input image to the learning model stored in the storage unit 21 and obtains the estimation result.

学習モデルが、入力画像に基づき特徴量マップ及び第１の重み付けマップを生成して出力するように構成されている場合、推定部２３は、出力された特徴量マップ及び第１の重み付けマップに基づき各ピクセルの評価値を算出し、その評価値に基づきキーポイントとするピクセルを決定する。評価値の算出方法及びキーポイントとするピクセルの決定方法は、第１の実施形態で説明した通りである。 When the learning model is configured to generate and output a feature map and a first weighting map based on the input image, the estimation unit 23 generates a feature map and a first weighting map based on the output feature map and first weighting map. An evaluation value of each pixel is calculated, and a pixel to be a key point is determined based on the evaluation value. The method for calculating the evaluation value and the method for determining pixels to be key points are as described in the first embodiment.

一方、学習モデルが、入力画像に基づき特徴量マップ及び第１の重み付けマップを生成し、それらに基づきキーポイントを決定して出力するように構成されている場合、推定部２３は、学習モデルから出力されたキーポイントを示す情報（キーポイントとして決定されたピクセルを示す情報）を推定結果として取得する。 On the other hand, if the learning model is configured to generate a feature map and a first weighting map based on the input image, determine and output key points based on them, the estimation unit 23 Information indicating the output key point (information indicating the pixel determined as the key point) is obtained as the estimation result.

出力部２４は、キーポイントして決定されたピクセルを示す情報を出力する。 The output unit 24 outputs information indicating the pixels determined as key points.

次に、図５を用いて、処理装置２０の利用例を説明する。当該例では、クエリ画像に含まれる被写体に類似する被写体を含む画像をデータベースの中から検索し、検索した画像を出力する処理に、処理装置２０が利用される。 Next, a usage example of the processing device 20 will be described using FIG. 5. In this example, the processing device 20 is used to search a database for images that include a subject similar to the subject included in the query image, and to output the searched images.

図示するように処理装置２０にクエリ画像が入力されると、処理装置２０は上述した処理を実行し、クエリ画像からキーポイントを決定する。そして、キーポイントとして決定したピクセルを示す情報を出力する。出力された情報は、類似画像検索装置に入力される。類似画像検索装置は、キーポイントとして決定されたピクセルの特徴量と類似する特徴量を含む画像をデータベースの中から検索し、検索した画像を出力する。 As illustrated, when a query image is input to the processing device 20, the processing device 20 executes the above-described processing and determines key points from the query image. Then, information indicating the pixels determined as key points is output. The output information is input to a similar image search device. The similar image search device searches a database for an image that includes a feature amount similar to the feature amount of a pixel determined as a key point, and outputs the searched image.

上述の通り、処理装置２０によれば、照明条件の変化に頑健なピクセルがキーポイントとして決定され、出力される。このようなキーポイントの特徴量を利用して画像検索を行うことで、クエリ画像に含まれる被写体と外観が類似する被写体を含む画像を、その画像の撮影時の照明条件がクエリ画像の撮影時の照明条件と同等であるか否かに関わらず、高精度に検索することが可能となる。すなわち、例えば昼間に撮影された建物Ａを含むクエリ画像に基づく画像検索により、昼間に撮影された建物Ａを含む画像のみならず、夜間に撮影された建物Ａを含む画像をも高精度に検索することが可能となる。 As described above, according to the processing device 20, pixels that are robust to changes in illumination conditions are determined and output as key points. By performing an image search using the features of such key points, you can search for images that include subjects with similar appearance to the subjects included in the query image, even if the lighting conditions at the time the images were captured were the same as when the query image was captured. It is possible to perform a search with high accuracy regardless of whether the lighting conditions are the same as the lighting conditions. That is, for example, by performing an image search based on a query image that includes building A that was photographed during the day, it is possible to search not only images that include building A that were photographed during the day but also images that include building A that were photographed at night with high accuracy. It becomes possible to do so.

図６を用いて、処理装置２０の他の利用例を説明する。当該例では、クエリ画像に含まれる被写体に類似する被写体を含む画像をデータベースの中から検索し、検索した画像に紐付けられていた位置情報を出力する処理に、処理装置２０が利用される。 Another usage example of the processing device 20 will be explained using FIG. 6. In this example, the processing device 20 is used to search a database for an image that includes a subject similar to the subject included in the query image, and to output position information associated with the searched image.

図示するように処理装置２０にクエリ画像が入力されると、処理装置２０は上述した処理を実行し、クエリ画像からキーポイントを決定する。そして、キーポイントとして決定したピクセルを示す情報を出力する。出力された情報は、類似画像検索装置に入力される。類似画像検索装置は、キーポイントとして決定されたピクセルの特徴量と類似する特徴量を含む画像をデータベースの中から検索する。なお、データベースにおいては、各画像を撮影した位置を示す位置情報が各画像に紐付けて記憶されている。類似画像検索装置は、検索した画像に紐付けられた位置情報を出力する。 As illustrated, when a query image is input to the processing device 20, the processing device 20 executes the above-described processing and determines key points from the query image. Then, information indicating the pixels determined as key points is output. The output information is input to a similar image search device. The similar image search device searches a database for an image that includes a feature amount similar to the feature amount of the pixel determined as a key point. Note that in the database, position information indicating the position where each image was photographed is stored in association with each image. The similar image search device outputs location information linked to the searched images.

以上、本実施形態の処理装置２０によれば、照明条件の変化に頑健なピクセルをキーポイントとして決定し、出力することができる。このようなキーポイントを利用することで、照明条件の変化に頑健な画像検索が実現される。 As described above, according to the processing device 20 of this embodiment, pixels that are robust to changes in illumination conditions can be determined as key points and output. By using such key points, image retrieval that is robust to changes in lighting conditions can be realized.

＜第３の実施形態＞
本実施形態の処理装置２０は、特徴量マップ及び第１の重み付けマップに加えて、再現性に関するマップ及び信頼性に関するマップの少なくとも一方を、入力画像に基づき生成する。特徴量マップ及び第１の重み付けマップの生成は、第２の実施形態で説明した方法で実現される。再現性に関するマップ及び信頼性に関するマップの生成は、非特許文献１に開示されている方法で実現される。 <Third embodiment>
In addition to the feature map and the first weighting map, the processing device 20 of this embodiment generates at least one of a reproducibility map and a reliability map based on the input image. Generation of the feature map and the first weighting map is realized by the method described in the second embodiment. Generation of a map regarding reproducibility and a map regarding reliability is realized by the method disclosed in Non-Patent Document 1.

再現性に関するマップ及び信頼性に関するマップはいずれも、各ピクセルの重み付け値を示す。入力画像がＨ×Ｗのピクセルを有し、特徴量マップがＨ´×Ｗ´×Ｃと示される場合、再現性に関するマップ及び信頼性に関するマップはいずれも、Ｈ´×Ｗ´×１で示される。すなわち、再現性に関するマップ及び信頼性に関するマップにおいて、各ピクセルの値は１個である。 Both the reproducibility map and the reliability map show a weighting value for each pixel. If the input image has H×W pixels and the feature map is expressed as H′×W′×C, both the reproducibility map and the reliability map are expressed as H′×W′×1. It will be done. That is, in the reproducibility map and the reliability map, each pixel has one value.

そして、処理装置２０は、特徴量マップ及び第１の重み付けマップに加えて、再現性に関するマップ及び信頼性に関するマップの少なくとも一方を用いて各ピクセルの評価値を算出し、算出した評価値に基づきキーポイントを決定する。 The processing device 20 then calculates an evaluation value for each pixel using at least one of a reproducibility map and a reliability map in addition to the feature map and the first weighting map, and based on the calculated evaluation value. Determine key points.

例えば、ピクセルごとに、第１の重み付けマップで示される重み付け値と、再現性に関するマップで示される重み付け値と、信頼性に関するマップで示される重み付け値とを掛け合わせて、評価値を算出してもよい。 For example, for each pixel, the evaluation value is calculated by multiplying the weighting value shown in the first weighting map, the weighting value shown in the reproducibility map, and the weighting value shown in the reliability map. Good too.

その他、ピクセルごとに、第１の重み付けマップで示される重み付け値と、再現性に関するマップで示される重み付け値と、信頼性に関するマップで示される重み付け値とを足し合わせて、評価値を算出してもよい。この場合、（第１の重み付けマップで示される重み付け値）×αと、（再現性に関するマップで示される重み付け値）×βと、（信頼性に関するマップで示される重み付け値）×γとの和を、評価値としてもよい。α、β及びγは、各マップの重み付け値である。なお、当該手法の場合、３つのマップをまとめて評価して算出された評価値が高いピクセルがキーポイントとして抽出されることとなる。他の例として、３つのマップに基づき算出された評価値が高いことに加えて、各マップのスコアが全体的に高いピクセルがキーポイントとして抽出されるようにしてもよい。例えば、各マップのスコアを予め定めた条件（閾値より大）でフィルタリングしてもよい。具体的には、再現性に関するマップをＡ、信頼性に関するマップをＢ、第１の重み付けマップをＣとした時、条件を満たすピクセル群ＭはＭ＝{(ｉ，ｊ∈Ｒ^２)|Ａ_ｉｊ>ｔｈｒ１，Ｂ_ｉｊ>ｔｈｒ２，Ｃ_ｉｊ>ｔｈｒ３}と定義される。そして、ピクセル群Ｍに対して上記評価値を算出する処理を行い、ピクセル群Ｍの中から評価値に基づきキーポイントを抽出してもよい。 In addition, for each pixel, the evaluation value is calculated by adding together the weighting value shown in the first weighting map, the weighting value shown in the reproducibility map, and the weighting value shown in the reliability map. Good too. In this case, the sum of (weighting value shown in the first weighting map) × α, (weighting value shown in the map related to reproducibility) × β, and (weighting value shown in the map related to reliability) × γ may be used as the evaluation value. α, β, and γ are weighting values for each map. Note that in the case of this method, pixels with high evaluation values calculated by evaluating the three maps collectively are extracted as key points. As another example, in addition to having high evaluation values calculated based on the three maps, pixels having high overall scores in each map may be extracted as key points. For example, the score of each map may be filtered using a predetermined condition (greater than a threshold). Specifically, when the reproducibility map is A, the reliability map is B, and the first weighting map is C, the pixel group M that satisfies the condition is M={(i,j∈R ² )|A _ij >thr1, B _ij >thr2, C _ij >thr3}. Then, the process of calculating the evaluation value described above may be performed for the pixel group M, and key points may be extracted from the pixel group M based on the evaluation value.

その他、上記算出方法において、再現性に関するマップで示される重み付け値、及び信頼性に関するマップで示される重み付け値の両方を用いず、いずれか一方のみを用いて、評価値を算出してもよい。 In addition, in the calculation method described above, the evaluation value may be calculated using only one of the weighting values shown in the map related to reproducibility and the weighting value shown in the map related to reliability, without using both.

算出した評価値に基づきキーポイントを決定する処理は、第２の実施形態で説明した通りである。 The process of determining key points based on the calculated evaluation values is as described in the second embodiment.

処理装置２０のその他の構成は、第２の実施形態と同様である。 The other configuration of the processing device 20 is the same as that of the second embodiment.

本実施形態の処理装置２０によれば、第２の実施形態と同様の作用効果が実現される。また、本実施形態の処理装置２０によれば、特徴量マップ及び第１の重み付けマップに加えて、再現性に関するマップ及び信頼性に関するマップの少なくとも一方を利用してキーポイントを決定することができる。結果、照明条件の変化に頑健で、撮影角度の変化等にも頑健なピクセルをキーポイントとして決定し、出力することができる。このようなキーポイントを利用することで、照明条件の変化に頑健で、撮影角度の変化等にも頑健な画像検索が実現される。 According to the processing device 20 of this embodiment, the same effects as in the second embodiment are realized. Further, according to the processing device 20 of the present embodiment, key points can be determined using at least one of a reproducibility map and a reliability map in addition to the feature map and the first weighting map. . As a result, it is possible to determine and output pixels that are robust to changes in lighting conditions and to changes in photographing angle as key points. By using such key points, image retrieval that is robust to changes in illumination conditions and changes in photographing angle can be realized.

なお、本明細書において、「取得」とは、ユーザ入力に基づき、又は、プログラムの指示に基づき、「自装置が他の装置や記憶媒体に格納されているデータを取りに行くこと（能動的な取得）」、たとえば、他の装置にリクエストまたは問い合わせして受信すること、他の装置や記憶媒体にアクセスして読み出すこと等、および、ユーザ入力に基づき、又は、プログラムの指示に基づき、「自装置に他の装置から出力されるデータを入力すること（受動的な取得）」、たとえば、配信（または、送信、プッシュ通知等）されるデータを受信すること、また、受信したデータまたは情報の中から選択して取得すること、及び、「データを編集（テキスト化、データの並び替え、一部データの抽出、ファイル形式の変更等）などして新たなデータを生成し、当該新たなデータを取得すること」の少なくともいずれか一方を含む。 In this specification, "acquisition" refers to "a process in which the own device retrieves data stored in another device or storage medium (actively)" based on user input or program instructions. (e.g., requesting or interrogating and receiving from other devices, accessing and reading other devices or storage media, etc.), and based on user input or program instructions. "Inputting data output from another device into one's own device (passive acquisition)," for example, receiving data that is distributed (or sent, push notification, etc.), and receiving received data or information. "Create new data by editing the data (converting it into text, sorting the data, extracting some data, changing the file format, etc.), and ``Obtaining data.''

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限定されない。
１．入力画像に基づき、ピクセル毎の特徴量を示す特徴量マップと、キーポイントとするピクセルを決定する処理で利用される重み付け値をピクセル毎に示す第１の重み付けマップとを生成する学習モデルのパラメータを記憶する記憶手段と、
照明条件が互いに異なり、かつ同一の被写体を含む複数の学習画像の組み合わせを取得する取得手段と、
前記複数の学習画像各々から生成される複数の前記特徴量マップ各々に基づき算出される複数のパラメータと、前記複数の学習画像から生成される前記第１の重み付けマップに基づき算出されるパラメータとを用いて定義される損失関数に基づき、前記学習モデルのパラメータを調整する学習手段と、
を有する学習装置。
２．前記学習手段は、前記入力画像に基づき前記特徴量マップと前記第１の重み付けマップの両方を生成する前記学習モデルのパラメータを、前記損失関数に基づき調整する１に記載の学習装置。
３．前記損失関数は、前記特徴量マップに基づき算出される各ピクセルの状態値と、前記第１の重み付けマップで示される各ピクセルの重み付け値とを用いて定義され、
各ピクセルの前記状態値は、各ピクセルの特徴量と、各ピクセルの周囲の複数のピクセルの特徴量とに基づき算出される１又は２に記載の学習装置。
４．各ピクセルの特徴量は、Ｃ個（Ｃは２以上）の値で示され、
各ピクセルの前記状態値は、各ピクセルの前記Ｃ個の値の不偏分散と、各ピクセルの周囲の複数のピクセル各々の前記Ｃ個の値の不偏分散とに基づき算出される請求項３に記載の学習装置。
５．前記損失関数は、以下の式で示される１から４のいずれかに記載の学習装置。

（Ｉは、第１の学習画像を示す。Ｉ´は、第２の学習画像を示す。ｉ及びｊは、第１の学習画像の中のピクセルの座標値を示す。ｐ_ｉｊは、第１の学習画像の中の被写体が存在するエリアの中のピクセルを示す。Ｓ（ｐ_ｉｊ）は、第１の学習画像のｐ_ｉｊピクセルの状態値を示す。Ｕ（ｉ，ｊ）は、第１の学習画像の（ｉ，ｊ）ピクセルに対応する第２の学習画像のピクセルを示す。ｐ´_{Ｕ（ｉ，ｊ）}は、第１の学習画像のｐ_ｉｊピクセルに対応する第２の学習画像のピクセルを示す。Ｓ（ｐ´_{Ｕ（ｉ，ｊ）}）は、第２の学習画像のｐ´_{Ｕ（ｉ，ｊ）}ピクセルの状態値を示す。Ｃ_ｉｊは、第１の学習画像に基づき生成された第１の重み付けマップの（ｉ，ｊ）ピクセルの重み付け値を示す。Ｐは、ｐ_ｉｊに着目したパッチ群を示す。|Ｐ|は、パッチ群に含まれるピクセルの数を示す。Ｆ_ｉｊは、ｐ_ｉｊピクセルの特徴量（Ｃ個（Ｃは２以上）の値）を示す。ｖａｒ（Ｆ_ｉｊ）は、Ｆ_ｉｊのＣ個（Ｃは２以上）の値の不偏分散を示す。ｍ及びｎは、ｐ_ｉｊに着目したパッチ群に含まれるｐ_ｉｊを除くピクセルの座標値を示す。Ｆ_ｍｎは、ｐ_ｉｊに着目したパッチ群に含まれるｐ_ｉｊを除くピクセルの特徴量（Ｃ個（Ｃは２以上）の値）を示す。ｖａｒ（Ｆ_ｍｎ）は、Ｆ_ｍｎのＣ個（Ｃは２以上）の値の不偏分散を示す。|ｐ|は、パッチ群の個数から１を引いた数を示す。）
６．コンピュータが、
入力画像に基づき、ピクセル毎の特徴量を示す特徴量マップと、キーポイントとするピクセルを決定する処理で利用される重み付け値をピクセル毎に示す第１の重み付けマップとを生成する学習モデルのパラメータを記憶しておき、
照明条件が互いに異なり、かつ同一の被写体を含む複数の学習画像の組み合わせを取得し、
前記複数の学習画像各々から生成される複数の前記特徴量マップ各々に基づき算出される複数のパラメータと、前記複数の学習画像から生成される前記第１の重み付けマップに基づき算出されるパラメータとを用いて定義される損失関数に基づき、前記学習モデルのパラメータを調整する学習方法。
７．コンピュータを、
入力画像に基づき、ピクセル毎の特徴量を示す特徴量マップと、キーポイントとするピクセルを決定する処理で利用される重み付け値をピクセル毎に示す第１の重み付けマップとを生成する学習モデルのパラメータを記憶する記憶手段、
照明条件が互いに異なり、かつ同一の被写体を含む複数の学習画像の組み合わせを取得する取得手段、
前記複数の学習画像各々から生成される複数の前記特徴量マップ各々に基づき算出される複数のパラメータと、前記複数の学習画像から生成される前記第１の重み付けマップに基づき算出されるパラメータとを用いて定義される損失関数に基づき、前記学習モデルのパラメータを調整する学習手段、
として機能させるプログラム。
８．１から５のいずれかに記載の学習装置で生成された学習モデルを用いて、入力画像のキーポイントを決定する処理装置。
９．コンピュータが、１から５のいずれかに記載の学習装置で生成された学習モデルを用いて、入力画像のキーポイントを決定する処理方法。
１０．コンピュータを、１から５のいずれかに記載の学習装置で生成された学習モデルを用いて、入力画像のキーポイントを決定する手段として機能させるプログラム。 Part or all of the above embodiments may be described as in the following supplementary notes, but the embodiments are not limited to the following.
1. Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. a storage means for storing the
acquisition means for acquiring a combination of a plurality of learning images that have different lighting conditions and include the same subject;
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. learning means for adjusting parameters of the learning model based on a loss function defined using the learning model;
A learning device with
2. 2. The learning device according to claim 1, wherein the learning means adjusts parameters of the learning model that generates both the feature map and the first weighting map based on the input image, based on the loss function.
3. The loss function is defined using a state value of each pixel calculated based on the feature map and a weighting value of each pixel indicated by the first weighting map,
3. The learning device according to 1 or 2, wherein the state value of each pixel is calculated based on the feature amount of each pixel and the feature amounts of a plurality of pixels surrounding each pixel.
4. The feature amount of each pixel is represented by C values (C is 2 or more),
4. The state value of each pixel is calculated based on an unbiased variance of the C values of each pixel and an unbiased variance of the C values of each of a plurality of pixels surrounding each pixel. learning device.
5. 5. The learning device according to any one of 1 to 4, wherein the loss function is expressed by the following formula.

(I indicates the first learning image. I' indicates the second learning image. i and j indicate the coordinate values of pixels in the first learning image. p _ij indicates the first learning image. S(p _ij ) indicates the state value of the p _ij pixel in the first learning image. U(i, j) indicates the pixel in the area where the subject exists in the first learning image. p _{′ U(i, j)} is the pixel of the second learning image corresponding to the p _ij pixel of the first learning image. S(p′ _{U(i, j)} ) indicates the state value of p _′ _{U(i, j)} pixel of the second learning image. Indicates the weighting value of the (i, j) pixel of the generated first weighting map. P indicates a patch group focused on p _ij . |P| indicates the number of pixels included in the patch group. F _ij indicates the feature amount (C values (C is 2 or more)) of the p _ij pixel. var (F _ij ) indicates the unbiased variance of the C values (C is 2 or more) of F _ij . _m and n indicate the coordinate values of pixels excluding _p _ij included in the patch group focused on p _ij . F _mn is the feature amount ( var (F _mn ) indicates the unbiased variance of C values (C is 2 or more) of F _mn . |p| is calculated from the number of patch groups. (Indicates the number minus 1.)
6. The computer is
Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. Remember,
Obtain a combination of multiple training images that have different lighting conditions and include the same subject,
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. A learning method that adjusts parameters of the learning model based on a loss function defined using the learning model.
7. computer,
Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. storage means for storing
acquisition means for acquiring a combination of a plurality of learning images that have different lighting conditions and include the same subject;
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. learning means for adjusting parameters of the learning model based on a loss function defined using the learning model;
A program that functions as
8. A processing device that determines key points of an input image using a learning model generated by the learning device according to any one of items 1 to 5.
9. 6. A processing method in which a computer determines key points of an input image using a learning model generated by the learning device according to any one of 1 to 5.
10. A program that causes a computer to function as means for determining key points of an input image using a learning model generated by the learning device according to any one of items 1 to 5.

１０学習装置
１１記憶部
１２取得部
１３学習部
２０処理装置
２１記憶部
２２入力部
２３推定部
２４出力部
１Ａプロセッサ
２Ａメモリ
３Ａ入出力Ｉ／Ｆ
４Ａ周辺回路
５Ａバス 10 learning device 11 storage unit 12 acquisition unit 13 learning unit 20 processing unit 21 storage unit 22 input unit 23 estimation unit 24 output unit 1A processor 2A memory 3A input/output I/F
4A peripheral circuit 5A bus

Claims

Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. a storage means for storing the
acquisition means for acquiring a combination of a plurality of learning images that have different lighting conditions and include the same subject;
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. learning means for adjusting parameters of the learning model based on a loss function defined using the learning model;
A learning device with

The learning device according to claim 1, wherein the learning means adjusts parameters of the learning model that generates both the feature map and the first weighting map based on the input image, based on the loss function.

The loss function is defined using a state value of each pixel calculated based on the feature map and a weighting value of each pixel indicated by the first weighting map,
The learning device according to claim 1 or 2, wherein the state value of each pixel is calculated based on the feature amount of each pixel and the feature amounts of a plurality of pixels surrounding each pixel.

The feature amount of each pixel is represented by C values (C is 2 or more),
4. The state value of each pixel is calculated based on an unbiased variance of the C values of each pixel and an unbiased variance of the C values of each of a plurality of pixels surrounding each pixel. learning device.

The learning device according to any one of claims 1 to 4, wherein the loss function is expressed by the following formula.

(I indicates the first learning image. I' indicates the second learning image. i and j indicate the coordinate values of pixels in the first learning image. p _ij indicates the first learning image. S(p _ij ) indicates the state value of the p _ij pixel in the first learning image. U(i, j) indicates the pixel in the area where the subject exists in the first learning image. p _{′ U(i, j)} is the pixel of the second learning image corresponding to the p _ij pixel of the first learning image. S(p′ _{U(i, j)} ) indicates the state value of p _′ _{U(i, j)} pixel of the second learning image. Indicates the weighting value of the (i, j) pixel of the generated first weighting map. P indicates a patch group focused on p _ij . |P| indicates the number of pixels included in the patch group. F _ij indicates the feature amount (C values (C is 2 or more)) of the p _ij pixel. var (F _ij ) indicates the unbiased variance of the C values (C is 2 or more) of F _ij . _m and n indicate the coordinate values of pixels excluding _p _ij included in the patch group focused on p _ij . F _mn is the feature amount ( var (F _mn ) indicates the unbiased variance of C values (C is 2 or more) of F _mn . |p| is calculated from the number of patch groups. (Indicates the number minus 1.)

The computer is
Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. Remember,
Obtain a combination of multiple training images that have different lighting conditions and include the same subject,
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. A learning method that adjusts parameters of the learning model based on a loss function defined using the learning model.

computer,
Parameters of a learning model that generates, based on an input image, a feature amount map showing feature amounts for each pixel and a first weighting map showing, for each pixel, weighting values used in the process of determining pixels as key points. storage means for storing
acquisition means for acquiring a combination of a plurality of learning images that have different lighting conditions and include the same subject;
a plurality of parameters calculated based on each of the plurality of feature maps generated from each of the plurality of learning images; and a parameter calculated based on the first weighting map generated from the plurality of learning images. learning means for adjusting parameters of the learning model based on a loss function defined using the learning model;
A program that functions as

A processing device that determines key points of an input image using a learning model generated by the learning device according to any one of claims 1 to 5.

A processing method in which a computer determines key points of an input image using a learning model generated by the learning device according to any one of claims 1 to 5.

A program that causes a computer to function as means for determining key points of an input image using a learning model generated by the learning device according to any one of claims 1 to 5.