JP6758250B2

JP6758250B2 - Local feature expression learning device and method

Info

Publication number: JP6758250B2
Application number: JP2017101178A
Authority: JP
Inventors: 周平田良島; 隆行黒住; 杵渕　哲也; 哲也杵渕
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-05-22
Filing date: 2017-05-22
Publication date: 2020-09-23
Anticipated expiration: 2037-05-22
Also published as: JP2018195270A

Description

本発明は、局所特徴表現学習装置、及び方法に係り、特に、複数の画像を含む画像データセットから局所特徴表現を学習する局所特徴表現学習装置、及び方法に関する。 The present invention relates to a local feature expression learning device and a method, and more particularly to a local feature expression learning device and a method for learning a local feature expression from an image data set including a plurality of images.

画像から抽出される任意の小領域（以下、「パッチ」という）を特徴表現する局所特徴表現技術は、画像や映の像検索やステレオマッチング、及び画像編集等、非常に多くの応用用途を持つパターン認識の基本技術である。局所特徴表現技術は、ＳＩＦＴ記述子やＳＵＲＦ記述子に代表される局所特徴表現アルゴリズムが人手で設計されているものと、膨大なデータセットから機械学習により特徴表現を獲得しているものの２種類に大別できる。 Local feature expression technology that features an arbitrary small area (hereinafter referred to as "patch") extracted from an image has numerous applications such as image search, stereo matching, and image editing of images and images. This is the basic technology for pattern recognition. There are two types of local feature expression technology: one in which the local feature expression algorithm represented by SIFT descriptor and SURF descriptor is designed manually, and one in which feature expression is acquired by machine learning from a huge data set. It can be roughly divided.

非特許文献１では、ラベルが付与された大規模パッチデータセットから、同一の物理領域を捉えた（ラベルが同一な）パッチペアのＬ２距離（ノルム）が小さくなり、かつ異なる物理領域を捉えた（ラベルが異なる）パッチペアのＬ２距離は十分大きくなるような特徴表現をニューラルネットワークで学習する方法が開示されている。 In Non-Patent Document 1, the L2 distance (norm) of a patch pair that captures the same physical region (same label) becomes smaller and different physical regions are captured from a large-scale patch data set with a label (the same physical region). A method of learning a feature expression using a neural network so that the L2 distance of a patch pair (with different labels) becomes sufficiently large is disclosed.

一方、非特許文献２では、同一ラベルに属する２つのパッチと、それとは異なるラベルのパッチから構成されるトリプレット（三つ組）を入力の最小単位として、同一ラベルのパッチ間のＬ２距離が、異なるラベルのパッチ間のＬ２距離よりも小さくなるような特徴表現をニューラルネットワークで学習する方法が開示されている。 On the other hand, in Non-Patent Document 2, labels having different L2 distances between patches of the same label, with a triplet (triple set) composed of two patches belonging to the same label and patches of different labels as the minimum input unit. A method of learning a feature expression smaller than the L2 distance between patches by a neural network is disclosed.

一方、非特許文献３では、画像単位のラベルのみが割り当てられたデータセットを入力として、局所特徴表現を学習する方法が開示されている。具体的には、画像ペアを構成する各画像から抽出される局所特徴集合間で算出される類似度を定義し、同一ラベルの画像ペア間の類似度が、異なるラベルの画像間の類似度よりも高くなるようなニューラルネットワークを学習する方法が提案されている。 On the other hand, Non-Patent Document 3 discloses a method of learning a local feature expression by inputting a data set to which only an image unit label is assigned. Specifically, the similarity calculated between the local feature sets extracted from each image constituting the image pair is defined, and the similarity between the image pairs of the same label is higher than the similarity between the images of different labels. A method of learning a neural network that is expensive has been proposed.

E. Simo-Serra et al., Discriminative Learning of Deep Convolutional Feature Point Descriptors, in ICCV, 2015.インターネット＜URL：http://cvlabwww.epfl.ch/~trulls/pdf/iccv-2015-deepdesc.pdf＞E. Simo-Serra et al., Discriminative Learning of Deep Convolutional Feature Point Descriptors, in ICCV, 2015. Internet <URL: http://cvlabwww.epfl.ch/~trulls/pdf/iccv-2015-deepdesc.pdf> V. Balntas et al., Learning local feature descriptors with triplets and shallow convolutional neural networks, in BMVC, 2016. インターネット＜URL：http://www.bmva.org/bmvc/2016/papers/paper119/paper119.pdf＞V. Balntas et al., Learning local feature descriptors with triplets and shallow convolutional neural networks, in BMVC, 2016. Internet <URL: http://www.bmva.org/bmvc/2016/papers/paper119/paper119.pdf> Nenad Markus et al., Learning Local Descriptors by Optimizing the Keypoint-Correspondence Criterion, in ICPR, 2016.Nenad Markus et al., Learning Local Descriptors by Optimizing the Keypoint-Correspondence Criterion, in ICPR, 2016.

しかしながら、非特許文献１及び非特許文献２のいずれの手法も、パッチレベルでの正解ラベルが割り当てられた学習データの存在を仮定しており、それらを直接適用するには、膨大な数のパッチを人手でラベリングしなくてはならないという問題がある。 However, both methods of Non-Patent Document 1 and Non-Patent Document 2 assume the existence of training data to which correct labels are assigned at the patch level, and a huge number of patches can be applied directly. There is a problem that it has to be labeled manually.

また、非特許文献３では、画像単位の類似度を導入することで、パッチ単位のラベルを用いずに局所特徴表現の学習を実現しているが、一方でこの類似度は、局所特徴の総当り比較のみに基づいて定義されており、局所特徴の空間的な関係については何ら考慮されていない。局所特徴の空間的な関係性は、例えば幾何的及び光学的な要因で、同一物理領域から抽出されながらも見えが異なるパッチを発見する大きな手がかりである。これらを学習データとして用いることは、幾何的及び光学的な変形に頑健な局所特徴表現の獲得に不可欠であると考えられる。 Further, in Non-Patent Document 3, the learning of the local feature expression is realized without using the label of the patch unit by introducing the similarity of the image unit, but on the other hand, this similarity is the total of the local features. It is defined based only on hit comparisons and does not take into account the spatial relationships of local features. The spatial relationship of local features is a great clue to discover patches that are extracted from the same physical region but look different, for example due to geometric and optical factors. It is considered that using these as training data is indispensable for acquiring local feature representations that are robust to geometrical and optical deformations.

以上から、幾何的及び光学的な変形に頑健な局所特徴表現を獲得する公知の技術には、学習のためのパッチのラベリングに膨大な人手のコストがかかるという問題がある。また、パッチ単位のアノテーションを必要としない公知の技術には、局所特徴間の空間関係性を考慮する機構が含まれておらず、学習データが潜在的に有する情報を十分に活用できていないという問題があった。 From the above, there is a problem that the known technique for acquiring a local feature expression robust to geometrical and optical deformation requires a huge manual cost for labeling a patch for learning. In addition, the known technology that does not require patch-based annotation does not include a mechanism that considers the spatial relationship between local features, and the information potentially possessed by the training data cannot be fully utilized. There was a problem.

本発明は上記問題点を考慮してなされたものであり、幾何的及び光学的な変形に頑健な局所特徴記述子を、パッチ単位のアノテーションを用いることなく学習することができる、局所特徴表現学習装置、及び方法を提供することを目的とする。 The present invention has been made in consideration of the above problems, and local feature expression learning that can learn local feature descriptors that are robust to geometrical and optical deformations without using patch-based annotations. It is an object of the present invention to provide an apparatus and a method.

上記目的を達成するために、本開示の第１の態様は、複数の画像を含む画像データセットから局所特徴表現を学習する局所特徴表現学習装置であって、入力される画像データセットに含まれる各画像から複数のパッチを抽出するパッチ抽出部と、前記画像データセットに含まれる任意の画像ペアの幾何的な同一性を判定し、同一であると判定された画像ペアから、幾何的な変形パラメータを用いて同一物理領域を捉えた前記パッチのペアを抽出する同一領域パッチペア抽出部と、前記パッチ抽出部で抽出された前記複数のパッチからなるパッチの集合、及び前記同一領域パッチペア抽出部で抽出された前記パッチのペアを用いて、局所特徴表現の学習のための学習データを構築する学習データ構築部と、前記学習データ構築部で構築された前記学習データを用いて局所特徴表現を学習する局所特徴表現学習部と、予め定められた基準を満たすまで、前記局所特徴表現学習部で学習された局所特徴表現を用いて、前記同一領域パッチペア抽出部から前記局所特徴表現学習部までの処理を繰り返し行わせる終了判定部と、を備える。 In order to achieve the above object, the first aspect of the present disclosure is a local feature expression learning device that learns a local feature expression from an image data set containing a plurality of images, and is included in the input image data set. A patch extraction unit that extracts a plurality of patches from each image and an arbitrary image pair included in the image data set are determined to be geometrically identical, and geometric deformation is performed from the image pairs determined to be the same. The same region patch pair extraction unit that extracts the pair of patches that capture the same physical region using parameters, a set of patches consisting of the plurality of patches extracted by the patch extraction unit, and the same region patch pair extraction unit A learning data construction unit that constructs learning data for learning a local feature expression using the extracted pair of patches, and a learning data construction unit that constructs a local feature expression using the training data constructed by the learning data construction unit. Processing from the same region patch pair extraction unit to the local feature expression learning unit using the local feature expression learning unit to be performed and the local feature expression learned by the local feature expression learning unit until a predetermined criterion is satisfied. It is provided with an end determination unit for repeatedly performing the above.

本開示の第２の態様は、第１の態様の局所特徴表現学習装置において、前記同一領域パッチペア抽出部は、前記パッチ抽出部で抽出された前記複数の各パッチの各々を前記局所特徴表現を用いて特徴表現する局所特徴記述部と、前記画像データセットに含まれる画像ペアについて、前記局所特徴記述部により表現された各パッチの局所特徴に基づいて、画像ペアの幾何的な同一性の判定、及び前記幾何的な変形パラメータの推定を行う変形推定部と、前記変形推定部で同一であると判定がなされた画像ペアについて、前記変形パラメータを用いて同一物理領域を捉えた前記パッチのペアを抽出する同一領域判定部と、を備える。 In the second aspect of the present disclosure, in the local feature expression learning device of the first aspect, the same region patch pair extraction unit expresses the local feature expression for each of the plurality of patches extracted by the patch extraction unit. Determining the geometrical identity of the image pair based on the local feature of each patch expressed by the local feature description unit and the image pair included in the image data set. , And the patch pair that captures the same physical region using the deformation parameters for the deformation estimation unit that estimates the geometric deformation parameters and the image pair that is determined to be the same by the deformation estimation unit. It is provided with the same area determination unit for extracting the image.

本開示の第３の態様は、第１の態様または第２の態様の局所特徴表現学習装置において、前記局所特徴表現学習部は、前記学習データを用いて、前記同一領域パッチペア抽出部で、同一物理領域を捉えていると判定された前記パッチのペアの類似度は高く、かつ異なる物理領域を捉えていると判定された前記パッチのペアの類似度は低くなるような特徴空間へ、パッチを写像するニューラルネットワークを学習する。 A third aspect of the present disclosure is the local feature expression learning device of the first aspect or the second aspect, wherein the local feature expression learning unit is the same in the same region patch pair extraction unit using the learning data. Patches are placed in a feature space where the similarity of the patch pairs determined to capture the physical region is high and the similarity of the patch pairs determined to capture a different physical region is low. Learn the neural network to map.

本開示の第４の態様は、第１の態様から第３の態様のいずれか１態様の局所特徴表現学習装置において、前記学習データ構築部は、同一物理領域を捉えていないと判定された前記パッチのペアを学習データとするにあたり、前記同一領域パッチペア抽出部における同一性の判定結果を用いる。 According to the fourth aspect of the present disclosure, in the local feature expression learning device of any one aspect from the first aspect to the third aspect, it is determined that the learning data construction unit does not capture the same physical region. When the patch pair is used as the training data, the determination result of the sameness in the same area patch pair extraction unit is used.

本開示の第５の態様は、第１の態様から第３の態様のいずれか１態様の局所特徴表現学習装置において、前記学習データ構築部は、同一物理領域を捉えていないと判定された前記パッチのペアを学習データとするにあたり、前記入力される画像データセットに付与されている画像単位のラベルの情報を用いる。 According to the fifth aspect of the present disclosure, in the local feature expression learning device according to any one of the first to third aspects, it is determined that the learning data construction unit does not capture the same physical region. When the pair of patches is used as the training data, the information of the label of the image unit attached to the input image data set is used.

また、上記目的を達成するために、本開示の第６の態様は、複数の画像を含む画像データセットから局所特徴表現を学習する局所特徴表現学習装置における局所特徴表現学習方法であって、パッチ抽出部が、入力される画像データセットに含まれる各画像から複数のパッチを抽出するステップと、同一領域パッチペア抽出部が、前記画像データセットに含まれる任意の画像ペアの幾何的な同一性を判定し、同一であると判定された画像ペアから、幾何的な変形パラメータを用いて同一物理領域を捉えた前記パッチのペアを抽出するステップと、学習データ構築部が、前記パッチ抽出部で抽出された前記複数のパッチからなるパッチの集合、及び前記同一領域パッチペア抽出部で抽出された前記パッチのペアを用いて、局所特徴表現の学習のための学習データを構築するステップと、局所特徴表現学習部が、前記学習データ構築部で構築された前記学習データを用いて局所特徴表現を学習するステップと、終了判定部が、予め定められた基準を満たすまで、前記局所特徴表現学習部で学習された局所特徴表現を用いて、前記同一領域パッチペア抽出部から前記局所特徴表現学習部までの処理を繰り返し行わせるステップと、を含む。 Further, in order to achieve the above object, the sixth aspect of the present disclosure is a local feature expression learning method in a local feature expression learning device that learns a local feature expression from an image data set including a plurality of images, and is a patch. The extraction unit extracts a plurality of patches from each image included in the input image data set, and the same area patch pair extraction unit determines the geometrical identity of any image pair included in the image data set. The step of extracting the pair of patches that capture the same physical region using geometric deformation parameters from the image pairs that are determined to be the same, and the learning data construction unit extracts them with the patch extraction unit. A step of constructing learning data for learning a local feature expression using the set of patches composed of the plurality of patches and the pair of the patches extracted by the same region patch pair extraction unit, and a local feature expression. The step in which the learning unit learns the local feature expression using the learning data constructed by the learning data construction unit, and the end determination unit learns in the local feature expression learning unit until it satisfies a predetermined criterion. It includes a step of repeating the process from the same region patch pair extraction unit to the local feature expression learning unit using the local feature expression.

本開示によれば、幾何的及び光学的な変形に頑健な局所特徴記述子を、パッチ単位のアノテーションを用いることなく学習することができる、という効果が得られる。 According to the present disclosure, it is possible to learn a local feature descriptor that is robust to geometrical and optical deformations without using patch-based annotation.

実施形態における画像特徴装置の一例を示す構成図である。It is a block diagram which shows an example of the image feature apparatus in embodiment. 実施形態のパッチ抽出部におけるパッチの抽出方法の一例を説明するための図である。It is a figure for demonstrating an example of the patch extraction method in the patch extraction part of embodiment. 実施形態のパッチ抽出部におけるパッチの抽出方法の他の例を説明するための図である。It is a figure for demonstrating another example of the patch extraction method in the patch extraction part of embodiment. 実施形態の同一領域パッチペア抽出部の構成の一例を示す構成図である。It is a block diagram which shows an example of the structure of the same area patch pair extraction part of embodiment. 実施形態の同一領域パッチペア抽出部の局所特徴記述部、変形推定部、及び同一領域判定部の動作を説明するための図である。It is a figure for demonstrating the operation of the local feature description part, the deformation estimation part, and the same area determination part of the same area patch pair extraction part of embodiment. Ｓｉａｍｅｓｅネットワークの一例を示す図である。It is a figure which shows an example of the Siamese network. Ｓｉａｍｅｓｅネットワークの他の例を示す図である。It is a figure which shows another example of the Siamese network. Ｔｒｉｐｌｅｔネットワークの一例を示す図である。It is a figure which shows an example of a Triplet network. Ｓｉａｍｅｓｅネットワークの一例の具体的な構成例を示す構成図である。It is a block diagram which shows the specific configuration example of an example of the Siamese network. Ｔｒｉｐｌｅｔネットワークの一例の具体的な構成例を示す構成図である。It is a block diagram which shows the specific configuration example of an example of a Triplet network. 本実施形態の局所特徴表現学習装置における局所特徴表現学習処理ルーチンの一例を示すフローチャートである。It is a flowchart which shows an example of the local feature expression learning processing routine in the local feature expression learning apparatus of this embodiment.

以下、図面を参照して本発明の実施形態を詳細に説明する。なお、本実施形態は本発明を限定するものではない。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The present embodiment does not limit the present invention.

＜本開示の実施形態に係る概要＞ <Outline of the Embodiment of the present disclosure>

本開示では、同一領域パッチペア抽出部において、画像データセット中の任意の画像ペアについて幾何的な同一性の判定を実施し、同一であると判定された画像間の変形パラメータを用いて、同一の物理領域（以下、単に「同一領域」という）を捉えていると考えられるパッチのペアを抽出する。さらに、局所特徴表現学習部では、同一領域パッチペア抽出部で抽出されたパッチペアが互いに類似する（あるいは距離が小さくなる）ような局所特徴表現（局所特徴記述子）をニューラルネットワークで学習する。更には、局所特徴表現学習部で学習された局所特徴記述子を用いて、同一領域パッチペア抽出部で抽出される同一領域パッチペアを更新し、得られた学習データで局所特徴表現学習部でニューラルネットワークも更新する、という繰り返し処理を、所定の条件が満たされるまで繰り返す。 In the present disclosure, in the same region patch pair extraction unit, geometrical identity is determined for any image pair in the image data set, and the same is the same using the deformation parameters between the images determined to be the same. Extract a pair of patches that are thought to capture the physical area (hereinafter simply referred to as the "same area"). Further, the local feature expression learning unit learns a local feature expression (local feature descriptor) in which the patch pairs extracted by the same region patch pair extraction unit are similar to each other (or the distance is reduced) by the neural network. Furthermore, the same region patch pair extracted by the same region patch pair extraction unit is updated using the local feature descriptor learned by the local feature expression learning unit, and the obtained learning data is used as a neural network in the local feature expression learning unit. The iterative process of updating is repeated until a predetermined condition is satisfied.

本開示では、画像間の同一性の判定結果を用いて、(同一領域を捉えた）パッチペアを抽出することで、パッチ単位のアノテーションを用いずに局所特徴表現の学習が可能となり、非特許文献１及び非特許文献２のようにパッチ単位のアノテーションを事前に用意する必要がなくなる。また、単純な既知の局所特徴表現に基づくパッチ間の類似度の比較ではなく、画像間の同一性の判定の結果得られた変形パラメータを用いてパッチペアを抽出することで、既知の局所特徴表現では抽出できないようなパッチペア、例えば、同一領域を捉えているものの、一方の画像では幾何的及び光学的な変形の少なくとも一方の影響を強く受けているパッチペアを学習データに含めることが可能となり、例えば、非特許文献３で開示された技術と比較して、より上記の幾何的及び光学的な変形に頑健な局所特徴記述子を学習することができる。 In the present disclosure, by extracting patch pairs (capturing the same region) using the determination result of identity between images, it is possible to learn local feature expressions without using annotations for each patch, and non-patent documents. It is not necessary to prepare annotations for each patch in advance as in 1 and Non-Patent Document 2. In addition, by extracting patch pairs using the deformation parameters obtained as a result of determining the identity between images, instead of comparing the similarity between patches based on a simple known local feature representation, the known local feature representation It is possible to include patch pairs that cannot be extracted by, for example, patch pairs that capture the same area but are strongly affected by at least one of geometric and optical deformations in one image, for example. , It is possible to learn a local feature descriptor that is more robust to the above geometric and optical deformations as compared with the technique disclosed in Non-Patent Document 3.

＜本開示の実施形態の局所特徴表現学習装置の構成＞ <Structure of local feature expression learning device according to the embodiment of the present disclosure>

本実施形態の局所特徴表現学習装置では、任意の画像を複数含む任意の画像データセットを入力として、局所特徴記述子を出力する。図１には、本実施形態の画像特徴装置の一例の構成図を示す。 The local feature expression learning device of the present embodiment outputs a local feature descriptor by inputting an arbitrary image data set including a plurality of arbitrary images. FIG. 1 shows a configuration diagram of an example of the image feature device of the present embodiment.

図１に示した局所特徴表現学習装置２０には、一例として複数の画像１２に対して、複数のカテゴリ（図１では、カテゴリ１〜カテゴリ３を図示）のラベルが付与された（ラベリングされた）画像データセット１０を入力とした場合を示している。 In the local feature expression learning device 20 shown in FIG. 1, as an example, a plurality of images 12 are labeled with a plurality of categories (in FIG. 1, categories 1 to 3 are shown) (labeled). ) The case where the image data set 10 is input is shown.

パッチ抽出部３０には、画像データセット１０が入力される、パッチ抽出部３０は、画像データセット１０に含まれる各画像１２の各々から、局所特徴抽出の対象となるパッチの抽出を行う。なお、パッチ抽出部３０におけるパッチの抽出方法は任意である。例えば図２に示すように、画像１００をグリッド状に分割することにより複数のパッチ１０２を抽出することができる。また例えば、図３に示すように、非特許文献１及び非特許文献２に示すような公知の局所特徴検出器を用いて画像１１０から複数のパッチ１１２を抽出することができる。 The image data set 10 is input to the patch extraction unit 30, and the patch extraction unit 30 extracts the patch to be the target of local feature extraction from each of the images 12 included in the image data set 10. The patch extraction method in the patch extraction unit 30 is arbitrary. For example, as shown in FIG. 2, a plurality of patches 102 can be extracted by dividing the image 100 into a grid shape. Further, for example, as shown in FIG. 3, a plurality of patches 112 can be extracted from the image 110 by using a known local feature detector as shown in Non-Patent Document 1 and Non-Patent Document 2.

パッチ抽出部３０は、抽出したパッチの集合を出力する。 The patch extraction unit 30 outputs a set of extracted patches.

同一領域パッチペア抽出部３２には、パッチ抽出部３０からパッチの集合が入力される。同一領域パッチペア抽出部３２は、入力されたパッチの集合から、同一ラベルの任意の画像ペアについて幾何的な同一性の判定を行い、その結果得られた変形パラメータを用いて同一領域を捉えたパッチペア（同一領域パッチペア）を抽出する。 A set of patches is input from the patch extraction unit 30 to the same area patch pair extraction unit 32. The same area patch pair extraction unit 32 determines geometrical identity of any image pair with the same label from the set of input patches, and uses the resulting deformation parameters to capture the same area. (Same area patch pair) is extracted.

本実施形態の同一領域パッチペア抽出部３２は、一例として図４に示すように、局所特徴記述部４０、変形推定部４２、及び同一領域判定部４４を備える。図４に示した同一領域パッチペア抽出部３２は、画像間の幾何的な変形パラメータとしてホモグラフィを仮定した場合の、同一領域パッチペアを抽出する構成を示している。 As shown in FIG. 4, the same region patch pair extraction unit 32 of the present embodiment includes a local feature description unit 40, a deformation estimation unit 42, and the same region determination unit 44, as shown in FIG. 4 as an example. The same region patch pair extraction unit 32 shown in FIG. 4 shows a configuration for extracting the same region patch pair when homography is assumed as a geometric deformation parameter between images.

局所特徴記述部４０には、パッチ抽出部３０からパッチの集合が入力される。局所特徴記述部４０は、局所特徴記述子を用いて各パッチを特徴表現し、画像間で、得られた局所特徴の類似度及び距離の少なくとも一方に基づき、パッチの対応付けを行う。局所特徴記述子としては、例えば、当該局所特徴記述部４０における直前の処理までに得られたものや、非特許文献１、参考文献１、及び参考文献２で得られているものを用いることができる。局所特徴記述部４０は、局所特徴の対応を表す情報を出力する。 A set of patches is input from the patch extraction unit 30 to the local feature description unit 40. The local feature description unit 40 features each patch using the local feature descriptor, and associates the patches between the images based on at least one of the similarity and the distance of the obtained local features. As the local feature descriptor, for example, those obtained by the immediately preceding processing in the local feature description unit 40, or those obtained in Non-Patent Document 1, Reference 1, and Reference 2 can be used. it can. The local feature description unit 40 outputs information indicating the correspondence of local features.

［参考文献１］D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, in IJCV, 2004.
［参考文献２］K. Mikolajczyk et al., A Comparison of Affine Region Detectors, in IJCV, 2005. [Reference 1] DG Lowe, Distinctive Image Features from Scale-Invariant Keypoints, in IJCV, 2004.
[Reference 2] K. Mikolajczyk et al., A Comparison of Affine Region Detectors, in IJCV, 2005.

変形推定部４２には、局所特徴記述部４０から局所特徴の対応を表す情報が入力される。局所特徴記述部４０により得られた対応には、アウトライヤが含まれ得るため、ＲＡＮＳＡＣを始めとするロバスト推定手法によりそれらを除去し、画像間の幾何的な変形を推定する。 Information indicating the correspondence of local features is input to the deformation estimation unit 42 from the local feature description unit 40. Since the correspondence obtained by the local feature description unit 40 may include outliers, they are removed by a robust estimation method such as RANSAC, and geometric deformation between images is estimated.

例えば、図５に示した一例では、局所特徴記述部４０は、（Ａ）に示すように、画像１３０と画像１３２との間の幾何的な変形パラメータとして、ホモグラフィ変換行列Ｈ、具体的には、 For example, in the example shown in FIG. 5, as shown in (A), the local feature description unit 40 uses the homography transformation matrix H as a geometric deformation parameter between the image 130 and the image 132, specifically. Is

を得る。変形推定部４２は、得られた変形パラメータを出力する。 To get. The deformation estimation unit 42 outputs the obtained deformation parameters.

同一領域判定部４４には、変形推定部４２から変形パラメータが入力される。同一領域判定部４４が、変形パラメータを用いて、一方の画像に含まれる各パッチの代表座標を他方の画像へ投影し、座標が最も近く、かつその距離が所定の閾値よりも小さいものがあった場合、それらパッチペアを同一領域パッチペア１３８と判定する。 Deformation parameters are input from the deformation estimation unit 42 to the same area determination unit 44. The same area determination unit 44 projects the representative coordinates of each patch included in one image onto the other image by using the deformation parameter, and some of the coordinates are the closest and the distance is smaller than a predetermined threshold value. If so, those patch pairs are determined to be the same region patch pair 138.

変形推定部４２により、局所特徴記述部４０における、局所特徴の単純な比較では抽出できなかった、幾何的及び光学的な少なくとも一方の変形を強く受けたパッチペア、例えば、図５に示した一例では、（Ｂ）に示したパッチペア１３４、及びパッチペア１３６等が抽出できる。また、変形推定部４２は、単純な見えは類似しているものの異なる物理領域から抽出されたパッチペア、すなわちアウトライヤを除去し、同一領域を捉えたパッチペアを抽出することができる。 In the patch pair strongly subjected to at least one of the geometrical and optical deformations, which could not be extracted by the deformation estimation unit 42 by the local feature description unit 40 by a simple comparison of the local features, for example, in the example shown in FIG. , (B), patch pair 134, patch pair 136 and the like can be extracted. In addition, the deformation estimation unit 42 can remove patch pairs extracted from different physical regions, that is, outliers, although they have similar simple appearances, and can extract patch pairs that capture the same region.

なお、画像間の変形パラメータとしては、上記ホモグラフィ変換行列Ｈの他にもアフィン変換行列やｔｈｉｎｐｌａｔｅｓｐａｔｉａｌｔｒａｎｓｆｏｒｍｅｒ、を用いてもよいし、各画像１２を撮影したカメラの外部パラメータも既知の場合はカメラポーズ行列を用いてもよい。また、一方の画像のパッチに対応するものを他方の画像から抽出するにあたり、パッチ抽出部３０で得られたパッチを用いず、一方の画像中のパッチそのものに変換行列を適用し、その領域を抽出することでパッチペアを抽出してもよい。 In addition to the homography transformation matrix H, an affine transformation matrix or a thin plate partial transformation may be used as the transformation parameters between the images, and when the external parameters of the camera that captured each image 12 are also known. May use a camera pose matrix. Further, when extracting the patch corresponding to the patch of one image from the other image, the conversion matrix is applied to the patch itself in one image without using the patch obtained by the patch extraction unit 30, and the area is defined. The patch pair may be extracted by extracting.

同一領域パッチペア抽出部３２の同一領域判定部４４は、抽出した同一領域パッチペア及びパッチ抽出部３０から入力されたパッチの集合を出力する。 The same area determination unit 44 of the same area patch pair extraction unit 32 outputs a set of the extracted same area patch pair and the patches input from the patch extraction unit 30.

学習データ構築部３４には、同一領域パッチペア抽出部３２から出力された同一領域パッチペア及びパッチの集合が入力される。学習データ構築部３４は、同一領域パッチペア及びパッチの集合を用いて、後段の局所特徴表現学習部３６で実際に学習に用いるデータを構築する。なお、局所特徴表現学習部３６で用いるニューラルネットワークの誤差関数の種類に応じて、データ構築が異なる。 The same region patch pair and the set of patches output from the same region patch pair extraction unit 32 are input to the learning data construction unit 34. The learning data construction unit 34 constructs the data actually used for learning in the local feature expression learning unit 36 in the subsequent stage by using the same area patch pair and the set of patches. The data construction differs depending on the type of the error function of the neural network used in the local feature expression learning unit 36.

局所特徴表現学習部３６で用いるニューラルネットワークの誤差関数がパッチペアを入力とする場合には、同一領域パッチペアの集合をｐｏｓｉｔｉｖｅなペア、パッチペアのうち同一領域を捉えていないものをｎｅｇａｔｉｖｅなペアとすればよい。ｐｏｓｉｔｉｖｅなペア及びｎｅｇａｔｉｖｅなペアの双方とも、学習データ、及び同一領域パッチペア抽出部３２から入力された全てのパッチペアを学習データとしてもよいし、あるいはその中から所定の数だけサンプリングして得られたパッチペアを学習データとしてもよい。また得られたパッチペアに、いわゆるｄａｔａａｕｇｍｅｎｔａｔｉｏｎと呼ばれる簡易な変形を施し、データの水増しを行ってもよい。 When the error function of the neural network used in the local feature expression learning unit 36 inputs a patch pair, the set of patch pairs in the same region may be a positive pair, and the patch pair that does not capture the same region may be a negative pair. Good. For both the positive pair and the negative pair, the training data and all the patch pairs input from the same region patch pair extraction unit 32 may be used as training data, or a predetermined number of them may be sampled and obtained. The patch pair may be used as training data. Further, the obtained patch pair may be subjected to a simple modification called so-called data augmentation to inflate the data.

また、局所特徴表現学習部３６で用いるニューラルネットワークがパッチのトリプレット（三つ組）を入力とする場合には、同一領域パッチペア集合をｐｏｓｉｔｉｖｅなパッチペアとし、ｐｏｓｉｔｉｖｅなパッチペアと、それらとは異なる物理領域から抽出されたパッチとからトリプレットを構成する。この場合も、トリプレットの数は学習データから得られる全てのトリプレットを学習データとしてもよいし、その中から所定の数だけサンプリングして得られたトリプレットを学習データとしてもよい。 When the neural network used in the local feature expression learning unit 36 inputs a patch triplet (triplet), the same area patch pair set is set as a positive patch pair and extracted from the positive patch pair and a physical area different from them. Make up a triplet from the patch. In this case as well, as for the number of triplets, all the triplets obtained from the training data may be used as the training data, or the triplets obtained by sampling a predetermined number of the triplets may be used as the training data.

なお、パッチのペア及びトリプレットを学習データとするいずれの場合においても、異なる物理領域から抽出されるｎｅｇａｔｉｖｅなパッチペアを抽出する際には、各画像のラベル情報を参照できる場合にはそれを活用してもよい。同一領域パッチペア抽出部３２で実施される幾何的な同一性の検証は、同一のラベルを持つ画像であっても同一であると判定されないことが発生し得る。その場合、ｎｅｇａｔｉｖｅなパッチをランダムにサンプリングすることで抽出すると、極稀に、本来は同一領域から抽出されたパッチであるにもかかわらず、ｎｅｇａｔｉｖｅであると判定されてしまうという問題があるが、画像のラベル情報を用いることでこれを回避することができる。入力される画像データセット１０にラベルが付与されていない場合は、画像ペアの同一性は厳密には判定できないため、例えば対応点の数がごく少ない画像ペアからｎｅｇａｔｉｖｅペアを抽出する等とすることができる。 In any case where patch pairs and triplets are used as training data, when extracting negative patch pairs extracted from different physical regions, if the label information of each image can be referred to, it is used. You may. The geometrical identity verification performed by the same region patch pair extraction unit 32 may not be determined to be the same even for images having the same label. In that case, if a negative patch is extracted by randomly sampling, there is a problem that, in rare cases, the patch is determined to be negative even though the patch is originally extracted from the same region. This can be avoided by using the label information of the image. If the input image data set 10 is not labeled, the identity of the image pair cannot be determined exactly. Therefore, for example, the negative pair is extracted from the image pair having a very small number of corresponding points. Can be done.

学習データ構築部３４からは、構築された学習データが出力される。 The learned learning data is output from the learning data construction unit 34.

局所特徴表現学習部３６には、学習データ構築部３４から学習データが入力される。局所特徴表現学習部３６は、入力された学習データを用いて、同一の物理領域から抽出されたパッチ間の類似度が高く（距離が小さく）、異なる物理領域から抽出されたパッチ間は類似度が低くなる（類似していない）ような特徴空間へ各パッチを写像する局所特徴記述子を学習する。ここで学習される局所特徴記述子は、パッチのペアないしはトリプレットを入力の最小単位として、学習データ構築部３４で得られたｐｏｓｉｔｉｖｅなパッチ間は類似させ、ｎｅｇａｔｉｖｅなパッチ間は類似させないようなニューラルネットワークを学習するものとする。 Learning data is input from the learning data construction unit 34 to the local feature expression learning unit 36. The local feature expression learning unit 36 uses the input learning data to have a high degree of similarity between patches extracted from the same physical area (small distance), and a degree of similarity between patches extracted from different physical areas. Learn a local feature descriptor that maps each patch to a feature space where is low (not similar). The local feature descriptor learned here is a neural network in which patch pairs or triplets are used as the minimum input unit, and the positive patches obtained by the training data construction unit 34 are similar, and the negative patches are not similar. Suppose you want to learn a network.

本実施形態では、上記を満たすニューラルネットワークであれば特に限定されず、任意のものを用いることができる。例えば、図６に示したＳｉａｍｅｓｅネットワーク１５０、及び図７に示したＳｉａｍｅｓｅネットワーク１５２、及び図８に示したＴｒｉｐｌｅｔネットワーク１５６等を用いることができる。 In the present embodiment, any neural network that satisfies the above conditions is not particularly limited, and any neural network can be used. For example, the Siamese network 150 shown in FIG. 6, the Siamese network 152 shown in FIG. 7, the Triplet network 156 shown in FIG. 8, and the like can be used.

図６及び図７に示したようなＳｉａｍｅｓｅネットワーク１５０、１５２は、パッチペアを構成する各パッチを順伝播ネットワークに入力することで得られた中間表現間の距離ないしは類似度から誤差関数が定義されるようなニューラルネットワークである。一例として、図６に示すように、Ｓｉａｍｅｓｅネットワーク１５０では、誤差関数（ｉ）が定義され、Ｓｉａｍｅｓｅネットワーク１５２では、誤差関数（ｉｉ）が定義される。 In the Siamese networks 150 and 152 as shown in FIGS. 6 and 7, an error function is defined from the distance or similarity between intermediate representations obtained by inputting each patch constituting the patch pair into the forward propagation network. It is a neural network like. As an example, as shown in FIG. 6, the error function (i) is defined in the Siamese network 150, and the error function (ii) is defined in the Siamese network 152.

順伝播ネットワークの重みＷはパッチペアで共有されていてもされていなくてもよい。また、距離または類似度についても、パラメータのないＬ２距離やコサイン類似度を用いても、メトリックそのものを学習するような層を順伝播ネットワークの上にスタックしてもよい。 The weight W of the forward propagation network may or may not be shared by the patch pair. Also, regarding the distance or similarity, the L2 distance or cosine similarity without parameters may be used, or a layer for learning the metric itself may be stacked on the forward propagation network.

誤差関数（ｉ）としては、例えば、下記（１）式で表されるようなＣｏｎｔｒａｓｔｉｖｅｌｏｓｓ関数を用いることができる。 As the error function (i), for example, a Contrastive loss function as represented by the following equation (1) can be used.

・・・（１） ... (1)

また、誤差関数（ｉｉ）としては、例えば、下記（２）式で表されるようなＢｉｎａｒｙｃｒｏｓｓｅｎｔｒｏｐｙ関数を用いることができる。 Further, as the error function (ii), for example, a Binary cross entropy function as represented by the following equation (2) can be used.

・・・（２） ... (2)

図９には、実際に用いるＳｉａｍｅｓｅネットワークの具体例として、Ｓｉａｍｅｓｅネットワーク１５０の一例の具体的な構成例を示す。なお、Ｓｉａｍｅｓｅネットワークは、当該構成に限らず任意の順伝播ネットワークの形態を採用することができる。 FIG. 9 shows a specific configuration example of an example of the Siamese network 150 as a specific example of the Siamese network actually used. The Siamese network is not limited to the configuration, and any form of forward propagation network can be adopted.

なお、図９に示したＳｉａｍｅｓｅネットワーク１５０は、フィルタサイズを（ｈ×ｗ×ｃ）としている。また、Ｃｏｎｖは、畳み込み層を表し、ＭａｘＰｏｏｌｉｎｇは、プーリング層を表し、ＲｅＬｕは、活性化関数を表し、Ｆｌａｔｔｅｎは、平坦化層を表し、さらにＦＣは、全接続層を表している。また、図９では、一例として、入力されるパッチのサイズが、（３２×３２×１）である場合について示している。 The Siamese network 150 shown in FIG. 9 has a filter size of (h × w × c). In addition, Conv represents a convolutional layer, MaxPolling represents a pooling layer, ReLu represents an activation function, Flatten represents a flattening layer, and FC represents a total connecting layer. Further, FIG. 9 shows, as an example, a case where the size of the input patch is (32 × 32 × 1).

一方、図８に示したようなＴｒｉｐｌｅｔネットワーク１５６は、トリプレットを構成する各パッチを順伝播ネットワークに入力することで得られた中間特徴表について、ｐｏｓｉｔｉｖｅなパッチ間の類似度が、ｎｅｇａｔｉｖｅなパッチ間の類似度よりも高くなるような（距離の場合は、ｐｏｓｉｔｉｖｅなパッチ間の距離がｎｅｇａｔｉｖｅなパッチ間の距離よりも小さくなるような）誤差関数が定義されるニューラルネットワークである。一例として、図８に示すように、Ｔｒｉｐｌｅｔネットワーク１５６では、誤差関数（ｉｉｉ）が定義される。 On the other hand, in the Triplet network 156 as shown in FIG. 8, the similarity between the positive patches is negative for the intermediate feature table obtained by inputting each patch constituting the triplet to the forward propagation network. It is a neural network in which an error function is defined so as to be higher than the similarity of (in the case of a distance, the distance between positive patches is smaller than the distance between negative patches). As an example, as shown in FIG. 8, the Triplet network 156 defines an error function (iii).

誤差関数（ｉｉｉ）としては、例えば、Ｒａｔｉｏｌｏｓｓ関数や、下記（３）式で表されるようなＭａｒｇｉｎｒａｎｋｉｎｇｌｏｓｓ関数を用いることができる。 As the error function (iii), for example, a Ratio loss function or a Margin ranking loss function as represented by the following equation (3) can be used.

・・・（３） ... (3)

図１０には、実際に用いるＴｒｉｐｌｅｔネットワーク１５６の一例の具体的な構成例を示す。なお、Ｔｒｉｐｌｅｔネットワーク１５６は、当該構成に限らず任意の順伝播ネットワークの形態を採用することができる。 FIG. 10 shows a specific configuration example of an example of the Triplet network 156 actually used. The Triplet network 156 is not limited to the configuration, and any form of forward propagation network can be adopted.

なお、上述したニューラルネットワークの誤差関数を最小化する最適化手法は任意であり、例えばＳＧＤ、ｍｏｍｅｎｔｕｍＳＧＤ、ＲＭＳＰｒｏｐ、参考文献３で開示されているＡｄａＧｒａｄ、参考文献４で開示されているＡｄａｄｅｌｔａ、及び参考文献５で開示されているＡｄａｍ等を用いることができる。学習率、ｗｅｉｇｈｔｄｅｃａｙの設定も任意である。ここで学習されたニューラルネットワークのうち、単一パッチの順伝播ネットワークから出力される任意の中間表現を局所特徴記述子として用いることができる。 The optimization method for minimizing the error function of the neural network described above is arbitrary, for example, SGD, momentum SGD, RMSProp, AdaGrad disclosed in Reference 3, Adadelta disclosed in Reference 4, and the like. Adam and the like disclosed in Reference 5 can be used. The learning rate and the weight decay are also optional. Of the neural networks learned here, any intermediate representation output from the single-patch forward propagation network can be used as the local feature descriptor.

［参考文献３］J. Duchi et al., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, in JMLR, 2011.
［参考文献４］M. D. Zeiler, ADADELTA: An Adaptive Learning Rate Method, in CoRR, 2012.
［参考文献５］D. P. Kingma et al., ADAM: A Method for Stochastic Optimization, in Proc. ICLR, 2015. [Reference 3] J. Duchi et al., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, in JMLR, 2011.
[Reference 4] MD Zeiler, ADADELTA: An Adaptive Learning Rate Method, in CoRR, 2012.
[Reference 5] DP Kingma et al., ADAM: A Method for Stochastic Optimization, in Proc. ICLR, 2015.

局所特徴表現学習部３６は、学習した局所特徴記述子を出力する。 The local feature expression learning unit 36 outputs the learned local feature descriptor.

終了判定部３８には、局所特徴表現学習部３６から局所特徴記述子が入力される。同一領域パッチペア抽出部３２から局所特徴表現学習部３６による繰り返し処理を行う場合、終了判定部３８は、当該局所特徴記述子を同一領域パッチペア抽出部３２に出力する。この場合、同一領域パッチペア抽出部３２は、終了判定部３８から入力された局所特徴記述子を用いることで、より多くのパッチペアを抽出することができる。そのため、学習データ構築部３４により、抽出されたパッチペアから再び構築された学習データを用いることで、局所特徴表現学習部３６により、局所特徴記述子そのものも更新することができる。 A local feature descriptor is input from the local feature expression learning unit 36 to the end determination unit 38. When the local feature expression learning unit 36 performs the iterative processing from the same region patch pair extraction unit 32, the end determination unit 38 outputs the local feature descriptor to the same region patch pair extraction unit 32. In this case, the same region patch pair extraction unit 32 can extract more patch pairs by using the local feature descriptor input from the end determination unit 38. Therefore, the local feature descriptor itself can be updated by the local feature expression learning unit 36 by using the learning data reconstructed from the patch pair extracted by the learning data construction unit 34.

終了判定部３８では、上記の繰り返し処理について終了判定処理を行う。終了判定の基準は任意であり、例えば、上記の繰り返し処理の繰り返し回数を所定の回数で打ち切ったり、同一領域パッチペア抽出部３２で得られたパッチペア集合と、直前に得られたパッチペア集合とが変わらなくなったことを基準としたりすればよい。 The end determination unit 38 performs an end determination process for the above-mentioned iterative process. The criteria for determining the end is arbitrary. For example, the number of repetitions of the above repetition process is cut off at a predetermined number, or the patch pair set obtained by the same area patch pair extraction unit 32 and the patch pair set obtained immediately before are different. It may be based on the fact that it has disappeared.

なお、本実施形態の局所特徴表現学習装置２０は、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、局所特徴表現学習プログラムや各種データを記憶したＲＯＭと（Read Only Memory）、を含むコンピュータで構成することができる。本実施形態のＣＰＵが局所特徴表現学習プログラムを実行することにより、局所特徴表現学習装置２０のパッチ抽出部３０、同一領域パッチペア抽出部３２、学習データ構築部３４、局所特徴表現学習部３６、及び終了判定部３８の各々として機能する。 The local feature expression learning device 20 of the present embodiment includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM that stores a local feature expression learning program and various data (Read Only Memory). Can be configured with including computers. When the CPU of the present embodiment executes the local feature expression learning program, the patch extraction unit 30 of the local feature expression learning device 20, the same area patch pair extraction unit 32, the learning data construction unit 34, the local feature expression learning unit 36, and It functions as each of the end determination units 38.

＜本実施形態の局所特徴表現学習装置２０の作用＞ <Action of local feature expression learning device 20 of this embodiment>

次に、本実施形態の局所特徴表現学習装置２０の作用について説明する。 Next, the operation of the local feature expression learning device 20 of the present embodiment will be described.

パッチ抽出部３０に画像データセット１０が入力されると、局所特徴表現学習装置２０は、図１１に一例を示す、局所特徴表現学習処理ルーチンを実行する。 When the image data set 10 is input to the patch extraction unit 30, the local feature expression learning device 20 executes the local feature expression learning processing routine shown in FIG. 11 as an example.

図１１に示したステップＳ１００でパッチ抽出部３０は、上述したように、画像データセット１０に含まれる各画像１２の各々から、局所特徴抽出の対象となるパッチの抽出を行い、抽出したパッチの集合を出力する。 As described above, in step S100 shown in FIG. 11, the patch extraction unit 30 extracts the patch to be the target of local feature extraction from each of the images 12 included in the image data set 10, and the extracted patch Output the set.

次のステップＳ１０２で同一領域パッチペア抽出部３２は、上述したように、入力されたパッチの集合から、同一ラベルの任意の画像ペアについて幾何的な同一性の判定を行い、その結果得られた変形パラメータとしてホモグラフィを用いて同一領域パッチペアを抽出して出力する。 In the next step S102, the same area patch pair extraction unit 32 determines the geometric identity of any image pair with the same label from the set of input patches, as described above, and the resulting deformation. The same area patch pair is extracted and output using homography as a parameter.

次のステップＳ１０４で学習データ構築部３４は、上述したように、入力された同一領域パッチペア及びパッチの集合を用いて、学習データ構築部３４は、同一領域パッチペア及びパッチの集合を用いて、後段の局所特徴表現学習部３６で用いるニューラルネットワークの誤差関数の種類に応じた学習データを構築して出力する。 In the next step S104, the learning data construction unit 34 uses the input same region patch pair and the set of patches as described above, and the learning data construction unit 34 uses the same region patch pair and the set of patches in the latter stage. The learning data corresponding to the type of the error function of the neural network used in the local feature expression learning unit 36 of is constructed and output.

次のステップＳ１０６で局所特徴表現学習部３６は、上述したように、入力された学習データを用いて、同一の物理領域から抽出されたパッチ間の類似度が高く、異なる物理領域から抽出されたパッチ間は類似度が低いような特徴空間へ各パッチを写像する局所特徴記述子を学習し、学習した局所特徴記述子（局所特徴表現）を出力する。 In the next step S106, as described above, the local feature expression learning unit 36 has high similarity between patches extracted from the same physical region and is extracted from different physical regions using the input learning data. Between patches, a local feature descriptor that maps each patch to a feature space with low similarity is learned, and the learned local feature descriptor (local feature expression) is output.

次のステップＳ１０８で終了判定部３８は、上述したように、予め定められた基準に基づいて、繰り返し処理を行うか否かを判定する。本実施形態では、一例として予め定められた基準を満たさない場合は繰り返し処理を行うため、ステップＳ１０８の判定が否定判定となり、ステップＳ１０２に戻り、上記各ステップの処理を繰り返す。この場合、終了判定部３８は、入力された局所特徴記述子を同一領域パッチペア抽出部３２に出力する。 In the next step S108, the end determination unit 38 determines whether or not to perform the iterative process based on a predetermined criterion as described above. In the present embodiment, as an example, if the predetermined criteria are not satisfied, the repetitive processing is performed. Therefore, the determination in step S108 becomes a negative determination, the process returns to step S102, and the processing in each of the above steps is repeated. In this case, the end determination unit 38 outputs the input local feature descriptor to the same area patch pair extraction unit 32.

一方、予め定められた基準を満たす場合、ステップＳ１０８の判定が肯定判定となり、本局所特徴表現学習処理ルーチンを終了する。この場合、終了判定部３８は、入力された局所特徴記述子（局所特徴表現）を局所特徴表現学習装置２０の外部に出力する。 On the other hand, when the predetermined criteria are satisfied, the determination in step S108 becomes an affirmative determination, and the local feature expression learning processing routine ends. In this case, the end determination unit 38 outputs the input local feature descriptor (local feature representation) to the outside of the local feature expression learning device 20.

以上説明したように、本実施形態の局所特徴表現学習装置２０は、複数の画像１２を含む画像データセット１０から局所特徴表現を学習する局所特徴表現学習装置２０であって、入力される画像データセット１０に含まれる各画像１２から複数のパッチを抽出するパッチ抽出部３０と、画像データセット１０に含まれる任意の画像ペアの幾何的な同一性を判定し、同一であると判定された画像ペアから、幾何的な変形パラメータを用いて同一物理領域を捉えたパッチのペアを抽出する同一領域パッチペア抽出部と、パッチ抽出部３０で抽出された複数のパッチからなるパッチの集合、及び同一領域パッチペア抽出部３２で抽出されたパッチのペアを用いて、局所特徴表現の学習のための学習データを構築する学習データ構築部と、学習データ構築部で構築された学習データを用いて局所特徴表現を学習する局所特徴表現学習部と、予め定められた基準を満たすまで、局所特徴表現学習部で学習された局所特徴表現を用いて、同一領域パッチペア抽出部から局所特徴表現学習部までの処理を繰り返し行わせる終了判定部と、を備える。 As described above, the local feature expression learning device 20 of the present embodiment is a local feature expression learning device 20 that learns a local feature expression from an image data set 10 including a plurality of images 12, and is input image data. An image determined to be the same by determining the geometrical identity of the patch extraction unit 30 that extracts a plurality of patches from each image 12 included in the set 10 and an arbitrary image pair included in the image data set 10. The same region patch pair extraction unit that extracts a pair of patches that capture the same physical region using geometric deformation parameters from the pair, a set of patches consisting of a plurality of patches extracted by the patch extraction unit 30, and the same region. A learning data construction unit that constructs training data for learning a local feature expression using the patch pair extracted by the patch pair extraction unit 32, and a local feature expression using the training data constructed by the training data construction unit. The processing from the same area patch pair extraction unit to the local feature expression learning unit is performed using the local feature expression learning unit that learns the data and the local feature expression learned by the local feature expression learning unit until a predetermined standard is satisfied. It is provided with an end determination unit for repeating the operation.

従って、本実施形態の局所特徴表現学習装置２０によれば、幾何的及び光学的な変形に頑健な局所特徴記述子を、パッチ単位のアノテーションを用いることなく学習することができる。また、本実施形態の局所特徴表現学習装置２０によれば、複数のパッチを人手でラベリングせずとも、自動でラベリングするため、人手による手間を省くことができる。 Therefore, according to the local feature expression learning device 20 of the present embodiment, the local feature descriptor robust to geometrical and optical deformation can be learned without using patch-based annotation. Further, according to the local feature expression learning device 20 of the present embodiment, since the plurality of patches are automatically labeled without being manually labeled, the manual labor can be saved.

なお、上記実施形態では、局所特徴表現学習装置２０に入力される画像データセット１０が、一例として複数の画像１２に対して、複数のカテゴリのラベルが付与された画像データセット１０である場合について説明したが、画像データセット１０はこれに限定されない。例えば、ラベルが付与されていない複数の画像１２を含むものであってもよい。 In the above embodiment, the image data set 10 input to the local feature expression learning device 20 is, for example, an image data set 10 in which a plurality of categories of labels are assigned to a plurality of images 12. As described above, the image data set 10 is not limited to this. For example, it may include a plurality of unlabeled images 12.

なお、本実施形態は一例であり、具体的な構成は本実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計等も含まれ、状況に応じて変更可能であることは言うまでもない。 It should be noted that this embodiment is an example, and the specific configuration is not limited to this embodiment, but includes a design and the like within a range that does not deviate from the gist of the present invention, and may be changed depending on the situation. Needless to say.

２０局所特徴表現学習装置
３０パッチ抽出部
３２同一領域パッチペア抽出部
３４学習データ構築部
３６局所特徴表現学習部
３８終了判定部
４０局所特徴記述部
４２変形推定部
４４同一領域判定部 20 Local feature expression learning device 30 Patch extraction unit 32 Same area patch pair extraction unit 34 Learning data construction unit 36 Local feature expression learning unit 38 End judgment unit 40 Local feature description unit 42 Deformation estimation unit 44 Same area judgment unit

Claims

A local feature expression learning device that learns local feature expressions from an image data set containing multiple images.
A patch extractor that extracts multiple patches from each image contained in the input image dataset,
The pair of patches that captures the same physical region using geometric deformation parameters from the image pairs that are determined to be identical by determining the geometric identity of any image pair included in the image dataset. The same area patch pair extractor that extracts
Using a set of patches composed of the plurality of patches extracted by the patch extraction unit and the patch pairs extracted by the same region patch pair extraction unit, learning data for learning a local feature expression is constructed. Learning data construction department and
A local feature expression learning unit that learns a local feature expression using the learning data constructed by the learning data construction unit, and a local feature expression learning unit.
An end determination unit that repeatedly performs processing from the same region patch pair extraction unit to the local feature expression learning unit using the local feature expression learned by the local feature expression learning unit until a predetermined criterion is satisfied. ,
A local feature expression learning device including.

The same region patch pair extraction unit
A local feature description section that features each of the plurality of patches extracted by the patch extraction section using the local feature representation, and a local feature description section.
For the image pair included in the image data set, the geometrical identity of the image pair is determined and the geometric deformation parameter is estimated based on the local feature of each patch expressed by the local feature description unit. Deformation estimation unit to be performed and
For the image pair determined to be the same by the deformation estimation unit, the same region determination unit that extracts the patch pair that captures the same physical region using the deformation parameter,
The local feature expression learning device according to claim 1.

The local feature expression learning unit
Using the learning data, the patch determined to capture the same physical region by the same region patch pair extraction unit has a high degree of similarity and is determined to capture a different physical region. Learn a neural network that maps patches to a feature space where the similarity of pairs is low,
The local feature expression learning device according to claim 1 or 2.

The learning data construction unit
In using the pair of patches determined not to capture the same physical region as learning data, the determination result of identity in the same region patch pair extraction unit is used.
The local feature expression learning device according to any one of claims 1 to 3.

The learning data construction unit
When the pair of patches determined not to capture the same physical region is used as training data, the information of the label of each image assigned to the input image data set is used.
The local feature expression learning device according to any one of claims 1 to 3.

It is a local feature expression learning method in a local feature expression learning device that learns a local feature expression from an image data set containing a plurality of images.
A step in which the patch extractor extracts multiple patches from each image contained in the input image dataset.
The same area patch pair extraction unit determines the geometrical identity of any image pair included in the image data set, and from the image pairs determined to be the same, the same physical area using geometric deformation parameters. And the step of extracting the pair of the patches that captured
The learning data construction unit uses the set of patches composed of the plurality of patches extracted by the patch extraction unit and the patch pair extracted by the same region patch pair extraction unit to learn the local feature expression. Steps to build training data for
A step in which the local feature expression learning unit learns a local feature expression using the learning data constructed by the learning data construction unit.
The process from the same region patch pair extraction unit to the local feature expression learning unit is repeated using the local feature expression learned by the local feature expression learning unit until the end determination unit satisfies a predetermined criterion. Steps to make
Local feature expression learning method including.