JP2019028700A

JP2019028700A - Verification device, method, and program

Info

Publication number: JP2019028700A
Application number: JP2017147068A
Authority: JP
Inventors: 豪入江; Takeshi Irie; 柏野　邦夫; Kunio Kashino; 邦夫柏野; 薫平松; Kaoru Hiramatsu; 隆行黒住; Takayuki Kurozumi; 清晴相澤; Kiyoharu Aizawa
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2017-07-28
Filing date: 2017-07-28
Publication date: 2019-02-21
Anticipated expiration: 2037-07-28
Also published as: JP6793925B2

Abstract

To provide a verification device, a method, and a program capable of highly accurately verifying more diverse objects.SOLUTION: A feature extraction unit 120 applies a convolution neural network to each of two images and obtains an output of a convolution layer for each partial area. A correspondence candidate calculation unit 130 obtains a cosine similarity of an output of the convolution layer for each combination of the partial area of one image and the partial area of the other image, when the cosine similarity becomes a value higher than a threshold value and a partial region of an other image having a maximum cosine similarity with respect to a partial region A coincides with a partial region B and a partial region of one image having a maximum cosine similarity with respect to the partial region B coincides with the partial region A, a combination of the partial region A of one image and the partial region B of the other image is respectively selected as a corresponding candidate. A verification unit 140 determines suitability of the correspondence candidate based on position coordinates of a partial area for each of the correspondence candidates and outputs the correspondence candidate as a correspondence when the correspondence candidate is appropriate.SELECTED DRAWING: Figure 1

Description

本発明は、検証装置、方法、及びプログラムに係り、特に、二枚の画像の対応を検証する検証装置、方法、及びプログラムに関する。 The present invention relates to a verification apparatus, method, and program, and more particularly, to a verification apparatus, method, and program for verifying correspondence between two images.

画像認識技術の進展が目覚ましい。従来、顔・指紋認証やファクトリーオートメーション等、認識する対象や環境が限定されている利用領域が中心的であった。最近はスマートフォン等の小型な撮像デバイスの普及に伴い、一般利用者が自由な場所や環境で、任意の対象を撮影したような自由撮影画像の認識に対する産業上の要請も増えてきている。実世界とウェブ世界の商品を相互につなぐO2Oサービスや、実環境に存在する様々なランドマークを認識して情報を提供する情報案内／ナビゲーションサービス、ロボットエージェントなどへの期待は特に高い。 The progress of image recognition technology is remarkable. Conventionally, the use areas where the recognition target and environment are limited, such as face / fingerprint authentication and factory automation, have been the focus. Recently, with the spread of small imaging devices such as smartphones, there is an increasing industrial demand for recognition of free shot images in which a general user has shot an arbitrary object in a free place or environment. Expectations are particularly high for O2O services that connect products in the real world and the web world, information guidance / navigation services that provide information by recognizing various landmarks in the real environment, and robot agents.

このような新たな用途に供される画像認識技術にはいくつかの形態がありうるが、代表的なものの一つが画像検索に基づく認識技術である。すなわち、事前に認識したい物体を撮影した画像（これを参照画像と呼ぶ）のデータベースを構築しておき、当該参照データベース内の参照画像の中から、撮影したクエリ画像に類似したものを検索することによって、クエリ画像中に存在する物体を特定するのである。 There are several forms of image recognition technology for such new applications, but one of the typical ones is recognition technology based on image retrieval. That is, a database of images (referred to as reference images) obtained by capturing an object to be recognized in advance is constructed, and a reference image in the reference database is searched for an image similar to the captured query image. Thus, an object existing in the query image is specified.

上記目的を達成するためには、単に画像として類似したものを検索するだけでは不十分であり、同一の物体が写った画像を正確に検索できる機能を備えている必要がある。通常、同一の物体であっても、どの画像にも同じ位置や姿勢（部分領域の角度）、大きさで写っているわけではなく、画像によってさまざまな撮影視点から撮影されているのが普通である。特に、一般利用者が自由撮影したような画像においては、事前に物体がどのような視点から撮影されているかを知ることは多くの場合ほぼ不可能であり、画像としては見え方が大きく変化する場合が多い。したがって、単純に画像同士の類似度を測って検索を行っても、所望の画像認識を実現することができないという問題がある。 In order to achieve the above object, it is not sufficient to simply search for a similar image, and it is necessary to have a function for accurately searching for an image showing the same object. Normally, the same object is not always captured in the same position, orientation (partial area angle), and size in every image, but is usually captured from various viewpoints. is there. In particular, it is almost impossible to know in advance what kind of viewpoint an object is shot in advance for images that are taken freely by general users, and the appearance of images changes greatly. There are many cases. Therefore, there is a problem that desired image recognition cannot be realized even if a search is performed by simply measuring the similarity between images.

このような問題を鑑み、撮影視点によらずに、同一の物体が存在するかを検証して有効な検索を実現するための検証技術が発明・開示されてきている。 In view of such problems, verification techniques for verifying whether the same object exists regardless of the shooting viewpoint and realizing effective search have been invented and disclosed.

非特許文献１には、Scale Invariant Feature Transform (SIFT)特徴と一般化ハフ変換に基づく検証方法が開示されている。まず、各々の画像の輝度値を解析することで、顕著な輝度変化を持つような部分領域を多数抽出し、それら各部分領域の輝度変化を、大きさ・回転に対して不変性を持つ特徴量ベクトルとして表現する（SIFT特徴）。次に、互いに異なる二つの画像に含まれる部分領域について、SIFT特徴同士のユークリッド距離を測り、これが小さい値を持つような異なる画像間の部分領域同士を対応候補として求める。さらに、同一の物体から得られた部分領域であれば、物体上の対応する部分領域間の位置・姿勢・大きさの変化が、撮影視点に依らず一貫性を持つという仮定に基づき、対応候補となった部分領域間の位置・姿勢・大きさの「ずれ」を求める。同一の物体から得られた対応する部分領域の集合は、このずれに一貫性があるという仮定の下、ずれのヒストグラムを構成したとすると、これらはごく少数のビンに集中して分布することが想定される。したがって、頻度の高いビンに分布している対応候補のみを真に有効な対応であると見做し、それら以外を有効な対応ではないとして削除する。結果として、有効な対応の数の多かったものを同一の物体が存在する画像として検索する。 Non-Patent Document 1 discloses a verification method based on a Scale Invariant Feature Transform (SIFT) feature and a generalized Hough transform. First, by analyzing the luminance value of each image, a large number of partial areas with significant luminance changes are extracted, and the luminance changes of each partial area are invariant to size and rotation. Expressed as a quantity vector (SIFT feature). Next, the Euclidean distance between SIFT features is measured for partial areas included in two different images, and partial areas between different images having a small value are obtained as correspondence candidates. Furthermore, for partial areas obtained from the same object, based on the assumption that changes in position, posture, and size between corresponding partial areas on the object are consistent regardless of the shooting viewpoint, the candidate for correspondence The “deviation” of the position / posture / size between the partial areas is obtained. Assuming that the set of corresponding subregions obtained from the same object constitutes a histogram of deviation under the assumption that this deviation is consistent, these may be concentrated in a very small number of bins. is assumed. Accordingly, only the correspondence candidates distributed in the bins with high frequency are regarded as truly effective correspondences, and the other candidates are deleted as not being effective correspondences. As a result, an image having a large number of effective correspondences is searched as an image having the same object.

特許文献１には、非特許文献１を改良した技術が開示されている。SIFT特徴に基づき部分領域の対応候補を求め、これらの位置・姿勢・大きさのずれを計算して対応の適否を判定することは同様であるが、ずれを評価する際に３次元回転角度を考えている。結果として、非特許文献１の技術よりもさらに精細な検証を可能にしている。 Patent Document 1 discloses a technique obtained by improving Non-Patent Document 1. It is the same as finding the corresponding candidate for the partial area based on the SIFT feature and calculating the deviation of the position / posture / size to determine the suitability of the correspondence. However, when evaluating the deviation, the three-dimensional rotation angle is set. thinking. As a result, a finer verification than that of Non-Patent Document 1 is possible.

非特許文献２に開示の技術では、やはりSIFT特徴を用いて異なる画像間で部分領域同士の対応候補を得るまでは同様であるが、複数の対応候補を集合として見たときの部分領域の位置のずれが、特定の線形変換に拘束されているような対応候補のみを有効な対応であるとみなすことにより、有効ではない対応候補を削除する方法となっている。 The technique disclosed in Non-Patent Document 2 is the same until the corresponding candidates of the partial areas between different images are obtained using the SIFT feature, but the position of the partial area when a plurality of corresponding candidates are viewed as a set. This is a method of deleting a non-valid correspondence candidate by regarding only a correspondence candidate that is restricted by a specific linear transformation as an effective correspondence.

特開2015-95156号公報JP-A-2015-95156

D.G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints ”, International Journal of Computer Vision, pp.91-110, 2004D.G.Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, pp.91-110, 2004 J. Philbin, O. Chum, M. Isard, Josef Sivic and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching1470-1477, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.J. Philbin, O. Chum, M. Isard, Josef Sivic and Andrew Zisserman.Object retrieval with large vocabularies and fast spatial matching 1470-1477, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.

大局的にみれば、既存の発明は、まずSIFT特徴同士の距離に基づいて対応候補を得たのち、部分領域の位置のずれを解析することによって、対応候補の適否を検証している。 From a broad perspective, the existing invention first obtains a corresponding candidate based on the distance between SIFT features, and then verifies the suitability of the corresponding candidate by analyzing the displacement of the position of the partial region.

しかしながら、このようなSIFT特徴に基づく検証は、多様な物体に対して高精度な検証ができないという問題点があった。すなわち、上記先行技術による検証は、特徴量ベクトルの対応候補を基に行うことを前提としているため、その精度はSIFT特徴の表現能力に依存する。SIFT特徴（あるいは、その他これに類する局所特徴と呼ばれるもの）は、顕著な輝度変化を記述するという特性があり、例えば、判別しやすい模様（テクスチャ）があるなど、顕著な輝度変化が起こりやすいような物体に対しては非常に高精度な検証が可能ではあるものの、特徴的な模様が無い、あるいは、平坦な部分の多い物体に対しては精度のよい対応を得ることができず、結果として高精度な検証を実現することができなかった。 However, the verification based on the SIFT feature has a problem that it cannot perform high-precision verification for various objects. That is, since the verification according to the prior art is premised on the basis of feature vector correspondence candidates, the accuracy depends on the ability to express SIFT features. SIFT features (or other local features similar to this) have the characteristic of describing significant luminance changes, for example, there are patterns (textures) that are easy to distinguish, so that significant luminance changes are likely to occur. Although very accurate verification is possible for a simple object, it is impossible to obtain an accurate response for an object with no characteristic pattern or many flat parts. Highly accurate verification could not be realized.

すなわち、現在に至るまで、多様な物体に対して高精度に同一物体の有無を検証できるような検証技術は発明されていなかった。 That is, until now, no verification technique has been invented that can verify the presence or absence of the same object with high accuracy for various objects.

本発明は、上記問題点を解決するために成されたものであり、より多様な物体に対する高精度な検証を可能にする検証装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a verification device, method, and program that enable highly accurate verification of a wider variety of objects.

上記目的を達成するために、本発明に係る検証装置は、第一の画像と第二の画像との対応を検証する検証装置であって、前記第一の画像と前記第二の画像の各々について、少なくとも一つ以上の畳み込み層を含む畳み込みニューラルネットワークを適用し、画像の部分領域ごとに前記畳み込み層の出力を求める特徴抽出部と、前記第一の画像と第二の画像の各々について前記部分領域ごとに求められた前記畳み込み層の出力に基づいて、前記第一の画像の前記部分領域のそれぞれと、前記第二の画像の前記部分領域のそれぞれとの各組み合わせについて、前記畳み込み層の出力のコサイン類似度を求め、前記第一の画像の部分領域Ａと前記第二の画像の部分領域Ｂとの組み合わせについて前記コサイン類似度が所定の閾値よりも高い値となり、かつ、前記部分領域Ａに対して最大のコサイン類似度となる前記第二の画像の部分領域が前記部分領域Ｂと一致し、かつ、前記部分領域Ｂに対して最大のコサイン類似度となる前記第一の画像の部分領域が前記部分領域Ａと一致する場合に、前記第一の画像の部分領域Ａと前記第二の画像の部分領域Ｂとの組み合わせを対応候補として各々選定する対応候補計算部と、前記対応候補の各々についての部分領域の画像中の位置座標に基づいて、前記対応候補の適否を判定し、前記対応候補が適当である場合には前記対応候補を対応として出力する検証部と、を含んで構成されている。 In order to achieve the above object, a verification apparatus according to the present invention is a verification apparatus for verifying correspondence between a first image and a second image, each of the first image and the second image. And applying a convolutional neural network including at least one convolutional layer to obtain an output of the convolutional layer for each partial region of the image, and for each of the first image and the second image Based on the output of the convolution layer determined for each partial region, for each combination of the partial region of the first image and each of the partial region of the second image, the convolution layer An output cosine similarity is obtained, and the cosine similarity is higher than a predetermined threshold for a combination of the partial area A of the first image and the partial area B of the second image. In addition, the partial area of the second image having the maximum cosine similarity with respect to the partial area A coincides with the partial area B and has the maximum cosine similarity with respect to the partial area B. Correspondence candidate calculation for selecting each combination of the partial area A of the first image and the partial area B of the second image as a corresponding candidate when the partial area of the first image matches the partial area A Verification of determining whether or not the correspondence candidate is appropriate based on the position coordinates in the image of the partial area for each of the correspondence candidates and the correspondence candidate, and outputting the correspondence candidate as a correspondence if the correspondence candidate is appropriate Part.

また、本発明に係る検証方法は、第一の画像と第二の画像との対応を検証する検証装置における検証方法であって、特徴抽出部が、前記第一の画像と前記第二の画像の各々について、少なくとも一つ以上の畳み込み層を含む畳み込みニューラルネットワークを適用し、画像の部分領域ごとに前記畳み込み層の出力を求め、対応候補計算部が、前記第一の画像と第二の画像の各々について前記部分領域ごとに求められた前記畳み込み層の出力に基づいて、前記第一の画像の前記部分領域のそれぞれと、前記第二の画像の前記部分領域のそれぞれとの各組み合わせについて、前記畳み込み層の出力のコサイン類似度を求め、前記第一の画像の部分領域Ａと前記第二の画像の部分領域Ｂとの組み合わせについて前記コサイン類似度が所定の閾値よりも高い値となり、かつ、前記部分領域Ａに対して最大のコサイン類似度となる前記第二の画像の部分領域が前記部分領域Ｂと一致し、かつ、前記部分領域Ｂに対して最大のコサイン類似度となる前記第一の画像の部分領域が前記部分領域Ａと一致する場合に、前記第一の画像の部分領域Ａと前記第二の画像の部分領域Ｂとの組み合わせを対応候補として各々選定し、検証部が、前記対応候補の各々についての部分領域の画像中の位置座標に基づいて、前記対応候補の適否を判定し、前記対応候補が適当である場合には前記対応候補を対応として出力することを特徴とする。 The verification method according to the present invention is a verification method in a verification device for verifying correspondence between a first image and a second image, wherein the feature extraction unit includes the first image and the second image. For each of the above, a convolutional neural network including at least one convolutional layer is applied, the output of the convolutional layer is obtained for each partial region of the image, and the correspondence candidate calculation unit is configured to output the first image and the second image. For each combination of each of the partial regions of the first image and each of the partial regions of the second image, based on the output of the convolution layer determined for each of the partial regions for each of The cosine similarity of the output of the convolution layer is obtained, and the cosine similarity is greater than a predetermined threshold for the combination of the partial area A of the first image and the partial area B of the second image. The partial area of the second image that has a high value and the maximum cosine similarity with respect to the partial area A coincides with the partial area B and has the maximum cosine similarity with respect to the partial area B When the partial area of the first image that coincides with the partial area A, a combination of the partial area A of the first image and the partial area B of the second image is selected as a corresponding candidate. The verification unit determines whether or not the corresponding candidate is appropriate based on the position coordinates in the partial region image for each of the corresponding candidates. If the corresponding candidate is appropriate, the corresponding candidate is determined to be corresponding. It is characterized by outputting.

また、本発明に係るプログラムは、コンピュータを、上記の発明に係る検証装置の各部として機能させるためのプログラムである。 Further, the program according to the present invention is a program for causing a computer to function as each unit of the verification device according to the above invention.

本発明の検証装置、方法、及びプログラムによれば、前記第一の画像と前記第二の画像の各々について、畳み込みニューラルネットワークを適用し、部分領域ごとに前記畳み込み層の出力を求め、前記第一の画像の前記部分領域のそれぞれと、前記第二の画像の前記部分領域のそれぞれとの各組み合わせについて、前記畳み込み層の出力のコサイン類似度を求め、前記コサイン類似度が所定の閾値よりも高い値となり、かつ、前記部分領域Ａに対して最大のコサイン類似度となる前記第二の画像の部分領域が前記部分領域Ｂと一致し、かつ、前記部分領域Ｂに対して最大のコサイン類似度となる前記第一の画像の部分領域が前記部分領域Ａと一致する場合に、前記第一の画像の部分領域Ａと前記第二の画像の部分領域Ｂとの組み合わせを対応候補として各々選定し、前記対応候補の各々についての部分領域の画像中の位置座標に基づいて、前記対応候補の適否を判定し、前記対応候補が適当である場合には前記対応候補を対応として出力することにより、より多様な物体に対する高精度な検証を可能にする、という効果が得られる。 According to the verification apparatus, method, and program of the present invention, a convolutional neural network is applied to each of the first image and the second image to obtain an output of the convolution layer for each partial region, and For each combination of the partial areas of one image and each of the partial areas of the second image, the cosine similarity of the output of the convolution layer is obtained, and the cosine similarity is lower than a predetermined threshold value. The partial area of the second image that has a high value and the maximum cosine similarity with respect to the partial area A coincides with the partial area B and has the maximum cosine similarity with respect to the partial area B When the partial area of the first image that coincides with the partial area A corresponds to the combination of the partial area A of the first image and the partial area B of the second image Each is selected as a complement, and the suitability of the correspondence candidate is determined based on the position coordinates in the image of the partial area for each of the correspondence candidates. If the correspondence candidate is appropriate, the correspondence candidate is regarded as correspondence By outputting, it is possible to obtain an effect of enabling highly accurate verification with respect to a wider variety of objects.

本発明の実施の形態に係る検証装置の構成を示すブロック図である。It is a block diagram which shows the structure of the verification apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る検証装置における検証処理ルーチンを示すフローチャートである。It is a flowchart which shows the verification process routine in the verification apparatus which concerns on embodiment of this invention. 畳み込み層の出力を示す図である。It is a figure which shows the output of a convolution layer. ２枚の画像の部分領域の幾何情報の幾何的関係を示す図である。It is a figure which shows the geometric relationship of the geometric information of the partial area | region of two images.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る原理＞
まず、本発明の実施の形態における原理について説明する。 <Principle according to the embodiment of the present invention>
First, the principle in the embodiment of the present invention will be described.

本発明の実施の形態では、畳み込みニューラルネットワークの畳み込み層の出力を用いることによって、画像中の部分領域ごとの特徴を表現する。通常、畳み込みニューラルネットワークは、複数の複雑な畳み込みフィルタによって構成されていることにより、高い表現能力を実現することができる。結果、従来のSIFT特徴等の局所特徴では対象にできなかったような、顕著な輝度変化が無いような物体であっても、正確な対応候補を得ることができるようになる。 In the embodiment of the present invention, the feature of each partial region in the image is expressed by using the output of the convolution layer of the convolutional neural network. In general, a convolutional neural network can be configured with a plurality of complex convolution filters, thereby realizing high expression capability. As a result, an accurate correspondence candidate can be obtained even for an object that does not have a noticeable luminance change, such as a conventional feature such as a SIFT feature that cannot be targeted.

さらに、本発明の実施の形態では、この対応候補を求める上で、２枚の画像間で部分領域ごとの畳み込み層の出力（すなわち、畳み込み層を構成する一つ以上の畳み込みフィルタの応答）のコサイン類似度を求め、この類似度が閾値以上となり、かつ２枚の画像で比較した際に互いに最大のコサイン類似度となっているような部分領域の組み合わせを対応候補とする。コサイン類似度は、他の様々な距離尺度に比して、より正確に畳み込みニューラルネットワークのフィルタ応答の類似度を求めることができ、特にこれが一定以上の値（例えば０．５以上）を超えるようなものは、精度のよい対応候補を与えることができる。さらに、同一の物体（またはその一部）を写した部分領域であれば、相互に最大のコサイン類似度となっているはずであるから、これを満たすようなものを対応候補として選択することで、より信頼性の高い対応候補を得ることができる。さらに、本発明の実施の形態では、対応候補の幾何的な整合性を判定して最終的な対応を求めることで、結果的に、高精度な検証を実行することが可能になる。 Furthermore, in the embodiment of the present invention, in obtaining this correspondence candidate, the output of the convolution layer for each partial region (that is, the response of one or more convolution filters constituting the convolution layer) between the two images. A cosine similarity is obtained, and a combination of partial areas in which the similarity is equal to or greater than a threshold value and has a maximum cosine similarity when compared with two images is determined as a corresponding candidate. The cosine similarity can more accurately determine the similarity of the filter response of the convolutional neural network than other various distance measures, and in particular, it exceeds a certain value (for example, 0.5 or more). Can give a corresponding candidate with high accuracy. In addition, if the partial area is the same object (or part of it), it should have the maximum cosine similarity with each other. Therefore, a more reliable response candidate can be obtained. Furthermore, in the embodiment of the present invention, it is possible to perform highly accurate verification as a result by determining the geometric consistency of the correspondence candidates and obtaining the final correspondence.

なお、本発明の実施の形態は、一般的な畳み込みニューラルネットワークと従来の検証方法の単純な組み合わせによって実現されるものではない。まず、従来の畳み込みニューラルネットワークでは、例えば参考文献１や参考文献２、参考文献３に記載されているように、入力画像に対して、一つ以上の畳み込みフィルタからなる畳み込み層を数層適用して畳み込み層の出力を求めた後、その出力を空間方向に集約し、固定次元のベクトルとして表現する。例えば参考文献１では、畳み込み層の出力の全ての要素に対して線形結合を取ることによって、参考文献２では、空間方向に対して平均を取ることによって、また、参考文献３では、特定の空間の領域に対して最大値や和を取ることによって、空間方向への集約を行う。 The embodiment of the present invention is not realized by a simple combination of a general convolutional neural network and a conventional verification method. First, in a conventional convolutional neural network, as described in Reference Document 1, Reference Document 2, and Reference Document 3, for example, several convolution layers made up of one or more convolution filters are applied to an input image. After obtaining the output of the convolution layer, the output is aggregated in the spatial direction and expressed as a fixed-dimensional vector. For example, in Reference 1, all the elements of the output of the convolution layer are linearly combined, in Reference 2, by taking an average in the spatial direction, and in Reference 3, a specific space Aggregation in the spatial direction is performed by taking the maximum value or the sum of the regions.

［参考文献１］Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton: ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Neural Information Processing Systems, pp. 1106-1114, 2012. [Reference 1] Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton: ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Neural Information Processing Systems, pp. 1106-1114, 2012.

［参考文献２］Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778., 2016. [Reference 2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778., 2016.

［参考文献３］Giorgos Tolias, Ronan Sicre, Herve Jegou: Particular Object Retrieval with Integral Max-Pooling of CNN Activations, arXiv:1511.05879. 2015. [Reference 3] Giorgos Tolias, Ronan Sicre, Herve Jegou: Particular Object Retrieval with Integral Max-Pooling of CNN Activations, arXiv: 1511.05879. 2015.

このような空間方向への集約を行う方法では、最終的な表現のサイズ（ベクトルの次元）を小さく抑えることができる一方で、畳み込み層の出力が備える空間表現性能を損なうことになるため、本技術の要件に対して適当な表現とはならない。そこで、本発明の実施の形態では、畳み込み層の出力に対する集約処理を廃し、そのまま利用する。このような手続きは、先の参考文献１〜３のいずれにも開示されていない。 This method of aggregation in the spatial direction can reduce the final representation size (vector dimension) while reducing the spatial representation performance of the output of the convolution layer. It is not an appropriate expression for technical requirements. Therefore, in the embodiment of the present invention, the aggregation processing for the output of the convolution layer is eliminated and used as it is. Such a procedure is not disclosed in any of the above references 1 to 3.

また、畳み込み層の出力は、画像内一様な大きさ・間隔で抽出され、また、高次元かつ疎な非負値ベクトルである点においてSIFT特徴とは性質が異なっているため、精度の高い検証を実現するには、畳み込み層の出力の特性に適した手続きが必要である。本発明の実施の形態では、畳み込み層の出力が疎な高次元ベクトルであり、対応候補を得る上ではそのベクトルのノルムが有益でないことを鑑み、コサイン類似度を用いて畳み込み層間の対応を求め、これが一定値を超え、かつ２枚の画像間で互いが最大の類似度になっているような場合のみを対応候補として残す。このような手続きにより、精度の高い対応候補を得ることができる。 In addition, the output of the convolution layer is extracted with uniform size and spacing in the image, and is different from the SIFT feature in that it is a high-dimensional and sparse non-negative vector. To achieve this, a procedure suitable for the output characteristics of the convolutional layer is required. In the embodiment of the present invention, in consideration of the fact that the output of the convolution layer is a sparse high-dimensional vector and the norm of the vector is not useful in obtaining the correspondence candidate, the correspondence between the convolution layers is obtained using the cosine similarity. Only when this exceeds a certain value and the two images have the maximum similarity between the two images is left as a corresponding candidate. With such a procedure, it is possible to obtain correspondence candidates with high accuracy.

さらに、一様な大きさ・間隔で抽出される畳み込み層の出力は、画像全体に対して均質で網羅的な表現を与える一方で、特徴的でないような部分領域（背景にある空や道路など、物体を見分ける上で重要ではないような部分領域）からも出力が得られてしまうことから、しばしば誤対応を招く原因となる。そこで、本発明の実施の形態では、特徴的でないような部分領域から計算された代表的な畳み込み層の出力を予め求めておき、これと対応するような部分領域を予め除く処理を適用することにより、誤対応を抑制する。このような発想は、参考文献１〜３、および、先行技術文献のいずれにも記載されていない。 In addition, the convolution layer output extracted at a uniform size and spacing gives a homogeneous and exhaustive representation of the entire image, while subregions that are not characteristic (such as the sky and roads in the background). Since the output is also obtained from a partial area that is not important for distinguishing the object), it often causes erroneous correspondence. Therefore, in the embodiment of the present invention, an output of a representative convolution layer calculated from a partial area that is not characteristic is obtained in advance, and a process of removing the partial area corresponding to this is applied in advance. This suppresses mishandling. Such an idea is not described in any of Reference Documents 1 to 3 and Prior Art Documents.

以上の通り、本発明の実施の形態により、多様な物体に対して高精度な検証が可能となる。 As described above, according to the embodiment of the present invention, it is possible to perform highly accurate verification on various objects.

以下、図面を参照して本発明の実施の一形態を詳細に説明する。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.

＜＜全体構成＞＞
図１は、本発明の実施の形態に係る検証装置１００の構成の一例を示すブロック図である。図１に示す検証装置１００は、入力部１１０、特徴抽出部１２０、対応候補計算部１３０、検証部１４０、出力部１５０を備える。 << Overall structure >>
FIG. 1 is a block diagram showing an example of the configuration of the verification apparatus 100 according to the embodiment of the present invention. The verification apparatus 100 illustrated in FIG. 1 includes an input unit 110, a feature extraction unit 120, a correspondence candidate calculation unit 130, a verification unit 140, and an output unit 150.

検証装置１００は、入力部１１０を介して参照データベース１６０と通信手段を介して接続されて相互に情報通信し、当該データベースに任意の画像情報を登録したり、また画像情報を読み出したりすることができる構成を採る。 The verification apparatus 100 is connected to the reference database 160 via the input unit 110 via a communication unit and communicates information with each other, so that arbitrary image information can be registered in the database or image information can be read out. Use a possible configuration.

ここでいう画像情報には、画像そのもの（画像ファイル）と、画像の部分領域の幾何情報、そして特徴量ベクトルを含むものとし、同一の画像に関するものであればこれらは相互に関連づけられているものとする。特に参照データベース１６０には、クエリ画像に対して検索の対象となる参照画像に関する画像情報が含まれているものとする。画像の部分領域は、画像の一部領域であればどのように定められても構わないが、好ましくは畳み込みニューラルネットワークの畳み込み層の畳み込みフィルタにより規定される部分領域を用いる（詳細は後述する）。 The image information here includes the image itself (image file), the geometric information of the partial area of the image, and the feature vector, and if they relate to the same image, they are associated with each other. To do. In particular, it is assumed that the reference database 160 includes image information related to a reference image to be searched for a query image. The partial region of the image may be determined in any way as long as it is a partial region of the image, but preferably a partial region defined by a convolution filter of a convolution layer of a convolutional neural network is used (details will be described later). .

また、部分領域の幾何情報としては、部分領域の位置と大きさを含むものとするが、本発明の実施形態の一例にあたっては少なくとも位置が含まれていればよい。 Further, the geometric information of the partial area includes the position and size of the partial area. However, in the example of the embodiment of the present invention, at least the position may be included.

参照データベース１６０は、例えば、一般的な汎用コンピュータに実装されているファイルシステムによって構成できる。各参照画像ファイルそれぞれを一意に識別可能な識別子（例えば、通し番号によるIDやユニークなファイル名等）を与えるものとし、さらに、当該画像に規定された部分領域、並びに、特徴量ベクトルを記述したファイルも、当該画像の識別子と関連づけて記憶しているものとする。あるいは、同様にRDBMS (Relational Database Management System）などで実装・構成されていても構わない。その他、メタデータとして、例えば画像の内容を表現するもの（画像のタイトル、概要文、又はキーワード等）、画像のフォーマットに関するもの（画像のデータ量、サムネイル等のサイズ）などを含んでいても構わないが、本発明の実施においては必須ではない。 The reference database 160 can be configured by, for example, a file system implemented on a general general-purpose computer. Each reference image file shall be given an identifier (for example, ID by serial number, unique file name, etc.) that can uniquely identify each file, and a file that describes a partial area defined in the image and a feature vector Are stored in association with the identifier of the image. Alternatively, it may also be implemented and configured by RDBMS (Relational Database Management System) or the like. In addition, the metadata may include, for example, data representing the content of the image (image title, summary text, or keyword), data related to the image format (image data amount, thumbnail size, etc.), and the like. Although not required, it is not essential in the practice of the present invention.

参照データベース１６０は、検証装置１００の内部にあっても外部にあっても構わず、通信手段は任意の公知ものを用いることができるが、本実施の形態においては、外部にあるものとして、通信手段は、インターネット、TCP/IPにより通信するよう接続されているものとする。 The reference database 160 may be inside or outside the verification apparatus 100, and any known communication means can be used. However, in the present embodiment, the communication is assumed to be outside. The means is assumed to be connected to communicate via the Internet or TCP / IP.

また、検証装置１００が備える各部及び参照データベース１６０は、演算処理装置、記憶装置等を備えたコンピュータやサーバ等により構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは検証装置１００あるいは参照データベース１６０が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。もちろん、その他いかなる構成要素についても、単一のコンピュータやサーバによって実現しなければならないものではなく、ネットワークによって接続された複数のコンピュータに分散して実現しても構わない。 Further, each unit included in the verification apparatus 100 and the reference database 160 may be configured by a computer, a server, or the like including an arithmetic processing device, a storage device, and the like, and the processing of each unit may be executed by a program. This program is stored in a storage device included in the verification apparatus 100 or the reference database 160, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or provided through a network. Of course, any other components need not be realized by a single computer or server, but may be realized by being distributed to a plurality of computers connected by a network.

なお、画像情報自体は必ずしも参照データベース１６０に格納されている必要はなく、たとえば適宜外部から入力部１１０を介して直接入力されるような構成をとっても構わない。このような構成は、例えば物体検索のために本発明を利用するような場合、参照画像については事前に必要な処理を実施した上でその画像情報が参照データベース１６０に格納されており、クエリ画像１７０については適宜問い合わせのタイミングで外部からクエリ画像の入力を受け付けて画像情報を得るような処理を行う用途に向く。具体例を挙げると、図１に記載の検証装置１００の構成の一例においては、参照データベース１６０に予め一枚以上の参照画像に関する画像情報が格納されており、これらは上記説明した通り検証装置１００と相互に読み出し／登録可能な形態で接続されている。この他、問い合わせとして入力されるクエリ画像１７０を外部から受け付けられるような構成を採っている。 Note that the image information itself does not necessarily have to be stored in the reference database 160. For example, the image information itself may be directly input from the outside via the input unit 110 as appropriate. In such a configuration, for example, when the present invention is used for object search, the reference image is stored in the reference database 160 after performing necessary processing for the reference image, and the query image 170 is suitable for an application in which an input of a query image is received from the outside at an inquiry timing as appropriate to obtain processing image information. As a specific example, in the example of the configuration of the verification apparatus 100 illustrated in FIG. 1, image information related to one or more reference images is stored in the reference database 160 in advance, and these are as described above. Are connected in a form that can be read / registered with each other. In addition, a configuration is adopted in which a query image 170 input as an inquiry can be received from the outside.

以降、本発明の実施形態の一例においては、２枚の画像の対応を検証する場合、特に物体検索における用途を見据えて、参照データベース１６０に登録されている参照画像のうちの１枚と、クエリ画像１７０として入力されたクエリ画像の１枚との対応を検証する場合を例に説明していく。複数組の参照画像とクエリ画像の対応を検証するような場合には、以降説明する処理を、検証したい組の数分だけ必要な処理を繰り返せばよい。なお、参照画像とクエリ画像とが、第一の画像と第二の画像との一例である。 Hereinafter, in an example of an embodiment of the present invention, when verifying the correspondence between two images, one of the reference images registered in the reference database 160 and a query, particularly with a view to use in object search, A case where the correspondence with one query image input as the image 170 is verified will be described as an example. When the correspondence between a plurality of sets of reference images and query images is verified, the processing described below may be repeated as many times as necessary for the number of sets to be verified. Note that the reference image and the query image are examples of the first image and the second image.

＜＜処理部＞＞
本実施の形態における検証装置１００の各処理部について説明する。 << Processor >>
Each processing unit of the verification apparatus 100 in the present embodiment will be described.

入力部１１０は、検証装置１００に対して外部からの入力を受け付けるインタフェースであり、参照データベース１６０から参照画像あるいは特徴量ファイルを読み出したり、クエリ画像１７０のような画像を外部から受け付けて各処理部へと伝達する。 The input unit 110 is an interface that accepts input from the outside to the verification apparatus 100, reads a reference image or feature amount file from the reference database 160, and accepts an image such as a query image 170 from the outside to receive each processing unit. Communicate to.

特徴抽出部１２０は、入力部１１０を介して参照画像又はクエリ画像を受け取った下で、当該画像に対して畳み込みニューラルネットワークを適用し、部分領域ごとの特徴量ベクトルを抽出する。抽出した部分領域および特徴量ベクトルは、対応候補計算部１３０又は出力部１５０へと伝達される。 The feature extraction unit 120 receives a reference image or a query image via the input unit 110, and applies a convolutional neural network to the image to extract a feature vector for each partial region. The extracted partial region and feature quantity vector are transmitted to the correspondence candidate calculation unit 130 or the output unit 150.

対応候補計算部１３０は、入力部１１０又は特徴抽出部１２０から受け取った２枚の画像（例えば、参照画像とクエリ画像）の画像情報（少なくとも部分領域ごとの特徴量ベクトル）に基づき、一方の画像の部分領域のそれぞれと、他方の画像の部分領域のそれぞれとの各組み合わせについて、特徴量ベクトルのコサイン類似度を求め、このコサイン類似度に基づいて、当該２枚の画像間の対応候補を求める。 The correspondence candidate calculation unit 130 selects one image based on image information (at least a feature amount vector for each partial region) of two images (for example, a reference image and a query image) received from the input unit 110 or the feature extraction unit 120. The cosine similarity of the feature amount vector is obtained for each combination of each of the partial areas of the image and the partial area of the other image, and a correspondence candidate between the two images is obtained based on the cosine similarity. .

検証部１４０は、２枚の画像について対応候補となった部分領域の幾何情報の組の各々について、その幾何的関係を求め、これに基づいてこの対応候補が適当であるか否かを判定し、適当である場合にはこれを対応とする判定結果を出力する。 The verification unit 140 obtains the geometric relationship for each of the sets of geometric information of the partial areas that are correspondence candidates for the two images, and determines whether or not the correspondence candidate is appropriate based on the geometric relationship. If appropriate, a determination result corresponding to this is output.

＜＜処理概要＞＞
次に、本実施の形態における検証装置１００の処理について説明する。本発明の実施形態における処理は、大きくオフライン処理とオンライン処理に分かれている。前者は参照データベース１６０に格納されている各参照画像について、少なくとも一度実施されていればよい処理であり、後者は実際に検索を行う際に、クエリ画像が入力されたことをトリガとして実施する処理である。以降、順に説明する。 << Process overview >>
Next, the process of the verification apparatus 100 in this Embodiment is demonstrated. Processing in the embodiment of the present invention is roughly divided into offline processing and online processing. The former is a process that needs to be performed at least once for each reference image stored in the reference database 160, and the latter is a process that is performed by using the input of a query image as a trigger when performing an actual search. It is. Hereinafter, this will be described in order.

オフライン処理は単純であり、参照データベース１６０に登録されている参照画像の各々に対して畳み込みニューラルネットワークを適用して部分領域ごとの特徴量ベクトルを抽出し、参照データベース１６０に格納する。 The off-line processing is simple, and a feature vector for each partial region is extracted by applying a convolutional neural network to each of the reference images registered in the reference database 160 and stored in the reference database 160.

オンライン処理は、オフライン処理に比べてステップ数が多いため、図２を用いて説明する。 Since online processing has more steps than offline processing, it will be described with reference to FIG.

まず、ステップＳ２０１では、クエリ画像を受け付けると、クエリ画像に対して畳み込みニューラルネットワークを適用して部分領域ごとの特徴量ベクトルを抽出する。 First, in step S201, when a query image is received, a convolutional neural network is applied to the query image to extract a feature vector for each partial region.

続いてステップＳ２０２では、クエリ画像の各部分領域の特徴量ベクトルと、参照データベース１６０に格納された参照画像の各部分領域の特徴量ベクトルとのコサイン類似度を求め、このコサイン類似度に基づいて、クエリ画像と参照画像との間の部分領域の組み合わせである対応候補を各々求める。 Subsequently, in step S202, the cosine similarity between the feature quantity vector of each partial area of the query image and the feature quantity vector of each partial area of the reference image stored in the reference database 160 is obtained, and based on this cosine similarity degree. Then, correspondence candidates that are combinations of partial areas between the query image and the reference image are respectively obtained.

続いてステップＳ２０３では、各対応候補に対し、当該対応候補となった参照画像側の部分領域の幾何情報と、クエリ画像側の部分領域の幾何情報との組み合わせについて、その幾何的関係を求め、各対応候補に対して求められた幾何的関係に基づいて、各対応候補が適当であるか否かを判定し、適当である場合には当該対応候補を対応として判定した判定結果を、認証結果１８０として出力部１５０により出力する。 Subsequently, in step S203, for each correspondence candidate, a geometric relationship is obtained with respect to a combination of the geometric information of the partial area on the reference image side and the geometric information of the partial area on the query image side that is the corresponding candidate, Based on the geometric relationship obtained for each correspondence candidate, it is determined whether or not each correspondence candidate is appropriate. 180 is output by the output unit 150.

以上の処理により、入力されたクエリ画像と参照画像の部分領域間の対応を検証することができる。 With the above processing, the correspondence between the input query image and the partial region of the reference image can be verified.

＜＜各処理の処理詳細＞＞
以降、各処理の詳細処理について、本実施の形態における一例を説明する。 << Details of each process >>
Hereinafter, an example of the detailed processing of each processing will be described in the present embodiment.

［特徴抽出処理］
まず、入力された画像に対して、部分領域と特徴量ベクトルを抽出する方法について説明する。 [Feature extraction processing]
First, a method for extracting a partial region and a feature vector from an input image will be described.

本発明の実施の形態においては、画像に対して畳み込みニューラルネットワークを適用し、各部分領域の特徴量ベクトルを抽出する。畳み込みニューラルネットワークには様々な公知のバリエーションが提案されているが、少なくとも１つの畳み込み層（同一のサイズとスキップ幅を持つような畳み込みフィルタの集合により規定されるニューラルネットワークの層）を用いて構成されているものであればどのようなものを用いてもよく、例えば参考文献１や参考文献２などに記載のものを用いればよい。 In the embodiment of the present invention, a convolutional neural network is applied to an image to extract a feature vector of each partial region. Various known variations have been proposed for convolutional neural networks, but they are configured using at least one convolutional layer (a layer of a neural network defined by a set of convolutional filters having the same size and skip width). Any of those described in Reference Document 1, Reference Document 2, etc. may be used.

ここで、どのような畳み込みニューラルネットワークを用いる場合であっても、本発明の実施の形態においては畳み込み層の出力を終端出力として求める。例えば、参考文献１に記載のものを始め、多くの畳み込みニューラルネットワークは、複数の畳み込み層の後に、プーリング層や全結合層を含んでいるため、畳み込みニューラルネットワークの最終出力は全結合層の出力であることになる。しかしながら、このような空間方向への集約を行う方法では、最終的な表現のサイズ（ベクトルの次元）を小さく抑えることができる一方で、畳み込み層の出力が備える空間表現性能を損なうことになるという問題がある。そこで、本発明の実施の形態では、畳み込み層の出力に対する集約処理を廃し、そのまま利用する。より具体的には、畳み込みニューラルネットワークを構成するある畳み込み層の出力を求め、これを特徴量ベクトルとして利用する。好ましくは、畳み込みニューラルネットワークの最終出力層に近い畳み込み層の出力を利用する。 Here, no matter what convolutional neural network is used, the output of the convolutional layer is obtained as the termination output in the embodiment of the present invention. For example, since many convolutional neural networks including those described in Reference 1 include a pooling layer and a fully connected layer after a plurality of convolutional layers, the final output of the convolutional neural network is the output of the fully connected layer. It will be. However, with this method of aggregation in the spatial direction, the final representation size (vector dimension) can be kept small, while the spatial representation performance of the output of the convolutional layer is impaired. There's a problem. Therefore, in the embodiment of the present invention, the aggregation processing for the output of the convolution layer is eliminated and used as it is. More specifically, the output of a certain convolution layer constituting the convolutional neural network is obtained and used as a feature vector. Preferably, the output of the convolutional layer close to the final output layer of the convolutional neural network is used.

図３に、形式的な畳み込み層の出力を図示する。畳み込み層の出力は、通常、高さ（ｈ）・幅（ｗ）・深さ（ｄ）を持つ３階のテンソルとして表現される。すなわち、ｈ×ｗ×ｄの要素を持つ３次元配列である。見方を変えると、これは入力された画像に対して、高さｈ×幅ｗの部分領域を取り、各部分領域にｄ次元のベクトルを出力していると言い換えることができる。本発明の実施の形態ではまさにこれを部分領域ごとの特徴量ベクトルとして求めるのである。 FIG. 3 illustrates the output of a formal convolution layer. The output of the convolution layer is usually expressed as a third-order tensor having a height (h), a width (w), and a depth (d). That is, it is a three-dimensional array having h × w × d elements. In other words, this takes a partial area of height h × width w for the input image and outputs a d-dimensional vector in each partial area. In the embodiment of the present invention, this is exactly obtained as a feature vector for each partial region.

なお、厳密には、畳み込み層の各出力要素の位置（図３のグリッドに区切られたマスの一つ一つ）が、元の入力画像のどの部分領域に対応しているかについては、畳み込みニューラルネットワークの畳み込み層の構成に依存して決定される。本発明の実施の形態においては近似的に、元の入力画像をｈ×ｗに分割した上で、各部分領域に対応する位置にある出力要素を取っても構わない。 Strictly speaking, as to which partial area of the original input image the position of each output element of the convolution layer (each of the cells divided in the grid of FIG. 3) corresponds to the convolutional neural network. It is determined depending on the configuration of the convolutional layer of the network. In the embodiment of the present invention, the output element at the position corresponding to each partial area may be taken after dividing the original input image into h × w.

好ましくは、ｄ次元の各特徴量ベクトルについてＬ２正規化を施しておく。こうすることによって、後のコサイン類似度が内積演算と等価になるため、好適である。 Preferably, L2 normalization is performed on each d-dimensional feature vector. This is preferable because the subsequent cosine similarity is equivalent to the inner product operation.

以上のように、入力された画像に対して部分領域ごとの特徴量ベクトルを求めることができる。 As described above, the feature vector for each partial region can be obtained for the input image.

［対応候補計算処理］
次に、異なる２枚の画像間に規定された部分領域同士の対応候補を求める処理について説明する。本発明の実施の形態の一例においては、クエリ画像とそれぞれの参照画像との間で、対応する部分領域を決定するために用いる処理である。 [Correspondence candidate calculation processing]
Next, processing for obtaining a correspondence candidate between partial areas defined between two different images will be described. In an example of the embodiment of the present invention, the process is used to determine a corresponding partial area between a query image and each reference image.

クエリ画像から抽出されたある部分領域（すなわち、先の畳み込み層の出力要素）をQ_i、参照画像から抽出されたある部分領域をR_jと表すことにする。以下では、部分領域Q_iとR_jを例にとり、これらが対応候補であるか否かを判断する処理を説明する。 A partial region extracted from the query image (that is, the output element of the previous convolution layer) is represented as Q _i , and a partial region extracted from the reference image is represented as R _j . Hereinafter, taking the partial areas Q _i and R _j as an example, processing for determining whether or not these are correspondence candidates will be described.

各部分領域には、畳み込みニューラルネットワークにより抽出された、その部分領域を表現する特徴量ベクトルが関連づけられている。部分領域Q_iを記述する特徴量ベクトルをq、R_jを記述する特徴量ベクトルをrと表すとする。このとき、部分領域同士のコサイン類似度をsim(Q_i, R_j)を次式により求める。 Each partial region is associated with a feature vector that expresses the partial region extracted by the convolutional neural network. Assume that a feature vector describing the partial region Q _i is represented by q, and a feature vector describing R _j is represented by r. At this time, sim (Q _i , R _j ) is obtained from the following equation for the cosine similarity between the partial regions.

・・・（１）
... (1)

ここで、||ｑ||はqのＬ２ノルムを表す。もし仮に、特徴抽出処理において、各特徴量ベクトルがＬ２正規化されているとすると、||ｑ||＝||ｒ||＝１であるため、上記式（１）は次式と等価である。 Here, || q || represents the L2 norm of q. If, in the feature extraction process, each feature vector is L2 normalized, || q || = || r || = 1, the above equation (1) is equivalent to the following equation: is there.

通常、SIFT特徴等による検証では、ユークリッド距離比を用いて対応候補を得ることが多かった（例えば非特許文献１）。しかし、畳み込み層の出力はSIFT特徴などに比べて非常に疎な高次元ベクトルとなることが多く、対応候補を得る上ではそのベクトルのノルムを含めて評価することが有益でないような場合が多い。そこで、本発明の実施の形態ではより正確な類似度を求めるべく、コサイン類似度を用いることとし、さらに次の２つの条件双方を満たす部分領域の組み合わせがあった場合に、それらを対応候補とする。 Usually, in verification using SIFT features or the like, correspondence candidates are often obtained using the Euclidean distance ratio (for example, Non-Patent Document 1). However, the output of the convolution layer is often a very sparse high-dimensional vector compared to SIFT features, etc., and it is often not useful to evaluate including the norm of the vector in order to obtain a corresponding candidate. . Therefore, in the embodiment of the present invention, in order to obtain a more accurate similarity, cosine similarity is used, and when there are combinations of partial areas that satisfy both of the following two conditions, they are regarded as correspondence candidates. To do.

＜＜＜条件１：高類似度＞＞＞
コサイン類似度sim(Q_i, R_j)は、-1から1までの値を取り、値が大きいほど特徴量ベクトル間が近いことを表す。本発明の実施の形態においては、特徴量ベクトルの近い部分領域の組み合わせを発見することを目的としているため、この値が高いものだけを考慮すればよい。この観点から、コサイン類似度sim(Q_i, R_j)が一定の閾値以上の値を持つことを条件とする。この閾値は、例えば０．５などとするのが好適である。 <<< Condition 1: High similarity >>>
The cosine similarity sim (Q _i , R _j ) takes a value from −1 to 1, and the larger the value, the closer the feature vector is. In the embodiment of the present invention, an object is to find a combination of partial regions having close feature vector values, and therefore only those having a high value need be considered. From this viewpoint, the cosine similarity sim (Q _i , R _j ) is required to have a value equal to or greater than a certain threshold value. This threshold is preferably set to 0.5, for example.

＜＜＜条件２：双方向性＞＞＞
参照画像の部分領域R_jに着目したとき、これに最も近い（最も類似度の高い）クエリ画像の部分領域がQ_iであったとする。このとき、反対に、クエリ画像の部分領域Q_iに着目したとき、これに最も近い参照画像の部分領域もやはりR_jであることを条件とする。 <<< Condition 2: Interactivity >>>
When attention is paid to the partial region R _j of the reference image, it is assumed that the partial region of the query image closest to this (the highest similarity) is Q _i . At this time, conversely, when the partial area Q _i of the query image is focused, the partial area of the reference image closest to this is also assumed to be R _j .

以上の計算を、クエリと参照画像のそれぞれとの間で、全ての部分領域の組み合わせに対して実施することで、対応候補となる部分領域を求めることが可能である。 By executing the above calculation for all combinations of partial areas between the query and the reference image, it is possible to obtain partial areas that are correspondence candidates.

なお、畳み込みニューラルネットワークにより特徴抽出した場合、部分領域は通常画像内均等一様に抽出される。この結果、後の対応候補計算処理の際、画像中の特に物体の存在しないような特徴的でない領域同士、例えば空などが対応してしまい、認識精度に悪影響を及ぼす場合がある。このような望ましくない対応を防ぐため、「タブー部分領域」を構成してもよい。 Note that, when features are extracted by a convolutional neural network, the partial regions are extracted uniformly in the normal image. As a result, in the later correspondence candidate calculation process, non-characteristic areas in the image where no object exists, such as the sky, correspond to each other, which may adversely affect recognition accuracy. In order to prevent such an undesirable response, a “taboo partial area” may be configured.

このタブー部分領域は、予め対応候補となるべきではない領域から抽出された特徴量ベクトルによって構成する。例えば、空に現れやすい特徴量ベクトルを、対応候補とならないように除外したいとする。このとき、予め空に対応する部分領域から抽出された部分領域の特徴量ベクトルをタブー部分領域として記憶する。もし抽出されるタブー部分領域が非常に多数になる場合には、必要に応じてクラスタリング法（k-means等）を用いて代表特徴量ベクトルを選択し、選択された代表特徴量ベクトルのみをタブー部分領域として記憶してもよい。 This tabu partial area is configured by a feature vector extracted from an area that should not be a corresponding candidate in advance. For example, it is assumed that a feature vector that tends to appear in the sky is to be excluded so as not to become a corresponding candidate. At this time, the feature quantity vector of the partial area extracted from the partial area corresponding to the sky in advance is stored as a taboo partial area. If there are a large number of tabu partial areas to be extracted, a representative feature vector is selected using a clustering method (k-means, etc.) as necessary, and only the selected representative feature vector is tabued. It may be stored as a partial area.

その後、参照画像やクエリ画像から部分領域および特徴量ベクトルを抽出した際に、抽出した部分領域とタブー部分領域との組み合わせについての特徴量ベクトルのコサイン類似度を求め、このコサイン類似度が先の条件１、条件２を満たした場合には、その部分領域を含む対応候補を除去する。 Thereafter, when the partial region and the feature vector are extracted from the reference image or the query image, the cosine similarity of the feature vector for the combination of the extracted partial region and the taboo partial region is obtained, and this cosine similarity is When the conditions 1 and 2 are satisfied, the correspondence candidate including the partial region is removed.

すなわち、参照画像やクエリ画像の部分領域Ｃとタブー部分領域Ｄとの組み合わせについてコサイン類似度が所定の閾値よりも高い値となり、かつ、部分領域Ｃに対して最大のコサイン類似度となるタブー部分領域が、タブー部分領域Ｄと一致し、かつ、タブー部分領域Ｄに対して最大のコサイン類似度となる参照画像やクエリ画像の部分領域が部分領域Ｃと一致する場合に、部分領域Ｃを含む対応候補を除去する。 That is, the taboo portion in which the cosine similarity is higher than a predetermined threshold for the combination of the partial region C and the tabular partial region D of the reference image or the query image, and the maximum cosine similarity is obtained with respect to the partial region C. The partial area C is included when the area matches the tabu partial area D and the partial area of the reference image or query image that has the maximum cosine similarity with the tabu partial area D matches the partial area C. Remove correspondence candidates.

タブー部分領域を選ぶ際には、先の空のように、カテゴリとして表現できるものであればなお好適である。この理由は、例えば参考文献４などの公知のセマンティックセグメンテーションなどと呼ばれる画像認識法により、あるカテゴリにあてはまる領域を自動検知することができるため、人手でタブー部分領域を選ぶ労力を削減できるからである。 When selecting a taboo partial area, it is more preferable if it can be expressed as a category as in the previous sky. This is because, for example, an area that falls within a certain category can be automatically detected by an image recognition method called known semantic segmentation, such as Reference 4, so that it is possible to reduce the labor of manually selecting a tabu partial area. .

［参考文献４］Jonathan Long, Evan Shelhamer, Trevor Darrell: Fully Convolutional Networks for Semantic Segmentation. In Proc. Conference on Computer Vision Pattern Recognition, pp. 3431-3440, 2015. [Reference 4] Jonathan Long, Evan Shelhamer, Trevor Darrell: Fully Convolutional Networks for Semantic Segmentation. In Proc. Conference on Computer Vision Pattern Recognition, pp. 3431-3440, 2015.

［検証処理］
続いて、部分領域の幾何情報に基づいて、求めた部分領域の対応候補の対応の適否を判定する。すなわち、対応候補計算部１３０で求めた対応候補のうち、有効な対応ではない（つまり、物体から抽出された部分領域同士の対応ではない）と考えられる部分領域の組み合わせである対応候補を削除する。 [Verification processing]
Subsequently, based on the geometric information of the partial area, it is determined whether or not the obtained correspondence candidate of the partial area is appropriate. That is, among the correspondence candidates obtained by the correspondence candidate calculation unit 130, correspondence candidates that are combinations of partial areas that are considered not to be effective correspondences (that is, not correspondences between the partial areas extracted from the object) are deleted. .

仮に、クエリ画像と参照画像が同一の物体を含んでいるとする。物体がおよそ同一の形状を持つならば、クエリ画像中の物体と参照画像中の物体は異なる視点から撮影されているにすぎず、現実的な仮定の下、この視点変動は部分領域の見え方に一貫性を与える。言い換えれば、仮に対応候補となっている部分領域同士が、正しく同一物体上に存在する部分領域の組み合わせである場合には、クエリ画像側の部分領域の幾何情報と、対応候補となっている参照画像側の部分領域の幾何情報との幾何的関係（ずれ方）には、他の適当な対応候補と一貫性があることになる。したがって、このずれ方に一貫性がある対応候補のみを有効な対応であると見做し、そうでない対応候補を棄却すればよい。 Suppose that the query image and the reference image include the same object. If the objects have approximately the same shape, the object in the query image and the object in the reference image are only taken from different viewpoints. To give consistency. In other words, if the partial areas that are candidates for correspondence are a combination of partial areas that are correctly present on the same object, the geometric information of the partial area on the query image side and the reference that is a candidate for correspondence The geometric relationship (displacement) with the geometric information of the partial area on the image side is consistent with other appropriate correspondence candidates. Therefore, it is only necessary to consider only correspondence candidates that are consistent in this way of deviation as effective correspondences, and reject those that are not.

図４を用いてわかりやすく説明する。図４に、同一の物体を含む２枚の画像Ａおよび画像Ｂを示す。それぞれ、破線で囲った２種類の模様４１Ａ、４１Ｂ、４２Ａ、４２Ｂが部分領域として規定されており、また互いに同一の番号により表される部分領域の組み合わせ（例えば４１Ａと４１Ｂ）が、対応候補として判定されているとする。目的は、同一の物体上に存在する部分領域の組み合わせ（この場合は４１Ａと４１Ｂおよび４２Ａと４２Ｂ）だけを有効な対応と判定することである。 This will be described in an easy-to-understand manner with reference to FIG. FIG. 4 shows two images A and B including the same object. Each of the two types of patterns 41A, 41B, 42A, and 42B surrounded by a broken line is defined as a partial area, and a combination of partial areas (for example, 41A and 41B) represented by the same number is used as a corresponding candidate. Assume that it has been determined. The purpose is to determine only a combination of partial areas existing on the same object (in this case 41A and 41B and 42A and 42B) as an effective correspondence.

図４を見ればわかるように、同一の物体上にある４１Ａと４１Ｂおよび４２Ａと４２Ｂは、視点のみに依存して同じようにその位置が変化しているため、相対的な位置関係はおよそ同様であることがわかる。したがって、対応候補である部分領域の組み合わせの相対的な位置関係が、他の適当な対応である部分領域の組み合わせの相対的な位置関係と一貫性があるかを判定することで、対応候補の適否を判定することが可能である。 As can be seen from FIG. 4, the positions of 41A and 41B and 42A and 42B on the same object change in the same way depending on only the viewpoint, so the relative positional relationship is approximately the same. It can be seen that it is. Therefore, by determining whether the relative positional relationship of the combination of partial areas that are correspondence candidates is consistent with the relative positional relationship of the combination of partial areas that are other appropriate correspondences, It is possible to determine suitability.

そこで、本発明の実施の形態では、幾何的関係性の検証法を用いて実現する。例えば、同一物体上の部分領域の幾何的関係は、適当な条件の下で線形変換に拘束されることが知られている。このような線形変換と、これに従う幾何的関係を持つ対応候補を求める手法として参考文献５に記載のRANSACアルゴリズムや参考文献６に記載のLO-RANSACアルゴリズムなど、公知の有効な方法が存在するため、これらを用いても構わない。 Therefore, in the embodiment of the present invention, this is realized by using a geometric relationship verification method. For example, it is known that the geometric relationship between partial areas on the same object is constrained by linear transformation under appropriate conditions. There are known effective methods such as the RANSAC algorithm described in Reference 5 and the LO-RANSAC algorithm described in Reference 6 as a method for obtaining a correspondence candidate having such a linear transformation and a geometric relationship according to the linear transformation. These may be used.

［参考文献５］ M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Comm. ACM, vol. 24, no. 6, pp. 381-395, 1981. [Reference 5] MA Fischler and RC Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Comm. ACM, vol. 24, no. 6, pp. 381-395, 1981 .

［参考文献６］ O. Chum, J. Matas, and S. Obdrzalek, “Enhancing RANSAC by generalized model optimization,” Proceedings of Asian Conference on Computer Vision, pp. 812-817, 2004. [Reference 6] O. Chum, J. Matas, and S. Obdrzalek, “Enhancing RANSAC by generalized model optimization,” Proceedings of Asian Conference on Computer Vision, pp. 812-817, 2004.

以上の手続きにより、クエリ画像と参照画像との間で、対応候補の各々について対応候補の適否を判定し、対応候補が適当である場合には対応候補を対応として出力することができ、同一の物体を含むか否かを判定することができる。 According to the above procedure, the suitability of the correspondence candidate is determined for each of the correspondence candidates between the query image and the reference image, and if the correspondence candidate is appropriate, the correspondence candidate can be output as correspondence, and the same It can be determined whether or not an object is included.

以上説明したように、本発明の実施の形態に係る検証装置によれば、クエリ画像と参照画像の各々について、畳み込みニューラルネットワークを適用し、部分領域ごとに畳み込み層の出力を求め、クエリ画像の部分領域のそれぞれと、参照画像の部分領域のそれぞれとの各組み合わせについて、畳み込み層の出力のコサイン類似度を求め、コサイン類似度が所定の閾値よりも高い値となり、かつ、部分領域Ａに対して最大のコサイン類似度となる参照画像の部分領域が部分領域Ｂと一致し、かつ、部分領域Ｂに対して最大のコサイン類似度となるクエリ画像の部分領域が部分領域Ａと一致する場合に、クエリ画像の部分領域Ａと参照画像の部分領域Ｂとの組み合わせを対応候補として各々選定し、対応候補の各々についての部分領域の画像中の位置座標に基づいて、対応候補の適否を判定し、対応候補が適当である場合には対応候補を対応として出力することにより、より多様な物体に対する高精度な検証を可能にする。 As described above, according to the verification device according to the embodiment of the present invention, the convolutional neural network is applied to each of the query image and the reference image, the output of the convolution layer is obtained for each partial region, and the query image For each combination of each of the partial areas and each of the partial areas of the reference image, the cosine similarity of the output of the convolution layer is obtained, the cosine similarity is a value higher than a predetermined threshold, and the partial area A When the partial area of the reference image having the maximum cosine similarity matches the partial area B, and the partial area of the query image having the maximum cosine similarity with respect to the partial area B matches the partial area A , A combination of the partial area A of the query image and the partial area B of the reference image is selected as a corresponding candidate, Based on the position coordinates, to determine the appropriateness of the candidate corresponding, if the corresponding candidate is appropriate by outputting the corresponding candidate as the corresponding, it enables high-accuracy verification against a wider variety of objects.

また、通常の特徴量ベクトルに比して高い表現能力を持つ畳み込みニューラルネットワークを用い、畳み込み層の出力の幾何的整合性を検証することにより、より多様な物体に対する高精度な検証を可能にする。 In addition, by using a convolutional neural network that has higher expressive capabilities than ordinary feature vectors, and verifying the geometric consistency of the output of the convolutional layer, it enables high-precision verification for a wider variety of objects. .

以上、本発明の実施形態の一例における検証装置の構成の一例について詳細に説明した。なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Heretofore, an example of the configuration of the verification apparatus according to an example of the embodiment of the present invention has been described in detail. The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

１００検証装置
１１０入力部
１２０特徴抽出部
１３０対応候補計算部
１４０検証部
１５０出力部
１６０参照データベース
１７０クエリ画像 DESCRIPTION OF SYMBOLS 100 Verification apparatus 110 Input part 120 Feature extraction part 130 Correspondence candidate calculation part 140 Verification part 150 Output part 160 Reference database 170 Query image

Claims

A verification device for verifying correspondence between a first image and a second image,
Applying a convolutional neural network including at least one convolution layer for each of the first image and the second image, and a feature extraction unit for obtaining an output of the convolution layer for each partial region of the image;
Based on the output of the convolution layer determined for each partial region for each of the first image and the second image, each of the partial regions of the first image and the second image of the second image For each combination with each of the partial regions, obtain the cosine similarity of the output of the convolution layer,
For the combination of the partial area A of the first image and the partial area B of the second image, the cosine similarity is a value higher than a predetermined threshold, and the cosine similarity is maximum with respect to the partial area A. The partial area of the second image that coincides with the partial area B and the partial area of the first image that has the maximum cosine similarity with the partial area B is the partial area A. A correspondence candidate calculation unit that selects a combination of the partial area A of the first image and the partial area B of the second image as a correspondence candidate when they match,
A verification unit that determines the suitability of the correspondence candidate based on the position coordinates in the image of the partial region for each of the correspondence candidates, and outputs the correspondence candidate as a correspondence when the correspondence candidate is appropriate;
A verification apparatus comprising:

The correspondence candidate calculation unit stores, as a tabu partial region, a set of partial regions that are not appropriate for correspondence and a corresponding convolution layer output,
Furthermore, based on the output of the convolutional layer obtained for each partial area for each of the first image and the second image, each of the partial areas of the first image and the tabu partial areas For each combination with each and each combination of the partial region of the second image and each of the tabu partial regions, obtain the cosine similarity of the output of the convolution layer,
For the combination of the partial area C and the taboo partial area D of the first image or the second image, the cosine similarity is a value higher than a predetermined threshold, and the maximum cosine with respect to the partial area C The tabu partial area that is the similarity is the tabu partial area D, and the partial area of the first image or the second image that has the maximum cosine similarity with respect to the tabu partial area D is the The verification apparatus according to claim 1, wherein, when matching with the partial area C, a combination including the partial area C is excluded from the correspondence candidates.

The verification unit, based on the position coordinates in the image of the partial area for each of the corresponding candidates, according to the consistency with the relative positional relationship in the combination of the partial areas for the other corresponding candidates The verification apparatus according to claim 1, wherein the suitability of the candidate is determined.

A verification method in a verification device for verifying correspondence between a first image and a second image,
A feature extraction unit applies a convolutional neural network including at least one convolutional layer for each of the first image and the second image, and obtains an output of the convolutional layer for each partial region of the image,
Each of the partial areas of the first image based on the output of the convolutional layer obtained for each partial area for each of the first image and the second image; For each combination with each of the partial regions of the second image, determine the cosine similarity of the output of the convolution layer,
For the combination of the partial area A of the first image and the partial area B of the second image, the cosine similarity is a value higher than a predetermined threshold, and the cosine similarity is maximum with respect to the partial area A. The partial area of the second image that coincides with the partial area B and the partial area of the first image that has the maximum cosine similarity with the partial area B is the partial area A. If they match, each of the combinations of the partial area A of the first image and the partial area B of the second image is selected as a corresponding candidate,
The verification unit determines the suitability of the correspondence candidate based on the position coordinates in the partial region image for each of the correspondence candidates, and outputs the correspondence candidate as a correspondence when the correspondence candidate is appropriate. A verification method characterized by that.

By selecting a correspondence candidate by the correspondence candidate calculation unit, a set of partial areas that are not appropriate for correspondence and a corresponding convolution layer output are stored as tabu partial areas,
Furthermore, based on the output of the convolutional layer obtained for each partial area for each of the first image and the second image, each of the partial areas of the first image and the tabu partial areas For each combination with each and each combination of the partial region of the second image and each of the tabu partial regions, obtain the cosine similarity of the output of the convolution layer,
For the combination of the partial area C and the taboo partial area D of the first image or the second image, the cosine similarity is a value higher than a predetermined threshold, and the maximum cosine with respect to the partial area C The tabu partial area that is the similarity is the tabu partial area D, and the partial area of the first image or the second image that has the maximum cosine similarity with respect to the tabu partial area D is the The verification method according to claim 4, wherein, when matching with the partial area C, a combination including the partial area C is excluded from the correspondence candidates.

The program for functioning a computer as each part of the verification apparatus of any one of Claims 1-3.