JP2015133101A

JP2015133101A - Method for constructing descriptor for image of scene

Info

Publication number: JP2015133101A
Application number: JP2014249654A
Authority: JP
Inventors: シャンタヌ・ラーネ; Shantanu Rane; ロヒット・ナイニ; Naini Rohit
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2014-01-10
Filing date: 2014-12-10
Publication date: 2015-07-23
Also published as: US20150199573A1; DE102015200260A1

Abstract

PROBLEM TO BE SOLVED: To provide global descriptors for matching Manhattan scenes used for viewpoint-invariant object matching.SOLUTION: A descriptor is associated with a vanishing point 101 in an image by first quantizing an angular region around the vanishing point into a preset number of angular quantization bins, and a centroid of each angular quantization bin indicates a direction of the angular quantization bin. For each angular quantization bin, a sum of the magnitude of pixel gradients for pixels in the image at which a direction of the pixel gradient is aligned with the direction of the angular quantization bin is determined. These steps are performed in a processor 150.

Description

本発明は、包括的には、コンピュータービジョンに関し、より詳細には、視点不変オブジェクトマッチングに用いることができるマンハッタンシーンをマッチングするためのグローバル記述子に関する。 The present invention relates generally to computer vision, and more particularly to global descriptors for matching Manhattan scenes that can be used for viewpoint invariant object matching.

視点不変オブジェクトマッチングは、回転、平行移動、照明、クロッピング、及びオクルージョン等の要因によって引き起こされる画像歪に起因して難しい。視覚的シーンの理解は、コンピュータービジョンにおいてよく知られた問題である。特に、２次元（２Ｄ）画像平面上への投影に基づく３Ｄシーン内のオブジェクトの識別は、手に負えそうもない課題を提起する。 Viewpoint invariant object matching is difficult due to image distortion caused by factors such as rotation, translation, lighting, cropping, and occlusion. Understanding visual scenes is a well-known problem in computer vision. In particular, the identification of objects in a 3D scene based on projection onto a two-dimensional (2D) image plane presents an unmanageable challenge.

人間の視覚野は、視野内の個々のオブジェクトを識別するために物理的なオブジェクト境界におけるエッジの存在に大きく依拠していることが知られている。エッジ、テクスチャ、及び色からの手がかりを用いて、脳は、通例、視点にかかわらず３次元（３Ｄ）シーンを視覚化して理解することができる。これとは対照的に、現代のコンピューターは、視覚野等の高レベルの処理アーキテクチャが欠如しているので、低レベルの視点不変性をシーン記述子内に明示的に組み込まなければならない。 It is known that the human visual cortex relies heavily on the presence of edges at physical object boundaries to identify individual objects within the field of view. Using clues from edges, textures, and colors, the brain can typically visualize and understand a three-dimensional (3D) scene regardless of viewpoint. In contrast, modern computers lack high-level processing architectures such as the visual cortex, so low-level viewpoint invariance must be explicitly incorporated into the scene descriptor.

シーンの理解のための方法は、２つの広い部類を含む。１つの部類は、回転、平行移動、及び他の視点の変化にかかわらず正確に検出することができるローカルなキーポイントに依拠したものである。この場合、視点の変化に対して不変のままである、勾配、テクスチャ、色、及び他の情報のローカルな構造を捕捉するために、それらのキーポイント用の記述子が構成される。スケール不変特徴変換（ＳＩＦＴ）及び高速化ロバスト特徴（ＳＵＲＦ）が、２つのキーポイントベースの記述子の例である。 Methods for scene understanding include two broad categories. One class relies on local keypoints that can be accurately detected regardless of rotation, translation, and other viewpoint changes. In this case, descriptors for those keypoints are constructed to capture local structures of gradients, textures, colors, and other information that remain unchanged with respect to viewpoint changes. Scale-invariant feature transform (SIFT) and accelerated robust feature (SURF) are examples of two keypoint-based descriptors.

もう１つの部類の方法は、グローバルスコープにおいて特徴を捕捉することを伴う。ローカルな平均化と、色及び勾配の分布の他の統計的な性質を用いることとによって正確さが得られる。このグローバルな手法は、勾配ヒストグラム（ＨＯＧ）及びＧＩＳＴ記述子において用いられる。 Another class of methods involves capturing features in a global scope. Accuracy is obtained by local averaging and using other statistical properties of the color and gradient distribution. This global approach is used in gradient histograms (HOG) and GIST descriptors.

ローカルな手法及びグローバルな手法は、相補的な特徴を有する。ローカル記述子は、対応するローカルキーポイントにとっては正確かつ弁別的であるが、より大きなオブジェクトについてのグローバルな構造的手がかりが欠けており、それらのキーポイントに関連付けられた幾つかのローカル記述子間の対応関係を確立した後にしか推測することができない。グローバル記述子は、画像についての集約的な統計情報を捕捉する傾向があるが、シーンの理解に関係していることが多い特定の幾何学的手がかり又は構造的手がかりを含んでいない。 Local and global approaches have complementary features. Local descriptors are accurate and discriminatory for the corresponding local keypoints, but lack global structural cues for larger objects, and between several local descriptors associated with those keypoints Can only be guessed after establishing the corresponding relationship. Global descriptors tend to capture aggregate statistical information about the image, but do not include specific geometric or structural cues that are often related to scene understanding.

多くの人工シーンは、マンハッタンワールド仮説を満たす。この仮説では、ライン（ｌｉｎｅ：線）は、３つの主直交方向に沿って配向されている。マンハッタン幾何学の非常に重要な側面は、優性方向を有する全ての平行なラインが２Ｄ画像平面内の消失点において交差するということである。３つの直交方向が存在しない場合があるシーンでは、ラインは、例えば垂直若しくは水平な単一の優性方向を満たすことができるか、又は、例えば、室内の家具のオブジェクトといった複数の優性非直交方向を含むことができる。 Many artificial scenes satisfy the Manhattan World Hypothesis. In this hypothesis, the line is oriented along three main orthogonal directions. A very important aspect of Manhattan geometry is that all parallel lines with dominant directions intersect at a vanishing point in the 2D image plane. In scenes where there may not be three orthogonal directions, a line can satisfy a single dominant direction, for example vertical or horizontal, or it can have multiple dominant non-orthogonal directions, for example, indoor furniture objects. Can be included.

本発明の実施の形態は、マンハッタンシーンのためのグローバル記述子を提供する。マンハッタンシーンは、通常は３つの直交方向に優性方向性配向を有する。したがって、優性方向にある３Ｄにおける全ての平行なエッジは、２Ｄ画像平面内の対応する消失点（ＶＰ）において例外なく交差する。全てのシーンエッジは、ＶＰから視認されるような相対的な空間ロケーション及び強度を保持する。グローバル記述子は、消失点の周囲のマンハッタンシーン内の画像エッジの空間ロケーション及び輝度に基づいている。本方法は、記述子当たり８キロビット及び画像当たり３つまでの記述子（各ＶＰにつき１つ）を用いて、ＳＩＦＴ等のローカルキーポイント記述子と比較して、マッチングするための効率的な記憶及びデータ転送を提供する。 Embodiments of the present invention provide global descriptors for Manhattan scenes. Manhattan scenes typically have a dominant directional orientation in three orthogonal directions. Thus, all parallel edges in 3D that are in the dominant direction intersect without exception at the corresponding vanishing point (VP) in the 2D image plane. All scene edges retain their relative spatial location and intensity as viewed from the VP. The global descriptor is based on the spatial location and brightness of the image edges in the Manhattan scene around the vanishing point. The method uses 8 kilobits per descriptor and up to 3 descriptors per image (one for each VP), compared to local keypoint descriptors such as SIFT for efficient storage. And provide data transfer.

方法が、画像を横断する平行なラインが消失点において交差するとき、それらのラインの角度順序（ａｎｇｕｌａｒｏｒｄｅｒｉｎｇ）を厳密に維持することによってグローバル記述子を構成する。消失点で交わるこれらの平行なラインの相対的な長さ及び相対的な角度（配向又は方向）は、ほぼ同じである。 The method constructs a global descriptor by maintaining the strict angular ordering of parallel lines across the image as they intersect at the vanishing point. The relative length and relative angle (orientation or direction) of these parallel lines that meet at the vanishing point are approximately the same.

マンハッタンシーンのためのコンパクトなグローバル画像記述子は、消失方向に沿ったエッジの相対的なロケーション及び強度を捕捉する。この記述子を構成するために、エッジマップが消失点ごとに求められる。このエッジマップは、消失点について測定された角度又は方向の範囲にわたるエッジ強度を符号化する。 A compact global image descriptor for Manhattan scenes captures the relative location and intensity of edges along the disappearance direction. To construct this descriptor, an edge map is determined for each vanishing point. This edge map encodes the edge strength over the range of angles or directions measured for the vanishing point.

オブジェクトマッチングのために、２つのシーンからの記述子が、複数の候補のスケール及び変位にわたって比較される。マッチング性能は、ヒストグラムの形式のスケール−変位プロットの極大値におけるエッジ形状を比較することによって改良される。 For object matching, descriptors from two scenes are compared across multiple candidate scales and displacements. Matching performance is improved by comparing the edge shape at the maximum of the scale-displacement plot in the form of a histogram.

本発明の実施形態によるグローバル記述子が構成される２つの消失点を含むマンハッタンシーンの画像を示す図である。It is a figure which shows the image of the Manhattan scene containing two vanishing points with which the global descriptor by the embodiment of this invention is comprised. 本発明の実施形態による、消失点ロケーションにおいて水平基準ラインに対してなす様々な角度、及び角度量子化ビンを示す概略図である。FIG. 6 is a schematic diagram illustrating various angles and angular quantization bins made with respect to a horizontal reference line at a vanishing point location according to an embodiment of the present invention. 本発明の実施形態によるエッジマップのビニングされたピクセル輝度の概略図である。FIG. 4 is a schematic diagram of binned pixel luminance of an edge map according to an embodiment of the present invention. 本発明の実施形態による建物の２つの異なるビューの角度ビンの図式化されたエッジ強度を示す図である。FIG. 6 shows a diagrammatic edge strength of angle bins of two different views of a building according to an embodiment of the present invention. 本発明の実施形態によるグローバル記述子を構成するための方法の流れ図である。3 is a flowchart of a method for constructing a global descriptor according to an embodiment of the present invention; 本発明の実施形態による２つの画像のアフィン変換の概略図である。FIG. 4 is a schematic diagram of affine transformation of two images according to an embodiment of the present invention. 本発明の実施形態によるスケール−変位プロット上のエッジ強度のヒストグラムである。4 is a histogram of edge strength on a scale-displacement plot according to an embodiment of the present invention. 本発明の実施形態によるグローバル記述子を用いてオブジェクトをマッチングするための方法の流れ図である。3 is a flowchart of a method for matching objects using a global descriptor according to an embodiment of the present invention; 本発明の実施形態によるマッチングの品質を測定するためのメトリックを説明する図である。It is a figure explaining the metric for measuring the quality of the matching by embodiment of this invention.

本発明の実施形態は、マンハッタンシーン１００のためのグローバル記述子２５０を提供する。マンハッタンシーンは、通常は３つの直交方向に優性方向性配向を有し、或る優性方向にある３Ｄにおける全ての平行なエッジは、２Ｄ画像平面内の対応する消失点（ＶＰ１０１）で交差する。マンハッタンシーンは、屋内のものとすることもできるし、屋外のものとすることもでき、任意の数のオブジェクトを含むことができることに留意されたい。 Embodiments of the present invention provide a global descriptor 250 for the Manhattan scene 100. A Manhattan scene usually has a dominant directional orientation in three orthogonal directions, and all parallel edges in 3D in one dominant direction intersect at a corresponding vanishing point (VP101) in the 2D image plane. Note that the Manhattan scene can be indoor or outdoor, and can include any number of objects.

記述子２５０は、カメラ１１０によって取得された画像１２０から構成される（５００）。これらの記述子は、次に、オブジェクトマッチング８００又は他の関連したコンピュータービジョンアプリケーションに用いることができる。これらの構成及びマッチングは、当該技術分野において知られているようにバスによってメモリ及び入力／出力インターフェースに接続されたプロセッサ１５０において実行することができる。 Descriptor 250 consists of image 120 acquired by camera 110 (500). These descriptors can then be used for object matching 800 or other related computer vision applications. These configurations and matching can be performed in a processor 150 connected to the memory and input / output interface by a bus as is known in the art.

消失点ベースの画像記述子
記述子は、同じオブジェクトの複数の画像１２０（ビュー）についての以下の認識に基づいている。第１に、実際の３Ｄシーンにおける平行なラインは、それらのラインが消失点で交差するとき、２Ｄ画像にわたって（反転するまで（ｕｐｔｏａｎｉｎｖｅｒｓｉｏｎ））それらの角度順序を厳密に維持する。第２に、消失点で交わる平行なラインの相対的な長さ及び相対的な角度は、ほぼ同じである。これらの認識は、記述子を構成するのに、消失方向に沿って配向されたエッジの相対的なロケーション及び強度を用いることができることを示唆している。記述子２５０を構成する（５００）こと、及びこれらの記述子をマッチングに用いることに関与するステップを以下に説明する。 Vanishing Point Based Image Descriptor The descriptor is based on the following perception of multiple images 120 (views) of the same object. First, parallel lines in an actual 3D scene maintain their angular order strictly across the 2D image (up to an inversion) as they intersect at the vanishing point. Second, the relative lengths and relative angles of the parallel lines that meet at the vanishing point are approximately the same. These recognitions suggest that the relative location and strength of edges oriented along the disappearance direction can be used to construct the descriptor. The steps involved in configuring (250) the descriptors 250 and using these descriptors for matching are described below.

各消失点における記述子のシーディング
消失点は、２Ｄ画像１００が入手可能である３Ｄシーンにおいて平行であるライン１０２を投影したものの交点として定義される。ＶＰは、３Ｄシーンにおける平行なラインによって与えられる方向の無限遠にある３Ｄ点の２Ｄ投影とみなすことができる。 Descriptor seeding at each vanishing point The vanishing point is defined as the intersection of projections of parallel lines 102 in the 3D scene where the 2D image 100 is available. A VP can be viewed as a 2D projection of a 3D point at infinity in the direction given by parallel lines in a 3D scene.

一般に、平行なラインによって決定される複数のシーン方向に対応する多くの消失点が存在する。しかしながら、多くの人工の構造、例えば、都市景観は、規則的な立方形の幾何学的形状を有する。したがって、通常は３つの消失点が画像投影から得られ、これらのうちの２つが図１に示されている。 In general, there are many vanishing points corresponding to a plurality of scene directions determined by parallel lines. However, many man-made structures, such as cityscapes, have a regular cubic geometry. Thus, typically three vanishing points are obtained from the image projection, two of which are shown in FIG.

ＶＰは、コンピュータービジョンにおいて、画像修正、カメラ較正、及び関連した問題に用いられてきた。ＶＰの識別は、基礎となる３Ｄシーンにおける平行なラインがラベル付けされている場合には簡単であるが、ラベル付けが利用可能でないときはより難しくなる。消失点を求めるための方法には、エッジをＶＰに割り当てるための、エッジの凝集クラスタリング、１Ｄハフ変換、マルチレベルランダムサンプルコンセンサス（ＲＡＮＳＡＣ）ベースの手法、及び期待値最大化法（ＥＭ）が含まれる。 VP has been used in computer vision for image correction, camera calibration, and related problems. VP identification is simple when parallel lines in the underlying 3D scene are labeled, but becomes more difficult when labeling is not available. Methods for determining vanishing points include edge aggregation clustering, 1D Hough transform, multi-level random sample consensus (RANSAC) based method, and expectation maximization method (EM) for assigning edges to VPs It is.

図２に示すように、ＶＰロケーション２００は、

によって示すことができる。ここで、通常、マンハッタンシーンの場合には、ｍ≦３である。さらに、ＶＰ

において水平基準ライン２０１に対してなす角度をθ_ｊ（ｘ，ｙ）とする。したがって、

である。 As shown in FIG. 2, the VP location 200 is

Can be indicated by Here, normally, in the case of a Manhattan scene, m ≦ 3. In addition, VP

Is the angle formed with respect to the horizontal reference line 201 at θ _j (x, y). Therefore,

It is.

記述子２５０は、各ＶＰに収束するエッジの相対的なロケーション及び強度を符号化することによって構成される。したがって、記述子は、関数Ｄ：Θ→Ｒ^＋とみなすことができ、その定義域は、ＶＰに収束するエッジの角度配向を含み、その値域は、正しい順序によるこれらのエッジの強度の測定量を含む。記述子は、以下で説明する方法５００に従ってＶＰごとに求められる。 Descriptor 250 is constructed by encoding the relative location and strength of edges that converge at each VP. Thus, the descriptor can be viewed as a function D: Θ → R ⁺ , whose domain includes the angular orientation of edges that converge to VP, whose range is a measure of the strength of these edges in the correct order. including. Descriptors are determined for each VP according to method 500 described below.

エッジロケーションの符号化
ライン検出手順は、多くの場合、途切れたライン及びクロッピングされたラインを生成し、重要なエッジを見つけ損ない、偽のラインを生成する。したがって、図３に示すように、正確さを得るために、画像エッジに適合するラインではなく、エッジピクセルの輝度を直接取り扱うことにする。消失点の周囲のエッジの角度ロケーションの関数としてのエッジ強度の表現は、エッジマップ３００と呼ばれる。具体的には、ピクセルが記述子を構成するための消失点に従って配向されていることを勾配が示すとき、図２に示すように、角度ビン２０２内のピクセルの輝度を記憶し、個別に合計する。これを行うために、図５に示すように、最初に、画像内のあらゆるピクセルについて、２Ｄベクトルである勾配ｇ（ｘ，ｙ）を求める（５１０）。 Edge Location Coding Line detection procedures often generate broken and cropped lines, fail to find important edges, and generate false lines. Therefore, as shown in FIG. 3, to obtain accuracy, we will deal directly with the brightness of the edge pixels, not the lines that match the image edges. The representation of the edge strength as a function of the angular location of the edge around the vanishing point is called the edge map 300. Specifically, when the gradient indicates that the pixels are oriented according to the vanishing points for constructing the descriptor, the luminance of the pixels in the angle bin 202 is stored and summed individually as shown in FIG. To do. To do this, a gradient g (x, y), which is a 2D vector, is first determined (510) for every pixel in the image, as shown in FIG.

画像内のロケーション（ｘ，ｙ）におけるピクセルの勾配の方向ψ_ｇ（ｘ，ｙ）５１１は、大きな輝度変化がそれに沿って存在する方向を指す。勾配の大きさ｜ｇ（ｘ，ｙ）｜５１２は、その勾配方向に沿ったそのピクセルにおける輝度差を指す。 The pixel gradient direction ψ _g (x, y) 511 at location (x, y) in the image refers to the direction along which a large luminance change exists. The gradient magnitude | g (x, y) | 512 refers to the luminance difference at the pixel along the gradient direction.

次に、以下のように、消失点ＶＰ

のピクセルセットＰ_ｊを求める（５２０）。

ここで、τは、勾配方向がＶＰの方向と一致していない量に基づいて選択された閾値である。このセットＰ_ｊが求められると、基礎となるエッジロケーションは、以下のように符号化される。 Next, the vanishing point VP is as follows:

The pixel set P _j is determined (520).

Here, τ is a threshold value selected based on an amount in which the gradient direction does not coincide with the VP direction. Once this set P _j is determined, the underlying edge location is encoded as follows:

ピクセル角度（方向）が、

であるような、画像にわたる角度範囲［θ_ｍｉｎ，θ_ｍａｘ］２０４内のφ_ｋ，１≦ｋ≦Ｋを中心とする（２０３）一様な角度ビン２０２の事前に設定された数（Ｋ）に量子化され、そのため、角度量子化ビンの重心は、角度量子化ビンの方向、すなわち、ピクセル角度を示す。 Pixel angle (direction) is

A pre-set number (K) of uniform angle bins 202 centered at φ _k , 1 ≦ k ≦ K in the angular range [θ _min , θ _max ] 204 over the image, such that Therefore, the centroid of the angle quantization bin indicates the direction of the angle quantization bin, ie the pixel angle.

エッジ強度の符号化
人間の視覚系に関する研究は、エッジの相対的な顕著さ（ｐｒｏｍｉｎｅｎｃｅ）が、弁別的なオブジェクトパターンを視覚化する際に役割を果たすことを示唆している。画像エッジの顕著さは、エッジの長さ、厚さ、及びエッジに対して垂直な方向の横変化（輝度及びフォールオフ特性）の関数である。 Edge Strength Coding Research on the human visual system suggests that the relative prominence of edges plays a role in visualizing discriminatory object patterns. Image saliency is a function of edge length, thickness, and lateral change (luminance and falloff characteristics) in a direction perpendicular to the edge.

エッジ強度メトリックを構成する幾つかの方法がある。例えば、エッジ検出器が、特定のＶＰの記述子を構成するのに用いられる場合、強度は、エッジの長さ及びエッジに沿ったピクセル単位の累積勾配の関数とすることができる。しかしながら、上記で説明したように、エッジ検出器を用いることは、常に正確であるとは限らない。したがって、ピクセル単位の勾配のクラスタリング又は量子化に基づく方法が好ましい。このプロセスは、以下で詳細に説明する。 There are several ways to construct the edge strength metric. For example, if an edge detector is used to construct a descriptor for a particular VP, the intensity can be a function of the length of the edge and the cumulative gradient in pixels along the edge. However, as explained above, using an edge detector is not always accurate. Therefore, methods based on pixel-by-pixel gradient clustering or quantization are preferred. This process is described in detail below.

ピクセルセットＰ_ｊが角度ビン２０２に一様に量子化されるとき、エッジ強度を符号化する１つの方法は、各角度量子化ビン内の勾配の大きさ｜ｇ（ｘ，ｙ）｜５１２の合計を求めることである。これを行うために、図２に示すように、端点（ｒ_{ｋ，ｍｉｎ}ｃｏｓφ_ｋ，ｒ_{ｋ，ｍｉｎ}ｓｉｎφ_ｋ）及び（ｒ_{ｋ，ｍａｘ}ｃｏｓφ_ｋ，ｒ_{ｋ，ｍａｘ}ｓｉｎφ_ｋ）を有するあらゆる角度量子化ビンの中央を通過するラインセグメント（ｌｉｎｅｓｅｇｍｅｎｔ：線分）２０３を考える。 When pixel set P _j is uniformly quantized into angle bins 202, one way to encode edge strength is to use the magnitude of gradient | g (x, y) | 512 in each angle quantization bin. It is to calculate the total. To do this, as shown in FIG. 2, any angular quantum with endpoints (r _{k, min} cos φ _k , r _{k, min} sin φ _k ) and (r _{k, max} cos φ _k , r _{k, max} sin φ _k ) Consider a line segment 203 that passes through the center of the bin.

この場合、記述子２５０は、以下の総和となる。

ここで、φ_ｋ，１≦ｋ≦Ｋ_ｊは、ＶＰ

に対する量子化ビンに関連付けられた角度配向又は方向を表し、ｒは、半ピクセル（ｈａｌｆ−ｐｉｘｅｌ）解像度における範囲内で変化することができる。 In this case, the descriptor 250 is the following sum.

Here, φ _k , 1 ≦ k ≦ K _j is VP

Represents the angular orientation or direction associated with the quantization bin for, and r can vary within a range in half-pixel resolution.

正確さを得るために、双一次補間が、サブピクセルロケーションにおけるピクセル勾配を得るのに用いられる。記述子Ｄ（ｋ）２５０の構成５００は、サブピクセル解像度において実行される。各角度ビン内のエッジ強度を求めることによって上記のように得られた記述子の例が、図４において、同じ（建物）オブジェクト４０１の２つの異なるビューについて示されている。対応するグラフは、正規化された輝度の合計をビンインデックスの関数として示している。 To obtain accuracy, bilinear interpolation is used to obtain the pixel gradient at the subpixel location. The configuration 500 of descriptor D (k) 250 is performed at sub-pixel resolution. An example descriptor obtained as above by determining the edge strength in each angle bin is shown in FIG. 4 for two different views of the same (building) object 401. The corresponding graph shows the normalized luminance sum as a function of bin index.

構成方法
図５は、構成方法の基本的なステップを要約したものである。画像１２０内の各ピクセルについて、勾配の方向５１１及び大きさ５１２を求めることにする。次に、消失点と一致した方向を有する勾配のセット５２１が求められる。消失点は、最大３つ存在し得る。次に、各セットについて勾配の大きさが個別に合計され、エッジ強度として符号化されて（５３０）、各消失点の記述子２５０が得られる。 Configuration Method FIG. 5 summarizes the basic steps of the configuration method. For each pixel in the image 120, the gradient direction 511 and magnitude 512 will be determined. Next, a set of gradients 521 having a direction that coincides with the vanishing point is determined. There can be up to three vanishing points. The gradient magnitudes are then individually summed for each set and encoded as edge strength (530) to obtain a descriptor 250 for each vanishing point.

射影変換
グローバル記述子２５０を構成すること（５００）の背後にある本発明者らの動機は、異なる視点から取得された画像内のオブジェクトのマッチング８００を実行することである。各画像は、同じ実世界のシーンの２Ｄ投影であるので、画像の対における対応するキーポイント又はエッジ間には、通常、幾何学的な関係が存在する。例えば、構成しているものの平坦な正面の画像間にはホモグラフィー関係が存在する。本発明者らの認識は、同じオブジェクトの画像について求められた記述子Ｄ（ｋ）２５０間にアフィン対応関係が存在することを示唆している。 Projective Transformation Our motivation behind constructing the global descriptor 250 (500) is to perform a matching 800 of objects in images obtained from different viewpoints. Since each image is a 2D projection of the same real-world scene, there is usually a geometric relationship between corresponding keypoints or edges in the image pair. For example, there is a homography relationship between the flat front images of the composition. Our recognition suggests that there is an affine correspondence between descriptors D (k) 250 determined for images of the same object.

以下では、この認識が理論的正当性を有することを説明する。特に、記述子を構成している（５００）間のビニングステップにおいて用いられる画像ライン（エッジ）間の角度の変換が近似的にアフィンであることを示す。 In the following, it is explained that this recognition has theoretical validity. In particular, it shows that the transformation of the angle between the image lines (edges) used in the binning step between (500) constituting the descriptor is approximately affine.

図６に示すように、消失点を通過するラインの「束（ｐｅｎｃｉｌ）」からなる同じシーンの２つの画像（ビュー）を考える。第１のビューの消失点は原点に位置しているものとする。同種の表現を用いると、ｘ軸及びｙ軸は、ｅ_ｘ＝（０１０）^Ｔ及びｅ_ｙ＝（１００）^Ｔによって与えられる。ここで、Ｔは転置演算子である。これらのベクトルを用いると、任意のラインｌ_λは、以下のように表される。

ここで、λ∈Ｒである。 Consider two images (views) of the same scene consisting of “pencils” of lines passing through the vanishing point, as shown in FIG. It is assumed that the vanishing point of the first view is located at the origin. Using the same kind of representation, the x and y axes are given by e _x = (010) ^T and e _y = (100) ^T. Here, T is a transpose operator. Using these vectors, an arbitrary line l _λ is expressed as:

Here, λεR.

一般性を失うことなく、検討されている相互角（ｉｎｔｅｒ−ａｎｇｌｅ）は、ｘ軸とｌ_λとの間の角度であると仮定する。θ_λ＝ｔａｎ^−１（−λ）であることに留意されたい。本発明者らの目標は、ｘ軸とｌ_λとの間の角度が、一方の画像から他方の画像へ近似的なアフィン変換を受けることを示すことである。これを示すために、２つのビュー間の３×３ホモグラフィーを、行列Ｈを用いて示すことにする。一般に、ホモグラフィーの下では、消失点は、もはや第２のビューの原点にはなく、Ｈｅ_ｘは、もはやｘ軸に沿っていない。ここで、図６に示すように、消失点を平行移動させて原点に戻すとともにＨｅ_ｘを回転させてｘ軸に戻す別の３×３行列Ｔによって与えられる変換を選ぶことにする。 Without loss of generality, mutual angle being considered (inter-angle) is assumed to be the angle between the x axis and l _lambda. Note that θ _λ = tan ⁻¹ (−λ). Our goal is to show that the angle between the x-axis and l _λ undergoes an approximate affine transformation from one image to the other. To illustrate this, a 3 × 3 homography between two views will be shown using the matrix H. In general, under homography, the vanishing point is no longer at the origin of the second view, and He _x is no longer along the x axis. Here, as shown in FIG. 6, the transformation given by another 3 × 3 matrix T is selected which translates the vanishing point and returns it to the origin, and rotates He _x to return to the x axis.

ｌ_λのＴＨ変換をｌ_γによって示し、ｌ_γとｘ軸との間の角度をθ_γによって示すことにする。この場合、以下の式となる。

ここで、

であり、（ａ_１，ａ_２，ｂ_１，ｂ_２）は、Ｔ及びＨの要素から導出された変換パラメーターである。消失点が画像から遠く離れており、そのため、θ_ｍａｘ−θ_ｍｉｎが小さいという仮定の下では、テーラー級数近似ｔａｎ^−１（α）≒αを用いることができる。ここで、αは、小さな角度（ラジアンで表される）である。したがって、以下の式となる。

The TH conversion of l _lambda indicated by l _gamma, will be indicated by the angle between the l _gamma and x-axis theta _gamma. In this case, the following equation is obtained.

here,

Where (a ₁ , a ₂ , b ₁ , b ₂ ) are transformation parameters derived from the elements of T and H. Under the assumption that the vanishing point is far from the image and θ _max −θ _min is small, the Taylor series approximation tan ⁻¹ (α) ≈α can be used. Here, α is a small angle (expressed in radians). Therefore, the following equation is obtained.

小さな相互角の仮定を用いると、２次の項θ_γθ_λは、無視できるほど小さくなる。この交差項を無視した場合、θ_λからθ_γへの変換は、近似的にアフィンとなる。 Using a small reciprocal angle assumption, the second order term θ _γ θ _λ is negligibly small. If this cross term is ignored, the conversion from θ _λ to θ _γ is approximately affine.

記述子マッチング
マンハッタンシーンにおけるオブジェクトは、３つのＶＰまで有することができ、したがって、３つの記述子まで有することができる。したがって、事前の配向情報なしで２つの視点から見えるオブジェクトをマッチングすることは、９対までのマッチング操作を伴う。上記で説明したように、角度エッジロケーションは、視点の変更を伴う近似的なアフィン変換を受ける。したがって、本発明者らは、マッチングされている記述子の対におけるエッジ強度の相対的な形状を比較する前にこの変換を反転することを提案する。この反転ステップは、幾つかの候補のスケール及び変位、すなわち、幾つかの候補のアフィン変換を用いて実行される。これらの候補のアフィン変換から、優性アフィン変換（スケール−変位）対を選ぶことができる。方法８００が、以下で説明するように記述子を比較するのに用いられる。 Descriptor Matching An object in a Manhattan scene can have up to 3 VPs and thus can have up to 3 descriptors. Therefore, matching objects visible from two viewpoints without prior orientation information involves up to nine matching operations. As explained above, angular edge locations undergo an approximate affine transformation with a change in viewpoint. We therefore propose to invert this transformation before comparing the relative shape of edge strengths in matched descriptor pairs. This inversion step is performed using several candidate scales and displacements, ie several candidate affine transformations. A dominant affine transformation (scale-displacement) pair can be selected from these candidate affine transformations. Method 800 is used to compare the descriptors as described below.

エッジに関する対応マッピング
視点間で記述子を平行移動させる近似的なアフィン変換を求めるために、正しい対応関係の下では、同一平面上のエッジの対は、スケール−変位対（ｓ，ｄ）によって与えられる近似値に等しいアフィンパラメーターを生成するということを利用する。したがって、エッジの対の（ｓ，ｄ）空間におけるハフ変換型投票手順の結果、真のスケールｓ^＊及び変位ｄ^＊における極大値が得られる。 Corresponding mapping for edges To find an approximate affine transformation that translates descriptors between viewpoints, under the correct correspondence, a pair of coplanar edges is given by a scale-displacement pair (s, d) To generate an affine parameter equal to the approximated value. Therefore, the Hough transform type voting procedure in the (s, d) space of the edge pair results in a maximum value at the true scale s ^* and displacement d ^* .

複数の極大値は、オブジェクトが、ＶＰ方向軸によってサポートされた複数の平面を有するときに生じる。正確さ及び効率性を得るために、顕著なエッジが、それらのエッジ強度に基づいて識別される。指定された百分位数の閾値よりも大きな強度を有するエッジ上のピクセルが選択される。さらに、エッジオクルージョンに対する正確さを得るために、密接に近接した角度範囲内のエッジのみが票を投じるために対にされ、例えば、各顕著なエッジがＣ個の最も近いエッジと対にされる。 Multiple maxima occur when an object has multiple planes supported by the VP direction axis. In order to obtain accuracy and efficiency, prominent edges are identified based on their edge strength. Pixels on the edge that have an intensity greater than the specified percentile threshold are selected. In addition, to obtain accuracy for edge occlusion, only edges within a close range of angles are paired for voting, eg, each prominent edge is paired with C nearest edges. .

記述子Ｄ_１（ｋ），１≦ｋ≦Ｋは、Ｎ_１個のピーク対（ｋ_ｉ、ｋ’_ｉ），１≦ｉ≦Ｎ_１のセットを生成することができる。同様に、Ｄ_２（ｍ）は、Ｎ_２個のピーク対（ｍ_ｊ、ｍ’_ｊ），１≦ｊ≦Ｎ_２のセットを生成する。これらの識別されたピークの対は、２つのセット間でクロスマッピングされ、

及びｄ＝ｍ_ｊ−ｓｋ_ｉを用いて（ｓ，ｄ）ヒストグラムの票が生成される。角度反転、すなわち、ＶＰの回りの上部／底部及び左／右の回転を可能にするために、上記２つのセットのうちの一方の中のピークの順序を逆にすることによって、追加の票が生成される。 Descriptor _{D 1 (k), 1 ≦} k ≦ K is, _{N 1} peaks pair _{_{(k i, k 'i)}} , it is possible to generate a set of ₁ ≦ i ≦ _N 1. _{Similarly, D} 2 (m) is, _{N 2} peaks pair _{_{(m j, m 'j)}} , generating a set of 1 ≦ j ≦ _{N 2.} These identified peak pairs are cross-mapped between the two sets,

And _d = m j -sk _i using (s, d) the vote of the histogram is generated. Additional votes can be obtained by reversing the order of the peaks in one of the two sets to allow angle reversal, ie top / bottom and left / right rotation around the VP. Generated.

図７に示すように、（ｓ，ｄ）票の粗いヒストグラム７００を、ここでは、極大値（ｓ^＊，ｄ^＊）を突き止めるのに用いることができる。このヒストグラムは、２つのＶＰベースの記述子が最良の一致を有するスケール及び変位を識別する。極大値は、オブジェクトの２つのビューにおけるエッジ間の関係を提供する。極大値が含む票が過度に少ない場合、その（ｓ^＊，ｄ^＊）対について不一致が宣言される。極大値のいずれもが十分な票を含んでいない場合、それらの記述子は同じオブジェクトを表していない。 As shown in FIG. 7, a coarse histogram 700 of (s, d) votes can be used here to locate the local maximum (s ^* , d ^* ). This histogram identifies the scale and displacement for which the two VP-based descriptors have the best match. The local maximum provides the relationship between the edges in the two views of the object. If the maximum value contains too few votes, a mismatch is declared for that (s ^* , d ^* ) pair. If none of the maxima contain enough votes, their descriptors do not represent the same object.

したがって、各記述子は、それらの記述子のスケール及び変位が同一となるように変更される。次に、第１の記述子におけるピークの形状と第２の記述子における対応するピークの形状との差が求められ、この差が閾値未満であるとき、２つの画像間の一致を示すことができる。 Accordingly, each descriptor is changed so that the scale and displacement of the descriptors are the same. Next, the difference between the shape of the peak in the first descriptor and the shape of the corresponding peak in the second descriptor is determined, and when this difference is less than a threshold, it indicates a match between the two images. it can.

マッチング方法
図８は、マッチング方法８００の基本的なステップを要約したものである。画像８０１及び８０２について、それぞれの記述子８１１及び８１２が、上記で説明したように構成される（５００）。ピーク８２１及び８２２が識別され（８２０）、ヒストグラム７００の票が生成される（８３０）。これらのピークは、２つのＶＰベースの記述子が最良の一致を有するスケール及び変位を識別する。 Matching Method FIG. 8 summarizes the basic steps of the matching method 800. For images 801 and 802, the respective descriptors 811 and 812 are configured as described above (500). Peaks 821 and 822 are identified (820) and a vote for histogram 700 is generated (830). These peaks identify the scale and displacement for which the two VP-based descriptors have the best match.

記述子は、類似したシーンの画像を取り出すために画像のデータベースへのクエリとして用いることができることにも留意すべきである。 It should also be noted that the descriptor can be used as a query to a database of images to retrieve images of similar scenes.

対応するエッジにおける形状マッチング
各極大値（ｓ^＊，ｄ^＊）において、比較されている２つの記述子におけるエッジ強度プロット、例えば、図４のプロットのローカルな形状を利用して、マッチングプロセスを改良することができる。本質的には、スケーリングファクターｓ^＊及び変位ｄ^＊を補償した後、次に残っているものは、（ｓ^＊，ｄ^＊）に投票したエッジ対の近傍におけるエッジ強度プロットの形状を比較することである。これを行う方法は幾つかある。以下に１つの実施形態を説明する。
ａ）図９に示すように、一致の品質を測定するためのメトリックを構成するために、各顕著なピークについて以下のステップを実行する。
ｂ）第１の記述子のピークの角度近傍における領域を考える。
ｃ）この近傍における累積エッジ強度ベクトルを求め、全てのエッジ強度の合計が１になるようにこのベクトルを正規化する。
ｄ）第２の記述子における各マッチングする顕著なピークについて、このプロセスを繰り返す。
ｅ）各記述子から１つずつ取られたマッチングするピークの各対について、正規化された累積エッジ強度ベクトル間の絶対距離を求める。
ｆ）ステップ（ｄ）で得られた絶対距離が、場合によっては複数のビンから生成された全てのマッチングするピーク対にわたって平均化され、閾値と比較される。
ｇ）正規化された累積エッジ強度ベクトル間の平均距離が上記閾値未満である場合、２つの記述子間の一致が宣言される。 Shape Matching at Corresponding Edges Each local maxima (s ^* , d ^* ) improves the matching process by utilizing the edge strength plots in the two descriptors being compared, eg the local shape of the plot of FIG. can do. In essence, after compensating for the scaling factor s ^* and displacement d ^* , what remains is to compare the shape of the edge strength plots in the vicinity of the edge pair voted for (s ^* , d ^* ). It is. There are several ways to do this. One embodiment is described below.
a) As shown in FIG. 9, perform the following steps for each salient peak to construct a metric for measuring the quality of the match.
b) Consider a region near the peak angle of the first descriptor.
c) Find the accumulated edge strength vector in this neighborhood and normalize this vector so that the sum of all edge strengths is 1.
d) Repeat this process for each matching salient peak in the second descriptor.
e) Find the absolute distance between the normalized cumulative edge intensity vectors for each pair of matching peaks taken one by one from each descriptor.
f) The absolute distance obtained in step (d) is averaged over all matching peak pairs, possibly generated from multiple bins, and compared to a threshold value.
g) If the average distance between the normalized cumulative edge strength vectors is less than the threshold, a match between the two descriptors is declared.

Claims

A method of constructing a descriptor for an image of a scene, said descriptor being associated with a vanishing point in the image,
The method
Quantizing an angular region around the vanishing point into a predetermined number of angular quantization bins, wherein a centroid of each angular quantization bin indicates a direction of the angular quantization bin; and ,
Determining, for each angular quantization bin, a sum of pixel gradient magnitudes of pixels in the image and a direction of the pixel gradient that matches the direction of the angular quantization bin;
Including
The steps are performed in a processor;
How to construct a scene image descriptor.

The scene is a Manhattan scene with the Manhattan World Hypothesis,
The method of claim 1.

The angular quantization bin is uniform;
The method of claim 1.

The angular quantization bin is determined by clustering the direction of the pixel gradient;
The direction is measured with respect to the location of the vanishing point;
The method of claim 1.

The pixel gradient is determined individually for each pixel,
The method of claim 1.

The pixel gradient is used to determine edge strength by performing edge detection on the image, and to determine the pixel gradient of only the pixels having an edge strength greater than a specified percentile threshold as a peak. Is,
The method of claim 1.

The gradient is determined at subpixel locations.
The method of claim 1.

Comparing a first descriptor composed of two images acquired from different viewpoints of the scene with a second descriptor;
The method of claim 1, further comprising:

Configuring a metric that measures the quality of the matching;
The method of claim 8, further comprising:

Identifying, from the descriptor of each image, the pixel having an edge strength greater than a specified percentile threshold as a peak;
Scale-displacement such that the peak pair selected from the first descriptor, cross-mapped according to a given scale and displacement value, corresponds to the peak pair selected from the second descriptor. Generating a plot; and
Identifying one or more local maxima in the scale-displacement plot;
Comparing two descriptors using the scale and displacement values at each local maximum;
The method of claim 8, further comprising:

The comparing step includes:
Changing each descriptor such that the scale and the displacement of the descriptor are the same;
Determining a difference between the peak in the first descriptor and the peak in the second descriptor;
Declaring a match between the two images when the difference is less than a threshold;
The method of claim 10, further comprising:

The step of obtaining the difference includes
Calculating a cumulative edge strength near an angle of the peak for corresponding peaks in the first descriptor and the second descriptor;
Normalizing the cumulative edge strength so that the sum of the edge strengths near the angle of the peak is 1.
Calculating a distance between the normalized cumulative edge strength of the first descriptor and the normalized cumulative edge strength of the second descriptor;
The method of claim 11, further comprising:

Retrieving a similar image from a database of images based on the descriptor;
The method of claim 1, further comprising:

The vanishing point pixel set is

And
Where the direction of the gradient of the pixel at location (x, y) in the image is ψ _g (x, y);
θ _j (x, y) is an angle formed with respect to a horizontal reference line at the vanishing point,
τ is a threshold selected based on the amount that the direction does not match the direction of the vanishing point;
The method of claim 1.

Quantizing the direction into a predetermined number (K) of bins centered around φ _k , 1 ≦ k ≦ K within the angular range [θ _min , θ _max ], such that
The method of claim 1, further comprising:

The descriptor is

And
Where φ _k , 1 ≦ k ≦ K _j represents the direction of the bin, and r varies in range with half-pixel resolution.
The method of claim 15.