JP2001134593A

JP2001134593A - Method and device for neighborhood data retrieval and storage medium stored with neighborhood data retrieving program

Info

Publication number: JP2001134593A
Application number: JP31593299A
Authority: JP
Inventors: 忠城 ▲吉▼田; Tadashiro Yoshida; Hiroki Akama; 浩樹赤間; Fumikazu Konishi; 史和小西
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-11-05
Filing date: 1999-11-05
Publication date: 2001-05-18

Abstract

PROBLEM TO BE SOLVED: To provide a method and a device for close data retrieval, which can efficient ly retrieve neighborhood data by reducing random access to a disk and solving the problem that the bottom layer needs to be reached without fail so as to find the closest vector, and a storage medium stored with a neighborhood data retrieving program. SOLUTION: Feature variable vectors which are present in a multidimensional space are managed with unique identifiers, the multidimensional space is divided hierarchically into partial spaces, and representative vectors of the respective partial spaces are extracted from feature variable vector groups present in the divided partial spaces; and the extracted representative vectors of the partial spaces are managed for each layer with unique identifiers, the distances between the representative vectors and the feature variable vector of a retrieval key are calculated, and according to the calculated distances, a feature variable vector which is similar to the feature variable vector of the retrieval key is outputted as a retrieval result as to a subset of feature variable vectors present in the multidimensional space which are narrowed down with the representative vectors.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、近傍データ検索方
法及び装置及び近傍データ検索プログラムを格納した記
憶媒体に係り、特に、検索キーを与えると、類似度の高
いものからｋ件の類似データを高速に返却するような近
傍データ検索方法及び装置及び近傍データ検索プログラ
ムを格納した記憶媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a neighborhood data search method and apparatus and a storage medium storing a neighborhood data search program, and in particular, when a search key is given, k pieces of similar data having the highest similarity are obtained. The present invention relates to a method and an apparatus for retrieving nearby data that can be returned at high speed, and a storage medium that stores a neighboring data search program.

【０００２】[0002]

【従来の技術】例えば、情報処理学会論文誌：データベ
ース、Vol.40, No.SIG3(TOD1),pp.171-184, “色や形状
等の表層的特徴量に基づく画像内容検索技術”では、各
画像データから抽出される色や形状の特徴量を多次元空
間におけるベクトル（多次元ベクトル）として表現し、
画像データの特徴量ベクトルとしてデータベースに格納
している。検索時においては、与えられた検索キー画像
より抽出される該検索キー画像の特徴量ベクトルと、デ
ータベース中に格納されている各画像の特徴量ベクトル
の距離を計算し、距離の近い画像を類似画像とすること
により画像データの検索を実現している。2. Description of the Related Art For example, in the Transactions of the Information Processing Society of Japan: Database, Vol. 40, No. SIG3 (TOD1), pp. 171-184, "Image Content Retrieval Technology Based on Surface Features such as Colors and Shapes" , Expressing the color and shape features extracted from each image data as a vector in a multidimensional space (multidimensional vector),
It is stored in the database as a feature vector of image data. At the time of the search, the distance between the feature vector of the search key image extracted from the given search key image and the feature vector of each image stored in the database is calculated, and an image having a short distance is similar. Image data retrieval is realized by using images.

【０００３】このようなシステムでは、一般に検索キー
を表現する特徴量ベクトルと、格納している各データを
表現する特徴量ベクトルとの距離が類似尺度となるが、
距離計算の計算量が多くなることから、当該距離計算量
を削減する手法が多く提案されている。その一つに、特
願平１１−２２９４５９に開示されている“類似特徴量
の検索方法及び装置及び類似特徴量の検索プログラム”
の方式がある。この方式では、予め多次元空間内に存在
する各特徴量ベクトルを検索キーとして、残る当該多次
元空間内のすべての特徴量ベクトルとの距離計算を行
い、各特徴量ベクトルが、当該特報量を検索キーとした
時に類似データとなる特徴量ベクトルを一定の件数、事
前検索結果リストとして保持する。検索時には与えられ
る検索キーと類似した特徴量ベクトル（準近傍ベクト
ル）を出発点として、事前検索結果リストに含まれる特
徴量ベクトルと、検索キーの特徴量ベクトルとの距離を
再計算することにより、検索結果を構築する。また、こ
の処理を連鎖的に適用することにより、検索結果を補正
する。従って、出発点が検索キーと最も近い特徴量ベク
トル（最近傍ベクトル）であれば、補正の必要はなくな
る。また、出発点が最近傍ベクトルから離れた特徴量ベ
クトルである場合には、連鎖的に補正処理を行っても最
近傍ベクトルに補正できない可能性がある。このため、
出発点となる特徴量ベクトルが重要な意味を持つ。In such a system, the distance between a feature vector expressing a search key and a feature vector expressing each stored data is a similarity measure.
Since the calculation amount of the distance calculation increases, many methods for reducing the distance calculation amount have been proposed. One of them is “Similar feature retrieval method and apparatus and similar feature retrieval program” disclosed in Japanese Patent Application No. 11-229559.
There is a method. In this method, each feature vector in the multidimensional space is used as a search key in advance and distance calculation is performed with respect to all remaining feature vectors in the multidimensional space. A feature quantity vector which becomes similar data when used as a search key is held as a predetermined number of cases and a pre-search result list. By recalculating the distance between the feature vector included in the pre-search result list and the feature vector of the search key, starting from the feature vector (quasi-neighbor vector) similar to the given search key at the time of search, Build search results. The search result is corrected by applying this processing in a chained manner. Therefore, if the starting point is a feature amount vector (nearest neighbor vector) closest to the search key, there is no need for correction. If the starting point is a feature vector that is far from the nearest neighbor vector, it may not be possible to correct the nearest neighbor vector even if the correction process is performed in a chain. For this reason,
The feature vector serving as a starting point has an important meaning.

【０００４】出発点となる特徴量ベクトルを求める一つ
の方法として、例えば、D.A White,and R.Jain:"Simila
rity Indexing: Algorithms and Performance", In Pro
ceedings of the SPIE. Storage Retrieval for Image
and Video Database IV, 1996, San Jose, CA, Vol.267
0, pp.62-75. で提案されている木状の索引方式があ
る。この方式は、多次元空間内に存在する特徴量ベクト
ルのデータ群を、各次元軸単位にデータの分散を測り、
分散値の最も大きい軸を対象にして、選択された軸にお
ける全データの広がりの略中央値で２分割する。分割さ
れた部分空間に対してこの処理を順次適用していくこと
により、多次元空間をｎ分木に階層的に構築するもので
ある。As one method for obtaining a feature vector serving as a starting point, for example, DA White, and R. Jain: "Simila
rity Indexing: Algorithms and Performance ", In Pro
ceedings of the SPIE.Storage Retrieval for Image
and Video Database IV, 1996, San Jose, CA, Vol.267
0, pp.62-75. There is a tree-like indexing method proposed in pp.62-75. This method measures the variance of data for each dimension axis unit for a data group of feature quantity vectors existing in a multidimensional space,
With respect to the axis having the largest variance value, the data is divided into two by the approximate median value of the spread of all data on the selected axis. By sequentially applying this processing to the divided subspaces, a multidimensional space is hierarchically constructed in an n-ary tree.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、このよ
うな木状の索引方式では、検索キーの特徴量ベクトル
と、階層的に構成される多次元空間の部分空間（多次元
矩形）との距離を類似度の尺度として、下位方向の階層
へ辿ることになるが、各階層において、検索キーから最
も近い多次元矩形に最近傍ベクトルが存在する保証がな
い。また、途中の階層の多次元矩形は、多次元空間の部
分空間を示す境界情報しか保持していない。However, in such a tree-like indexing method, the distance between the feature vector of the retrieval key and the subspace (multidimensional rectangle) of the multidimensional space hierarchically formed is determined. As a measure of the degree of similarity, the search proceeds to lower layers, but there is no guarantee that the nearest neighbor vector exists in the multidimensional rectangle closest to the search key in each layer. In addition, the multidimensional rectangle of the middle layer holds only boundary information indicating a subspace of the multidimensional space.

【０００６】従って、各階層において、検索キーの特徴
量ベクトルと該階層に存在する多次元矩形との距離を基
準に、最近距離である多次元矩形内の直下の階層に単純
に辿るだけでは不完全であり、最近傍ベクトルを持つ可
能性のある多次元矩形を候補に残しながら全体的に階層
を辿る必要があるが、特に、高次元の多次元空間では、
候補に残る多次元矩形の数が多くなる。途中の階層の多
次元矩形には特徴量ベクトルの情報を保持していないこ
とから、最近傍となる特徴量ベクトルを求めるために
は、必ず最下位の階層まで辿る必要がある、といった問
題がある。Therefore, in each hierarchy, it is not possible to simply follow the closest hierarchy immediately below the closest multi-dimensional rectangle based on the distance between the feature vector of the search key and the multi-dimensional rectangle existing in the hierarchy. It is necessary to traverse the hierarchy as a whole while leaving candidate multidimensional rectangles that are complete and may have the nearest neighbor vector. In particular, in a high-dimensional multidimensional space,
The number of multidimensional rectangles remaining as candidates increases. Since the information of the feature amount vector is not stored in the multidimensional rectangle of the middle layer, there is a problem that it is necessary to always go to the lowest hierarchy in order to find the nearest feature amount vector. .

【０００７】また、一般にこのような木状の索引はディ
スク状に置かれるが、データ件数が多いときにはディス
クへのアクセスがランダムになるといった問題がある。
本発明は、上記の点に鑑みなされたもので、ディスクの
ランダムアクセスを削減し、最近傍ベクトルを求めるた
めに、必ず最下位の階層までたどらなければならないと
いった問題を解決し、効率的な近傍データを検索するこ
とができる近傍データ検索方法及び装置及び近傍データ
検索プログラムを格納した記憶媒体を提供することを目
的とする。In general, such a tree-like index is placed on a disk. However, when the number of data items is large, access to the disk becomes random.
The present invention has been made in view of the above points, and solves a problem that it is necessary to always go to the lowest hierarchy in order to reduce random access to a disk and obtain a nearest neighbor vector. It is an object of the present invention to provide a nearby data search method and apparatus capable of searching data and a storage medium storing a nearby data search program.

【０００８】[0008]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明（請求項１）は、各デ
ータの特徴量が多次元空間における特徴量ベクトルとし
て表現され、検索キーが表現する特徴量ベクトルとの距
離の近い順に上位ｋ件の検索結果を返却するような近傍
データ検索方法において、多次元空間内に存在する特徴
量ベクトルを、一意な識別子で管理しておき（ステップ
１）、多次元空間を階層的に部分空間として分割し（ス
テップ２）、分割された部分空間内に存在する特徴量ベ
クトル群から、該部分空間の代表となる代表ベクトルを
抽出し（ステップ３）、抽出された部分空間の代表ベク
トルを各階層毎に一意な識別子で管理し（ステップ
４）、代表ベクトルと、検索キーの特徴量ベクトルの距
離を計算し（ステップ５）、計算された距離に基づい
て、代表ベクトルにより絞り込まれた多次元空間内に存
在する特徴量ベクトルのサブセットを対象に、検索キー
の特徴量ベクトルと類似した特徴量ベクトルを、検索結
果として出力する（ステップ６）。FIG. 1 is a diagram for explaining the principle of the present invention. According to the present invention (claim 1), the feature quantity of each data is represented as a feature quantity vector in a multidimensional space, and the top k search results are returned in order of decreasing distance from the feature quantity vector represented by the search key. In a neighborhood data search method, a feature vector existing in a multidimensional space is managed by a unique identifier (step 1), and the multidimensional space is hierarchically divided into subspaces (step 2). A representative vector representative of the subspace is extracted from the feature amount vector group existing in the extracted subspace (step 3), and the extracted representative vector of the subspace is managed by a unique identifier for each layer. (Step 4) Calculate the distance between the representative vector and the feature vector of the search key (Step 5), and in the multidimensional space narrowed down by the representative vector based on the calculated distance. Targeting a subset of feature vector are present, similar feature vectors and the feature amount vector of the search key, and outputs as a search result (step 6).

【０００９】本発明（請求項２）は、検索対象の部分空
間に存在する特徴量ベクトルが示す該部分空間内のすべ
ての点の中心点を求め、該中心点より最も近い点を該部
分空間の代表ベクトルとする。本発明（請求項３）は、
中心点の代わりに、重心点を用いる。本発明（請求項
４）は、中心点または、重心点の算出の対象となる特徴
量ベクトル、及び代表ベクトルとなる特徴量ベクトルに
ついて、対象の部分空間に存在するすべての特徴量ベク
トルを用いる代わりに、該部分空間の直下の階層におい
て抽出された代表ベクトルのみを対象とする。According to the present invention (claim 2), a center point of all points in the subspace indicated by a feature vector existing in the subspace to be searched is determined, and a point closest to the center point is determined in the subspace. Let the representative vector be The present invention (claim 3)
The center of gravity is used instead of the center point. According to the present invention (claim 4), instead of using all the feature amount vectors existing in the target subspace for the feature amount vector to be calculated for the center point or the center of gravity and the feature amount vector to be the representative vector, Then, only the representative vectors extracted in the hierarchy immediately below the subspace are targeted.

【００１０】本発明（請求項５）は、各階層における複
数の代表ベクトルを該階層毎に代表ベクトル管理リスト
として管理し、検索応答に対する要求時間が与えられる
とき、複数の階層の代表ベクトル管理リストのうち、該
代表ベクトル管理リストに含まれる代表ベクトルの数
と、該検索応答に対する要求時間を基に、どの階層の代
表ベクトル管理リストを用いるかを決定する。The present invention (claim 5) manages a plurality of representative vectors in each layer as a representative vector management list for each layer, and when a request time for a search response is given, a representative vector management list for a plurality of layers. Of the representative vector management lists to be used is determined based on the number of representative vectors included in the representative vector management list and the request time for the search response.

【００１１】図２は、本発明の原理構成図である。本発
明（請求項６）は、各データの特徴量が多次元空間にお
ける特徴量ベクトルとして表現され、検索キーが表現す
る特徴量ベクトルとの距離の近い順に上位ｋ件の検索結
果を返却するような近傍データ検索装置であって、多次
元空間内に存在する特徴量ベクトルを、一意な識別子で
管理する特徴量ベクトル管理手段１１と、多次元空間を
階層的に部分空間として分割する空間分割手段１２と、
空間分割手段１２で分割された部分空間内に存在する特
徴量ベクトル群から、該部分空間の代表となる代表ベク
トルを抽出する代表ベクトル抽出手段１３と、抽出され
た部分空間の代表ベクトルを各階層毎に一意な識別子で
管理する代表ベクトル管理手段１４と、代表ベクトル管
理手段１４で管理されている代表ベクトルと、検索キー
の特徴量ベクトルの距離を計算する距離計算手段１５
と、距離計算手段１５で計算された距離に基づいて、代
表ベクトルにより絞り込まれた多次元空間内に存在する
特徴量ベクトルのサブセットを対象に、検索キーの特徴
量ベクトルと類似した特徴量ベクトルを、近傍検索し、
検索結果として返却する検索結果管理手段１６とを有す
る。FIG. 2 is a diagram showing the principle of the present invention. According to the present invention (claim 6), the feature amount of each data is represented as a feature amount vector in a multidimensional space, and the top k search results are returned in the order of shortest distance from the feature amount vector represented by the search key. Feature data vector management device 11 for managing a feature vector existing in a multidimensional space with a unique identifier, and a space dividing device for hierarchically dividing the multidimensional space as a subspace. 12 and
A representative vector extracting unit for extracting a representative vector representing the subspace from a feature amount vector group existing in the subspace divided by the space dividing unit; Representative vector management means 14 for managing each of them by a unique identifier, a distance calculation means 15 for calculating a distance between a representative vector managed by the representative vector management means 14 and a feature amount vector of a search key.
Based on the distance calculated by the distance calculating means 15, a feature vector similar to the feature vector of the search key is set for a subset of the feature vectors existing in the multidimensional space narrowed down by the representative vector. , Neighborhood search,
And a search result management means 16 for returning search results.

【００１２】本発明（請求項７）は、各データの特徴量
が多次元空間における特徴量ベクトルとして表現され、
検索キーが表現する特徴量ベクトルとの距離の近い順に
上位ｋ件の検索結果を返却するような近傍データ検索プ
ログラムを格納した記憶媒体であって、多次元空間内に
存在する特徴量ベクトルを、一意な識別子で管理する特
徴量ベクトル管理プロセスと、多次元空間を階層的に部
分空間として分割する空間分割プロセスと、空間分割プ
ロセス手段で分割された部分空間内に存在する特徴量ベ
クトル群から、該部分空間の代表となる代表ベクトルを
抽出する代表ベクトル抽出プロセスと、代表ベクトル抽
出プロセスで抽出された部分空間の代表ベクトルを各階
層毎に一意な識別子で管理する代表ベクトル管理プロセ
スと、代表ベクトル管理プロセスで管理されている代表
ベクトルと、検索キーの特徴量ベクトルの距離を計算す
る距離計算プロセスと、距離計算プロセスで計算された
距離に基づいて、代表ベクトルにより絞り込まれた多次
元空間内に存在する特徴量ベクトルのサブセットを対象
に、検索キーの特徴量ベクトルと類似した特徴量ベクト
ルを、近傍検索し、検索結果として返却する検索結果管
理プロセスとを有する。According to the present invention (claim 7), the feature amount of each data is represented as a feature amount vector in a multidimensional space.
A storage medium that stores a neighborhood data search program that returns the top k search results in the order of distance from the feature vector expressed by the search key, and stores a feature vector existing in a multidimensional space, A feature vector management process for managing with a unique identifier, a space division process for hierarchically dividing a multidimensional space as a subspace, and a feature vector group existing in the subspace divided by the space division process means, A representative vector extraction process for extracting a representative vector representative of the subspace, a representative vector management process for managing a representative vector of the subspace extracted in the representative vector extraction process with a unique identifier for each layer, A distance calculation process that calculates the distance between the representative vector managed in the management process and the feature vector of the search key And, based on the distance calculated in the distance calculation process, for a subset of the feature vector existing in the multidimensional space narrowed down by the representative vector, a feature vector similar to the feature vector of the search key is obtained. And a search result management process for performing a proximity search and returning as a search result.

【００１３】本発明（請求項８）は、代表ベクトル抽出
プロセスにおいて、検索対象の部分空間に存在する特徴
量ベクトルが示す該部分空間内のすべての点の中心点を
求め、該中心点より最も近い点を該部分空間の代表ベク
トルとするプロセスを含む。本発明（請求項９）は、代
表ベクトル抽出プロセスにおいて、中心点の代わりに、
重心点を用いるプロセスを含む。According to the present invention (claim 8), in the representative vector extraction process, the center points of all the points in the subspace indicated by the feature quantity vectors existing in the subspace to be searched are obtained, and the most significant points are determined from the center points. The process includes making a near point a representative vector of the subspace. According to the present invention (claim 9), in the representative vector extraction process, instead of the center point,
Includes a process that uses the center of gravity.

【００１４】本発明（請求項１０）は、代表ベクトル抽
出プロセスにおいて、中心点または、重心点の算出の対
象となる特徴量ベクトル、及び代表ベクトルとなる特徴
量ベクトルについて、対象の部分空間に存在するすべて
の特徴量ベクトルを用いる代わりに、該部分空間の直下
の階層において抽出された代表ベクトルのみを対象とす
る。According to a tenth aspect of the present invention, in the representative vector extraction process, the feature amount vector for which the center point or the center of gravity is to be calculated and the feature amount vector to be the representative vector exist in the target subspace. Instead of using all feature amount vectors, only the representative vectors extracted in the layer immediately below the subspace are targeted.

【００１５】本発明（請求項１１）は、各階層における
複数の代表ベクトルを該階層毎に代表ベクトル管理リス
トとして管理するプロセスと、検索応答に対する要求時
間が与えられるとき、複数の階層の代表ベクトル管理リ
ストのうち、該代表ベクトル管理リストに含まれる代表
ベクトルの数と、該検索応答に対する要求時間を基に、
どの階層の代表ベクトル管理リストを用いるかを決定す
るプロセスとを有する。The present invention (Claim 11) provides a process for managing a plurality of representative vectors in each hierarchy as a representative vector management list for each hierarchy, and a method for managing a plurality of representative vectors in a plurality of hierarchies when a request time for a search response is given. Of the management list, based on the number of representative vectors included in the representative vector management list and the request time for the search response,
Determining which hierarchical representative vector management list to use.

【００１６】上記において、特徴量ベクトル管理手段
は、多次元空間内に存在する特徴量ベクトルを一意な識
別子により管理している。このため、同一点を含めたす
べてのデータを一意に特定することが可能となる。空間
分割手段は、全ての特徴量ベクトルを含む多次元空間か
ら、各階層における分割数、特徴量ベクトル同士の距離
などを基に、階層的な部分空間（多次元矩形）を構築す
る。従って、ある程度距離の近い複数の特徴量ベクトル
が一つの多次元矩形で管理可能になる。また、当該多次
元矩形は、階層的に構築されるために、複数の粒度の代
表ベクトルの集合（代表ベクトル群）を構築することが
可能となる。In the above, the feature vector management means manages feature vectors existing in the multidimensional space by using unique identifiers. Therefore, it is possible to uniquely specify all data including the same point. The space dividing means constructs a hierarchical subspace (multidimensional rectangle) from the multidimensional space including all feature vectors based on the number of divisions in each hierarchy, the distance between feature vectors, and the like. Therefore, it is possible to manage a plurality of feature amount vectors that are close to each other to some extent with one multidimensional rectangle. In addition, since the multidimensional rectangle is hierarchically constructed, it is possible to construct a set of representative vectors having a plurality of granularities (representative vector group).

【００１７】代表ベクトル抽出手段は、部分空間内に存
在する全ての特徴量ベクトルから当該部分空間の代表と
なる特徴量ベクトル（代表ベクトル）を抽出する。この
ため、検索の対象となる特徴量ベクトルを選択すること
が可能となる。代表ベクトル管理手段は、代表ベクトル
抽出手段により抽出された各階層での部分空間の代表ベ
クトルを、該階層毎に一意の識別子によって管理し、連
続領域に格納する。このため、検索時の対象となる特徴
量ベクトルを一意に識別することが可能となり、また、
検索対象となる複数の特徴量ベクトルを格納領域より一
括して読み出すことが可能となる。The representative vector extracting means extracts a feature vector (representative vector) which is representative of the subspace from all the feature vectors existing in the subspace. Therefore, it is possible to select a feature vector to be searched. The representative vector management means manages the representative vectors of the subspaces at each hierarchical level extracted by the representative vector extracting means using a unique identifier for each hierarchical level and stores them in a continuous area. Therefore, it is possible to uniquely identify a feature amount vector to be searched for,
A plurality of feature amount vectors to be searched can be collectively read from the storage area.

【００１８】距離計算手段は、検索キーの特徴量ベクト
ルと、代表ベクトル管理手段が管理する特徴量ベクトル
との距離計算を行う。このため、検索結果の類似度とな
る基準値を作成することが可能となる。検索結果管理手
段は、代表ベクトル管理手段から順次読み出される特徴
量ベクトルを、距離計算手段を用いて検索キーの特徴量
ベクトルと総当たり的に距離を計算し、求められた距離
に従って、返却する件数の特徴量ベクトルをソートし
て、返却する。The distance calculation means calculates the distance between the feature vector of the search key and the feature vector managed by the representative vector management means. For this reason, it is possible to create a reference value that is the similarity of the search result. The search result management means calculates a feature amount vector sequentially read out from the representative vector management means, and uses a distance calculation means to calculate a brute force distance with the feature amount vector of the search key, and according to the obtained distance, Is sorted and returned.

【００１９】従って、多次元空間内に存在するすべての
特徴量ベクトルから、偏りを崩すことなく抽出されたサ
ブセットである代表ベクトルを、連続領域に格納・管理
することが可能となり、当該代表ベクトルと検索キーの
特徴量ベクトルとの距離を順次求めることにより、本発
明の目的である検索キーの特徴量ベクトルとの近傍検索
を行うことが可能となる。Therefore, it is possible to store and manage in a continuous area a representative vector, which is a subset extracted from all the feature quantity vectors existing in the multidimensional space without distorting the bias, in the continuous area. By sequentially calculating the distance between the search key and the feature amount vector, it is possible to perform a neighborhood search with the search key feature amount vector, which is an object of the present invention.

【００２０】[0020]

【発明の実施の形態】以下の説明において、本発明は、
近傍検索の一種であるが、検索対象を代表ベクトルによ
り絞り込んでいるため、最近傍を含め必ずしも精度は保
証されないため、このような検索を準近傍検索と呼ぶも
のとする。従って、以下に説明する装置については、準
近傍データ検索装置と記す。DESCRIPTION OF THE PREFERRED EMBODIMENTS In the following description, the present invention
Although it is a kind of neighborhood search, since the search target is narrowed down by the representative vector, the accuracy including the nearest neighbor is not necessarily guaranteed, so such a search is referred to as a quasi-neighbor search. Therefore, the device described below is referred to as a quasi-neighbor data search device.

【００２１】図３は、本発明の準近傍データ検索装置の
構成を示す。同図に示す準近傍データ検索装置は、特徴
量ベクトル管理部１１、空間分割部１２、代表ベクトル
抽出部１３、代表ベクトル管理部１４、距離計算部１
５、及び、検索結果管理部１６から構成される。特徴量
ベクトル管理部１１は、多次元空間内に存在する特徴量
ベクトルを一意な識別子により管理する。FIG. 3 shows the configuration of the quasi-neighbor data retrieval apparatus of the present invention. The quasi-neighbor data search device shown in FIG. 1 includes a feature vector management unit 11, a space division unit 12, a representative vector extraction unit 13, a representative vector management unit 14, and a distance calculation unit 1.
5 and a search result management unit 16. The feature vector management unit 11 manages feature vectors that exist in a multidimensional space using unique identifiers.

【００２２】空間分割部１２は、全ての特徴量ベクトル
を含む多次元空間から、各階層における分割数、特徴量
ベクトル同士の距離などを基に、階層的な部分空間（多
次元矩形）を構築する。代表ベクトル抽出部１３は、部
分空間内に存在する全ての特徴量ベクトルから当該部分
空間の代表となる特徴量ベクトル（代表ベクトル）を抽
出する。The space division unit 12 constructs a hierarchical subspace (multidimensional rectangle) from a multidimensional space including all feature vectors based on the number of divisions in each hierarchy, the distance between feature vectors, and the like. I do. The representative vector extraction unit 13 extracts a feature vector (representative vector) that is representative of the subspace from all the feature vectors existing in the subspace.

【００２３】代表ベクトル管理部１４は、代表ベクトル
抽出部１３により抽出された各階層での部分空間の代表
ベクトルを、該階層毎に一意の識別子によって管理し、
連続領域に格納する。距離計算部１５は、検索キーの特
徴量ベクトルと、代表ベクトル管理手段が管理する特徴
量ベクトルとの距離計算を行う。The representative vector management unit 14 manages the representative vectors of the subspaces at each layer extracted by the representative vector extraction unit 13 using a unique identifier for each layer,
Store in a continuous area. The distance calculation unit 15 calculates the distance between the feature vector of the search key and the feature vector managed by the representative vector management unit.

【００２４】検索結果管理部１６は、代表ベクトル管理
部１４から順次読み出される特徴量ベクトルを、距離計
算部１５を用いて検索キーの特徴量ベクトルと総当たり
的に距離を計算し、求められた距離に従って、返却する
件数の特徴量ベクトルをソートして、返却する。次に、
上記の構成における準近傍データ検索装置の動作を説明
する。The search result management unit 16 calculates a feature amount vector sequentially read from the representative vector management unit 14 and a feature amount vector of the search key by using the distance calculation unit 15 to calculate a brute force distance. According to the distance, the feature amount vectors of the number of items to be returned are sorted and returned. next,
The operation of the quasi-neighbor data search device in the above configuration will be described.

【００２５】図４は、本発明の準近傍データ検索装置の
動作のフローチャートである。ステップ１０１）予め特徴量ベクトル管理部１１にお
いて特徴量ベクトルを一意な識別子で管理しておく。検
索キーが当該準近傍データ検索装置に入力される。ステップ１０２）空間分割部１２において、多次元空
間を階層的に部分空間として分割する。FIG. 4 is a flowchart of the operation of the near-neighbor data search device of the present invention. Step 101) The feature amount vector management unit 11 manages the feature amount vector in advance using a unique identifier. A search key is input to the near-neighbor data search device. Step 102) In the space dividing unit 12, the multidimensional space is hierarchically divided as a subspace.

【００２６】ステップ１０３）代表ベクトル抽出部１
３は、ステップ１０２で分割された部分空間内に存在す
る特徴量ベクトル群から、その部分の空間の代表となる
特徴量ベクトル（代表ベクトル）を抽出する。ステップ１０４）代表ベクトル管理部１４は、ステッ
プ１０３により抽出された代表ベクトルを管理する。Step 103) Representative vector extraction unit 1
In step 3, a feature vector (representative vector), which is representative of the space of the part, is extracted from a group of feature vectors existing in the partial space divided in step 102. Step 104) The representative vector management unit 14 manages the representative vectors extracted in step 103.

【００２７】ステップ１０５）距離計算部１５におい
て、入力された検索キーの特徴量ベクトルと、代表ベク
トル管理部１４で管理されている代表ベクトルとの距離
を計算する。ステップ１０６）検索結果管理部１６は、計算された
距離に基づいて、代表ベクトルにより絞り込まれた多次
元空間内に存在する特徴量ベクトルのサブセットを対象
に、検索キーの特徴量ベクトルと類似した特徴量ベクト
ルを近傍検索により求める。Step 105) The distance calculation unit 15 calculates the distance between the input characteristic amount vector of the search key and the representative vector managed by the representative vector management unit 14. Step 106) Based on the calculated distance, the search result management unit 16 targets a subset of the feature vectors existing in the multi-dimensional space narrowed down by the representative vector and searches for a feature similar to the feature vector of the search key. The quantity vector is obtained by neighborhood search.

【００２８】ステップ１０７）検索された特徴量ベク
トルを出力する。Step 107) The searched feature amount vector is output.

【００２９】[0029]

【実施例】以下、図面と共に本発明の実施例を説明す
る。特徴量ベクトル管理部１１において、例えば、各特
徴量ベクトルは、ファイルに格納されており、特徴量ベ
クトル管理部１１は、当該特徴量ベクトルを一意の識別
子により管理している。Embodiments of the present invention will be described below with reference to the drawings. In the feature vector management unit 11, for example, each feature vector is stored in a file, and the feature vector management unit 11 manages the feature vector using a unique identifier.

【００３０】空間分割部１２は、木状索引の構築方法を
利用し、多次元空間を複数階層の部分空間に分割する。
図５は、本発明の一実施例の空間分割部における一つの
木状の索引を適用し、２次元空間を２階層まで分割した
例を示す。同図において、２１、２２は、第１階層にお
ける部分空間を、２１１、２１２、２２１、２２２は、
第２階層における部分空間を示す。分割の際には、直下
への分岐数の上限を２（２分木）としている。同図の例
では、２１及び２２の矩形が第１レベルでの部分空間で
ある（ルート階層を第０階層とする）。また、２１１、
２１２、２２１、２２２が第２レベルでの部分空間であ
る。The space dividing unit 12 divides a multidimensional space into a plurality of hierarchical subspaces by using a tree-like index construction method.
FIG. 5 shows an example in which a two-dimensional space is divided into two layers by applying one tree-like index in the space dividing unit according to one embodiment of the present invention. In the figure, reference numerals 21 and 22 denote partial spaces in the first hierarchy, and reference numerals 211, 212, 221, and 222 denote subspaces.
7 shows a subspace in the second hierarchy. At the time of division, the upper limit of the number of branches immediately below is set to 2 (binary tree). In the example shown in the figure, rectangles 21 and 22 are partial spaces at the first level (the root layer is the 0th layer). Also, 211,
212, 221 and 222 are subspaces at the second level.

【００３１】このようにして構築された木状の索引を図
６に示す。同図は、図５における特徴ベクトルを空間分
割してできた最終的な木状の索引を模式的に表したもの
である。従来の木状の索引を用いた検索方式では、検索
キーが与えられると、ルートより２１、２２、２１１、
…と候補となる部分空間との距離を計算し、当該距離に
よって部分空間をソートしながら検索を行っている。FIG. 6 shows a tree-like index constructed in this manner. This figure schematically shows a final tree-like index obtained by spatially dividing the feature vector in FIG. In a conventional search method using a tree-like index, when a search key is given, 21, 22, 211,
.. And the candidate subspace are calculated, and the search is performed while sorting the subspaces by the distance.

【００３２】代表ベクトル抽出部１３は、分割された部
分空間より代表ベクトルの抽出を行う。例えば、図６に
おける第２階層では、対象の部分空間に存在する特徴量
ベクトルが示す、部分空間内のすべての点の中心点を求
め、当該中心点より最も近い点を部分空間の代表ベクト
ルとする方法を適用すると、部分空間２１１では、特徴
量ベクトルａ，ｂ，ｃのうち、中心点に近い特徴量ベク
トルとして特徴量ベクトルｂ、または、部分空間２２１
では、特徴量ベクトルｇ，ｈ，ｉのうち特徴量ベクトル
ｇが抽出される。The representative vector extracting unit 13 extracts a representative vector from the divided subspace. For example, in the second hierarchy in FIG. 6, the center points of all the points in the subspace indicated by the feature amount vectors existing in the target subspace are obtained, and the point closest to the center point is defined as the representative vector of the subspace. In the subspace 211, the feature vector b or the subspace 221 is used as the feature vector close to the center point among the feature vectors a, b, and c in the subspace 211.
Then, the feature vector g is extracted from the feature vectors g, h, and i.

【００３３】一方、中心点の代わりに重心点を用いる方
法を適用すると、部分空間２１１では特徴量ベクトルｂ
が、部分空間２２１では、特徴量ベクトルｉが抽出され
る。代表ベクトル管理部１４は、代表ベクトル抽出部１
３により抽出された代表ベクトルを、各階層毎に、特徴
量ベクトル管理部１１と同じ方法により、例えば、ファ
イルに特徴量ベクトルの識別子と各次元の値の組として
格納し、代表ベクトル管理リストとして管理する。On the other hand, if the method using the center of gravity instead of the center point is applied, the feature vector b
However, in the subspace 221, a feature amount vector i is extracted. The representative vector management unit 14 includes the representative vector extraction unit 1
3 is stored for each layer in the same manner as the feature vector management unit 11 for each layer, for example, as a set of a feature vector identifier and a value of each dimension in a file. to manage.

【００３４】図７は、本発明の一実施例の代表ベクトル
管理部の代表ベクトル管理リストの例を示す。同図に示
す代表ベクトル管理リストは、図５における特徴量ベク
トル集合において、部分空間内のすべての点の中心点を
求め、当該中心点より最も近い点を部分空間の代表ベク
トルとした場合の代表特徴量ベクトルを各階層毎に管理
することを模式的に表している。同図（Ａ）は、第１階
層の代表ベクトルの管理リストを示し、同図（Ｂ）は、
第２階層の代表ベクトル管理リストを示す。FIG. 7 shows an example of a representative vector management list of the representative vector management unit according to one embodiment of the present invention. The representative vector management list shown in the figure is obtained by calculating the center points of all the points in the subspace in the feature vector set in FIG. 5, and setting the point closest to the center point as the representative vector of the subspace. It schematically shows that a feature amount vector is managed for each layer. FIG. 9A shows a management list of representative vectors of the first hierarchy, and FIG.
5 shows a second-layer representative vector management list.

【００３５】検索結果管理部１６は、まず、与えられる
検索キーの特徴量ベクトルに対して、要求される検索応
答時間と、各階層の代表ベクトル管理リストに含まれる
特徴量ベクトルの数に基づいて、どの階層の代表ベクト
ル管理リストを用いるかを決定する（例えば、図６の第
２階層を選択する）。次に、距離計算部１５において、
代表ベクトル管理リストが管理するすべての特徴量ベク
トル（図６の場合、２１１、２１２、２２１、２２２の
各代表ベクトル）に対して、検索キーの特徴量ベクトル
との距離計算を行う。また、距離計算を行うと同時に、
検索結果管理部１６において検索結果のリストをソート
することにより、検索結果リストを構築・管理する。The search result management unit 16 first determines the required search response time for the feature vector of the given search key and the number of feature vectors included in the representative vector management list of each hierarchy. Then, a hierarchy vector representative list to be used is determined (for example, the second hierarchy in FIG. 6 is selected). Next, in the distance calculation unit 15,
The distance calculation is performed with respect to all the feature amount vectors (each of the representative vectors 211, 212, 221 and 222 in FIG. 6) managed by the representative vector management list from the feature amount vector of the search key. At the same time as calculating the distance,
A search result list is constructed and managed by sorting the search result list in the search result management unit 16.

【００３６】このようにして構築された準近傍検索結果
が返却される。また、上記の実施例は、図３の構成に基
づいて説明しているが、同図に示す構成要素をプログラ
ムとして構築し、準近傍データ検索装置として利用され
るコンピュータに接続されるディスク装置や、フロッピ
ー（登録商標）ディスク、ＣＤ−ＲＯＭ等の可搬記憶媒
体に格納しておき、本発明を実施する際にインストール
することにより容易に本発明を実現できる。The quasi-neighborhood search result constructed in this way is returned. Although the above embodiment has been described based on the configuration of FIG. 3, a disk device connected to a computer used as a quasi-neighbor data search device by constructing the components shown in FIG. The present invention can be easily realized by storing the program in a portable storage medium such as a floppy (registered trademark) disk, a CD-ROM, or the like, and installing the program when implementing the present invention.

【００３７】なお、本発明は、上記の実施例に限定され
ることなく、特許請求の範囲内において種々変更・応用
が可能である。The present invention is not limited to the above embodiment, but can be variously modified and applied within the scope of the claims.

【００３８】[0038]

【発明の効果】上述のように、本発明によれば、多次元
空間内に存在するすべての特徴量ベクトルから、偏りを
くずすことなく抽出されたサブセットである代表ベクト
ルを、連続領域に格納・管理することが可能となり、代
表ベクトルと検索キーの特徴量ベクトルとの距離を順次
求めることにより、検索キーの特徴量ベクトルとの準近
傍検索を行うことができる。As described above, according to the present invention, a representative vector, which is a subset extracted from all feature quantity vectors existing in a multidimensional space without losing a bias, is stored in a continuous area. This makes it possible to manage, and by sequentially obtaining the distance between the representative vector and the feature vector of the search key, it is possible to perform a quasi-neighbor search with the feature vector of the search key.

[Brief description of the drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の準近傍データ検索装置の構成図であ
る。FIG. 3 is a configuration diagram of a quasi-neighbor data search device of the present invention.

【図４】本発明の準近傍データ検索装置の動作のフロー
チャートである。FIG. 4 is a flowchart of the operation of the quasi-neighbor data search device of the present invention.

【図５】本発明の一実施例の空間分割部における一つの
木状の索引を適用し、２次元空間を２階層まで分割した
例である。FIG. 5 is an example in which a two-dimensional space is divided into two layers by applying one tree-like index in a space dividing unit according to an embodiment of the present invention.

【図６】本発明の一実施例の木状検索の例である。FIG. 6 is an example of a tree-like search according to an embodiment of the present invention.

【図７】本発明の一実施例の代表ベクトル管理部の代表
ベクトル管理リストの例である。FIG. 7 is an example of a representative vector management list of a representative vector management unit according to an embodiment of the present invention.

[Explanation of symbols]

１１特徴量ベクトル管理手段、特徴量ベクトル管理部１２空間分割手段、空間分割部１３代表ベクトル抽出手段、代表ベクトル抽出部１４代表ベクトル管理手段、代表ベクトル管理部１５距離計算手段、距離計算部１６検索結果管理手段、検索結果管理部 DESCRIPTION OF SYMBOLS 11 Feature-value vector management means, feature-vector management part 12 Space division means, space division part 13 Representative vector extraction means, representative vector extraction part 14 Representative vector management means, representative vector management part 15 Distance calculation means, distance calculation part 16 Search Result management means, search result management unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者小西史和東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B075 ND35 NR15 PP22 PQ74 PR06 QM08 5L096 DA02 FA60 FA62 FA66 HA09 JA11 KA09 MA07 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Fumika Konishi 2-3-1, Otemachi, Chiyoda-ku, Tokyo F-term (reference) in Nippon Telegraph and Telephone Corporation 5B075 ND35 NR15 PP22 PQ74 PR06 QM08 5L096 DA02 FA60 FA62 FA66 HA09 JA11 KA09 MA07

Claims

[Claims]

1. A feature amount of each data is expressed as a vector in a multidimensional space (hereinafter, referred to as a feature amount vector), and the top k search results are arranged in ascending order of a distance from the feature amount vector represented by the search key. In the neighborhood data search method to be returned, a feature vector existing in the multidimensional space is managed by a unique identifier, and the multidimensional space is hierarchically divided into subspaces. A representative vector representing the subspace is extracted from the feature amount vector group existing in the subspace. The representative vector of the extracted subspace is managed by a unique identifier for each layer. The distance between the feature vectors of the feature vectors in the multidimensional space narrowed down by the representative vector is calculated based on the calculated distance. A method of searching for neighboring data, comprising: outputting, as a search result, a feature vector similar to a feature vector of a search key to an elephant.

2. The method according to claim 1, wherein a center point of all points in the subspace indicated by the feature vector existing in the subspace to be searched is determined, and a point closest to the center point is set as a representative vector of the subspace. 1. A method for searching for nearby data according to item 1.

3. The neighborhood data search method according to claim 2, wherein a center of gravity is used instead of said center point.

4. For the feature amount vector for calculating the center point or the center of gravity and the feature amount vector serving as the representative vector, instead of using all the feature amount vectors existing in the target subspace, 4. The neighborhood data search method according to claim 2, wherein only the representative vector extracted in the hierarchy immediately below the subspace is targeted.

5. A plurality of representative vectors in each layer are managed as a representative vector management list for each layer, and when a request time for a search response is given, the representative vector management list in the plurality of layers is displayed. 2. The neighborhood data search method according to claim 1, wherein a hierarchy vector representative list to be used is determined based on the number of representative vectors included in the list and a request time for the search response.

6. A feature amount of each data is represented as a vector in a multidimensional space (hereinafter, referred to as a feature amount vector), and the top k search results are arranged in ascending order of a distance from the feature amount vector represented by the search key. A neighboring data search device that returns a feature vector managing unit that manages a feature vector existing in a multidimensional space with a unique identifier, and that hierarchically divides the multidimensional space into subspaces Space dividing means; representative vector extracting means for extracting a representative vector representative of the subspace from a group of feature vectors existing in the subspace divided by the space dividing means; representative of the extracted subspace A representative vector management unit that manages a vector with a unique identifier for each layer; a representative vector managed by the representative vector management unit; Distance calculating means for calculating the distance of the collection vector, based on the distance calculated by the distance calculating means, for a subset of feature amount vectors existing in a multidimensional space narrowed down by the representative vector, A neighborhood data search device comprising: a search result management unit that performs a neighborhood search for a feature amount vector similar to a feature amount vector of a search key and returns the search result as a search result.

7. A feature amount of each data is represented as a vector in a multidimensional space (hereinafter, referred to as a feature amount vector), and the top k search results are arranged in ascending order of distance from the feature amount vector represented by the search key. A storage medium storing a neighborhood data search program to be returned, wherein a feature amount vector management process for managing a feature amount vector existing in a multi-dimensional space by a unique identifier; and A space division process of dividing as a subspace; a representative vector extraction process of extracting a representative vector representative of the subspace from a feature vector group existing in the subspace divided by the space division process unit; A representative vector management program that manages the representative vector of the subspace extracted by the representative vector extraction process with a unique identifier for each layer A distance calculation process for calculating a distance between the representative vector managed in the representative vector management process and a feature amount vector of a search key; and the representative vector based on the distance calculated in the distance calculation process. And a search result management process of performing a neighborhood search for a feature vector similar to the feature vector of the search key for a subset of the feature vectors existing in the multidimensional space narrowed down by the search key and returning the search result as a search result. A storage medium storing a neighborhood data search program characterized by having

8. The representative vector extraction process calculates a center point of all points in the subspace indicated by a feature vector present in the subspace to be searched, and determines a point closest to the center point in the subspace. 8. A storage medium storing the neighborhood data search program according to claim 7, including a process of setting a representative vector of the data.

9. The storage medium according to claim 8, wherein the representative vector extracting process includes a process using a center of gravity instead of the center point.

10. The representative vector extraction process includes, for the feature amount vector for which the center point or the center of gravity is to be calculated, and the feature amount vector to be the representative vector, which are all existing in the target subspace. 10. The storage medium storing the neighborhood data search program according to claim 8, wherein instead of using a feature amount vector, only a representative vector extracted in a layer immediately below the subspace is targeted.

11. A process of managing a plurality of representative vectors in each layer as a representative vector management list for each layer, and when a request time for a search response is given, the representative vector management list of the plurality of layers 8. A process for deciding which hierarchical representative vector management list to use based on the number of representative vectors included in the vector management list and the request time for the search response.
A storage medium storing the described neighborhood data search program.