JP2009294855A

JP2009294855A - Similar data retrieval system

Info

Publication number: JP2009294855A
Application number: JP2008147060A
Authority: JP
Inventors: Daisuke Matsubara; 大輔松原; Atsushi Hiroike; 敦廣池
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2008-06-04
Filing date: 2008-06-04
Publication date: 2009-12-17
Anticipated expiration: 2028-06-04
Also published as: JP5155025B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce the load of disk scanning by arranging feature value vectors so that featured value vectors whose similarity is high can be closely gathered on a secondary storage, and that the order of the number of closely gathering data can be turned into O(N<SB>c</SB>) depending on the number N<SB>c</SB>of all data. <P>SOLUTION: In the case of registering new data, the prescribed number of clusters having a cluster average whose inter-vector distance with a featured vector extracted from new data is close is retrieved as a proximity cluster, and optimization by a k-means method with the member of the retrieved proximity cluster and new data as an object. When the number of members of the cluster exceeds the upper limit, the cluster is divided into two part, and the newly created cluster is added to the group of approximate clusters. The update of the cluster management is performed by making energy functions have the relation of the adjacent clusters and information showing where each cluster should be arranged in the whole cluster array. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、類似データ検索システムに関し、類似データ検索における計算機上でのデータ配置最適化方式に関する。 The present invention relates to a similar data search system, and to a data arrangement optimization method on a computer in similar data search.

ネットワークのブロードバンド化や記憶装置の大容量化に伴って、ユーザがアクセス可能な画像や映像の量は、日々増加の一途を辿っている。こうした流れの中で、膨大な画像・映像コンテンツから、如何にして所望するものにアクセスすべきかという問題が生じている。従来は、人手もしくは半自動的にコンテンツに対してテキスト情報を付与することによってキーワード検索を可能とし、この問題を解決していた。しかし、それが可能となるのは大量のコンテンツの一部のみであり、そもそもテキスト情報だけでは画像の持つ情報を的確に表すことができないという問題もある。 The amount of images and videos that can be accessed by the user is increasing day by day as the network becomes broadband and the storage capacity increases. In such a flow, a problem arises as to how to access a desired one from a vast amount of image / video content. In the past, keyword search was made possible by manually or semi-automatically adding text information to the content to solve this problem. However, this is possible only for a part of a large amount of content, and there is also a problem that information possessed by an image cannot be accurately represented by text information alone.

そこで、画像の色の分布や形状等、画像自体が持つ情報を自動的に抽出し、高次元のベクトル情報として表現した画像特徴量に基づき、画像間の類似性を評価して検索を行う類似画像検索技術が提案されている。類似画像検索の主要処理である類似ベクトル検索処理では、検索のキーとなるデータと、ベクトル空間中で距離が小さいデータの検索、すなわち、ベクトル空間中での近傍検索が行われる。例えば、静止画像の類似検索では、画像中の色分布のヒストグラムや形状パターン等が、データを特徴付ける特徴量ベクトルとして用いられる。 Therefore, the information that the image itself, such as the color distribution and shape of the image, is automatically extracted, and the search is performed by evaluating the similarity between images based on the image feature amount expressed as high-dimensional vector information. Image retrieval techniques have been proposed. In a similar vector search process, which is a main process of similar image search, a search for data serving as a search key and data having a short distance in the vector space, that is, a neighborhood search in the vector space is performed. For example, in a similarity search of still images, a color distribution histogram or shape pattern in an image is used as a feature vector that characterizes data.

近傍検索では、ベクトル間の距離を知る必要がある。距離計算の対象データ数をＮ、特徴量ベクトルの次元数をＭとすれば、検索キーデータと距離計算の対象データとの間でＮ回の距離計算が必要となり、かつ、各距離計算に必要となる時間は、Ｍに比例する。従って、近傍検索を線形探索で実現した場合、Ｎ×Ｍに比例した計算時間が必要となる。よって、大規模高次元データを検索対象にする場合、膨大な計算時間が必要となるため、最近接検索処理の高速化が必須である。 In the neighborhood search, it is necessary to know the distance between vectors. If the number of target data for distance calculation is N and the number of dimensions of the feature vector is M, N distance calculations are required between the search key data and the target data for distance calculation, and each distance calculation is required. Is proportional to M. Therefore, when the neighborhood search is realized by a linear search, a calculation time proportional to N × M is required. Therefore, when a large-scale high-dimensional data is to be searched, enormous calculation time is required, so that it is essential to speed up the nearest neighbor search process.

最近接検索の処理を高速化する手法として、検索クエリデータに応じて距離計算の対象データを絞り込む方法が複数提案されている。多次元インデキシングと総称される一群の手法は、データベースの分野で利用されているバランス木の概念を多次元空間の処理に拡張したものである。多次元インデキシングでは、空間中の領域が木構造で管理される。そして、検索クエリデータが与えられると、その検索クエリデータを含む領域が定義されたリーフLog Ｎのオーダで検索する。一方、パターン認識の分野では、距離が近いもの同士を予め分類しておく、クラスタリングに基づく高速化がしばしば用いられる。具体的な分類の手法としては、k-means法が一般的である。 As a technique for speeding up the nearest neighbor search process, a plurality of methods for narrowing the target data for distance calculation according to the search query data have been proposed. A group of techniques collectively referred to as multidimensional indexing is an extension of the concept of balanced trees used in the field of databases to multidimensional space processing. In multidimensional indexing, regions in space are managed in a tree structure. When the search query data is given, the search is performed with the order of leaf Log N in which the area including the search query data is defined. On the other hand, in the field of pattern recognition, speeding up based on clustering is often used in which objects that are close to each other are classified in advance. As a specific classification method, the k-means method is common.

一般的なデータベースシステムでは、特徴量ベクトルは２次記憶上に格納されていることが多い。そこで、２次記憶上の特徴量ベクトに対して、随時Ｉ／Ｏを行いながら類似ベクトルを検索する必要がある。ただし、２次記憶上でのＩ／Ｏは１次記憶上でのＩ／Ｏと比較して多大な時間を要するため、特徴量ベクトル同士の距離計算を高速化しても、対象となる特徴量ベクトルへのアクセス速度が遅いと意味がなくなってしまう。 In general database systems, feature vectors are often stored in secondary storage. Therefore, it is necessary to retrieve similar vectors while performing I / O on the feature vector in the secondary storage as needed. However, since the I / O on the secondary storage requires much time compared with the I / O on the primary storage, the target feature amount even if the distance calculation between the feature amount vectors is accelerated. If the access speed to the vector is slow, it makes sense.

そこで、２次記憶上のデータのＩ／Ｏ速度を向上するために、以下に示す手法が提案されている。 In order to improve the I / O speed of data on the secondary storage, the following method has been proposed.

一般に、類似性が高い特徴量ベクトルは、検索時及び検索対象データ更新時に同時に参照される可能性が高い。したがって、類似性が高い特徴量ベクトル同士が、２次記憶上でなるべく近傍に集まるように配置すれば、ディスク走査の負荷を低減できると考えられる。 In general, a feature vector having high similarity is highly likely to be referred to at the same time when searching and when updating search target data. Therefore, it is considered that the load of scanning the disk can be reduced if the feature quantity vectors having high similarity are arranged as close as possible to each other on the secondary storage.

また、通常、ディスクアクセスは一度に一定の範囲のデータ全てに対して行われる。したがって、類似したデータが２次記憶上の近傍に集まっていれば、検索を行う際にディスクアクセスは行われたが最終的な検索結果として表示されなかったデータがディスクキャッシュに読み込まれた状態になる。よって、類似した検索キーデータを用いて複数回検索する際に、ディスクキャッシュ内に類似したデータが読み込まれているため、ディスクアクセスの回数が減り、検索の高速化を図ることができる。類似した検索キーデータを用いて複数回検索するようなことは、検索結果画面に表示されている画像を新たに検索キーデータとして用いる場合に、生じやすい。 In general, disk access is performed for all data in a certain range at a time. Therefore, if similar data are gathered in the vicinity of the secondary storage, the data that was accessed when the search was performed but was not displayed as the final search result is read into the disk cache. Become. Therefore, since similar data is read into the disk cache when searching using the similar search key data a plurality of times, the number of disk accesses is reduced, and the search speed can be increased. Searching a plurality of times using similar search key data is likely to occur when an image displayed on the search result screen is newly used as search key data.

そこで、２次記憶媒体を１次元配列とみなし、類似した特徴量ベクトル同士を１次元配列上で近傍に配置し、類似していない特徴量ベクトル同士は１次元配列上で遠い領域に配置するようなデータ配置処理を行うことによって、ディスクアクセスの高速化を図ることができると考えられる。 Therefore, the secondary storage medium is regarded as a one-dimensional array, similar feature quantity vectors are arranged in the vicinity on the one-dimensional arrangement, and dissimilar feature quantity vectors are arranged in distant areas on the one-dimensional arrangement. It is considered that high-speed disk access can be achieved by performing an appropriate data arrangement process.

しかし、既存の木構造や階層型クラスタリングなどでは、類似した特徴量ベクトル同士を１次元配列上の近傍に配置することができない。例えば木構造において、各クラスタを最下層の葉ノードとして木構造を構築した場合、各ノードの下に存在するクラスタ同士は類似していると言うことができる。しかし、木構造はノードの上下関係のみを規定する抽象的な構造であり、類似した特徴量ベクトル同士を近傍に配置するような関係を実現するものではない。 However, with existing tree structures, hierarchical clustering, and the like, similar feature quantity vectors cannot be arranged in the vicinity on a one-dimensional array. For example, in a tree structure, when a tree structure is constructed with each cluster as a leaf node in the lowest layer, it can be said that the clusters existing under each node are similar. However, the tree structure is an abstract structure that prescribes only the hierarchical relationship of nodes, and does not realize a relationship in which similar feature quantity vectors are arranged in the vicinity.

特許文献１では、ニューラルネットワークの一種である自己組織化マップを用いて、特徴量ベクトルを高次元ベクトル空間中の距離を保存したままで、１次元の配列上にマッピングする手法が提案されている。ここでは、自己組織化マップが持つ高次元空間での近傍関係を出来る限り保ちつつ低次元空間へデータを配置する特徴を利用している。自己組織化マップの出力層のユニット数をｋとすると、全検索対象特徴量ベクトルはｋ個のクラスタに分割され、類似したクラスタ同士が１次元配列上で近傍に配置される。クラスタ同士の類似度はクラスタの持つ参照ベクトル間の距離によって定義される。そして、全検索対象特徴量ベクトルを距離が最も近いクラスタに格納する。このとき、各クラスタの参照ベクトルはメモリ上に保持され、各クラスタに属する特徴量ベクトルは２次記憶上の連続した領域に格納される。 Patent Document 1 proposes a technique for mapping feature vectors onto a one-dimensional array using a self-organizing map that is a kind of neural network while preserving the distance in a high-dimensional vector space. . Here, the self-organizing map uses the feature of arranging data in a low-dimensional space while maintaining the neighborhood relation in a high-dimensional space as much as possible. If the number of units in the output layer of the self-organizing map is k, all search target feature vectors are divided into k clusters, and similar clusters are arranged in the vicinity on a one-dimensional array. The similarity between clusters is defined by the distance between reference vectors of the clusters. And all the search object feature-value vectors are stored in the nearest cluster. At this time, the reference vector of each cluster is held in the memory, and the feature vector belonging to each cluster is stored in a continuous area on the secondary storage.

特許文献２では、k-means法に基づくクラスタリング手法を用いて類似した特徴量ベクトル同士をまとめたクラスタを作成し、類似したクラスタ同士が１次元配列上で近傍領域に配置されるようにデータ配置処理を行う。クラスタの代表値のみを１次記憶上に保持し、特徴量ベクトルは２次記憶上に格納する。ここで、クラスタ内に含まれる特徴量ベクトルの平均値をクラスタ代表値とし、以下クラスタ平均と呼ぶ。また、各クラスタメンバの２次記憶上の格納領域を連続的に確保するために、クラスタ生成時に所定の量の記憶領域を確保する。それに従い、各クラスタメンバの最大数も制限される。 In Patent Document 2, a cluster in which similar feature vectors are created using a clustering method based on the k-means method is created, and data arrangement is performed so that similar clusters are arranged in a neighboring region on a one-dimensional array. Process. Only the representative value of the cluster is held on the primary storage, and the feature vector is stored on the secondary storage. Here, an average value of feature quantity vectors included in the cluster is referred to as a cluster representative value, and is hereinafter referred to as a cluster average. Further, in order to continuously secure storage areas on the secondary storage of each cluster member, a predetermined amount of storage areas is secured at the time of cluster generation. Accordingly, the maximum number of each cluster member is also limited.

ただし、新規特徴量ベクトルを登録するたびに、全特徴量ベクトルのクラスタリングを行ったのでは、実用的な処理時間で登録処理を行うことは不可能である。そこで、以下に示す逐次クラスタリングを行う。新規データ登録時に、新規データから抽出した新規特徴量ベクトルとベクトル間距離が近いクラスタ平均を持つクラスタを、近接クラスタとして所定の個数検索し、検索された近接クラスタに含まれる特徴量ベクトル及び新規特徴量ベクトルのみを対象にしてk-means法による最適化を実行する。さらに、各クラスタのメンバ数に上限を設け、上限を超えた場合はクラスタを２分割し、新たに生成されたクラスタを近接クラスタの集合に加える。よって、登録データ数が増加するにつれて、クラスタ数も増加することになる。 However, if clustering of all feature quantity vectors is performed each time a new feature quantity vector is registered, it is impossible to perform registration processing in a practical processing time. Therefore, the following sequential clustering is performed. When a new data is registered, a predetermined number of clusters having a cluster average whose vector distance is close to the new feature vector extracted from the new data are searched as neighboring clusters, and the feature vector and new feature included in the searched neighboring cluster Perform optimization by k-means method only for quantity vectors. Further, an upper limit is set for the number of members of each cluster, and when the upper limit is exceeded, the cluster is divided into two, and the newly generated cluster is added to the set of adjacent clusters. Therefore, as the number of registered data increases, the number of clusters also increases.

そして、以下に示すエネルギー関数Ｅ₁を利用して、類似したクラスタを１次元配列上の近傍領域に配置する。 Then, using the energy function E ₁ shown below, similar clusters are arranged in the vicinity region on the one-dimensional array.

式（１）は、クラスタ位置の再配列の手続きを定義するためのエネルギー関数Ｅ₁である。Ｎ_cはクラスタ数、ｖ_iはｉ番目の位置にあるクラスタの平均ベクトルを表す。本エネルギー関数は、相前後する位置にあるクラスタ平均の２乗距離の総和として定義されている。 Equation (1) is an energy function E ₁ for defining a cluster position rearrangement procedure. N _c represents the number of clusters, and v _i represents an average vector of clusters at the i-th position. This energy function is defined as the sum of the squared distances of the cluster averages at successive positions.

ただし、両端の位置、すなわち１番目の位置とＮ_c番目のクラスタについての境界条件は、片側のみを定義している。具体的には、存在しない０番目の位置のクラスタと１番目のクラスタとのクラスタ平均同士の距離は０とし、存在しないＮ_c＋１番目の位置のクラスタとＮ番目の位置のクラスタとのクラスタ平均同士の距離も０とする。この時、ｉ番目の位置にあるクラスタとｊ番目の位置にあるクラスタを交換した場合のエネルギー関数の変化量は、式（２）によって算出される。ただし、ｊ＞ｉとする。 However, positions of both ends, i.e., the boundary condition for the first position and the N _c-th cluster defines one side only. Specifically, the distance between the cluster averages of the non-existing 0th position cluster and the first cluster is 0, and the non-existing N _c + 1st position cluster and the Nth position cluster average The distance between them is also zero. At this time, the amount of change in the energy function when the cluster at the i-th position and the cluster at the j-th position are exchanged is calculated by Expression (2). However, j> i.

上記のエネルギー変化量に基づいてエネルギー関数が減少するように、現在更新対象となっている近接クラスタ集合中のクラスタの位置を更新すれば、配列中で隣り合う位置に存在するクラスタ同士の距離が相対的に小さい状態が実現できる。 If the position of the cluster in the adjacent cluster set that is the current update target is updated so that the energy function decreases based on the amount of energy change described above, the distance between the clusters existing at adjacent positions in the array can be determined. A relatively small state can be realized.

以上の処理を新規データ登録のたびに繰り返すことによって、類似クラスタ同士が１次元配列の近傍領域に配置される。 By repeating the above process every time new data is registered, similar clusters are arranged in the vicinity region of the one-dimensional array.

特開２００４−４６６１２号公報JP 2004-46612 A 特開２００７−３３４４０２号公報JP 2007-334402 A

類似したベクトルデータを記憶媒体上の近傍領域に配置することを目的とする従来のデータ配置最適化方式には、以下に示す問題点が存在していた。 The conventional data placement optimization method that aims to place similar vector data in a nearby region on a storage medium has the following problems.

自己組織化マップを用いて高次元ベクトルデータを１次元配列上にマッピングする場合、新規データの追加に柔軟に対応することができない。なぜなら、自己組織化マップを用いてマッピングを行う際には大量の処理時間を必要とするため、新規データを追加するたびに再マッピングを行うことは実用上不可能である。例えば、監視カメラで撮影された画像中の人物を検索対象とする場合、検索対象となる人物画像が逐次的に生成されるため、新規データを逐次的に登録する必要があり、極めて短時間の内に再マッピングを行わなければならなくなってしまう。 When high-dimensional vector data is mapped onto a one-dimensional array using a self-organizing map, it is not possible to flexibly cope with the addition of new data. This is because, when mapping using a self-organizing map, a large amount of processing time is required, and therefore it is practically impossible to perform remapping every time new data is added. For example, when a person in an image captured by a surveillance camera is a search target, since a human image to be searched is sequentially generated, it is necessary to register new data sequentially. Will have to be remapped.

相前後する位置にあるベクトルデータの２乗距離の総和として定義されるエネルギー関数を利用して高次元ベクトルデータを１次元配列上に配置する場合、類似度の大きい特徴量ベクトル同士が隣接するように配置することはある程度可能である。しかし、１次元配列上の距離が一定以上離れたクラスタ同士の類似度は、１次元配列上の距離とは関係なく、ランダムに等しい状態になってしまう。つまり、この手法が有効なのは１次元配列上の一定範囲内のクラスタのみに留まる。また、近傍に集まるクラスタ数のオーダは全クラスタ数に因らず一定（Ｏ(１)）のため、クラスタ数が増加するにつれて、近傍に集まるクラスタ数が相対的に少なくなる。したがって、局所的に最適化されたクラスタ群が複数箇所に散在してしまうという問題が生じる。この理由として、最近接クラスタ間の距離のみからエネルギー関数を計算してクラスタ位置を決定しているため、１次元配列全体におけるクラスタ位置が不定になることが挙げられる。 When high-dimensional vector data is arranged on a one-dimensional array using an energy function defined as the sum of square distances of vector data at successive positions, feature vectors with large similarity are adjacent to each other. It is possible to arrange to some extent. However, the similarity between clusters whose distances on the one-dimensional array are more than a certain distance is randomly equal regardless of the distance on the one-dimensional array. In other words, this method is effective only for clusters within a certain range on the one-dimensional array. Further, since the order of the number of clusters gathered in the vicinity is constant (O (1)) regardless of the total number of clusters, the number of clusters gathered in the vicinity becomes relatively small as the number of clusters increases. Therefore, there arises a problem that locally optimized cluster groups are scattered in a plurality of locations. This is because the cluster position in the entire one-dimensional array becomes indefinite because the energy position is calculated from only the distance between the nearest clusters and the cluster position is determined.

そこで、本発明は、記憶媒体上のデータのＩ／Ｏ速度を向上するために、近傍に集まるデータ数のオーダが全データ数Ｎ_cに依存してＯ(Ｎ_c)となるように、高次元ベクトルデータを１次元配列上に配置する手段を提供することを目的とする。 Therefore, in order to improve the I / O speed of data on the storage medium, the present invention increases the order of the number of data gathered in the vicinity so as to become O (N _c ) depending on the total number of data N _c. An object of the present invention is to provide means for arranging dimensional vector data on a one-dimensional array.

エネルギー関数を利用して高次元ベクトルデータを１次元配列（以下、この１次元配列をクラスタ配列と呼ぶ）上に配置する場合、クラスタ配列全体におけるクラスタ位置が不定になることを避けるためには、隣接クラスタ同士の関係だけでなく、各クラスタがクラスタ配列全体において、どこに配置されるべきかという情報もエネルギー関数に加える必要がある。そのために特徴量ベクトルの分布情報を利用して、クラスタ配列の近傍領域に類似したクラスタが集まり、類似していないクラスタ同士はクラスタ配列の中で離れた領域に配置されるようにする。 In order to avoid the indefinite cluster position in the entire cluster array when high-dimensional vector data is arranged on a one-dimensional array (hereinafter, this one-dimensional array is referred to as a cluster array) using an energy function, In addition to the relationship between adjacent clusters, information on where each cluster should be arranged in the entire cluster array needs to be added to the energy function. For this purpose, the cluster information similar to the neighborhood area of the cluster array is gathered using the distribution information of the feature vector, and the dissimilar clusters are arranged in a distant area in the cluster array.

新規データ登録時には特許文献２の方法を用いて逐次クラスタリングを行う。新規データから抽出した特徴量ベクトルとベクトル間距離が近いクラスタ平均を持つクラスタを、近接クラスタとして所定の個数検索し、検索された近接クラスタ内に含まれる特徴量ベクトル及び新規特徴量ベクトルのみを対象にしてk-means法による最適化を実行する。さらに、各クラスタのメンバ数に上限を設け、上限を超えた場合はクラスタを２分割し、新たに生成されたクラスタを近接クラスタの集合に加える。よって、登録データ数が増加するにつれて、クラスタ数も増加することになる。 At the time of new data registration, sequential clustering is performed using the method of Patent Document 2. Search for a specified number of clusters that have a cluster average that is close to the feature vector extracted from the new data as adjacent clusters, and target only the feature vectors and new feature vectors that are included in the searched neighboring clusters Then, the optimization by the k-means method is executed. Further, an upper limit is set for the number of members of each cluster, and when the upper limit is exceeded, the cluster is divided into two, and the newly generated cluster is added to the set of adjacent clusters. Therefore, as the number of registered data increases, the number of clusters also increases.

クラスタ数Ｎ_cが少数の場合（例えばＮ_c＜４）は、更新対象クラスタに対して、特許文献２と同様のエネルギー関数Ｅ₁を用いてクラスタ配置の更新を行う。これに加えて、クラスタ情報としてクラスタ平均とクラスタ位置の組を保存しておく。ここでクラスタ位置とは、クラスタ配列上の配列番号である。 When the number of clusters N _c is small (for example, N _c <4), the cluster arrangement is updated using the same energy function E ₁ as in Patent Document 2 for the update target cluster. In addition to this, a set of cluster average and cluster position is stored as cluster information. Here, the cluster position is an array element number on the cluster array.

クラスタ数Ｎ_cが増加した場合（例えばＮ_c≧４）、現在のクラスタ平均と保存している過去のクラスタ平均とのベクトル間距離の２乗をエネルギー関数Ｅ₁に加える。具体的には、まず、保存している過去のクラスタについて、ある時点の過去クラスタ配列に着目する。次に、着目した過去クラスタ配列内のクラスタ番号を最小値が１、最大値がＮ_cとなるように変換する。そして、現在クラスタ配列内のクラスタのクラスタ番号と、最も近い変換後のクラスタ番号を持つ過去クラスタ配列内のクラスタを取得し、それらのクラスタのクラスタ平均のベクトル間距離の２乗をエネルギー関数に加算する。また、クラスタ数Ｎ_cが少数の場合と同様に、クラスタ情報の保存を行う。 When the number of clusters N _c increases (for example, N _c ≧ 4), the square of the inter-vector distance between the current cluster average and the stored past cluster average is added to the energy function E ₁ . Specifically, first, attention is paid to a past cluster arrangement at a certain point in time for past clusters stored. Next, the cluster number in the focused past cluster array is converted so that the minimum value is 1 and the maximum value is N _c . Then, the cluster in the past cluster array having the cluster number of the cluster in the current cluster array and the cluster number after the closest conversion is obtained, and the square of the inter-vector distance of the cluster average of those clusters is added to the energy function. To do. Further, the cluster information is stored as in the case where the number of clusters _Nc is small.

本発明では、新規登録データから抽出された特徴量ベクトルとベクトル空間中で距離の近いクラスタ代表値を持つクラスタを検索し、新規登録データと類似クラスタに含まれるデータを複数のクラスタに分類し、各クラスタのメンバ数に上限を設け、上限を超えた場合は該当クラスタを分割し、各クラスタに含まれる特徴量ベクトルを記憶媒体の連続した領域に格納し、各クラスタに一意的な連続した番号を割り振って、クラスタ番号として管理し、各クラスタに含まれる特徴量ベクトルからベクトルデータであるクラスタ代表値を計算し、各クラスタのクラスタ代表値とクラスタ番号を保存し、特定番号を持つクラスタのクラスタ代表値と前後のクラスタのクラスタ代表値とのベクトル間距離１を計算し、特定番号を持つクラスタのクラスタ代表値と保存しているクラスタのクラスタ代表値とのベクトル間距離２を計算し、ベクトル間距離１とベクトル間距離２の和が小さくなるように、クラスタに含まれている特徴量ベクトルの記憶媒体上における格納位置を交換することによって、類似したクラスタ代表値を持つクラスタを記憶媒体の近傍領域に格納する。 In the present invention, a cluster having a cluster representative value with a close distance in the vector space with the feature vector extracted from the newly registered data is searched, and the data included in the newly registered data and similar clusters are classified into a plurality of clusters, An upper limit is set for the number of members of each cluster. When the upper limit is exceeded, the corresponding cluster is divided, and the feature vector included in each cluster is stored in a continuous area of the storage medium. Are managed as cluster numbers, cluster representative values that are vector data are calculated from feature vectors included in each cluster, cluster representative values and cluster numbers of each cluster are stored, and clusters of clusters having specific numbers The class of the cluster with a specific number is calculated by calculating the inter-vector distance 1 between the representative value and the cluster representative value of the preceding and following clusters. The distance 2 between vectors between the representative value and the cluster representative value of the stored cluster is calculated, and the feature amount vector included in the cluster is stored so that the sum of the distance 1 between the vectors and the distance 2 between the vectors becomes small. By exchanging the storage positions on the medium, clusters having similar cluster representative values are stored in the vicinity area of the storage medium.

本発明にかかる類似データ検索におけるデータ配置最適化方式によれば、記憶媒体を１次元配列とみなし、類似したデータ同士を１次元配列上の近傍領域に配置することによって、記憶媒体上の類似データのＩ／Ｏ速度を向上することができる。さらに、近傍領域に集まる類似データ数は、全データ数Ｎ_cに依存してＯ(Ｎ_c)となる。 According to the data arrangement optimization method in the similar data search according to the present invention, the storage medium is regarded as a one-dimensional array, and similar data on the storage medium is arranged by arranging similar data in a neighboring region on the one-dimensional array. I / O speed can be improved. Further, the number of similar data gathered in the vicinity region is O (N _c ) depending on the total number of data N _c .

以下、本発明の実施の一形態として、画像を対象とした類似検索システムにおけるデータ配置処理について説明する。 Hereinafter, as one embodiment of the present invention, data arrangement processing in a similarity search system for images will be described.

図１は、本発明による類似データ検索システムのシステム構成例を示す図である。検索エンジンが稼動するサーバ計算機１１０は、通信基盤１２０を経由して、アプリケーションプログラムが稼動するクライアント計算機１３０と接続され、クライアント計算機１３０に検索等のサービスを提供する。通信基盤１２０は、サーバ計算機１１０とクライアント計算機１３０とを接続するネットワーク（例えば、ＩＰネットワーク）である。サーバ計算機１１０は、少なくとも、相互に接続されたインターフェース（Ｉ／Ｆ）１５１、ＣＰＵ１５２、メモリ１５３及び２次記憶、例えばハードディスク１５４を備える。 FIG. 1 is a diagram showing a system configuration example of a similar data search system according to the present invention. The server computer 110 on which the search engine operates is connected to the client computer 130 on which the application program operates via the communication infrastructure 120, and provides services such as search to the client computer 130. The communication infrastructure 120 is a network (for example, an IP network) that connects the server computer 110 and the client computer 130. The server computer 110 includes at least an interface (I / F) 151, a CPU 152, a memory 153, and a secondary storage such as a hard disk 154 that are connected to each other.

Ｉ／Ｆ１５１は、通信基盤１２０に接続され、サーバ計算機１１０とクライアント計算機１３０の間の通信に使用される。ＣＰＵ１５２は、メモリ１５３に格納されたプログラムを実行するプロセッサである。メモリ１５３は、ＣＰＵ１５２によって実行されるプログラム及びＣＰＵ１５２によって参照されるデータを格納する記憶装置である。本実施の形態のメモリ１５３は、いわゆる主記憶装置であり、例えば、ランダムアクセス可能な半導体記憶装置である。本実施の形態のメモリ１５３は、少なくとも、検索サーバプロセス１１１を実現するためのサーバ・プログラム及びデータを格納する。 The I / F 151 is connected to the communication infrastructure 120 and is used for communication between the server computer 110 and the client computer 130. The CPU 152 is a processor that executes a program stored in the memory 153. The memory 153 is a storage device that stores a program executed by the CPU 152 and data referred to by the CPU 152. The memory 153 according to the present embodiment is a so-called main memory device, for example, a semiconductor memory device that can be randomly accessed. The memory 153 of this embodiment stores at least a server program and data for realizing the search server process 111.

ハードディスク１５４は、一つ以上のハードディスクドライブ（ＨＤＤ）からなる記憶装置である。本実施の形態のハードディスク１５４は、画像サーバ１４０に格納された画像の特徴量ベクトルに関する情報を特徴量データ１１４及びクラスタ管理情報１１５として格納する。なお、本実施の形態のハードディスク１５４は、光ディスク装置、フラッシュメモリのような半導体記憶装置、又は、その他のいかなる種類の記憶装置によって置き換えられてもよい。 The hard disk 154 is a storage device including one or more hard disk drives (HDD). The hard disk 154 according to the present embodiment stores information about the feature vector of the image stored in the image server 140 as feature data 114 and cluster management information 115. Note that the hard disk 154 of this embodiment may be replaced with an optical disk device, a semiconductor storage device such as a flash memory, or any other type of storage device.

特徴量ベクトルとは、画像サーバ１４０に格納されている画像の特徴をベクトルデータとして数値化したものである。特徴量ベクトルは、従来から知られている種々の方法によって算出することができる。例えば、エッジパターン特徴量と呼ばれるものを用いることができる。エッジパターン特徴量を抽出するためには、まず、特徴的なエッジパターンを予め複数設定する。そして、顔画像に対して格子状に構図分割を行い、各領域内に含まれるエッジパターン数を計数する。このエッジパターン数からヒストグラムを生成し、多次元ベクトルとすることによって特徴量ベクトルを得る。また、検索対象画像の全領域に対して格子状に構図分割を行い、分割された各領域の色ヒストグラムから作られる多次元ベクトルを特徴量ベクトルとして用いてもよい。さらにこれらの多次元ベクトルを組み合わせたものを特徴量ベクトルとして使用してもよい。 The feature amount vector is obtained by digitizing the feature of an image stored in the image server 140 as vector data. The feature vector can be calculated by various conventionally known methods. For example, what is called an edge pattern feature amount can be used. In order to extract the edge pattern feature quantity, first, a plurality of characteristic edge patterns are set in advance. Then, the face image is divided into a grid pattern, and the number of edge patterns included in each region is counted. A feature quantity vector is obtained by generating a histogram from the number of edge patterns and making it a multidimensional vector. Further, composition division may be performed on the entire region of the search target image in a grid pattern, and a multidimensional vector created from the color histogram of each divided region may be used as the feature amount vector. Further, a combination of these multidimensional vectors may be used as the feature vector.

本実施の形態において、各特徴量ベクトルは、複数のクラスタのいずれかに分類される。相互に距離が近い特徴量ベクトルは、同じクラスタに分類されることが望ましい。特徴量ベクトルは、どのような方法でクラスタに分類されてもよいが、本実施の形態では、k-means法によって分類される。 In the present embodiment, each feature vector is classified into one of a plurality of clusters. It is desirable that feature vectors that are close to each other are classified into the same cluster. The feature amount vector may be classified into clusters by any method, but in the present embodiment, it is classified by the k-means method.

特徴量データ１１４は、一つのクラスタに分類された一つ以上の画像を識別するデータＩＤと、そのＩＤによって識別される画像データの特徴量ベクトルと、の組を含む。尚、各クラスタに分類された画像及び特徴量ベクトルは、クラスタメンバとも記載する。 The feature amount data 114 includes a set of a data ID that identifies one or more images classified into one cluster and a feature amount vector of the image data identified by the ID. Note that the images and feature vectors classified into each cluster are also referred to as cluster members.

クラスタ管理情報１１５は、各クラスタを識別するクラスタＩＤと、そのクラスタＩＤによって識別されるクラスタの代表値と、の組を含む。現在の情報及び全クラスタ数Ｎ_cが特定の数のときの情報が格納されている。本実施の形態では、クラスタ数Ｎ_cが２のべき乗のときにクラスタ情報を格納する。 The cluster management information 115 includes a set of a cluster ID for identifying each cluster and a representative value of the cluster identified by the cluster ID. Current information and information when the total number of clusters _Nc is a specific number are stored. In the present embodiment, cluster information is stored when the number of clusters N _c is a power of two.

本実施の形態において、各クラスタの代表値とは、各クラスタに含まれる特徴量ベクトルの平均ベクトル（クラスタ平均）である。尚、平均ベクトル以外の値がクラスタの代表値として使用されてもよい。例えば、平均ベクトルに近接するクラスタメンバの特徴量ベクトルが使用されてもよいし、k-means法による最適化を実行する際に指定されるシード値が使用されてもよい。k-means法による最適化の結果、各特徴量ベクトルは、その特徴量ベクトルと最も距離が近い代表値を含むクラスタに含まれる。 In the present embodiment, the representative value of each cluster is an average vector (cluster average) of feature quantity vectors included in each cluster. A value other than the average vector may be used as the representative value of the cluster. For example, a feature vector of cluster members close to the average vector may be used, or a seed value specified when performing optimization by the k-means method may be used. As a result of optimization by the k-means method, each feature vector is included in a cluster including a representative value closest to the feature vector.

画像サーバ１４０は、画像データを格納する記憶装置を備え、通信基盤１２０に接続される計算機である。クライアント計算機１３０又はサーバ計算機１１０から通信基盤１２０を通して画像データへのアクセスがあった場合に、画像データの入出力を行う。尚、画像サーバ１４０は、サーバ計算機１１０もしくはクライアント計算機１３０に含まれていても良い。 The image server 140 is a computer that includes a storage device that stores image data and is connected to the communication infrastructure 120. When image data is accessed from the client computer 130 or the server computer 110 through the communication infrastructure 120, input / output of image data is performed. The image server 140 may be included in the server computer 110 or the client computer 130.

サーバ計算機１１０内の検索サーバプロセス１１１は、クラスタリングされた（すなわち、クラスタに分類された）検索対象を管理している。システム稼動時には、クラスタ管理情報１１５は、サーバ計算機のメモリ１５３内にクラスタ管理情報１１２として展開されている。各クラスタ情報１１３として、各クラスタＩＤ、そのＩＤによって識別されるクラスタの代表値である平均ベクトル、及び、クラスタメンバを識別するデータＩＤ列等が格納されている。各クラスタメンバの特徴量ベクトルは、特徴量データ１１４として一括してハードディスク１５４上で管理される。このため、メモリ１５３内のクラスタ情報１１３として、さらに、各クラスタメンバの特徴量ベクトルを格納したハードディスク上の位置が格納されている。 The search server process 111 in the server computer 110 manages search targets that are clustered (that is, classified into clusters). When the system is operating, the cluster management information 115 is expanded as the cluster management information 112 in the memory 153 of the server computer. As each cluster information 113, each cluster ID, an average vector which is a representative value of the cluster identified by the ID, a data ID string for identifying cluster members, and the like are stored. The feature vector of each cluster member is managed on the hard disk 154 collectively as feature data 114. For this reason, as the cluster information 113 in the memory 153, the position on the hard disk that stores the feature vector of each cluster member is further stored.

尚、クラスタ管理情報１１２は、ハードディスク１５４上に記録されたクラスタ管理情報のコピーである。このため、クラスタに対する更新が生じた場合、クラスタ管理情報１１２だけでなく、ハードディスク１５４上のクラスタ管理情報１１５も更新される。しかし、検索処理においてクラスタ管理情報１１５が直接参照されることはない。 The cluster management information 112 is a copy of the cluster management information recorded on the hard disk 154. Therefore, when an update to the cluster occurs, not only the cluster management information 112 but also the cluster management information 115 on the hard disk 154 is updated. However, the cluster management information 115 is not directly referenced in the search process.

クライアント計算機１３０は、通信基盤１２０に接続される計算機である。図１には２つのクライアント計算機１３０を示すが、本類似データ検索システムは任意の数のクライアント計算機１３０を備えてもよい。尚、クライアント計算機１３０と同等の機能をサーバ計算機１１０が備えている場合、全ての処理をサーバ計算機で行っても良い。 The client computer 130 is a computer connected to the communication infrastructure 120. Although two client computers 130 are shown in FIG. 1, the similar data search system may include any number of client computers 130. If the server computer 110 has a function equivalent to that of the client computer 130, all processing may be performed by the server computer.

クライアント計算機１３０は、いかなる構成の計算機であってもよい。図１には、典型的なクライアント計算機１３０の構成を示す。すなわち、図１のクライアント計算機１３０は、ＣＰＵ１３１、メモリ１３２、Ｉ／Ｆ１３３、入力装置１３４及び出力装置１３５を備える。 The client computer 130 may be a computer having any configuration. FIG. 1 shows a configuration of a typical client computer 130. That is, the client computer 130 of FIG. 1 includes a CPU 131, a memory 132, an I / F 133, an input device 134, and an output device 135.

ＣＰＵ１３１は、メモリ１３２に格納されたプログラムを実行するプロセッサである。メモリ１３２は、ＣＰＵ１３１によって実行されるプログラム等を格納する記憶装置である。Ｉ／Ｆ１３３は、通信基盤１２０に接続され、クライアント計算機１３０とサーバ計算機１１０との間の通信に使用されるインターフェースである。入力装置１３４は、クライアント計算機１３０のユーザから入力を受け付ける装置である。入力装置１３４は、例えば、キーボード、ポインティングデバイス又は画像スキャナ等を含んでもよい。出力装置１３５は、クライアント計算機１３０のユーザに情報を表示する装置である。具体的には、例えば、類似データ検索の結果として取得された画像が出力装置１３５に表示される。出力装置１３５は、例えばＣＲＴ又は液晶ディスプレイのような画像表示装置である。 The CPU 131 is a processor that executes a program stored in the memory 132. The memory 132 is a storage device that stores programs executed by the CPU 131. The I / F 133 is an interface that is connected to the communication infrastructure 120 and is used for communication between the client computer 130 and the server computer 110. The input device 134 is a device that receives input from the user of the client computer 130. The input device 134 may include, for example, a keyboard, a pointing device, an image scanner, or the like. The output device 135 is a device that displays information to the user of the client computer 130. Specifically, for example, an image acquired as a result of the similar data search is displayed on the output device 135. The output device 135 is an image display device such as a CRT or a liquid crystal display.

次に、本実施の形態の検索システムにおける類似データ検索時の処理について説明する。図２は、本発明の実施の形態において、類似データ検索時に実行される処理の実行部を示すブロック図である。同図に示すように、本実施の形態の類似データ検索処理は、画像入力部２１０、画像特徴量抽出部２１１、類似クラスタ検索部２１２、類似特徴量検索部２１３、検索結果出力部２１４で構成される。 Next, processing at the time of similar data search in the search system of the present embodiment will be described. FIG. 2 is a block diagram showing an execution unit of processing executed when searching for similar data in the embodiment of the present invention. As shown in the figure, the similar data search processing according to the present embodiment includes an image input unit 210, an image feature amount extraction unit 211, a similar cluster search unit 212, a similar feature amount search unit 213, and a search result output unit 214. Is done.

画像入力部２１０では、検索目標画像の入力処理を行う。ここで、検索目標画像は、検索しようとする物体や人物が写された画像データや、検索したいテクスチャが描かれた画像データである。これらの画像データは、デジタルカメラや補助記憶装置、マウス等によるユーザの描画によって供給される。 In the image input unit 210, search target image input processing is performed. Here, the search target image is image data in which an object or person to be searched is copied, or image data in which a texture to be searched is drawn. These image data are supplied by user drawing using a digital camera, an auxiliary storage device, a mouse, or the like.

画像特徴量抽出部２１１では、検索目標画像から画像特徴量を抽出する。本実施例では、画像特徴量としてエッジパターン特徴量と呼ばれるものを用いる。エッジパターン特徴量を抽出するためには、まず、特徴的なエッジパターンを予め複数設定する。そして、画像データに対して格子状に構図分割を行い、各領域内に含まれるエッジパターン数を計数する。このエッジパターン数からヒストグラムを生成し、多次元ベクトルとすることによって、画像特徴量を作成する。他にも色ヒストグラムを画像特徴量として用いても良い。 The image feature amount extraction unit 211 extracts an image feature amount from the search target image. In this embodiment, what is called an edge pattern feature amount is used as the image feature amount. In order to extract the edge pattern feature quantity, first, a plurality of characteristic edge patterns are set in advance. Then, composition division is performed on the image data in a grid pattern, and the number of edge patterns included in each region is counted. An image feature quantity is created by generating a histogram from the number of edge patterns and using it as a multidimensional vector. In addition, a color histogram may be used as the image feature amount.

類似クラスタ検索部２１２では、クラスタ管理情報１１２の持つクラスタＩＤとクラスタ平均を用いて、検索目標画像から抽出した画像特徴量ベクトルとベクトル空間中で距離の近いクラスタ平均を持つクラスタを所定の個数検索する。 The similar cluster search unit 212 uses the cluster ID and the cluster average of the cluster management information 112 to search for a predetermined number of clusters having a cluster average that is close to the image feature vector extracted from the search target image and the vector space. To do.

類似特徴量検索部２１３では、特徴量データ１１４の持つ特徴量ベクトルを用いて、前記類似クラスタ検索部２１２で取得した類似クラスタ内に含まれるクラスタメンバの中から、検索目標画像から抽出した画像特徴量ベクトルとベクトル空間中で距離の近い特徴量ベクトルを持つデータを検索する。 The similar feature quantity search unit 213 uses the feature quantity vector of the feature quantity data 114 to extract the image feature extracted from the search target image from among the cluster members included in the similar cluster acquired by the similar cluster search unit 212. Search for data having feature vectors close to the vector in the vector space.

検索結果出力部２１４では、前記類似特徴量検索部２１３で検索されたデータの画像データを画像サーバ１４０から取得し、検索結果として出力する。尚、画像サーバ１４０を用いずに、データＩＤを検索結果として表示しても良い。 The search result output unit 214 acquires the image data of the data searched by the similar feature amount search unit 213 from the image server 140 and outputs it as a search result. Note that the data ID may be displayed as a search result without using the image server 140.

ここで、画像特徴量抽出部２１１、類似クラスタ検索部２１２、類似特徴量検索部２１３は、サーバ計算機１１０のメモリ１５３に展開されたプログラムによって検索サーバプロセス１１１の一部として実現され、その処理が実行される。 Here, the image feature quantity extraction unit 211, the similar cluster search unit 212, and the similar feature quantity search unit 213 are realized as a part of the search server process 111 by a program developed in the memory 153 of the server computer 110. Executed.

次に、本実施の形態の類似データ検索システムにおける新規データ登録時の処理について説明する。図３は、本発明の実施の形態において、新規データ登録時に実行される処理の実行部を示すブロック図である。同図に示すように、本実施の形態の新規データ登録処理は、画像入力部３１０、画像特徴量抽出部３１１、類似クラスタ検索部３１２、クラスタリング部３１３、クラスタ分割部３１４、エネルギー関数計算部３１５、データ格納部３１６、で構成される。 Next, processing at the time of new data registration in the similar data search system of the present embodiment will be described. FIG. 3 is a block diagram showing an execution unit of processing executed when new data is registered in the embodiment of the present invention. As shown in the figure, the new data registration process of the present embodiment includes an image input unit 310, an image feature amount extraction unit 311, a similar cluster search unit 312, a clustering unit 313, a cluster division unit 314, and an energy function calculation unit 315. , A data storage unit 316.

画像入力部３１０では、新規登録画像の入力処理を行う。画像データは、デジタルカメラや補助記憶装置、マウス等によるユーザの描画によって供給される。画像特徴量抽出部３１１では、登録画像を対象に、画像特徴量抽出部２１１と同様に画像特徴量を抽出する。類似クラスタ検索部３１２、クラスタリング部３１３、クラスタ分割部３１４では、特許文献２の方法を用いて逐次クラスタリングを行う。 The image input unit 310 performs input processing for a newly registered image. The image data is supplied by user drawing using a digital camera, an auxiliary storage device, a mouse, or the like. The image feature amount extraction unit 311 extracts image feature amounts for registered images in the same manner as the image feature amount extraction unit 211. The similar cluster search unit 312, clustering unit 313, and cluster dividing unit 314 perform sequential clustering using the method of Patent Document 2.

類似クラスタ検索部３１２では、類似クラスタ検索部２１２と同様に、クラスタ管理情報１１２の持つクラスタＩＤとクラスタ平均を用いて、新規登録画像から抽出した画像特徴量ベクトルとベクトル空間中で距離の近いクラスタ平均を持つクラスタを所定の個数検索する。尚、検索結果として得るクラスタの数は、類似クラスタ検索部２１２と異なっても良い。クラスタリング部３１３では、類似クラスタ検索部３１２で取得した類似クラスタ内に含まれるクラスタメンバの特徴量ベクトルのみを特徴量データ１１４から取得し、これらの特徴量データと新規登録画像の特徴量ベクトルのみを対象に、k-means法に基づいたクラスタリングを行う。 In the similar cluster search unit 312, as in the similar cluster search unit 212, an image feature amount vector extracted from a newly registered image and a cluster having a short distance in the vector space using the cluster ID and cluster average of the cluster management information 112. A predetermined number of clusters having an average are searched. Note that the number of clusters obtained as a search result may be different from that of the similar cluster search unit 212. The clustering unit 313 acquires only the feature quantity vectors of the cluster members included in the similar clusters acquired by the similar cluster search unit 312 from the feature quantity data 114, and only the feature quantity data and the feature quantity vectors of the newly registered image are acquired. Clustering is performed on the target based on the k-means method.

クラスタ分割部３１４では、クラスタの分割を行う。クラスタに含まれるクラスタメンバ数には、予め最大数が設定されており、クラスタメンバ数が最大値を超えたクラスタを、二つに分割する。この処理によって新しいクラスタが生成されるため、登録データ数が増加するにつれて、クラスタ数が増加することになる。クラスタを生成する際には、各クラスタのメンバのハードディスク上の格納領域を、物理的にも連続的に確保するために、所定の量のディスク領域を確保する。このときに確保するディスク領域の量によって、各クラスタメンバの最大数が制限されることになる。 The cluster division unit 314 performs cluster division. A maximum number is set in advance for the number of cluster members included in the cluster, and a cluster whose number of cluster members exceeds the maximum value is divided into two. Since a new cluster is generated by this process, the number of clusters increases as the number of registered data increases. When a cluster is generated, a predetermined amount of disk area is secured in order to physically and continuously secure storage areas on the hard disks of the members of each cluster. The maximum number of cluster members is limited by the amount of disk area secured at this time.

エネルギー関数計算部３１５では、後述するエネルギー関数に基づいて、エネルギーを計算する。データ格納部３１６では、前記クラスタリング部３１３でクラスタリングを行った特徴量ベクトルを、特徴量データ１１４に格納する。このとき、後述するエネルギー関数が最も小さくなる位置に格納する。さらに、クラスタ数が特定の数の時に、全クラスタのクラスタＩＤとクラスタ平均をクラスタ管理情報１１２に保存する。 The energy function calculation unit 315 calculates energy based on an energy function described later. In the data storage unit 316, the feature quantity vector clustered by the clustering unit 313 is stored in the feature quantity data 114. At this time, the energy function, which will be described later, is stored at a position where it becomes the smallest. Further, when the number of clusters is a specific number, the cluster IDs and cluster averages of all the clusters are stored in the cluster management information 112.

ここで、画像特徴量抽出部３１１、類似クラスタ検索部３１２、クラスタリング部３１３、クラスタ分割部３１４、エネルギー関数計算部３１５、データ格納部３１６は、サーバ計算機１１０のメモリ１５３に展開されたプログラムによって検索サーバプロセス１１１の一部として実現され、その処理が実行される。 Here, the image feature quantity extraction unit 311, the similar cluster search unit 312, the clustering unit 313, the cluster division unit 314, the energy function calculation unit 315, and the data storage unit 316 are searched by a program developed in the memory 153 of the server computer 110. It is realized as a part of the server process 111, and its processing is executed.

図４は、本発明の実施の形態において新規データ登録時にサーバ上で実行される処理を示すフローチャートである。ここでは、類似クラスタ検索部３１２、クラスタリング部３１３、クラスタ分割部３１４、で実行される処理について、詳細に説明する。 FIG. 4 is a flowchart showing processing executed on the server when new data is registered in the embodiment of the present invention. Here, processing executed by the similar cluster search unit 312, the clustering unit 313, and the cluster dividing unit 314 will be described in detail.

図４に示す処理は、検索サーバプロセス１１１を実現するサーバ・プログラムの一部として実行される。従って、図４に示す処理は、ＣＰＵ１５２によって実行される。 The process shown in FIG. 4 is executed as part of a server program that implements the search server process 111. Therefore, the process shown in FIG.

ＣＰＵ１５２は、登録対象の新規データｘが与えられると、まず、近接クラスタを検索し、近接クラスタの集合Ｃ*を取得する（Ｓ４１０）。具体的には、ＣＰＵ１５２は、各クラスタの平均ベクトルと新規データｘとを比較し、新規データｘと距離が近い平均ベクトルによって代表されるクラスタから順に、所定の数のクラスタを近接クラスタの集合Ｃ*として取得する。次に、ＣＰＵ１５２は、近接クラスタの集合Ｃ*の中の最近接クラスタｃ*（すなわち、新規データｘと最も距離が近い平均ベクトルによって代表されるクラスタ）に、新規データｘを追加する（Ｓ４２０）。 When the new data x to be registered is given, the CPU 152 first searches for neighboring clusters and acquires a set C * of neighboring clusters (S410). Specifically, the CPU 152 compares the average vector of each cluster with the new data x, and selects a predetermined number of clusters in order from the cluster represented by the average vector whose distance is close to the new data x. Get as *. Next, the CPU 152 adds the new data x to the nearest cluster c * (that is, the cluster represented by the average vector closest to the new data x) in the set C * of adjacent clusters (S420). .

次に、ＣＰＵ１５２は、パラメータｔ及びパラメータｉを、それぞれ、「０」及び「１」に初期化する（Ｓ４２１、Ｓ４２２）。パラメータｔは、k-means法の更新の反復回数を計数するために使用される。パラメータｉは、近接クラスタの集合Ｃ*に要素として含まれるクラスタを指示するために使用される。その後、ステップ４３０以降に示す、k-means法による最適化のループに入る。 Next, the CPU 152 initializes the parameter t and the parameter i to “0” and “1”, respectively (S421, S422). The parameter t is used to count the number of iterations of the k-means method update. The parameter i is used to indicate a cluster included as an element in the set C * of neighboring clusters. Thereafter, an optimization loop based on the k-means method shown in step 430 and subsequent steps is entered.

まず、ＣＰＵ１５２は、近接クラスタの集合Ｃ*の要素である各クラスタについて、クラスタのメンバ数が制限Ｍ_maxを超えるか否かを判定する。Ｍ_maxはクラスタ内メンバ数の最大値である。具体的には、ＣＰＵ１５２は、ステップ４３０において、パラメータｉが集合Ｃ*の要素数以下であるか否かを判定する。ステップ４３０において、パラメータｉが集合Ｃ*の要素数以下であると判定された場合、ＣＰＵ１５２は、集合Ｃ*のｉ番目の要素であるクラスタｃを対象として（Ｓ４３４）、クラスタｃのメンバ数がＭ_maxを超えるか否かを判定する（Ｓ４３１）。尚、最適化ループに入った時点では、新規データｘが追加された最近接クラスタｃ*以外のクラスタは、メンバ数制限を超えないことが前提となる。 First, the CPU 152 determines whether or not the number of cluster members exceeds the limit M _max for each cluster that is an element of the set C * of adjacent clusters. M _max is the maximum number of members in the cluster. Specifically, CPU 152 determines in step 430 whether parameter i is equal to or less than the number of elements in set C *. If it is determined in step 430 that the parameter i is less than or equal to the number of elements in the set C *, the CPU 152 targets the cluster c that is the i-th element of the set C * (S434), and the number of members of the cluster c is It is determined whether or not M _max is exceeded (S431). At the time of entering the optimization loop, it is assumed that the cluster other than the nearest cluster c * to which the new data x is added does not exceed the member number limit.

仮に最近接クラスタｃ*のメンバ数がＭ_maxを超えた場合、ＣＰＵ１５２は、そのクラスタを２分割し（Ｓ４３２）、新たに生成されたクラスタｄを近接クラスタの集合Ｃ*の要素に加える（Ｓ４３３）。クラスタを２分割する方法としては、種々の方法が考えられる。本実施の形態では、そのクラスタ内のベクトル分布に関して主軸を求め、各メンバのベクトルの主軸への射影が、クラスタ平均ベクトルの射影のどちら側に存在するかを判定することによって、メンバを二つのクラスタに分割する。 If the number of members of the closest cluster c * exceeds M _max , the CPU 152 divides the cluster into two (S432), and adds the newly generated cluster d to the elements of the neighboring cluster set C * (S433). ). Various methods can be considered as a method of dividing a cluster into two. In the present embodiment, the principal axis is obtained with respect to the vector distribution in the cluster, and by determining which side the projection of the vector of each member onto the principal axis is on the projection side of the cluster average vector, Divide into clusters.

ステップ４３３が実行された後、ＣＰＵ１５２の処理は、ステップ４３１に戻る。ステップ４３１では、分割後のクラスタのメンバ数がＭ_maxを超えているか否かが判定される。分割後のクラスタのメンバ数がＭ_maxを超えていると判定された場合、そのクラスタをさらに分割するために、処理はステップ４３２に進む。一方、分割後のクラスタのメンバ数がＭ_maxを超えていないと判定された場合、次のクラスタについてステップ４３１の判定を実行するために、ＣＰＵ１５２は、パラメータｉの値に１を加算して（Ｓ４３５）、ステップ４３０に戻る。 After step 433 is executed, the process of the CPU 152 returns to step 431. In step 431, it is determined whether or not the number of members of the cluster after the division exceeds _Mmax . If it is determined that the number of members of the cluster after the division exceeds M _max , the process proceeds to step 432 in order to further divide the cluster. On the other hand, if it is determined that the number of members of the cluster after the division does not exceed M _max , the CPU 152 adds 1 to the value of the parameter i in order to execute the determination of step 431 for the next cluster ( S435), the process returns to step 430.

ステップ４３０において、パラメータｉが集合Ｃ*の要素数を超えたと判定された場合、集合Ｃ*の要素である全てのクラスタのメンバ数がＭ_max以内であることが確認された。この場合、ＣＰＵ１５２は、k-means法による最適化の反復回数ｔをチェックする（Ｓ４４０）。 If it is determined in step 430 that the parameter i exceeds the number of elements of the set C *, it is confirmed that the number of members of all the clusters that are elements of the set C * is within M _max . In this case, the CPU 152 checks the number of optimization iterations t by the k-means method (S440).

本システムにおいて、図４に示す最適化は、あくまでクラスタの部分集合を対象としたものであり、クラスタ全体での最適化を意味しない。また、データの追加は、その後も繰り返し行われることを想定しており、その度に最適化が実行される。従って、ある時点での最適化を極端に重視する必要はなく、反復の最大数ｔ_maxは、数回程度で十分である。 In this system, the optimization shown in FIG. 4 is intended only for a subset of clusters, and does not mean optimization for the entire cluster. Further, it is assumed that the addition of data is repeatedly performed thereafter, and optimization is executed each time. Therefore, it is not necessary to place extreme importance on optimization at a certain point in time, and it is sufficient that the maximum number of iterations t _max is several times.

ステップ４４０において、反復回数を示すパラメータｔが反復の最大数ｔ_max以上であると判定された場合、k-means法による最適化が所定の回数実行されたため、図４の処理が終了する。あるいは、ステップ４４０において、集合Ｃ*が変化していないと判定された場合、さらに最適化を実行する必要がないと考えられる。従って、この場合も、図４の処理が終了する。 If it is determined in step 440 that the parameter t indicating the number of iterations is equal to or greater than the maximum number of iterations t _max , optimization by the k-means method has been performed a predetermined number of times, and the processing in FIG. 4 ends. Alternatively, if it is determined in step 440 that the set C * has not changed, it is considered that it is not necessary to perform further optimization. Therefore, also in this case, the processing of FIG.

一方、ステップ４４０において、反復回数を示すパラメータｔが反復の最大数ｔ_maxより小さく、かつ、集合Ｃ*が変化していると判定された場合、クラスタの最適化を実行する必要があるため、ＣＰＵ１５２は、k-means法によって集合Ｃ*を更新する（Ｓ４５０）。 On the other hand, if it is determined in step 440 that the parameter t indicating the number of iterations is smaller than the maximum number of iterations t _max and that the set C * is changing, it is necessary to perform cluster optimization. The CPU 152 updates the set C * by the k-means method (S450).

ステップ４５０の処理は、通常のk-means法と同様である。すなわち、近接クラスタに含まれる全データは、その時点での最も近接したクラスタ平均を持つクラスタに配分される。これによって、各近接クラスタのメンバ、及び、クラスタ平均が更新され、ステップ４３０に戻る。このとき、ステップ４５１で反復回数を示すパラメータｔの値に１が加算される。最適化ループに入った時点とは異なり、今回は、大きくクラスタの状態が変化した場合、複数のクラスタがメンバ数の上限を超える可能性がある。また、２分割しただけでは不十分であるため、再度分割が必要となる場合、あるいは、新たに生成されたクラスタが上限を超える場合も生じる可能性がある。このため、ＣＰＵ１５２は、全てのクラスタのメンバ数が上限Ｍ_max以下となるように処理（ステップ４３０からステップ４３５）した後、ステップ４４０に移行する。ステップ４４０で、反復数が最大数ｔ_maxに達したか、あるいは、近接クラスタの集合Ｃ*の状態に全く変化がない場合、処理を終了する。 The processing in step 450 is the same as the normal k-means method. That is, all data included in the adjacent cluster is distributed to the cluster having the closest cluster average at that time. As a result, the members of the adjacent clusters and the cluster average are updated, and the process returns to Step 430. At this time, 1 is added to the value of the parameter t indicating the number of iterations in step 451. Unlike the point in time when the optimization loop is entered, this time, if the cluster state changes greatly, there is a possibility that a plurality of clusters may exceed the upper limit of the number of members. Moreover, since it is not sufficient to divide into two, there is a possibility that division may be necessary again, or a newly generated cluster may exceed the upper limit. For this reason, the CPU 152 proceeds to step 440 after performing processing (step 430 to step 435) so that the number of members of all clusters is equal to or less than the upper limit M _max . In step 440, if the number of iterations reaches the maximum number t _max , or if there is no change in the state of the set C * of neighboring clusters, the process ends.

以上の処理によって、登録データ数が増加するにつれて、クラスタ数も増加することになる。新クラスタが追加されるのは、クラスタ内のメンバ数がクラスタメンバ最大数を超えたときのため、（クラスタメンバ最大数）回に１回程度の頻度となる。図４に示した一連の処理は、メモリ１５３上の作業領域で実行され、近接クラスタの集合が更新される。 With the above processing, the number of clusters increases as the number of registered data increases. The new cluster is added when the number of members in the cluster exceeds the maximum number of cluster members, so the frequency is about once every (maximum number of cluster members). The series of processing shown in FIG. 4 is executed in the work area on the memory 153, and the set of adjacent clusters is updated.

次に、本類似検索システムにおけるデータ配置処理について説明する。一般に、類似性が高いクラスタは、更新時、及び、検索時に同時に参照される可能性が高い。従って、類似性が高いクラスタ同士が、２次記憶媒体上でなるべく近傍に集まるように配列すれば、ディスク走査の負荷を低減できるはずである。そこで、２次記憶媒体を１次元配列とみなし、類似したクラスタ同士を１次元配列上で近傍に配置し、類似していないクラスタ同士は１次元配列上で遠い領域に配置する。 Next, a data arrangement process in the similarity search system will be described. In general, a cluster having high similarity is highly likely to be referred to at the time of update and search. Therefore, if the clusters having high similarity are arranged as close to each other as possible on the secondary storage medium, the load on the disk scanning should be reduced. Therefore, the secondary storage medium is regarded as a one-dimensional array, similar clusters are arranged in the vicinity on the one-dimensional array, and dissimilar clusters are arranged in a distant area on the one-dimensional array.

クラスタ数Ｎ_cが少数の場合（例えばＮ_c＜４）は、更新対象クラスタに対して、特許文献２と同様のエネルギー関数Ｅ₁を用いてクラスタ配置の更新を行う。 When the number of clusters N _c is small (for example, N _c <4), the cluster arrangement is updated using the same energy function E ₁ as in Patent Document 2 for the update target cluster.

クラスタ数Ｎ_cが増加した場合（例えばＮ_c≧４）、式（３）に示されるエネルギー関数Ｅ₂を用いて、クラスタ配置の更新を行う。 When the number of clusters N _c increases (for example, N _c ≧ 4), the cluster arrangement is updated using the energy function E ₂ shown in Expression (3).

式（３）は、クラスタ位置の再配列の手続きを定義するためのエネルギー関数Ｅ₂である。Ｎ_cはクラスタ数、ｖ_iはｉ番目の位置にあるクラスタの平均ベクトルであり、ｕ_lqはクラスタ数がｍのｌ乗（ｍ、ｌは自然数）の時のｑ番目の位置にあるクラスタの平均ベクトルを表す。以下では例として、ｍ＝２としたときについて説明する。ｐは式（４）を満たす最大の整数であり、ｐのオーダはＯ(log(Ｎ_c))となる。また、ｑは式（５）を満たす。尚、ｕ_lqはクラスタ管理情報１１５に保存されている。 Equation (3) is an energy function E ₂ for defining a cluster position rearrangement procedure. N _c is the number of clusters, v _i is the average vector of the clusters at the i-th position, and u _lq is the cluster at the q-th position when the number of clusters is m to the power of 1 (m and l are natural numbers). Represents the mean vector. Hereinafter, as an example, a case where m = 2 is described. p is the maximum integer that satisfies the equation (4), and the order of p is O (log (N _c )). Moreover, q satisfies Formula (5). Note that u _lq is stored in the cluster management information 115.

本エネルギー関数Ｅ₂は、相前後する位置にあるクラスタ平均の２乗距離の総和を表す項と、逐次クラスタリングの結果として得られ、クラスタ管理情報１１５に保存している過去クラスタ平均と現在のクラスタ平均の２乗距離の総和を表す項の和として定義されている。具体的には、まず、保存している過去のクラスタについて、ある時点の過去クラスタ配列に着目する。次に、着目した過去クラスタ配列内のクラスタ番号を最小値が１、最大値がＮ_cとなるように、式（５）を用いて変換する。そして、現在クラスタ配列内のクラスタのクラスタ番号と、最も近い変換後のクラスタ番号を持つ過去クラスタ配列内のクラスタを取得し、それらのクラスタのクラスタ平均のベクトル間距離の２乗をエネルギー関数に加算する。 This energy function E ₂ includes a term representing the sum of square distances of cluster averages at successive positions, and the past cluster average and current cluster obtained as a result of sequential clustering and stored in the cluster management information 115. It is defined as the sum of terms representing the sum of the mean square distances. Specifically, first, attention is paid to a past cluster arrangement at a certain point in time for past clusters stored. Next, the focused cluster number in the past cluster array is converted using Equation (5) so that the minimum value is 1 and the maximum value is N _c . Then, the cluster in the past cluster array having the cluster number of the cluster in the current cluster array and the cluster number after the closest conversion is obtained, and the square of the inter-vector distance of the cluster average of those clusters is added to the energy function. To do.

ただし、両端の位置、すなわち１番目の位置とＮ_c番目のクラスタについての境界条件は、片側のみを定義している。具体的には、存在しない０番目の位置のクラスタと１番目のクラスタとのクラスタ平均同士の距離は０とし、存在しないＮ_c＋１番目の位置のクラスタとＮ_c番目の位置のクラスタとのクラスタ平均同士の距離も０とする。 However, positions of both ends, i.e., the boundary condition for the first position and the N _c-th cluster defines one side only. Specifically, the distance between the cluster averages of the non-existing zeroth-position cluster and the first cluster is 0, and the non-existing _Nc + 1-th position cluster and the _Nc- th position cluster are clusters. The distance between the averages is also 0.

本エネルギー関数Ｅ₂では、ｉ番目のクラスタとｉ−１番目のクラスタ及びｉ＋１番目のクラスタの類似度が大きければ、前半の項が小さくなり、ｉ番目のクラスタが過去のクラスタ平均に対応した位置に配置されれば、後半の項は小さくなる。 In the energy function E _2, the larger the similarity between i-th cluster and i-1-th cluster and i + 1-th cluster, the first half of the term is reduced, the i-th cluster corresponding to past cluster mean position The second half of the term becomes smaller.

このとき、ｉ番目の位置にあるクラスタとｊ番目の位置にあるクラスタを交換した場合のエネルギー関数の変化量は、式（６）によって算出される。ただし、ｊ＞ｉとする。 At this time, the amount of change in the energy function when the cluster at the i-th position and the cluster at the j-th position are exchanged is calculated by Expression (6). However, j> i.

上記のエネルギー変化量に基づいてエネルギー関数が減少するように、現在更新対象となっている近接クラスタ集合中のクラスタの位置を更新する。逐次クラスタリング処理によってクラスタ数が増加した場合、新規クラスタはクラスタ配列の末尾に配置されていたとみなして、クラスタ配列の再配列処理を行う。 The position of the cluster in the adjacent cluster set that is the current update target is updated so that the energy function decreases based on the energy change amount. When the number of clusters is increased by the sequential clustering process, it is assumed that the new cluster is arranged at the end of the cluster arrangement, and the cluster arrangement is rearranged.

さらに、更新された各クラスタに、クラスタ配列の順番に対応するように、全クラスタを通して一意的な連続番号であるクラスタ番号を割り振り、クラスタ平均と共に、クラスタ管理情報１１２で管理する。このとき割り振られる番号は、ステップ４１０で取得した最近接クラスタの集合Ｃ*に含まれるクラスタのクラスタ番号である。また、クラスタメンバ数の上限を超えたためにクラスタの分割が行われた場合は、増加したクラスタの数だけ新規にクラスタ番号を割り当てる。 Further, a cluster number that is a unique serial number is assigned to each updated cluster so as to correspond to the order of the cluster arrangement, and is managed by the cluster management information 112 together with the cluster average. The number assigned at this time is the cluster number of the cluster included in the set C * of the nearest clusters acquired in step 410. In addition, when a cluster is divided because the upper limit of the number of cluster members is exceeded, new cluster numbers are assigned by the increased number of clusters.

最後に、クラスタ数が特定の数の場合、全クラスタのクラスタ平均とクラスタ番号の組をクラスタ情報として保存する。本実施の形態では、クラスタ数が２のｌ乗（ｌは自然数）の時に保存することとする。 Finally, when the number of clusters is a specific number, a set of cluster averages and cluster numbers of all clusters is stored as cluster information. In this embodiment, it is assumed that the number of clusters is 2 to the l-th power (where l is a natural number).

以下では、現在のクラスタ平均と逐次クラスタリングの結果として得られた過去クラスタ平均の２乗距離の総和を表す項をエネルギー関数に加える理由について、詳細に説明する。 Hereinafter, the reason why a term representing the sum of the square distances of the current cluster average and the past cluster average obtained as a result of sequential clustering is added to the energy function will be described in detail.

まず、クラスタ配列の近傍領域に類似したクラスタが集まり、類似していないクラスタ同士がクラスタ配列の中で離れた領域に配置されているという関係が全てのクラスタに対して成立している理想的なクラスタ配列について考える。理想的なクラスタ配列に含まれる各クラスタについて、ベクトル空間内におけるクラスタ平均ベクトル同士の距離とクラスタ配列上の距離との関係を図５に示す。ここで、クラスタ配列上の距離とは、クラスタ配列の配列番号の差分を表す。さらに、図５に示す関係が成立している理想的なクラスタ配列が持つ階層的な構造を図６に表す。ここでは例として、クラスタ数Ｎ_cが１２の時について示す。最下層の矩形（６３０）が理想状態のクラスタ配列のクラスタ平均を表す。上層のクラスタ平均の平均値（６１０、６２０）とは、下層のクラスタ配列を複数の領域に区切った際に、各領域に含まれるクラスタ平均の平均値を表す。ここで、各層のクラスタ平均の平均値同士は、最下層のクラスタ配列と同様に図５に示す関係が成立し、このような関係は階層的に成立していると考えられる。 First, an ideal cluster where similar clusters gather in the neighborhood of the cluster array and dissimilar clusters are arranged in distant areas in the cluster array holds for all clusters. Consider a cluster arrangement. FIG. 5 shows the relationship between the distance between cluster average vectors in the vector space and the distance on the cluster array for each cluster included in the ideal cluster array. Here, the distance on the cluster array represents the difference between the array element numbers of the cluster array. Further, FIG. 6 shows a hierarchical structure of an ideal cluster arrangement in which the relationship shown in FIG. 5 is established. Here, as an example, the case where the number of clusters N _c is 12 is shown. The bottom rectangle (630) represents the cluster average of the cluster arrangement in the ideal state. The average value (610, 620) of the upper layer cluster average represents the average value of the cluster average included in each region when the lower layer cluster arrangement is divided into a plurality of regions. Here, the average values of the cluster averages of the respective layers have the relationship shown in FIG. 5 as in the lowermost cluster arrangement, and such a relationship is considered to be established hierarchically.

ここで、理想的な状態のクラスタ配列から得られるクラスタ平均の平均値は、画像特徴量分布全体を、クラスタ平均の平均値の個数と同じ数に分割する最適なクラスタリング結果とほぼ等しいと仮定することができる。例えば、クラスタ平均の平均値の数が４つならば、そのときのクラスタ平均の平均値は、画像特徴量ベクトルのベクトル空間全体を４つに分割するクラスタリング結果から得られるクラスタ平均に等しいと見做せる。なぜならば、理想状態のクラスタ平均の平均値は、画像特徴量の分布にしたがって、クラスタ配列を適切に分割しているからである。 Here, it is assumed that the average value of the cluster average obtained from the cluster arrangement in the ideal state is almost equal to the optimal clustering result obtained by dividing the entire image feature amount distribution into the same number as the average number of cluster averages. be able to. For example, if the average number of cluster averages is four, the average value of the cluster averages at that time is considered to be equal to the cluster average obtained from the clustering result obtained by dividing the entire vector space of the image feature vector into four. I can lose weight. This is because the average value of the cluster average in the ideal state appropriately divides the cluster arrangement according to the distribution of the image feature amount.

したがって、クラスタ平均の平均値を事前に知ることが出来れば、各クラスタ平均の平均値を基準として、それらの値に距離が近いクラスタ平均を持つクラスタを各クラスタ平均の平均値に対応する領域に集めることができる。これは、クラスタ平均とクラスタ平均の平均値とのベクトル間距離をエネルギー関数に加えることに相当する。この操作によって、実際にクラスタを配置する際に、クラスタ配列全体における凡その位置を決定することが可能になると考えられる。 Therefore, if the average value of the cluster average can be known in advance, a cluster having a cluster average whose distance is close to the average value of each cluster average is set as an area corresponding to the average value of each cluster average. Can be collected. This corresponds to adding an intervector distance between the cluster average and the average value of the cluster average to the energy function. By this operation, it is considered that the approximate position in the entire cluster arrangement can be determined when the cluster is actually arranged.

しかしながら、図６に示す構造は理想状態におけるものであるため、クラスタ平均の平均値は本質的に未知である。それゆえに、何らかの手段を用いてクラスタ平均の平均値を求める必要がある。 However, since the structure shown in FIG. 6 is in an ideal state, the average value of the cluster average is essentially unknown. Therefore, it is necessary to obtain the average value of the cluster average using some means.

そこで、前述したk-means法を用いた逐次クラスタリング処理に着目する。逐次クラスタリング処理では、新規データを逐次的に登録する際に、部分的なクラスタリングを繰り返し行うことによって、全体のクラスタリングを実行する。また、クラスタメンバ数の上限を予め決めているため、データ数が増加するにつれてクラスタ数も同様に増加するという特徴を持つ。 Therefore, attention is focused on the sequential clustering process using the k-means method described above. In the sequential clustering process, the entire clustering is performed by repeatedly performing partial clustering when new data is sequentially registered. In addition, since the upper limit of the number of cluster members is determined in advance, the number of clusters similarly increases as the number of data increases.

ここで得られるクラスタ平均の系列は、登録データ数が増加してクラスタ数が増えるにつれて、画像特徴量の分布全体を粗く近似している状態から、細かく近似している状態になるとみなせる。つまり、特徴量分布中において、画像特徴量が多く存在している領域には多くのクラスタ平均が存在し、逆に画像特徴量があまり存在していない領域には少数のクラスタ平均しか存在しないと言える。これは特徴量分布全体を、各分割領域内の画像特徴量の生起確率を一定とするベクトル量子化によって、離散的に表すことと等しいと考えることができる。よって、登録するデータのランダム性を仮定すると、逐次クラスタリングの結果として得られるクラスタ平均は、全画像特徴量をクラスタリングした結果の近似値を与えると考えることができる。 The cluster average series obtained here can be regarded as a fine approximation from a state in which the entire distribution of image feature values is roughly approximated as the number of registered data increases and the number of clusters increases. In other words, in the feature quantity distribution, there are many cluster averages in areas where there are many image feature quantities, and conversely there are only a few cluster means in areas where there are not many image feature quantities. I can say that. This can be considered to be equivalent to discretely representing the entire feature amount distribution by vector quantization in which the occurrence probability of the image feature amount in each divided region is constant. Therefore, assuming the randomness of the data to be registered, the cluster average obtained as a result of the sequential clustering can be considered to give an approximate value of the result of clustering all image feature amounts.

したがって、画像特徴量分布全体を理想的な状態のクラスタ平均の平均値の個数と同じ数に分割する逐次クラスタリングの結果として得られるクラスタ平均を、理想的な状態のクラスタ平均の平均値の近似値として用いることができる。 Therefore, the cluster average obtained as a result of sequential clustering that divides the entire image feature distribution into the same number as the average number of cluster averages in the ideal state is approximated to the average value of the cluster average in the ideal state. Can be used as

以上のことから、クラスタ平均と逐次クラスタリングの結果として得られた過去クラスタ平均の２乗距離の総和を表す項をエネルギー関数に加えれば、類似したクラスタ同士をクラスタ配列上で近傍に配置し、類似していないクラスタ同士はクラスタ配列上で遠い領域に配置することができる。 From the above, if a term representing the sum of the square distance of the cluster average and the past cluster average obtained as a result of the sequential clustering is added to the energy function, similar clusters are arranged in the neighborhood on the cluster array and similar. Non-clusters can be arranged in distant areas on the cluster arrangement.

図７は、本発明の実施の形態において実行されるクラスタ位置の再配列の処理を示すフローチャートである。ここでは、エネルギー関数計算部３１５、データ格納部３１６、で実行される処理について、詳細に説明する。図７に示す処理は、検索サーバプロセス１１１を実現するサーバ・プログラムの一部として実行される。したがって、図７に示す処理は、ＣＰＵ１５２によって実行される。 FIG. 7 is a flowchart showing cluster position rearrangement processing executed in the embodiment of the present invention. Here, processing executed by the energy function calculation unit 315 and the data storage unit 316 will be described in detail. The process shown in FIG. 7 is executed as part of a server program that implements the search server process 111. Therefore, the process shown in FIG.

まず、ＣＰＵ１５２は、現在更新対象となっている近接クラスタの集合Ｃ*から、式（３）によって算出されるエネルギーの減少量が最大となるクラスタの組を探す（Ｓ７１０）。次に、ＣＰＵ１５２は、ステップ７１０の条件に該当するクラスタの組を発見したか否かを判定する（Ｓ７２０）。該当するクラスタの組が発見されない場合、現在のクラスタの配列のエネルギーが最も小さく、各クラスタは最適に配置されていると考えられるため、処理を終了する。 First, the CPU 152 searches for a set of clusters that maximizes the amount of energy reduction calculated by Expression (3) from the set C * of adjacent clusters that are currently updated (S710). Next, the CPU 152 determines whether or not a cluster set corresponding to the condition of step 710 has been found (S720). If no corresponding cluster set is found, the energy of the current cluster arrangement is the smallest and each cluster is considered to be optimally arranged, so the processing ends.

一方、該当するクラスタの組が発見された場合、ＣＰＵ１５２は、その位置を交換することによって配列を更新し（Ｓ７３０）、次のクラスタの組を探すためにステップ７１０に戻る。最終的には、こうして得られた配列上の位置に従って、クラスタメンバの特徴量ベクトルをハードディスク１５４上へ保存する。さらに、クラスタ数が２のｌ乗（ｌは自然数）の場合（Ｓ７４０）、クラスタ情報としてクラスタ平均とクラスタ位置の組をハードディスク１５４上のクラスタ管理情報１１５へ保存する（Ｓ７５０）。尚、クラスタ数が２のｌ乗ではなく、他の特定の数の時に保存しても良い。 On the other hand, if a corresponding cluster set is found, the CPU 152 updates the array by exchanging the position (S730), and returns to step 710 to search for the next cluster set. Finally, the cluster member feature vector is stored on the hard disk 154 in accordance with the position on the array thus obtained. Further, when the number of clusters is 2 to the power of 1 (l is a natural number) (S740), a set of cluster average and cluster position is stored as cluster information in the cluster management information 115 on the hard disk 154 (S750). The number of clusters may be stored when the number of clusters is not a power of 2 but another specific number.

本発明と特許文献２に示される手法によって実現されるクラスタ配列の違いを明確化するために５００万件のベクトルデータを用いて実験を行い、１２２４５個のクラスタからなるクラスタ配列を得た。図５と同様に、クラスタ配列内の各クラスタについて、ベクトル空間内におけるクラスタ平均ベクトル同士の距離とクラスタ配列上の距離との関係を、図８、図９、図１０、図１１に示す。これは、クラスタ配列上の距離が等しいクラスタの組を求め、各組のクラスタ同士のクラスタ平均のベクトル間距離を計算し、それらの平均値をプロットしたものである。縦軸がクラスタ平均のベクトル間距離であり、横軸がクラスタ配列上の距離である。 In order to clarify the difference between the cluster arrangements realized by the present invention and the technique disclosed in Patent Document 2, experiments were performed using 5 million vector data, and a cluster arrangement consisting of 12245 clusters was obtained. Similar to FIG. 5, for each cluster in the cluster array, the relationship between the distance between the cluster average vectors in the vector space and the distance on the cluster array is shown in FIG. 8, FIG. 9, FIG. This is a set of clusters having the same distance on the cluster array, a cluster average vector distance between clusters of each set is calculated, and the average value is plotted. The vertical axis represents the cluster average vector distance, and the horizontal axis represents the distance on the cluster array.

図８は、特許文献２に示される手法によって実現されるクラスタ同士の関係を示すものであり、クラスタ数Ｎ_cが、５１２、２０４８、８１９２、１２２４５の時を示している。また、図９は、クラスタ配列上の距離が５０以内のクラスタ同士の関係のみを拡大表示したものである。 FIG. 8 shows the relationship between clusters realized by the method disclosed in Patent Document 2, and shows the case where the number of clusters N _c is 512, 2048, 8192, and 12245. FIG. 9 is an enlarged view showing only the relationship between clusters having a distance of 50 or less on the cluster arrangement.

図８、図９のグラフから、クラスタ配列上の距離が約１０より近い場所に配置されているクラスタのクラスタ平均同士の距離の平均値は、クラスタ配列上の距離に比例していることが分かる。また、クラスタ配列上の距離が約１０より遠い場所に配置されているクラスタのクラスタ平均同士の距離の平均値が、ほぼ一定になっていることが分かる。複数データの平均値が一定ということは、それらのデータは、クラスタ配列上の距離と関係なく、ほぼランダムに近い状態で分布していると考えることができる。したがって、特許文献２に示される手法では、類似度の大きいクラスタ同士が隣接するように配置することはある程度可能だが、クラスタ配列上の距離が一定以上離れたクラスタ同士の類似度は、クラスタ配列上の距離に依存せずに、ランダムに等しい状態になってしまう。さらに、クラスタ平均同士の距離の平均値がクラスタ配列上の距離と比例している範囲は、全クラスタ数に関係なく一定であることが図９から分かる。したがって、近傍に集まるクラスタの数は、全クラスタ数が増加するにつれて相対的に少なくなると言える。 From the graphs of FIGS. 8 and 9, it can be seen that the average value of the distances between the cluster averages of the clusters arranged at a distance closer than about 10 on the cluster array is proportional to the distance on the cluster array. . Further, it can be seen that the average value of the distances between the cluster averages of the clusters arranged at a place where the distance on the cluster arrangement is more than about 10 is substantially constant. The fact that the average value of a plurality of data is constant can be considered that these data are distributed almost in a random state regardless of the distance on the cluster arrangement. Therefore, in the method disclosed in Patent Document 2, it is possible to arrange the clusters having high similarity so that they are adjacent to each other to some extent, but the similarity between clusters whose distances on the cluster array are more than a certain distance is It will be in the same state at random without depending on the distance. Furthermore, it can be seen from FIG. 9 that the range in which the average value of the distance between the cluster averages is proportional to the distance on the cluster arrangement is constant regardless of the total number of clusters. Therefore, it can be said that the number of clusters gathered in the vicinity decreases relatively as the total number of clusters increases.

図１０は、本実施の形態に示される手法によって実現されるクラスタ同士の関係を示すものであり、クラスタ数Ｎ_cが、５１２、２０４８、８１９２、１２２４５、の時を示している。また、図１１は、クラスタ配列上の距離が５０以内のクラスタ同士の関係のみを拡大表示したものである。これは特許文献２の手法を用いた実験と同様に、クラスタ配列上の距離が等しいクラスタの組を求め、各組のクラスタ同士のクラスタ平均間の距離を計算し、それらの平均値をプロットしたものである。縦軸がクラスタ平均の距離であり、横軸がクラスタ配列上の距離である。 FIG. 10 shows the relationship between clusters realized by the method shown in the present embodiment, and shows when the number of clusters N _c is 512, 2048, 8192, 12245. FIG. 11 is an enlarged view showing only the relationship between clusters having a distance of 50 or less on the cluster arrangement. This is similar to the experiment using the method of Patent Document 2, a set of clusters having the same distance on the cluster array is obtained, the distance between the cluster averages of the clusters of each set is calculated, and the average value is plotted. Is. The vertical axis represents the cluster average distance, and the horizontal axis represents the distance on the cluster array.

図１０、図１１のグラフから、クラスタ平均同士の距離の平均値がクラスタ配列上の距離と比例している範囲が、Ｎ_cが５１２の時は約１００、Ｎ_cが２０４８の時は約４００、Ｎ_cが８１９２の時は約１６００、Ｎ_cが１２２４５の時は約２４００、というように、全クラスタ数に比例して増加している様子が分かる。したがって、近傍に集まるクラスタの数は、全クラスタ数に依存してＯ（Ｎ）であると言える。 10 and 11, the range in which the average distance between the cluster averages is proportional to the distance on the cluster arrangement is about 100 when N _c is 512, and about 400 when N _c is 2048. It can be seen that when N _c is 8192, it is about 1600, and when N _c is 12245, it is about 2400. Therefore, it can be said that the number of clusters gathered in the vicinity is O (N) depending on the total number of clusters.

以上の本発明の実施の形態は、検索対象のベクトルデータが画像の特徴量ベクトルである場合を例として説明したが、本発明は、いかなる種類のベクトルデータの類似検索に対しても適用することができる。類似画像検索以外の類似データ検索の例として、連想文書検索装置がある。連想文書検索装置は、予め検索対象となる文書群を形態素解析し、文書の意味を各単語の意味を単位ベクトルとし、各文書の意味をそれらの総和として多次元ベクトルに展開し、ある文書と他の文書との類似をそれら多次元ベクトルの距離で判定する。よって、ユーザが入力した問い合わせ文と最も近いベクトルを持つ文書を検索することが可能になる。 In the above embodiment of the present invention, the case where the vector data to be searched is an image feature vector has been described as an example. However, the present invention can be applied to similar searches of any kind of vector data. Can do. An example of similar data search other than similar image search is an associative document search apparatus. The associative document search device performs a morphological analysis on a document group to be searched in advance, sets the meaning of each document as a unit vector, and expands the meaning of each document into a multidimensional vector as a sum of them, Similarity with other documents is determined by the distance of these multidimensional vectors. Therefore, it is possible to search for a document having a vector closest to the query sentence input by the user.

本発明の類似検索システムの構成例を示すブロック図。The block diagram which shows the structural example of the similar search system of this invention. 本発明の類似検索システムにおける類似データ検索処理の構成例を示すブロック図。The block diagram which shows the structural example of the similar data search process in the similar search system of this invention. 本発明の類似検索システムにおける新規データ登録処理の構成例を示すブロック図。The block diagram which shows the structural example of the new data registration process in the similar search system of this invention. データ登録時に実行される処理を示すフローチャート。The flowchart which shows the process performed at the time of data registration. クラスタ配列に含まれる任意のクラスタについての、クラスタ平均ベクトル同士の距離とクラスタ配列上の距離との理想的な関係を示す図。The figure which shows the ideal relationship between the distance of cluster average vectors and the distance on a cluster arrangement | sequence about the arbitrary clusters contained in a cluster arrangement | sequence. 図５に示す理想的な関係が成立しているクラスタ配列の持つ構造を表す図。The figure showing the structure which the cluster arrangement | sequence where the ideal relationship shown in FIG. 5 is materialized has. クラスタ位置の再配列の処理を示すフローチャート。The flowchart which shows the process of rearrangement of a cluster position. 従来法によって得られるクラスタ配列に含まれる任意のクラスタについての、クラスタ平均ベクトル同士の距離とクラスタ配列上の距離との関係を示す図。The figure which shows the relationship between the distance of cluster average vectors and the distance on a cluster arrangement | sequence about the arbitrary clusters contained in the cluster arrangement | sequence obtained by a conventional method. 図８の近傍領域を拡大して示した図。The figure which expanded and showed the vicinity area | region of FIG. 本発明によって得られるクラスタ配列に含まれる任意のクラスタについての、クラスタ平均ベクトル同士の距離とクラスタ配列上の距離との関係を示す図。The figure which shows the relationship between the distance of cluster average vectors and the distance on a cluster arrangement | sequence about the arbitrary clusters contained in the cluster arrangement | sequence obtained by this invention. 図１０の近傍領域を拡大して示した図。The figure which expanded and showed the vicinity area | region of FIG.

Explanation of symbols

１１０：サーバ計算機
１１１：検索サーバプロセス
１１２：クラスタ管理情報
１１３：クラスタ情報
１１４：クラスタメンバの特徴量データ
１１５：記録されたクラスタ管理情報
１２０：通信基盤
１３０：クライアント計算機
１３１，１５２：ＣＰＵ
１３２，１５３：メモリ
１３３，１５１：インターフェース
１３４：入力装置
１３５：出力装置
１４０：画像サーバ
１５４：ハードディスク
２１０：画像入力部
２１１：画像特徴量抽出部
２１２：類似クラスタ検索部
２１３：類似特徴量検索部
２１４：検索結果出力部
３１０：画像入力部
３１１：画像特徴量抽出部
３１２：類似クラスタ検索部
３１３：クラスタリング部
３１４：クラスタ分割部
３１５：エネルギー関数計算部
３１６：データ格納部
６１０，６２０：理想的なクラスタ配列のクラスタ平均の平均値
６３０：理想的なクラスタ配列のクラスタ平均 110: Server computer 111: Search server process 112: Cluster management information 113: Cluster information 114: Cluster member feature data 115: Recorded cluster management information 120: Communication infrastructure 130: Client computers 131, 152: CPU
132, 153: Memory 133, 151: Interface 134: Input device 135: Output device 140: Image server 154: Hard disk 210: Image input unit 211: Image feature quantity extraction unit 212: Similar cluster search unit 213: Similar feature quantity search unit 214: Search result output unit 310: Image input unit 311: Image feature amount extraction unit 312: Similar cluster search unit 313: Clustering unit 314: Cluster division unit 315: Energy function calculation unit 316: Data storage units 610, 620: Ideal Average value of cluster averages 630 of ideal cluster arrays: Cluster average of ideal cluster arrays

Claims

In a similar data search system that searches data similar to search query data from registered search target data,
A feature quantity extraction unit that extracts feature quantities from data and generates a feature quantity vector;
A clustering unit that classifies feature vectors of search target data into a plurality of clusters;
Manages the cluster number, which is a unique serial number assigned to each cluster, and the cluster representative value of each cluster, and stores the feature vector contained in the cluster in a continuous area of the storage medium, and the number of clusters is predetermined. A data storage unit that stores a cluster representative value and a cluster number of the cluster,
A similar cluster search unit that searches for clusters having cluster representative values that are close in distance in the vector space and feature vector extracted from given data;
A cluster dividing unit for dividing the cluster when the number of members of the cluster exceeds a preset upper limit;
The inter-vector distance 1 between the cluster representative value of the cluster having the specific number and the cluster representative value of the preceding and following clusters, and the vector between the cluster representative value of the cluster having the specific number and the cluster representative value of the stored cluster An energy function calculator for calculating the distance 2;
When additionally registering new data as a search target, the feature value vector of the new data extracted by the feature value extraction unit from the clusters stored in the storage medium has a cluster representative value that is close in distance in the vector space. A plurality of clusters are searched by searching for a predetermined number of clusters by the similar cluster search unit, and clustering by the clustering unit with respect to the feature amount vector of the new data and the feature amount vectors included in the searched predetermined number of clusters. The energy function calculation unit obtains the storage positions on the recording medium of the reclassified clusters so that the sum of the inter-vector distance 1 and the inter-vector distance 2 is reduced. A feature quantity vector included in the reclassified cluster is stored in a storage position. Similar data retrieval system.

The similar data search system according to claim 1, further comprising a similar feature amount search unit and a search result output unit,
The similar cluster search unit searches the clusters stored in the storage medium for clusters having cluster representative values that are close in distance in the vector space with the feature amount vector extracted from the search query data,
The similar feature quantity search unit includes data having a feature quantity vector that is close in distance to a feature quantity vector extracted from the search query data from among the cluster members included in the cluster searched by the similar cluster search unit. Search for
The search result output unit outputs the data searched by the similar feature amount search unit.

Data classified into a plurality of clusters assigned unique consecutive numbers as cluster numbers and their feature vectors, cluster number and cluster representative information of the current cluster, and clusters when the number of clusters is a predetermined number A storage unit that stores information on the cluster number and cluster representative value, a feature amount extraction unit, a similar cluster search unit, a clustering unit that classifies the feature amount vector of search target data into a plurality of clusters, a cluster division unit, The data storage unit, the inter-vector distance 1 between the cluster representative value of the cluster having the specific number and the cluster representative value of the preceding and following clusters, the cluster representative value of the cluster having the specific number, and the cluster representative of the stored cluster Using a similar data search system having an energy function calculation unit for calculating a distance 2 between vectors with a value, A similar data retrieval method for retrieving similar data with search query data from the registered search target data,
A step of extracting a feature quantity from the newly registered data by the feature quantity extraction unit to generate a feature quantity vector;
A predetermined number of clusters having cluster representative values that are close in distance in the vector space to the feature vector extracted from the newly registered data by the similar cluster search unit based on the current cluster information stored in the storage unit Searching for
Clustering by the clustering unit with respect to feature vectors of the newly registered data and feature vectors of cluster members included in the predetermined number of similar clusters, and classifying into a plurality of clusters;
Dividing the cluster by the cluster dividing unit when the number of members of the cluster exceeds the preset upper limit;
The feature vector included in the plurality of clusters classified by the clustering unit is stored on the storage medium so that the sum of the inter-vector distance 1 and the inter-vector distance 2 is reduced by the energy function calculation unit. Replacing the storage position;
The data storage unit stores the feature vector included in the cluster in a continuous area of the storage medium, and each time the number of clusters reaches a predetermined number, the cluster representative value and the cluster number of the cluster are stored in the storage unit. A method for retrieving similar data, comprising the step of storing.

The similar data search method according to claim 3, wherein the similar data search system further includes a similar feature amount search unit and a search result output unit,
Searching for a cluster having a cluster representative value that is close in distance in the vector space with the feature vector extracted from the search query data from the current clusters stored in the storage medium by the similar cluster search unit;
A step of searching for data having a feature vector close to a distance in a vector space with a feature vector extracted from the search query data from among the cluster members included in the searched cluster by the similar feature search unit; ,
And a step of outputting the data searched by the similar feature amount search unit by the search result output unit.