JP2017021606A

JP2017021606A - Method, device, and program for searching for dynamic images

Info

Publication number: JP2017021606A
Application number: JP2015139166A
Authority: JP
Inventors: 泰男松山; Yasuo Matsuyama; 雅史森脇; Masafumi Moriwaki; 良太横手; Ryota Yokote; 輝樹堀江; Teruki Horie; 晶滉鹿野; Masaaki Kano; 弘道岩瀬; Hiromichi Iwase
Original assignee: Waseda University
Current assignee: Waseda University
Priority date: 2015-07-10
Filing date: 2015-07-10
Publication date: 2017-01-26

Abstract

PROBLEM TO BE SOLVED: To conduct searching or grouping using a dynamic image itself as a query more precisely and more easily with lower cost compared to a conventional case.SOLUTION: A feature quantity extraction part 22 extracts the respective feature quantities of a query dynamic image and a target dynamic image. A dynamic image representative frame selection part 23 acquires representative frames in chronological order of the query dynamic image and the target dynamic image on the basis of the relation of the feature quantities, using one of a time-interval k average method, a time-division k average method, a time-binding pair-wise nearest neighbor method, and a time-division pair-wise nearest neighbor method. Further, a similarity calculation part 24 obtains the similarity between the query dynamic image and the target dynamic image from the similarity between the representative frames of the query dynamic image and the target dynamic image and searches for a dynamic image which is similar to the query dynamic image on the basis of the obtained similarity.SELECTED DRAWING: Figure 1

Description

本発明は、動画像検索方法、動画像検索装置及びそのプログラムに関する。 The present invention relates to a moving image search method, a moving image search device, and a program thereof.

インターネット上には、莫大な量の動画像データが蓄積されている。これらの動画像の検索に関してよく使われる方法として、テキストベースの検索手法が知られている。このテキストベースの検索手法は、動画像のそれぞれに適切なテキスト情報を対応させ、テキスト検索を行なうことで所望の動画像を検索するものである。しかしながら、このような手法では、以下のような問題がある。
（ａ）動画像とテキスト情報の対応付け（ラベル付け）は手作業で行なう必要があり、コストがかかる。
（ｂ）汎用性が乏しい不適切なラベルが付けられることにより、検索の精度や効率が低下する可能性がある。 A huge amount of moving image data is accumulated on the Internet. A text-based search method is known as a frequently used method for searching these moving images. This text-based search method searches for a desired moving image by associating appropriate text information with each moving image and performing a text search. However, this method has the following problems.
(A) The association (labeling) of the moving image and the text information needs to be performed manually, which is expensive.
(B) Since an inappropriate label with poor versatility is attached, the accuracy and efficiency of search may be reduced.

上記問題点を解決する手法として、画像の内容（コンテンツ）そのものをクエリとして用いるコンテンツベースの検索が期待される。 As a method for solving the above problem, a content-based search using the image content (content) itself as a query is expected.

コンテンツベースの検索方法として、画像全体の輝度の変化と複雑度、エッジ量の複雑度、輝度分布を用いた簡易構図情報など画像全体に関する情報、またはそれらの組み合わせで、クエリ画像と検索対象画像との類似度を計算する手法が提案されている（例えば特許文献１参照）。また、複数の検索対象画像をそれぞれ所定の大きさの要素画像に分割し、分割された要素画像から代表的な要素画像を選択し、この代表的な要素画像を用いて描画した検索参照画像による画像の検索を行なう手法も提案されている（例えは特許文献２参照）。 As a content-based search method, the query image and the search target image can be obtained by combining information about the entire image such as brightness change and complexity of the entire image, complexity of the edge amount, simple composition information using the luminance distribution, or a combination thereof. There has been proposed a method for calculating the degree of similarity (see, for example, Patent Document 1). Further, by dividing a plurality of search target images into element images of a predetermined size, selecting a representative element image from the divided element images, and using a search reference image drawn using the representative element image A technique for searching for an image has also been proposed (for example, see Patent Document 2).

しかしながらこれらの提案手法は、静止画像をクエリとして用いて静止画像の検索を行なうものであり、動画像をクエリとして検索する手法については考慮されていない。 However, these proposed methods search still images using still images as queries, and do not consider a method for searching for moving images as queries.

特開２００１−５６８２０号公報JP 2001-56820 A 特開２００２−１４０３３１号公報JP 2002-140331 A

上記したように、従来技術においてはテキストベース又は静止画像をクエリとして用いる手法について開示があるのみであり、動画像をクエリとする、動画像のコンテンツに基づいた動画像検索について開示されたものはなく、検討されていなかった。本発明は、動画像そのものをクエリとする検索を、計算コストを低く抑えつつ精度良く行なうことが可能な動画像検索方法、動画像検索装置及びそのプログラムを提供することを目的とする。 As described above, in the prior art, there is only a disclosure about a method using a text base or a still image as a query, and what is disclosed about a moving image search based on a moving image content using a moving image as a query is disclosed. There was no consideration. An object of the present invention is to provide a moving image search method, a moving image search apparatus, and a program thereof that can perform a search using a moving image itself as a query with high accuracy while keeping calculation cost low.

本発明は、クエリ動画像に対して対象動画像の類似度を算出して、前記クエリ動画像に類似する動画像を検索する動画像検索装置による動画像検索方法であって、クエリ動画像のフレームから、フレームごとに第１の特徴量を抽出し、対象動画像のフレームから、フレームごとに第２の特徴量を抽出し、前記クエリ動画像について、抽出された第１の特徴量に対して、時間区分ｋ平均法、時間分割ｋ平均法、時間拘束ペアワイズ・ニアレスト・ネイバー法、又は時間区分ペアワイズ・ニアレスト・ネイバー法の何れかを用いて時系列で第１の代表フレームを取得し、前記対象動画像について、抽出された第２の特徴量に対して、時間区分ｋ平均法、時間分割ｋ平均法、時間拘束ペアワイズ・ニアレスト・ネイバー法、又は時間区分ペアワイズ・ニアレスト・ネイバー法の何れかを用いて時系列で第２の代表フレームを取得し、前記クエリ動画像における前記第１の代表フレームと、前記対象動画像における前記第２の代表フレームとの間の類似度から、前記クエリ動画像と前記対象動画像との間の類似度を求めて、前記クエリ動画像に類似する前記動画像を検索することを特徴とする。 The present invention is a moving image search method by a moving image search apparatus for calculating a similarity of a target moving image with respect to a query moving image and searching for a moving image similar to the query moving image. A first feature value is extracted for each frame from the frame, a second feature value is extracted for each frame from the frame of the target moving image, and the query moving image is extracted with respect to the extracted first feature value. The first representative frame is acquired in time series using either the time-segment k-average method, the time-division k-average method, the time-constrained pair-wise nearest neighbor method, or the time-segment pair-wise nearest neighbor method, For the second feature amount extracted for the target moving image, a time-segment k-average method, a time-division k-average method, a time-constrained pairwise nearest neighbor method, or a time-segmented pairwise A second representative frame is acquired in a time series using any of the nearest neighbor method, and between the first representative frame in the query moving image and the second representative frame in the target moving image The similarity between the query moving image and the target moving image is obtained from the similarity, and the moving image similar to the query moving image is searched.

また本発明は、クエリ動画像に対して検索対象となる対象動画像の類似度を算出して、前記クエリ動画像に類似する動画像を検索する動画像検索装置であって、前記クエリ動画像と前記対象動画像のフレームから、前記クエリ動画像に係るフレームごとの第１の特徴量と、前記対象動画像に係るフレームごとの第２の特徴量を抽出する特徴量抽出部と、前記クエリ動画像と前記対象動画像について、抽出された第１の特徴量、及び第２の特徴量に対して、時間区分ｋ平均法、時間分割ｋ平均法、時間拘束ペアワイズ・ニアレスト・ネイバー法、又は時間区分ペアワイズ・ニアレスト・ネイバー法の何れか用いて時系列で、第１の特徴量に係る第１の代表フレームと、第２の特徴量に係る第２の代表フレームを取得する代表フレーム選出部と、前記クエリ動画像における前記第１の代表フレームと、前記対象動画像における前記第２の代表フレームとの間の類似度から、前記クエリ動画像と前記対象動画像との間の類似度を求めて、前記クエリ動画像に類似する前記動画像を検索する動画像検索部と、を備えたことを特徴とする。 The present invention is also a moving image search apparatus for calculating a similarity of a target moving image to be searched with respect to a query moving image and searching for a moving image similar to the query moving image, wherein the query moving image And a feature amount extraction unit that extracts a first feature amount for each frame related to the query moving image and a second feature amount for each frame related to the target moving image from the frame of the target moving image, and the query For the moving image and the target moving image, the time-segment k-average method, the time-division k-average method, the time-constrained pairwise nearest neighbor method, or the extracted first feature value and second feature value, A representative frame selection unit that acquires a first representative frame related to the first feature quantity and a second representative frame related to the second feature quantity in time series using any of the time-wise pairwise, nearest, and neighbor methods When, The similarity between the query moving image and the target moving image is obtained from the similarity between the first representative frame in the query moving image and the second representative frame in the target moving image. A moving image search unit that searches for the moving image similar to the query moving image.

上述した本発明の動画像検索方法、動画像検索装置およびそのプログラムにより、動画像をクエリとする検索を、計算コストを低く抑えつつ精度良く行なうことができる。 With the above-described moving image search method, moving image search apparatus, and program thereof according to the present invention, a search using a moving image as a query can be performed with high accuracy while keeping the calculation cost low.

本発明の実施の形態による、動画像検索装置のシステム構成を示すブロック図である。It is a block diagram which shows the system configuration | structure of the moving image search device by embodiment of this invention. 本発明の実施の形態による動画像検索装置の動作フローの説明図である。It is explanatory drawing of the operation | movement flow of the moving image search device by embodiment of this invention. 本発明の実施の形態で適用する時系列情報を考慮したクラスタリングと、従来技術の時系列情報を考慮していないクラスタリングとの違いを示す説明図である。It is explanatory drawing which shows the difference between the clustering which considered the time series information applied in embodiment of this invention, and the clustering which does not consider the time series information of a prior art. 本発明の実施の形態において、時間窓を用いたaffinity propagation 法（TB-AP，Time-Bound Affinity Propagation）による代表フレーム選出アルゴリズムを示すフローチャートである。In the embodiment of the present invention, it is a flowchart showing a representative frame selection algorithm by an affinity propagation method (TB-AP, Time-Bound Affinity Propagation) using a time window. 本発明の実施の形態と従来技術において、メッセージ交換の違いを説明する説明図である。It is explanatory drawing explaining the difference of message exchange in embodiment of this invention and a prior art. 本発明の実施の形態において、メッセージ交換のリンク行列を示す説明図である。In embodiment of this invention, it is explanatory drawing which shows the link matrix of message exchange. 本発明の実施の形態において、TB-APにおける時間窓と、改良したk-means法やharmonic competition法における時間ブロックを、それぞれ時間軸上で示した説明図である。In the embodiment of the present invention, the time window in TB-AP and the time block in the improved k-means method and harmonic competition method are each an explanatory diagram showing on the time axis. 本発明の実施の形態における時間区分ｋ平均法（TP k-means）の説明図である。It is explanatory drawing of the time division k average method (TP k-means) in embodiment of this invention. 本発明の実施の形態における時間分割ｋ平均法（Time-Split k-means：TS k-means）について、全フレームを時系列にｋ等分して割り振り、各クラスターで重心を求めた状態を示す説明図である。In the time-split k-means (TS k-means) in the embodiment of the present invention, all frames are time-divided into k equal parts, and the center of gravity is obtained for each cluster. It is explanatory drawing. 本発明の実施の形態における時間分割ｋ平均法（Time-Split k-means：TS k-means）について、各フレームで現在属するクラスターの重心と、時系列的に隣接する前後のクラスターの重心との間の距離を計算した状態を示す説明図である。Regarding the time-split k-means (TS k-means) in the embodiment of the present invention, the centroid of the cluster that currently belongs in each frame and the centroid of the clusters before and after adjacent in time series It is explanatory drawing which shows the state which calculated the distance between. 本発明の実施の形態における時間分割ｋ平均法（Time-Split k-means：TS k-means）について、再クラスタリングを行なった状態を示す説明図である。It is explanatory drawing which shows the state which performed the reclustering about the time division | segmentation k average method (Time-Split k-means: TS k-means) in embodiment of this invention. 本発明の実施の形態における時間分割ｋ平均法（Time-Split k-means：TS k-means）について、時系列が不連続なフレームの間で、クラスターを分割した状態を示す説明図である。It is explanatory drawing which shows the state which divided | segmented the cluster between the frames in which a time series is discontinuous about the time-division k average method (Time-Split k-means: TS k-means) in embodiment of this invention. 本発明の実施の形態における時間分割ｋ平均法（Time-Split k-means：TS k-means）について、クラスターを分割した後に、各クラスターで重心を算出した状態を示す説明図である。It is explanatory drawing which shows the state which calculated the gravity center in each cluster, after dividing | segmenting a cluster about the time division | segmentation k-means (Time-Split k-means: TS k-means) in embodiment of this invention. 本発明の実施の形態におけるペアワイズ・ニアレスト・ネイバー法（PNN）におけるデータポイントの併合を示す説明図である。It is explanatory drawing which shows the merge of the data point in the pairwise nearest neighbor method (PNN) in embodiment of this invention. 本発明の実施の形態における時間区分ペアワイズ・ニアレスト・ネイバー法（TP-PNN）の説明図である。It is explanatory drawing of the time division | segmentation pairwise nearest neighbor method (TP-PNN) in embodiment of this invention. 本発明の実施の形態における時間拘束ペアワイズ・ニアレスト・ネイバー法（TB-PNN）の説明図である。It is explanatory drawing of the time constraint pairwise nearest neighbor method (TB-PNN) in embodiment of this invention. 本発明の実施の形態において、Ｍ距離による類似度算出アルゴリズムを示すフローチャートである。In the embodiment of the present invention, it is a flowchart showing a similarity calculation algorithm by M distance. 実験で使用した元の動画像データの生成過程を表わす説明図である。It is explanatory drawing showing the production | generation process of the original moving image data used in experiment. 実験で使用した動画像データの構造を示す説明図である。It is explanatory drawing which shows the structure of the moving image data used in experiment. ５つのガウス型クラスターのそれぞれを、時間方向で２つに分けて１０集団としたプロットであり、時間性を反映した代表点（典型静止画）の出現と計算速度を調べるために使用した図である。This is a plot of each of the five Gaussian clusters divided into two in the time direction and made into 10 groups. This figure was used to examine the appearance and calculation speed of representative points (typical still images) reflecting temporal characteristics. is there. 動画像ｖ_Ａと動画像ｖ_Ｂの各典型静止画の距離を行列として表した図である。Is a diagram showing the distance of each representative still image of the moving image v _A and moving image v _B as matrix. 実験で得られた大域アライメント用テーブルの一例を示す図である。It is a figure which shows an example of the table for global alignment obtained by experiment. 実験で得られた局所アライメント用テーブルの一例を示す図である。It is a figure which shows an example of the table for local alignment obtained by experiment. 大域アライメント用テーブルに対応して、計算を省略した個所を黒く塗り潰した図である。It is the figure which blacked out the part which abbreviate | omitted calculation corresponding to the table for global alignment. 実験で得られた動画像検索における適合率−再現率曲線を示すグラフである。It is a graph which shows the precision-recall rate curve in the moving image search obtained by experiment. 他の実験例として、異なる代表フレーム選出アルゴリズムを用いた実験で得られた、動画像検索における適合率−再現率曲線を示すグラフであって、クラスタリング時に冗長代表フレーム削除の事後処理を行なうことなく、そのまま類似度の算出を行なった場合の５つのクラスタリング手法における適合率−再現率曲線を表す。As another experimental example, it is a graph showing a precision-recall rate curve in a moving image search obtained by an experiment using a different representative frame selection algorithm, without performing post-processing of redundant representative frame deletion at the time of clustering The relevance rate-recall rate curves in the five clustering methods when the similarity is calculated as it is are shown. 他の実験例として、異なる代表フレーム選出アルゴリズムを用いた実験で得られた、動画像検索における適合率−再現率曲線を示すグラフであって、クラスタリング時に冗長代表フレーム削除の事後処理を行なったときの５つのクラスタリング手法における適合率−再現率曲線を表す。As another experimental example, it is a graph showing a precision-recall rate curve in a moving image search obtained by an experiment using a different representative frame selection algorithm, and when post-processing of redundant representative frame deletion is performed during clustering The relevance rate-recall rate curves in the five clustering methods are expressed. 他の実験例として、特徴量としてCSDを用いた実験で得られた、動画像検索における適合率−再現率曲線を示すグラフである。It is a graph which shows the precision-recall rate curve in the moving image search obtained by the experiment using CSD as a feature-value as another experiment example. 他の実験例として、特徴量としてFrame Signatureを用いた実験で得られた、動画像検索における適合率−再現率曲線を示すグラフである。As another experimental example, it is a graph which shows the relevance rate-reproducibility curve in the moving image search obtained by experiment using Frame Signature as a feature-value. 本発明の実施の形態による、動画検索装置の処理装置としてコンピュータを用いた場合のハードウェア構成図の例である。It is an example of the hardware block diagram at the time of using a computer as a processing apparatus of the moving image search apparatus according to an embodiment of the present invention.

以下、添付図面に基づき、本発明における動画像検索方法と、動画像検索装置について説明する。 Hereinafter, a moving image search method and a moving image search apparatus according to the present invention will be described with reference to the accompanying drawings.

図１は、本実施の形態における動画像検索装置のシステム構成の一例を模式的に示すものである。同図において、１は動画像の集合を記憶保存するデータベースとして用いるハードディスクドライブなどの記憶装置、２は類似画像の検索を行なうコンピュータなどの処理装置、３は検索に用いる動画像や検索対象の動画像を表示したり、検索結果を表示するディスプレイ、４はマウスやキーボード等の入力装置である。記憶装置１は処理装置２に読み出し可能な状態に接続される。処理装置２は、必要に応じて記憶装置１に蓄積された動画像を表示装置であるディスプレイ３に表示できるようになっている。なお、記憶装置１は処理装置２に内蔵されていても、また、処理装置２に通信手段を介して接続されるサーバ等のコンピュータ上に構築されていてもよい。また、クエリ画像と検索対象画像とを別々の記憶装置に記憶させておくこともでき、どのような形態であるかは特に限定されない。なお、処理装置２は入力装置４と接続されており、ユーザーによる入力を適宜受け付けることができる。 FIG. 1 schematically shows an example of a system configuration of a moving image search apparatus according to the present embodiment. In the figure, 1 is a storage device such as a hard disk drive used as a database for storing and storing a set of moving images, 2 is a processing device such as a computer that searches for similar images, and 3 is a moving image used for searching and a moving image to be searched A display 4 displays an image or a search result, and 4 is an input device such as a mouse or a keyboard. The storage device 1 is connected to the processing device 2 in a readable state. The processing device 2 can display a moving image stored in the storage device 1 on a display 3 as a display device as necessary. Note that the storage device 1 may be built in the processing device 2 or may be constructed on a computer such as a server connected to the processing device 2 via communication means. In addition, the query image and the search target image can be stored in separate storage devices, and the form is not particularly limited. The processing device 2 is connected to the input device 4 and can appropriately accept input from the user.

処理装置２は、クエリ動画像及び対象動画像の処理を行なうため、動画像取込み部２１と、特徴量抽出部２２と、動画像代表フレーム選出部２３とを備えている。動画像取込み部２１は、記憶装置１に保存された動画像の中から、入力装置４を介してユーザーが選択したクエリ動画像、あるいは検索の対象となる対象動画像を取込むものである。特徴量抽出部２２は、動画像取込み部２１で取込んだクエリ動画像、あるいは対象動画像の特徴量を抽出する。動画像代表フレーム選出部２３は、特徴量抽出部２２で抽出した、クエリ動画像あるいは対象動画像の特徴量に基づき、クエリ動画像あるいは対象動画像内における一乃至複数の代表フレームをそれぞれ選出する。なお、上記動画像取込み部２１、特徴量抽出部２２、及び動画像代表フレーム選出部２３は、クエリ動画像用と対象動画像用とに分けて構成しても構わない。 The processing device 2 includes a moving image capturing unit 21, a feature amount extracting unit 22, and a moving image representative frame selecting unit 23 in order to process the query moving image and the target moving image. The moving image capturing unit 21 captures a query moving image selected by the user via the input device 4 or a target moving image to be searched from the moving images stored in the storage device 1. The feature amount extraction unit 22 extracts the feature amount of the query moving image or the target moving image captured by the moving image capturing unit 21. The moving image representative frame selection unit 23 selects one or more representative frames in the query moving image or the target moving image based on the feature amount of the query moving image or the target moving image extracted by the feature amount extraction unit 22. . The moving image capturing unit 21, the feature amount extracting unit 22, and the moving image representative frame selecting unit 23 may be configured separately for the query moving image and the target moving image.

処理装置２はその他に、クエリ動画像に類似する対象動画像を検索するため、類似度算出部２４と、類似動画像表示制御部２５とを備えている。類似度算出部２４は、クエリ動画像内における各代表フレームの特徴量と対象動画像内における各代表フレームの特徴量とをそれぞれ比較し、類似度を算出するものである。類似動画像表示制御部２５は、類似度算出部２４で算出した類似度の高い順に、対象動画像をディスプレイ３に一乃至複数表示することができる。 In addition, the processing device 2 includes a similarity calculation unit 24 and a similar moving image display control unit 25 in order to search for a target moving image similar to the query moving image. The similarity calculation unit 24 compares the feature amount of each representative frame in the query moving image with the feature amount of each representative frame in the target moving image, and calculates the similarity. The similar moving image display control unit 25 can display one or more target moving images on the display 3 in descending order of the similarity calculated by the similarity calculating unit 24.

処理装置２としてコンピュータを用いる場合、類似動画像検索を行なうためのアプリケーションソフトウェアをインストールすることで、処理を行なうための各機能の一部または全部を実現することができ、処理装置２は、インストールされたアプリケーションソフトウェアに従った処理を実行することができる。 When a computer is used as the processing device 2, a part or all of each function for processing can be realized by installing application software for performing similar moving image search. It is possible to execute processing according to the application software that has been set.

図２は、本実施の形態における処理装置２の処理概要を示したものである。図中、左経路はクエリ動画像に対する処理フローを表し、右経路は対象動画像に対する処理フローを表す。図２の処理により、最終的にクエリ動画像に対する対象動画像の類似度が算出される。 FIG. 2 shows an outline of the processing of the processing device 2 in the present embodiment. In the figure, the left route represents the processing flow for the query moving image, and the right route represents the processing flow for the target moving image. With the processing in FIG. 2, finally, the similarity of the target moving image to the query moving image is calculated.

図２の左経路のクエリ動画像の処理フローを説明する。Ｓ１１０で、記憶装置１のデータベースからクエリ動画像を取り込む。次にＳ１２０で、クエリ動画像の特徴量を抽出する。特徴量としては、MPEG-7で提案されている画像特徴量の一つである色構造記述子（Color Structure Descriptor；CSD）や、ISO/IECで規定されているVideo Signature（Multimedia content description interface, International Standard of ISO/IEC 15938-3, Amendment 4 (2010)）を用いることができる。次にＳ１３０で、典型静止画としてクエリ動画像の代表フレームを決める。このとき、動画像の経過時間の要素を考慮した手法を用いて代表フレームは決められる。クエリ動画像のフレーム数にもよるが、通常は複数の代表フレームが選択され、選択された代表フレームに対応するそれぞれの典型静止画が持つ特徴量の集合がクエリ動画像の特徴量とされる。 A processing flow of the query moving image of the left route in FIG. 2 will be described. In S110, the query moving image is taken from the database of the storage device 1. In step S120, the feature amount of the query moving image is extracted. As feature quantities, one of the image feature quantities proposed in MPEG-7, Color Structure Descriptor (CSD), and Video Signature (Multimedia content description interface, ISO / IEC) International Standard of ISO / IEC 15938-3, Amendment 4 (2010)) can be used. In step S130, a representative frame of the query moving image is determined as a typical still image. At this time, the representative frame is determined using a method that takes into account the factors of the elapsed time of the moving image. Although depending on the number of frames of the query moving image, a plurality of representative frames are usually selected, and the feature amount set of each typical still image corresponding to the selected representative frame is set as the feature amount of the query moving image. .

図２の右経路の対象動画像の処理フローもクエリ動画像の処理フローと同様の手順で行なわれる。すなわち、Ｓ１４０で、記憶装置１のデータベースから対象動画像を取り込む。次にＳ１５０で、対象動画像の特徴量を抽出する。特徴量として前記したCSDやVideo Signatureを用いることができる。次にＳ１６０で、対象動画像の典型静止画となる代表フレームを決める。このとき、前記したように、動画像の経過時間の要素を考慮した手法を用いて代表フレームは決められる。対象動画像のフレーム数にもよるが、通常は複数の代表フレームが選択され、選択された代表フレームに対応するそれぞれの典型静止画が持つ特徴量の集合が対象動画像の特徴量とされる。図２の左経路の処理と右経路の処理は、どちらを先に行なっても、また、並列的に行なっても構わない。 The processing flow of the target moving image on the right path in FIG. 2 is performed in the same procedure as the processing flow of the query moving image. That is, in S140, the target moving image is captured from the database of the storage device 1. In step S150, the feature amount of the target moving image is extracted. The CSD or video signature described above can be used as the feature amount. In step S160, a representative frame that is a typical still image of the target moving image is determined. At this time, as described above, the representative frame is determined using a method that takes into account the element of the elapsed time of the moving image. Although depending on the number of frames of the target moving image, a plurality of representative frames are usually selected, and a set of feature amounts of each typical still image corresponding to the selected representative frame is set as a feature amount of the target moving image. . Either the left route processing or the right route processing in FIG. 2 may be performed first or in parallel.

なお、動画像代表フレーム選出部２３は、動画像の経過時間の要素を考慮した技術として、時間拘束アフィニティ伝播法（Time-Bound Affinity propagation：TB-AP）、時間区分ｋ平均法（Time-Partition k-means：TP-k-means）、時間分割ｋ平均法（Time-Split k-means：TS-k-means）、時間拘束ペアワイズ・ニアレスト・ネイバー法（Time-Bound pairwise nearest neighbor：TB-PNN）、あるいは時間区分ペアワイズ・ニアレスト・ネイバー法（Time-Partition pairwise nearest neighbor：TP-PNN）を用いることができる。これらについては後程詳しく説明する。 Note that the moving image representative frame selection unit 23 includes a time-bound affinity propagation method (TB-AP), a time-part k-average method (Time-Partition) as a technique that takes into account the elements of the elapsed time of the moving image. k-means (TP-k-means), Time-Split k-means (TS-k-means), Time-bound pairwise nearest neighbor (TB-PNN) ), Or the time-partition pairwise nearest neighbor (TP-PNN) method. These will be described in detail later.

また、TP-k-means法の代わりに、その一般化である時間ブロックを導入したharmonic competitionを適用することも可能である。harmonic competition法については、「Matsuyama, Yasuo, “Harmonic Competition: A Self-Organizing Multiple Criteria Optimization,” IEEE Transactions on Neural Networks, Vol.7, No.3, pp.652-668, MAY 1996」などに記載されている。 Also, instead of the TP-k-means method, it is also possible to apply a harmonic competition that introduces a time block that is a generalization of the method. The harmonic competition method is described in “Matsuyama, Yasuo,“ Harmonic Competition: A Self-Organizing Multiple Criteria Optimization, ”IEEE Transactions on Neural Networks, Vol.7, No.3, pp.652-668, MAY 1996. Has been.

こうして、クエリ動画像と対象動画像について、それぞれの代表フレーム（典型静止画）が選出されると、Ｓ１７０の類似度算出の処理が行なわれる。ここでの典型静止画の数は各動画像に依存するので、クエリ動画像と対象動画像との類似度は、ギャップの挿入が可能な重み付きＭ距離によって算出される。なお、Ｍ距離については後程詳しく説明する。Ｓ１８０において、Ｓ１７０で算出した類似度スコアを基に、類似度の高い対象動画像を検索してディスプレイ３に表示する。 Thus, when the representative frames (typical still images) are selected for the query moving image and the target moving image, the similarity calculation process of S170 is performed. Since the number of typical still images here depends on each moving image, the similarity between the query moving image and the target moving image is calculated by a weighted M distance that allows gap insertion. The M distance will be described in detail later. In S180, based on the similarity score calculated in S170, a target moving image with a high similarity is retrieved and displayed on the display 3.

上述の処理フローで、Ｓ１２０，Ｓ１５０における特徴量の抽出、Ｓ１３０，Ｓ１６０における代表フレームの選出、及びＳ１７０における類似度の算出の３つの処理について以下に説明する。 In the processing flow described above, the following describes the three processes of feature amount extraction in S120 and S150, representative frame selection in S130 and S160, and similarity calculation in S170.

・フレームに対する特徴量の抽出
本実施の形態では、始めに動画像に含まれる各フレームの特徴量の抽出が実行される。ここで抽出された特徴量は、時系列を考慮したベクトル｛ｘ_ｔ｝^ｎ _ｔ＝１として表現される。例えば、類似画像検索で用いるアプリケーションソフトフェアによる処理が色情報を持つ動画像を扱えるように、当該ベクトルを先に述べたMPEG-7のCSDによって表現し、ベクトルをすべて正規化されたCSDヒストグラムとすることができる。 -Extraction of feature amount for frame In the present embodiment, extraction of the feature amount of each frame included in the moving image is first executed. The feature amount extracted here is expressed as a vector {x _t } ⁿ _{t = 1} in consideration of time series. For example, so that the processing by the application software used in the similar image search can handle a moving image having color information, the vector is expressed by the above-described MPEG-7 CSD, and all the vectors are normalized with the CSD histogram. can do.

ただし、動画像の各フレームの特徴量はCSDを用いたものに限らず、当該特徴量はベクトル量であればどのようなものでも構わない。前述したCSDヒストグラムの特徴量は以下の条件を適用して正規化している。
・正規化の条件
（ｉ）各ベクトルの成分は非負とする。
（ｉｉ）ベクトル成分全部の和は、1.0となるようにする。
これは、次の手順で得られる。
（ａ）動画像の各フレームについて、その特徴を表すベクトル量に変えておく（特徴量抽出）。
（ｂ）類似度の比較法およびそれをソフトウェアとして実現した時の汎用性を確保できるように、各ベクトルを正規化できる手法を採用する。 However, the feature amount of each frame of the moving image is not limited to that using CSD, and any feature amount may be used as long as the feature amount is a vector amount. The above-mentioned CSD histogram feature values are normalized by applying the following conditions.
-Normalization conditions (i) Each vector component is non-negative.
(Ii) The sum of all vector components is set to 1.0.
This is obtained by the following procedure.
(A) Each frame of the moving image is changed to a vector amount representing the feature (feature amount extraction).
(B) A method that can normalize each vector is employed so as to ensure a similarity comparison method and versatility when it is realized as software.

上記手順（ｂ）は、実質的に各ベクトルの最小値と総和を使うことで常に可能になる。また、CSDは手順（ａ）における特徴ベクトルの単なる一例で、画像フレーム全体だけを対象にするさらに簡単なものがヒストグラムである。 The procedure (b) is always possible by using substantially the minimum value and sum of each vector. CSD is merely an example of the feature vector in the procedure (a), and a simpler one that covers only the entire image frame is a histogram.

CSDは簡単な特徴量ではあるが、良好な結果を得ることができて好ましい。また、CSDの代わりに、計算量が多くなるものの、別な手法による特徴量を用いることも可能である。例えば、動画像の各フレームの特徴量は、SIFT（Scale-Invariant Feature Transform）や、SURF（Speeded Up Robust Features）や、FREAK（Fast Retina Keypoint）や、bag of visual wordsなどを用いることができる。また、先に述べたVideo Signatureを用いることもできる。これらCSD以外の特徴量や、CSDを含む特徴量を組み合わせることも可能である。 CSD is a simple feature amount, but it is preferable because good results can be obtained. Moreover, although the amount of calculation increases instead of CSD, it is also possible to use a feature amount by another method. For example, SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features), FREAK (Fast Retina Keypoint), bag of visual words, etc. can be used for the feature amount of each frame of the moving image. In addition, the aforementioned video signature can be used. It is also possible to combine these feature quantities other than CSD and feature quantities including CSD.

例えば、前述した｛CSD，FREAK，bag of visual words，SIFT，Video Signature…｝などを個別の正規化済み特徴量の集団として、個別の正規化済み特徴量をNormalizedFeatureVector（ｉ）と表記すると、動画像フレームの特徴量として、以下の数１に定義される合成特徴量ベクトルCompositeFeatureVectorを利用することもできる。ただし、正規化を行なう必要がない特徴量を用いる場合、NormalizedFeatureVectorは入力された特徴量をそのまま出力する。 For example, when {CSD, FREAK, bag of visual words, SIFT, Video Signature ...} is described as a group of individual normalized features, and the individual normalized features are expressed as NormalizedFeatureVector (i), As a feature quantity of the image frame, a composite feature quantity vector CompositeFeatureVector defined in the following Equation 1 can be used. However, when using a feature quantity that does not need to be normalized, NormalizedFeatureVector outputs the input feature quantity as it is.

ただし、合成特徴量ベクトルCompositeFeatureVectorは上式において正規化条件：ｗ_ｉ≧０、Σ^Ｎ _ｉ＝０ｗ_ｉ＝１を満たすものとする。上記数１の式は、合成特徴量ベクトルCompositeFeatureVectorを得る一例に過ぎず、例えば特徴量同士を掛け算したものも、合成特徴量ベクトルCompositeFeatureVectorとして利用できる。したがって、合成特徴量ベクトルCompositeFeatureVectorは、次の数２の一般式で得られる。これは、幾つかの個別の特徴量FeatureVector（ｉ）を組み合わせて最終的に正規化したものが、合成特徴量ベクトルCompositeFeatureVectorになるという表現を、数式化したものである。 However, the combined feature vector CompositeFeatureVector satisfies the normalization conditions: w _i ≧ 0 and Σ ^N _{i = 0} w _i = 1 in the above equation. The above formula 1 is merely an example of obtaining a combined feature vector CompositeFeatureVector. For example, a product obtained by multiplying feature amounts can be used as the combined feature vector CompositeFeatureVector. Therefore, the composite feature quantity vector CompositeFeatureVector is obtained by the following general formula 2. This is a mathematical expression of the expression that what is finally normalized by combining several individual feature quantities FeatureVector (i) becomes a composite feature quantity vector CompositeFeatureVector.

・時系列を保つ代表フレームの選出
各フレームに対する特徴量を抽出した後、代表フレームの選出が行なわれる。ここでは、処理装置２が動画像内の各フレームを抽出した特徴量に基づいてクラスター化し、各クラスターを代表するフレームを選出する。各クラスターを代表するフレームを典型静止画と呼ぶが、本実施の形態では、典型静止画の選出には時系列情報が加味されている。 -Selection of representative frames to keep time series After extracting the feature values for each frame, the representative frames are selected. Here, the processing device 2 performs clustering based on the extracted feature amount of each frame in the moving image, and selects a frame representing each cluster. A frame representing each cluster is referred to as a typical still image. In this embodiment, time-series information is added to the selection of a typical still image.

図３は、本実施の形態で用いる時系列情報を考慮した時間拘束アフィニティ伝播法（TB-AP）と通常のアフィニティ伝播法（AP）におけるクラスタリングの違いを説明する図である。図中上段にある時系列データは、特徴量として抽出した時間ｔ＝１から時間ｔ＝１０までのｎ＝１０個の時系列の特徴ベクトル｛ｘ_ｔ｝^１０ _ｔ＝１を表している。ここでの各ボックスは各時間ｔにおけるフレームを表し、例えば時間ｔ＝１のフレームには１の数字を、時間ｔ＝５のフレームには５の数字が付してある。また、図で同じグラデーションで描いたフレームは、互いに類似しているフレーム（例えば動画の中の１シーン等）を表し、同じクラスに属するものとする。図示の例では、｛１，２，１０｝のクラス、｛３，４，５｝のクラス、｛６，７，８，９｝のクラスと、類似フレームのセットが３種類存在する。 FIG. 3 is a diagram for explaining a difference in clustering between the time-constrained affinity propagation method (TB-AP) considering the time series information used in the present embodiment and the normal affinity propagation method (AP). The time series data in the upper part of the figure represents n = 10 time series feature vectors {x _t } ¹⁰ _{t = 1} from time t = 1 to time t = 10 extracted as feature quantities. Each box here represents a frame at each time t. For example, a number 1 is assigned to a frame at time t = 1, and a number 5 is assigned to a frame at time t = 5. Also, the frames drawn with the same gradation in the figure represent similar frames (for example, one scene in a moving image) and belong to the same class. In the illustrated example, there are three types of sets of similar frames: {1,2,10} class, {3,4,5} class, {6,7,8,9} class.

これらのフレームを特徴量別にクラスター化する際、通常のアフィニティ伝播法（AP）を用いた場合には、図中左側下段に示すように、典型静止画として例えばフレーム｛２，４，７｝が選出される。しかし、これは時系列情報が考慮されておらず、動画像を検索するラベルとして適切ではない。時系列が正しく考慮されることにより、図中右側下段に示すように、典型静止画は｛２，４，７，１０｝と選出される。動画像の経過時間を考慮する時間拘束アフィニティ伝播法（TB-AP）では、そのような時系列を考慮したクラスタリングが可能となる。つまり図３の例では、時間拘束アフィニティ伝播法を適用すると、典型静止画が正しく｛（１，２，０）（１，４，１），（１，７，２），（０，１０，０）｝と選出される。 When these frames are clustered by feature amount, when the normal affinity propagation method (AP) is used, for example, frames {2, 4, 7} Elected. However, this does not take time series information into consideration, and is not appropriate as a label for searching for moving images. By correctly considering the time series, the typical still image is selected as {2, 4, 7, 10} as shown in the lower right part of the figure. In the time-constrained affinity propagation method (TB-AP) that considers the elapsed time of a moving image, clustering considering such time series becomes possible. That is, in the example of FIG. 3, when the time-constrained affinity propagation method is applied, the typical still image is correctly {(1, 2 , 0) (1, 4 , 1), (1, 7 , 2), (0, 10 , 0)}.

なお、これらの典型静止画のフレーム数は動画像の長さやコンテンツに応じて可変する。したがって、各々の典型静止画は、時間順序におけるその左右に、責任のある（responsible）フレームを含んでいる。 Note that the number of frames of these typical still images varies according to the length of the moving image and the content. Thus, each typical still image includes a responsible frame on its left and right in time order.

以下、時系列を保つ典型静止画を選出するためのアルゴリズムの一つであるTB-APのアルゴリズムについて、図４のフローチャートを参照しながら説明する。 Hereinafter, the TB-AP algorithm, which is one of the algorithms for selecting a typical still image that keeps time series, will be described with reference to the flowchart of FIG.

Ｓ２１０：Similarityの計算
ここでは、時間ｔ＝１から時間ｔ＝ｎまでの動画像フレームについて、特徴量として抽出されたｎ個の時系列の特徴ベクトル｛ｘ_ｔ｝^ｎ _ｔ＝１が与えられる。前述のように、各々の特徴ベクトル｛ｘ_ｔ｝^ｎ _ｔ＝１は、合計が１となる正規化されたCSDヒストグラムである。時間ｔ＝ｉの特徴ベクトルｘ_ｉから、時間ｔ＝ｊの特徴ベクトルｘ_ｊへの関係性を表わすSimilarity測度ｓ（ｘ_ｉ，ｘ_ｊ）を、ｓ（ｉ，ｊ）として定義する。そして、特徴ベクトルｘ_ｉが特徴ベクトルｘ_ｊよりも時間ｔ＝ｋの特徴ベクトルｘ_ｋに類似している時かつその時に限り、ｓ（ｋ，ｉ）＞ｓ（ｋ，ｊ）を満たすような性質を持つものとする。 S210: Calculation of Similarity Here, for a moving image frame from time t = 1 to time t = n, n time-series feature vectors {x _t } ⁿ _{t = 1} extracted as feature values are given. As described above, each feature vector {x _t } ⁿ _{t = 1} is a normalized CSD histogram with a total of one. From the feature vectors _{x i} of the time t = i, define the time t = j Similarity measure _{s (x} i, _{x j)} representing the relationship between the feature vector _{x j} of the as s (i, j). Then, when and only when the feature vector x _i is more similar to the feature vector x _k at time t = k than the feature vector x _j , s (k, i)> s (k, j) is satisfied. It shall have properties.

本実施の形態では、以下の数３の式で計算したSimilarity測度ｓ（ｉ，ｊ）を用いることとする。 In the present embodiment, the similarity measure s (i, j) calculated by the following equation (3) is used.

ただし、bar（Ｄ）（以下、数３の右辺第１項のように、記号Ｄにオーバーラインが付いたものを、本文中ではbar（Ｄ）と表記する）は、例えば入力装置４によりユーザーが設定可能な定数パラメータである。 However, bar (D) (hereinafter, the symbol D with an overline as in the first term on the right side of Equation 3 is expressed as bar (D) in the text) is input by the input device 4, for example. Is a constant parameter that can be set.

Similarityは、「値が大きいほど関係性も大きく、値が小さいほど関係性も小さい」というルールに則った値の設定であれば、数３の式に限らずどのようなものでも構わない。また、bar（Ｄ）の選び方はTB-APの結果に影響しないので、ここでは０としておくが、当該パラメータは動画像間を比較する上では重要な要素となり、０以外の値をとりうる。 As long as the similarity is set according to the rule that “the larger the value is, the greater the relationship is, and the smaller the value is, the smaller the relationship is”, the similarity is not limited to the equation 3 but may be any type. The selection of bar (D) does not affect the result of TB-AP, and is set to 0 here. However, this parameter is an important factor in comparing moving images, and can take a value other than 0.

Ｓ２２０：行列Ａを初期化
ここでは、初期値を零行列にした行列Ａ＝［ａ_ｉｊ］を用意する（ａ_ｉｊ←０）。この行列Ａは、availabilityを表す行列である。availabilityは、典型静止画候補のデータポイントｊからデータポイントｉへ送られるメッセージで、データポイントｉがデータポイントｊの典型静止画として選ぶことがどれだけ適切かを表す値である。 S220: Initialize matrix A Here, a matrix A = [a _ij ] having an initial value of zero is prepared (a _ij ← 0). This matrix A is a matrix representing availability. The availability is a message sent from the data point j of the typical still image candidate to the data point i, and is a value representing how appropriate it is to select the data point i as the typical still image of the data point j.

Ｓ２３０：行列Ａを用いて行列Ｒを算出（または更新）
ここでは先ず、スライドする時間窓の長さを、その中心が対角成分に位置するように、Ｗ＝２ω−１に設定した後、時間窓内での時系列を考慮して、特徴ベクトルｘ_ｉ，ｘ_ｊ，ｘ_ｋを選択する。 S230: Matrix R is calculated (or updated) using matrix A
Here, first, the length of the sliding time window is set to W = 2ω−1 so that its center is located at the diagonal component, and then the feature vector x is considered in consideration of the time series in the time window. _i, _x j, to select the _{x k.}

次に、responsibilityを表す行列Ｒの成分ｒ_ｉｊを、先に説明したavailability行列Ａを利用して計算更新する。responsibilityは、データポイントｉから典型静止画候補のデータポイントｊへ送られるメッセージで、データポイントｉがデータポイントｊの典型静止画としてどのくらい適切かを表す値である。このresponsibility行列Ｒの計算と、後述するavailability行列Ａの計算では、データポイントｉ，ｊを次の数４の範囲に設定して算出する。つまりTB-APでは、メッセージ交換を行なう範囲に数４に示す制限を設けている。 Next, the component r _{ij of the} matrix R representing the responsibility is calculated and updated using the availability matrix A described above. Responsibility is a message sent from data point i to data point j of a typical still image candidate, and is a value representing how appropriate data point i is as a typical still image of data point j. In the calculation of the responsibility matrix R and the calculation of the availability matrix A, which will be described later, the data points i and j are set within the range of the following equation (4). In other words, in TB-AP, the limit shown in Equation 4 is set in the range for message exchange.

responsibility行列Ｒは通常対称的であり、行列Ａを用いて数５に示すように更新される。 The responsibility matrix R is usually symmetric and is updated using the matrix A as shown in Equation 5.

数５において、ρ_ｉｊは、データポイントｉとデータポイントｊとの間のresponsibilityの伝播値である。また、λ∈（０，１）は設計パラメータとしてのダンピング係数であり、これを用いて後述するＳ２５０での収束を遅らせ、安定した結果を得ることができる。行列Ｒも初期値は零行列とされるが、例えばダンピング係数λが０であれば、responsibilityの伝播値ρ_ｉｊがそのまま行列Ｒの値として更新される。 In Equation 5, ρ _ij is a response propagation value between data point i and data point j. Further, λε (0,1) is a damping coefficient as a design parameter, and by using this, the convergence in S250 described later can be delayed and a stable result can be obtained. The initial value of the matrix R is also a zero matrix, but if the damping coefficient λ is 0, for example, the response propagation value ρ _ij is updated as it is as the value of the matrix R.

Ｓ２４０：行列Ａを算出
ここでは、データポイントｉ，ｊを数４で示した範囲に制限し、次の数６を用いてavailability行列Ａの成分ａ_ｉｊを更新する。 S240: Calculate Matrix A Here, the data points i and j are limited to the range indicated by Equation 4, and the component a _ij of the availability matrix A is updated using Equation 6 below.

数６において、α_ｉｊは、データポイントｉとデータポイントｊとの間のavailabilityの伝播値である。 In Equation 6, α _ij is a propagation value of availability between data point i and data point j.

Ｓ２５０：行列Ａと行列Ｒの総和が収束したか否かを判定
以上のように算出した行列Ａの成分ａ_ｉｊと行列Ｒの成分ｒ_ｉｊとを加算し、その値ａ_ｉｊ＋ｒ_ｉｊが収束の基準を満たすまで、Ｓ２３０とＳ２４０の手順を繰り返す。一方、ａ_ｉｊ＋ｒ_ｉｊの値が収束の基準を満たすと、次のＳ２６０の手順に移行する。 S250: Determine whether or not the sum of the matrix A and the matrix R has converged. The component a _{ij of the} matrix A and the component r _{ij of the} matrix R calculated as described above are added, and the value a _ij + r _ij The steps S230 and S240 are repeated until the standard is satisfied. On the other hand, when the value of a _ij + r _ij satisfies the convergence criterion, the process proceeds to the next step S260.

Ｓ２６０：代表フレーム選出
ここでは、以下の数７に定義されるインデックスが、典型静止画のインデックスとして採用される。 S260: Representative Frame Selection Here, the index defined in Equation 7 below is adopted as the index of the typical still image.

つまり、ａ_ｉｊ＋ｒ_ｉｊの値が最大となるデータポイントでの静止画が、典型静止画として採用される。例えば、データポイントｉは、ａ_ｉｊ＋ｒ_ｉｊの値が最大となるデータポイントｊを中心とするクラスターに属する。最終的に代表フレームとして選出される典型静止画の集合は、こうしたインデックスを収集して見つけ出される。 That is, the still image at the data point where the value of a _ij + r _ij is the maximum is adopted as the typical still image. For example, the data point i belongs to a cluster centered on the data point j having the maximum value of a _ij + r _ij . A set of typical still images finally selected as representative frames is found by collecting such indexes.

上記一連のＳ２３０〜Ｓ２６０のフローの中で、時間窓を設定する上で必要なωは、行列Ａや行列Ｒを算出する際のメッセージ交換の範囲を制限するパラメータとなる。この制限下におけるメッセージ交換の様子を、図５に示す。同図において、上段の「AP」は、従来技術となるアフィニティ伝播法によるメッセージ交換の範囲を示している。また下段の「TB-AP」は、本実施の形態における時間拘束アフィニティ伝播法によるメッセージ交換の範囲を示している。ここでは、前記図３で説明したものと同じフレームが時系列に並べられているものとする。 In the above-described flow of S230 to S260, ω necessary for setting the time window is a parameter for limiting the range of message exchange when calculating the matrix A and the matrix R. FIG. 5 shows how messages are exchanged under this restriction. In the figure, “AP” in the upper part indicates a range of message exchange by the affinity propagation method as a conventional technique. The lower “TB-AP” indicates the range of message exchange by the time-constrained affinity propagation method in the present embodiment. Here, it is assumed that the same frames as described in FIG. 3 are arranged in time series.

通常の「AP」では、全てのデータポイントに対して、メッセージの交換が行われるが、本実施の形態では、時間窓で制限された近傍のデータポイントに対して、メッセージの交換が行なわれる。具体的には、図５に示す「AP」の例では、データポイントｉ＝５を中心として、データポイントｊ＝１からｊ＝１０の全範囲にデータ交換が行なわれる。一方、同図「TB-AP」の例では、パラメータω＝４に設定して、時間窓２ω―１＝７の範囲でメッセージ交換を行なう構成になっている。そのため、データポイントｉ＝５を中心として、データポイントｊ＝２からｊ＝８の範囲にメッセージ交換の範囲が制限される。 In normal “AP”, messages are exchanged for all data points, but in this embodiment, messages are exchanged for neighboring data points limited by a time window. Specifically, in the example of “AP” shown in FIG. 5, data exchange is performed over the entire range from data point j = 1 to j = 10 with data point i = 5 as the center. On the other hand, in the example of “TB-AP” in the figure, the parameter ω = 4 is set, and the message exchange is performed within the range of the time window 2ω-1 = 7. Therefore, the range of message exchange is limited to the range of data points j = 2 to j = 8 with data point i = 5 as the center.

図６は、本実施の形態のTB-APを適用したメッセージ交換のリンク行列を示している。ここでもパラメータω＝４に設定しており、データポイントｉを１から１０の何れかに選択したときに、どのデータポイントｊとの間でメッセージ交換が行なわれるのかを塗り潰して示している。 FIG. 6 shows a link matrix for message exchange to which the TB-AP of this embodiment is applied. Again, parameter ω = 4 is set, and when data point i is selected from any one of 1 to 10, it is shown with which data point j message exchange is performed.

通常のAPの特性として、メッセージ交換を行なったデータポイントのみが典型静止画の候補として最終的に選出される。これは、本実施の形態のTB-APでも同じことがいえる。本実施の形態のTB-APでは、メッセージを交換する範囲を時系列を考慮して制限しているため、結果としてメッセージ交換を行なわない時系列的に遠いデータポイント（例えば、図３に示す時間ｔ＝１０）の典型静止画も、クラスタリングにより代表フレームの一つとして選出できる。 As a characteristic of a normal AP, only data points for which messages have been exchanged are finally selected as typical still image candidates. The same can be said for the TB-AP of the present embodiment. In the TB-AP of the present embodiment, the range for exchanging messages is limited in consideration of time series. As a result, data points that are distant in time series not exchanging messages (for example, the time shown in FIG. 3). A typical still image at t = 10) can also be selected as one of the representative frames by clustering.

また、TB-APにおける典型静止画の採用に関して、連続する類似フレームについては時間窓を越えても同じ典型静止画をとることができる。具体的には、いま第ｉフレームに対する典型静止画ｅ_ｉを決定するとする。このとき、第ｉフレームより前にある第ｋフレームから第ｉ−１フレームまでの特徴ベクトルが、閾値などを用いて類似していると判定され、そして第ｋフレームに対する典型静止画ｅ_ｋと第ｉフレームに対する典型静止画ｅ_ｉも類似していると判定される場合、新たな典型静止画ｅ_ｉを生成せずに典型静止画ｅ_ｋをそのまま用いることができる。これによって、本発明の性能を向上させることができる。 Further, regarding the adoption of a typical still image in TB-AP, the same typical still image can be taken even if the time frame is exceeded for consecutive similar frames. Specifically, it is assumed that the typical still image e _i for the i-th frame is determined. At this time, it is determined that the feature vectors from the k-th frame to the (i−1) -th frame before the i-th frame are similar using a threshold or the like, and the typical still image e _k for the k-th frame and the If it is determined that the typical still image e _i for the i frame is also similar, the typical still image e _k can be used as it is without generating a new typical still image e _i . As a result, the performance of the present invention can be improved.

以上のように、本実施の形態では上記TB-APに基づく処理を類似動画像検索処理装置２が実行することにより、時系列を考慮した典型静止画の集合を、クエリ動画像および対象動画像よりそれぞれ抽出することができる。そしてこの抽出された代表フレームの特徴量を、当該動画像の特徴量として決定する。特に、本実施の形態で採用するTB-APでは、選択したデータポイントと共にスライドする時間窓を設けているので、メッセージ交換の回数が制限され、メッセージ伝達の複雑さはＯ（ωｎ）となる。一方、通常のAPでの複雑さはＯ（ｎ^２）である。つまり、APの計算コストはデータポイントの数の２乗に比例するため、データポイントの数が多くなると計算時間を要するといった問題点がある。それに対して、本実施の形態では一般的にω＜ｎであるため、TB-APはAPよりも計算コストを低く抑えることができる。 As described above, in the present embodiment, the similar moving image search processing device 2 executes the processing based on the TB-AP, so that a set of typical still images taking time series into consideration is obtained as a query moving image and a target moving image. Respectively. Then, the extracted feature amount of the representative frame is determined as the feature amount of the moving image. In particular, the TB-AP employed in this embodiment has a time window that slides with the selected data point, so the number of message exchanges is limited, and the complexity of message transmission is O (ωn). On the other hand, the complexity of a normal AP is O (n ² ). In other words, since the calculation cost of the AP is proportional to the square of the number of data points, there is a problem that a calculation time is required when the number of data points increases. On the other hand, in the present embodiment, generally, ω <n, so that TB-AP can reduce the calculation cost lower than AP.

なお、類似度に関するデータの正規化と距離測度に関して、上記Similarity測度ｓ（ｉ，ｊ）は、アルゴリズムが収束すればどのようなものを用いてもよい。パラメータλを適切に設定すれば、数３におけるSimilarity測度ｓ（ｉ，ｊ）の形式は、アルゴリズムの収束をもたらす。 Regarding the normalization of the data related to the similarity and the distance measure, the Similarity measure s (i, j) may be any as long as the algorithm converges. If the parameter λ is set appropriately, the form of the Similarity measure s (i, j) in Equation 3 results in convergence of the algorithm.

bar（Ｄ）の基準となる選択肢は次の通りである：
（ａ）全データ距離の平均値：これは、データのサイズが小さい場合にのみ許される。
（ｂ）各々の特徴ベクトルｘ_ｔが各成分の総和が１になる非負成分だけを有するように正規化されている場合、bar（Ｄ）は√２，√（（ｄ−２）／（２ｄ）），または√（１−（１／ｄ））のいずれかである：
ただし、ｄはシンプレックスに存在する特徴ベクトルｘ_ｔの次元を表わす。√２はシンプレックスのエッジであり、√（（ｄ−２）／（２ｄ））≒１／√２は内球の半径であり、√（１−（１／ｄ））≒１は外球の半径を表す。なお、≒は近似記号を意味し、ｄが十分大きい場合に当該近似が有効となる。
また、以上の基準の他にシステムの用途に応じた最適な値を実験的に求めることもできる。 The standard choices for bar (D) are as follows:
(A) Average value of all data distances: This is allowed only when the data size is small.
(B) if each of the feature vector x _t is the sum of the components is normalized to have only non-negative components become 1, bar (D) is √2, √ ((d-2 ) / (2d )), Or √ (1- (1 / d)):
Here, d represents the dimension of the feature vector x _t existing in the simplex. √2 is the edge of the simplex, √ ((d−2) / (2d)) ≈1 / √2 is the radius of the inner sphere, and √ (1− (1 / d)) ≈1 is the outer sphere Represents the radius. Note that ≈ means an approximate symbol, and the approximation is effective when d is sufficiently large.
In addition to the above criteria, it is possible to experimentally obtain an optimum value according to the use of the system.

例えば、前述のCSDではHSV（色相−彩度−明度）色空間表現を用いており、Ｈ，Ｓ，Ｖの各値の比率を、Ｈ：Ｓ：Ｖ＝１２：８：８に量子化するのが好ましいとされる。この場合、各画像は１２×８×８＝７６８色で表現され、CSDによる特徴量は768次元のヒストグラムとなるので、ｄ＝768となり、外球の半径を表す値をbar（Ｄ）として用いる。 For example, the above-mentioned CSD uses HSV (Hue-Saturation-Lightness) color space representation, and the ratio of each value of H, S, V is quantized to H: S: V = 12: 8: 8. Is preferred. In this case, each image is expressed by 12 × 8 × 8 = 768 colors, and the feature amount by CSD is a 768-dimensional histogram, so d = 768, and a value representing the radius of the outer sphere is used as bar (D). .

次に、TB-APの代替手法として、動画像の経過時間の要素を考慮したクラスタリング手法について説明する。始めに改良したk-means法（ｋ平均法）やharmonic competition法について、図７を参照しながら説明する。図７左側には、TB-APにおける時間軸上でスライドする時間窓Ｗが示されており、図７右側には、改良したk-means法やharmonic competition法における時間軸上でスライドする時間窓Ｗが示されている。 Next, as a TB-AP alternative method, a clustering method that considers the elements of the elapsed time of moving images will be described. The improved k-means method (k-average method) and harmonic competition method will be described with reference to FIG. The left side of FIG. 7 shows a time window W that slides on the time axis in TB-AP, and the right side of FIG. 7 shows a time window that slides on the time axis in the improved k-means method and harmonic competition method. W is shown.

TB-APは、時間軸に沿って重なり合うようにスライドする時間窓Ｗ中の一連の画像フレームを計算対象とする。これに対して、代替技術である改良したk-meansやharmonic competitionでは、重なりの無い時間窓Ｗの画像フレームを計算対象としている。つまりTB-APでは、時間窓Ｗのずらし幅が時間窓Ｗのフレーム長さよりも小さく、一乃至複数フレームずつ時間窓Ｗがスライディングするのに対し、改良したk-meansやharmonic competitionでは、時間窓Ｗのずらし幅と時間窓Ｗのフレーム長さが一致し、時間窓Ｗの単位で時間窓Ｗがスライディングしている。このような手法を時間拘束（Time-Bound）に対して時間区分（Time-Partition）と記載し、前記改良したk-meansを時間区分ｋ平均法（Time-Partition k-means：TP k-means）と適宜記載する。 TB-AP uses a series of image frames in a time window W that slides so as to overlap along a time axis as a calculation target. On the other hand, in the improved k-means and harmonic competition, which are alternative techniques, image frames in the time window W without overlap are targeted for calculation. That is, in TB-AP, the shift width of the time window W is smaller than the frame length of the time window W, and the time window W slides by one or more frames, whereas in the improved k-means and harmonic competition, the time window The shift width of W matches the frame length of the time window W, and the time window W is sliding in units of the time window W. Such a method is described as time-partition with respect to time-bound, and the improved k-means is referred to as time-partition k-means (TP k-means). ) As appropriate.

改良したk-means（TP k-means）やharmonic competitionを用いた場合、時間軸に沿って画像フレームを時間窓Ｗに分割し、分割された時間窓Ｗの中でクラスタリングがなされる。ここで得られる代表点ベクトルは、元の画像そのものに対応するものではなく、k-meansやharmonic competitionに基づく計算により生成された類似画像（人工画像）の特徴ベクトルである。したがって、その代表点ベクトルと最も類似した特徴を持つフレームを、典型静止画として選出するという、二段階の計算手順が必要となる。この二段階の計算により得られた典型静止画は、TB-APの典型静止画と近似している。 When improved k-means (TP k-means) or harmonic competition is used, an image frame is divided into time windows W along the time axis, and clustering is performed in the divided time windows W. The representative point vector obtained here does not correspond to the original image itself, but is a feature vector of a similar image (artificial image) generated by calculation based on k-means or harmonic competition. Therefore, a two-step calculation procedure is required in which a frame having features most similar to the representative point vector is selected as a typical still image. The typical still image obtained by this two-stage calculation is close to the typical still image of TB-AP.

以下にTP k-meansの処理フローについて、図８を参照して説明する。始めに時間軸に沿って時間窓Ｗで動画像を分割し、時間窓Ｗで区切られた複数のフレームに区分けする。区分けされた時間窓Ｗ内でフレームをｋ等分し、それぞれのクラスターにフレームを割り振る。続いて、それぞれのクラスターに属するフレームの特徴ベクトルを基にクラスターの重心を求める。重心は特徴ベクトルの算術平均により求めることができる。各フレームについて、フレームの特徴ベクトルと、時間窓内全ての重心との間の距離を計算し、その結果を基に再クラスタリングする。再クラスタリング後、再度クラスターの重心を求め、時間窓内全ての重心と各フレームとの間の距離の再計算と再クラスタリングを、処理が収束するまで行なう。各クラスターにおいて、収束後の重心に最も近いフレームを典型静止画として選出する。この処理を、時間窓を移動させつつ行なう。 Hereinafter, the processing flow of TP k-means will be described with reference to FIG. First, the moving image is divided by the time window W along the time axis and divided into a plurality of frames divided by the time window W. The frame is divided into k equal parts within the divided time window W, and the frame is allocated to each cluster. Subsequently, the center of gravity of the cluster is obtained based on the feature vector of the frame belonging to each cluster. The center of gravity can be obtained by the arithmetic average of feature vectors. For each frame, the distance between the feature vector of the frame and all the centroids in the time window is calculated, and re-clustering is performed based on the result. After the reclustering, the center of gravity of the cluster is obtained again, and the recalculation and reclustering of the distances between all the centroids in the time window and each frame are performed until the processing converges. In each cluster, a frame closest to the center of gravity after convergence is selected as a typical still image. This process is performed while moving the time window.

次に、k-meansを用いた他のクラスタリング法である時間分割ｋ平均法（Time-Split k-means：TS k-means）について、図９Ａ〜図９Ｅを参照して説明する。なお、これらの各図において、１〜25の連続する数字は、時系列な動画像フレームの時間ｔに相当する。前記TP k-meansでは、あらかじめ決められたサイズの時間窓Ｗで動画像を区切り、その時間窓Ｗの中でk-meansに基づくクラスタリングを行なっている。これに対して、このTS k-meansは、動画像の全フレームに対してk-meansに基づくクラスタリングを行なう手法である。ただし、時系列を考慮したフレームの割り振りとクラスターの再分割を行なう点が、一般的なk-meansと異なる。 Next, time-split k-means (TS k-means), which is another clustering method using k-means, will be described with reference to FIGS. 9A to 9E. Note that, in each of these drawings, consecutive numbers 1 to 25 correspond to time t of time-series moving image frames. In the TP k-means, a moving image is divided by a time window W of a predetermined size, and clustering based on k-means is performed in the time window W. In contrast, TS k-means is a technique for performing clustering based on k-means for all frames of a moving image. However, it differs from general k-means in that frame allocation and cluster subdivision are performed in consideration of time series.

TS k-meansでは以下のような処理が行なわれる。始めに、動画像の全フレームを時系列に基づいてｋ等分し、それぞれのクラスターにフレームを割り振る。続いて、それぞれのクラスターに属するフレームの特徴ベクトルを基にクラスターの重心を求める。重心は特徴ベクトルの算術平均により求めることができる。例として図９Ａでは、全部で25個の動画像フレームを時系列に５等分し（ｋ＝５）、曲線で囲まれたそれぞれのクラスターで重心１〜重心５を算出している。次に各フレームについて、フレームの特徴ベクトルと、現在属するクラスターの重心、及び、時系列の近い前後のクラスターの重心との間の距離を計算し、その結果を基に再クラスタリングする。これは図９Ｂに示すように、例えば時間ｔ＝18のフレームについて、現在属するクラスターの重心４と、時系列的に隣接する前後のクラスターの重心３及び重心５との間の距離をそれぞれ計算する。その結果、このフレームは重心３との間の距離が最も近いので、再クラスタリングでは図９Ｃに示すように、重心４ではなく重心３のクラスターに割り振られる。 In TS k-means, the following processing is performed. First, all frames of a moving image are divided into k equal parts based on a time series, and a frame is allocated to each cluster. Subsequently, the center of gravity of the cluster is obtained based on the feature vector of the frame belonging to each cluster. The center of gravity can be obtained by the arithmetic average of feature vectors. As an example, in FIG. 9A, a total of 25 moving image frames are divided into five in time series (k = 5), and centroid 1 to centroid 5 are calculated for each cluster surrounded by a curve. Next, for each frame, the distance between the feature vector of the frame, the centroid of the cluster to which the frame currently belongs, and the centroids of the clusters before and after the time series are calculated, and reclustering is performed based on the result. As shown in FIG. 9B, for example, for the frame at time t = 18, the distance between the centroid 4 of the current cluster and the centroid 3 and centroid 5 of the previous and subsequent clusters adjacent in time series is calculated. . As a result, since the distance between this frame and the center of gravity 3 is the shortest, in the reclustering, as shown in FIG.

次に、再クラスタリングされた各クラスター内で時系列の不連続性を判断し、不連続がある場合にはクラスターを分割する。これは図９Ｄに示すように、例えば重心３と重心４のクラスターについて、これらのクラスターが時系列的に不連続なフレームの間で分割される。クラスターが分割された場合にはクラスターの数はｋより増えるが、再クラスタリングによりクラスターが消滅した場合には数が減ることもありうる。再クラスタリング及び分割後、再度クラスターの重心を求め、重心の属するクラスター及び、時系列が前後のクラスターに属する各フレームとの間の距離の再計算と再クラスタリングを、処理が収束するまで行なう。図９Ｅに示す例では、重心３と重心４のクラスターを分割した後に、７つのクラスターでそれぞれ重心１〜重心７を算出し、時間ｔ＝１〜25の各フレームについて、上述した距離の計算と再クラスタリングを収束するまで繰り返す。そして、各クラスターにおいて収束後の重心に最も近いフレームを典型静止画として選出する。 Next, the discontinuity of the time series is determined in each cluster that has been reclustered, and if there is a discontinuity, the cluster is divided. As shown in FIG. 9D, for example, for clusters of centroid 3 and centroid 4, these clusters are divided between frames that are discontinuous in time series. When a cluster is divided, the number of clusters increases from k. However, when a cluster disappears due to reclustering, the number may decrease. After re-clustering and dividing, the center of gravity of the cluster is obtained again, and recalculation and re-clustering of the distance between the cluster to which the center of gravity belongs and each frame belonging to the cluster in which the time series belongs are performed until the process converges. In the example shown in FIG. 9E, after dividing the cluster of the centroid 3 and the centroid 4, the centroid 1 to the centroid 7 are calculated for each of the seven clusters, and the distance calculation described above is performed for each frame at time t = 1 to 25. Repeat reclustering until convergence. Then, a frame closest to the center of gravity after convergence in each cluster is selected as a typical still image.

続いて、改良したペアワイズ・ニアレスト・ネイバー法（pairwise nearest neighbor：PNN）を用いたクラスタリング手法について説明する。PNNはまず、与えられた全てのデータポイント間の距離を計算する。次に、距離の近いデータポイント同士から順に併合する。これを併合後のデータポイントが全て一定の距離以上になるまで繰り返していく。これにより、類似度の高いデータポイント群を一つの集合としてまとめたクラスター群を求めるアルゴリズムである。通常のPNNについては「W. Equitz, “A new vector quantization clustering algorithm,” IEEE Trans ASSP, vol. 37, pp. 1568-1575, 1989」などに記載されている。通常のPNNでは、ペアの特徴量の平均値がそのペアを表すものとして用いられるが、本実施の形態で説明する改良したPNNでは、後述するようにペアの一方を残し、他方を除去する点が異なる。また、時系列を考慮した時間窓Ｗの設定のやり方に応じて、Time-Bound PNN（TB-PNN）とTime-Partition PNN（TP-PNN）とがある。 Next, a clustering method using an improved pairwise nearest neighbor (PNN) method will be described. PNN first calculates the distance between all given data points. Next, data points are merged in order from the closest data points. This is repeated until all the data points after merging are more than a certain distance. This is an algorithm for obtaining a cluster group in which data points having high similarity are collected as one set. The normal PNN is described in “W. Equitz,“ A new vector quantization clustering algorithm, ”IEEE Trans ASSP, vol. 37, pp. 1568-1575, 1989”. In an ordinary PNN, the average value of feature values of a pair is used to represent the pair. However, in the improved PNN described in the present embodiment, one of the pairs is left and the other is removed as described later. Is different. Further, there are a Time-Bound PNN (TB-PNN) and a Time-Partition PNN (TP-PNN) depending on how the time window W is set in consideration of the time series.

図１０は、改良したPNNの併合を説明する図である。図中、○や●は各々のフレームの特徴量に対応したデータポイントｘ_ｉを表し、■は全データポイントの重心を表す。PNNでは、与えられた全データポイント間の距離を計算した後、それらのデータポイントの組み合わせの中で、最も距離の近いペアを見つける。図１０に示す例では、ａとｂのデータポイントが、最も距離の近いペアに相当する。次に、そのペアを併合する。本実施の形態に係る改良されたPNNにおける「併合」とは、全データポイントの重心に近いものをそのまま残し、遠いものを削除することをいう。図１０に示す例では、ｂのデータポイントを削除して、ａのデータポイントを残す。なお、通常のPNNでは、ａのデータポイントとｂのデータポイントの重心（中点）を計算し、それをａとｂの代表値としている。その後、残ったデータポイント間の距離が一定値（δ）以上になるまで繰り返し、残ったデータポイントに対応するフレームが典型静止画となる。それぞれの典型静止画は、その典型静止画に併合されたフレームを、それぞれのメンバーであるとして扱う。 FIG. 10 is a diagram for explaining improved PNN merging. Figure, ○ and ● represent the data points x _i which corresponds to the feature amount of each frame, ■ represents the center of gravity of all data points. In PNN, after calculating the distance between all given data points, the closest pair of those data points is found. In the example shown in FIG. 10, the data points a and b correspond to the pair with the closest distance. The pair is then merged. “Merging” in the improved PNN according to the present embodiment means that the data near the center of gravity of all the data points is left as it is, and the far data is deleted. In the example shown in FIG. 10, the data point b is deleted and the data point a is left. In the normal PNN, the center of gravity (midpoint) of the data point a and the data point b is calculated and used as a representative value of a and b. Thereafter, the process is repeated until the distance between the remaining data points reaches a certain value (δ) or more, and the frame corresponding to the remaining data points becomes a typical still image. Each typical still image treats a frame merged with the typical still image as a member.

図１１は、上述のTP-PNNを説明する図である。TP k-meansと同様に、改良したTP-PNNでは、あらかじめ決められたサイズの時間窓Ｗで動画像を区切り、その時間窓Ｗの中でPNNに基づくクラスタリングを行なっている。このとき通常のPNNと異なる点は、まず時間窓Ｗごとに重心（各フレームの特徴量の算術平均）を求め、２つのデータポイント(フレームの特徴量)を併合する際に、その重心に距離の近いデータポイントをそのまま残し、遠いデータポイントを削除することである。この処理を隣接するペアの距離が所定の値δより大きくなるまで繰り返し、残ったデータポイントに対応するフレームを典型静止画とする。各典型静止画には、それまでに併合によって削除されたフレーム群が属する。TP-PNNでは、この処理を各時間窓（重複しない）に対して行なう。 FIG. 11 is a diagram for explaining the above-described TP-PNN. Similar to TP k-means, in the improved TP-PNN, a moving image is divided by a time window W having a predetermined size, and clustering based on the PNN is performed in the time window W. At this time, the difference from normal PNN is that the center of gravity (arithmetic average of feature values of each frame) is first calculated for each time window W, and the distance to the center of gravity when two data points (frame feature values) are merged. The data points that are close to each other are left as they are, and the data points that are far are deleted. This process is repeated until the distance between adjacent pairs becomes greater than a predetermined value δ, and the frame corresponding to the remaining data point is set as a typical still image. Each typical still picture belongs to a group of frames deleted by merging so far. In TP-PNN, this process is performed for each time window (not overlapping).

これに対し、TB-PNNでは図１２に示すように、TB-APと同様にスライドする時間窓の制限をかけ、各データポイントの時間窓の範囲のみとの距離計算を行なう。そして、時間窓により制限されたデータポイントの組み合わせのうち、距離の近いものから順にデータポイントの併合を行なう。ただし、併合を行なう際、TP-PNNと異なり、動画フレーム全体の重心に近いものをそのまま残し、遠いものを削除する。この処理を隣接するペアの距離が所定の値δ以上となるまで繰り返す。これもTP-PNNと同様に残ったデータポイントに対応するフレームを典型静止画とする。各典型静止画には、それまでに併合によって削除されたフレーム群が属する。 On the other hand, as shown in FIG. 12, TB-PNN limits the sliding time window in the same manner as TB-AP, and calculates the distance from only the time window range of each data point. Then, the data points are merged in order from the shortest distance among the combinations of data points limited by the time window. However, when merging, unlike TP-PNN, the one close to the center of gravity of the entire video frame is left as it is, and the one far away is deleted. This process is repeated until the distance between adjacent pairs becomes a predetermined value δ or more. Similarly to TP-PNN, a frame corresponding to the remaining data point is set as a typical still image. Each typical still picture belongs to a group of frames deleted by merging so far.

以上説明したように、動画像の経過時間の要素を考慮した時間的な制限を導入して１乃至複数の代表フレームを取得することで、通常のAPよりも検索精度を上げることができる。また、通常のk-means及びPNNの各手法に対しても同様の結果が期待できる。TB-APでは、時間窓の中の一連の画像フレームを計算対象として、そこから一段階で代表フレームを選出することが可能となり、計算手順を簡素化できる。他のTP-k-means、TS k-means、TB-PNN、TP-PNNの４手法は、TB-APに比べ計算量が少ないという利点がある。精度に関して、TB-APより若干劣る場合もあるが些少であり、いずれも高い精度を保つことが可能である。 As described above, the search accuracy can be improved as compared with a normal AP by introducing one or a plurality of representative frames by introducing a time limit in consideration of an element of the elapsed time of a moving image. Similar results can be expected for the usual k-means and PNN methods. TB-AP makes it possible to select a representative frame in one step from a series of image frames in the time window, which can simplify the calculation procedure. The other four methods of TP-k-means, TS k-means, TB-PNN, and TP-PNN have the advantage that the amount of calculation is smaller than that of TB-AP. Regarding accuracy, it may be slightly inferior to TB-AP, but it is insignificant, and both can maintain high accuracy.

上記した時系列を保つ代表フレームの選出に際して、時間窓の幅が小さすぎる場合や、動画像中でほとんど変化の無いシーンが続く場合などで、連続する類似のフレームにより必要以上に代表フレームが選出される可能性がある。このような場合には、時間順序が連続する代表フレーム同士の差を求め、その差が閾値ε以下となるような場合は、後続の代表フレームを消去し、グループをまとめることで、冗長な代表フレームを削除することができる。 When selecting a representative frame that keeps the time series described above, if the time window width is too small or a scene with almost no change in the moving image continues, representative frames are selected more than necessary by successive similar frames. There is a possibility that. In such a case, a difference between representative frames having consecutive time sequences is obtained. If the difference is equal to or less than the threshold value ε, redundant representative frames can be obtained by deleting subsequent representative frames and grouping them. The frame can be deleted.

・動画像間の解析による類似度の算出
次に、異なる典型静止画の集合を有する異なる動画像同士の比較を行なうときに用いる動画像間の解析による類似度の算出手法について説明する。コンテンツに基づいた動画像間の検索をするためには、クエリの動画像とターゲットの動画像の類似度を比較する必要がある。ここでは、以下に説明する２つの方法の何れかで行なう。本実施の形態では比較対象を動画像とするが、代表フレームのようなexemplarを持つ時系列データであれば、これに限るものではない。 Calculation of similarity by analysis between moving images Next, a method of calculating similarity by analysis between moving images used when comparing different moving images having different sets of typical still images will be described. In order to search between moving images based on content, it is necessary to compare the similarity between the query moving image and the target moving image. Here, one of the two methods described below is performed. In the present embodiment, the comparison target is a moving image, but it is not limited to this as long as it is time-series data having an exemplar such as a representative frame.

（１）大域アライメントを求めるＭ距離を適用した類似度の比較
本実施の形態では、類似度を比較するための尺度としてＭ距離（M-distance）と呼ぶ距離を用いる。大域アライメント用のＭ距離は、文字列の類似度を測るためのレーベンシュタイン距離（L-distance）や、生命情報学において大域アライメントを求めるためのNeedleman−Wunschアルゴリズムを一般化したものである。したがって、この大域アライメント用のＭ距離は、レーベンシュタイン距離や、Needleman−Wunschアルゴリズムのそれぞれを特例として含んでいる。 (1) Comparison of Similarity Applying M Distance for Finding Global Alignment In this embodiment, a distance called M distance is used as a scale for comparing similarity. The M distance for global alignment is a generalization of the Levenshtein distance (L-distance) for measuring the similarity of character strings and the Needleman-Wunsch algorithm for obtaining global alignment in bioinformatics. Therefore, the M distance for global alignment includes the Levenshtein distance and the Needleman-Wunsch algorithm as special cases.

以下、大域アライメント用のＭ距離を利用したアルゴリズムについて、図１３のフローチャートを参照しながら詳しく説明する。 Hereinafter, an algorithm using the M distance for global alignment will be described in detail with reference to the flowchart of FIG.

Ｓ３１０：クエリ動画像と対象動画像の各典型静止画を用意
クエリ動画像に相当する動画像ｖ_Ａに対して、典型静止画の集合｛ｅ^Ａ _ｉ｝と、当該典型静止画が所有するフレーム数の集合｛Ｅ^Ａ _ｉ｝（ｉ＝１，２，…）が与えられる。また、動画像ｖ_Ａと比較される対象動画像に相当する動画像ｖ_Ｂに対しても、同様の集合｛ｅ^Ｂ _ｊ｝，｛Ｅ^Ｂ _ｊ｝（ｊ＝１，２，…）が与えられる。ここでは、数３におけるSimilarity測度ｓ（ｉ，ｊ）が、代表フレーム間類似度として選択される。 S310: Prepare typical still images of the query moving image and the target moving image For the moving image v _A corresponding to the query moving image, a set of typical still images {e ^A _i } and a frame owned by the typical still image ^A set of numbers {E ^A _i } (i = 1, 2,...) Is given. _A similar set {e ^B _j }, {E ^B _j } (j = 1, 2,...) Is also given to the moving image v _B corresponding to the target moving image to be compared with the moving image v _A. It is done. Here, the Similarity measure s (i, j) in Equation 3 is selected as the similarity between representative frames.

Ｓ３２０：テーブルを作成し、次の動的計画法の手順に従って経路を元に辿る。
動画像ｖ_Ａの典型静止画の数をｍとし、動画像ｖ_Ｂの典型静止画の数をｎとしたときに、（ｍ＋１）×（ｎ＋１）の行列を大域アライメントのテーブルとして作成する。行列の各成分ｆ（ｉ，ｊ）は、動的計画法により次の手順で算出される。 S320: Create a table and follow the path according to the following dynamic programming procedure.
_A matrix of (m + 1) × (n + 1) is created as a global alignment table, where m is the number of typical still images of the moving image v _A and n is the number of typical still images of the moving image v _B. Each component f (i, j) of the matrix is calculated by the following procedure by dynamic programming.

先ず、Ｓ３２１では、設計パラメータとして設定したギャップペナルティーｇが選択される。このギャップペナルティーｇを導入することで、クエリ動画像の典型静止画の数ｍと、対象動画像の典型静止画の数ｎが異なっていても、これらの時系列を考慮した典型静止画同士の類似性を、正しく判定できるようになる。 First, in S321, a gap penalty g set as a design parameter is selected. By introducing this gap penalty g, even if the number m of the typical still images of the query moving image and the number n of the typical still images of the target moving image are different from each other, Similarity can be determined correctly.

次のＳ３２２では、｛ｉ＝０｝の行と、｛ｊ＝０｝の列について、ギャップｇにフレームの総数を掛け合わせて、段階的に値が下降するように、テーブルを構成する行列の各成分を計算する。具体的には、｛ｉ＝０｝の行は、次の数８に基づいて行列成分ｆ（０，ｊ）を算出する。 In the next S322, for the row of {i = 0} and the column of {j = 0}, the matrix g of the table is configured so that the value decreases stepwise by multiplying the gap g by the total number of frames. Calculate each component. Specifically, the row of {i = 0} calculates the matrix component f (0, j) based on the following formula 8.

また、｛ｊ＝０｝の列は、次の数９に基づいて行列成分ｆ（ｉ，０）を算出する。 For the column of {j = 0}, the matrix component f (i, 0) is calculated based on the following equation (9).

さらにＳ３２３では、（ｉ，ｊ）＝（１，１）の位置から開始し、｛ｉ＝０｝の行と｛ｊ＝０｝の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数１０に基づいて算出する。 Furthermore, in S323, starting from the position of (i, j) = (1, 1), the remaining matrix components f (i, j) other than the row of {i = 0} and the column of {j = 0} are It calculates based on following Formula 10.

その後、最大値を与える成分に向かって矢印を描く。この計算は、全ての行列成分が算出されるまで続けられる。ここでｒ（ｉ，ｊ）は、典型静止画の所有度を反映する重みを表し、その値は例えば、次の数１１に基づいて算出される。 Then, an arrow is drawn toward the component that gives the maximum value. This calculation is continued until all matrix components are calculated. Here, r (i, j) represents a weight reflecting the ownership of the typical still image, and the value is calculated based on the following equation 11, for example.

数１０に示すように、ここでの行列成分ｆ（ｉ，ｊ）は、行または列の数が大きくなる順に向けて逐次的に生成される。そして、ｉ行ｊ列の行列成分ｆ（ｉ，ｊ）は、（ｉ−１）行ｊ列を経由しｉ行ｊ列に至る経路、ｉ行（ｊ−１）列を経由しｉ行ｊ列に至る経路、（ｉ−１）行（ｊ−１）列を経由しｉ行ｊ列に至る経路のうち、コストの最も大きい経路が最適経路として選択され、そのときの数値が当該行列の成分として逐次的に決められていく。 As shown in Equation 10, the matrix component f (i, j) here is sequentially generated in the order in which the number of rows or columns increases. Then, the matrix component f (i, j) of i row and j column passes through (i-1) row j column and reaches i row j column, i row j through i row (j-1) column. The route with the highest cost is selected as the optimum route from the route to the column and the route to the i row and j column via the (i-1) row (j-1) column, and the value at that time is the value of the matrix It is determined sequentially as a component.

Ｓ３３０：大域アライメントの決定
テーブルを構成する全ての行列成分の算出が終了した後、行列ｆ（ｉ，ｊ）の最終行最終列の成分から矢印を元に辿ることで、大域アライメントの経路を得ることができる。 S330: Determination of global alignment After calculation of all matrix components constituting the table is completed, a path of global alignment is obtained by tracing from the last row and last column components of the matrix f (i, j) based on the arrows. be able to.

Ｓ３４０：類似度の算出
動画像ｖ_Ａと動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を、次の数１２に基づいて算出する。 S340: Calculation of similarity The similarity u (A, B) between the moving image v _A and the moving image v _B is calculated based on the following equation (12).

ただしｈ（ｘ）は正の増加関数とし、ｆ_ｌａｓｔは当該行列の最終行最終列の数値である。また、ｗは平均化関数とする。当該平均化関数は例えば、Σ_ｉＥ_ｉ ^ＡとΣ_ｉＥ_ｉ ^Ｂとの算術平均とすることができる。 However, h (x) is a positive increase function, and f _last is a numerical value of the last row and last column of the matrix. W is an averaging function. The averaging function can be, for example, an arithmetic average of Σ _i E _i ^A and Σ _i E _i ^B.

（２）局所アライメントを求めるＭ距離を適用した類似度の比較
比較すべき動画像が互いに大きく相違している場合、上述した大域アライメントは人間の感覚からずれたものとなる。このような場合には、局所アライメントが利用できる。局所アライメントは、配列が全体としては似ておらず、部分的類似を見つけたい場合に有効である。以下に記述するアルゴリズムは、上記の大域的アライメントによる方法を、局所アライメントに適応させたもので、類似度の尺度として局所アライメント用のＭ距離を用いる。この局所アライメント用のＭ距離は、生命情報学における局所アライメントを求めるためのSmith-Watermanアルゴリズムの一般化となっている。したがって、本実施の形態で適用する局所アライメント用のＭ距離は、Smith-Watermanアルゴリズムを特例として含んでいる。 (2) Comparison of Similarity Applying M Distance for Obtaining Local Alignment When the moving images to be compared are greatly different from each other, the above-described global alignment is deviated from human senses. In such a case, local alignment can be used. Local alignment is useful when the sequences are not similar overall and you want to find partial similarities. The algorithm described below is an adaptation of the above global alignment method to local alignment, and uses M distance for local alignment as a measure of similarity. This M distance for local alignment is a generalization of the Smith-Waterman algorithm for obtaining local alignment in bioinformatics. Therefore, the M distance for local alignment applied in the present embodiment includes the Smith-Waterman algorithm as a special case.

以下、局所アライメント用のＭ距離を利用したアルゴリズムについて、図８のフローチャートを参照しながら詳しく説明する。ただし、既に説明した大域アライメント用のＭ距離と重複する手順が多いため、違いを記述するに留める。 Hereinafter, an algorithm using the M distance for local alignment will be described in detail with reference to the flowchart of FIG. However, since there are many procedures that overlap with the already described M distance for global alignment, only the differences are described.

Ｓ３１０：クエリ動画像と対象動画像の各典型静止画を用意
これは、大域アライメント用のＭ距離と同じ処理手順である。 S310: Prepare typical still images of the query moving image and the target moving image. This is the same processing procedure as the M distance for global alignment.

Ｓ３２０：テーブルを作成し、次の動的計画法の手順に従って経路を元に辿る。
動画像ｖ_Ａの典型静止画の数をｍとし、動画像ｖ_Ｂの典型静止画の数をｎとしたときに、（ｍ＋１）×（ｎ＋１）の行列を局所アライメントのテーブルとして作成する。行列の各成分は、動的計画法により次の手順で算出される。 S320: Create a table and follow the path according to the following dynamic programming procedure.
_A matrix of (m + 1) × (n + 1) is created as a local alignment table where m is the number of typical still images of the moving image v _A and n is the number of typical still images of the moving image v _B. Each component of the matrix is calculated by the following procedure by dynamic programming.

先ず、Ｓ３２１では、設計パラメータとして設定したギャップペナルティーｇが選択される。これは、大域アライメント用のＭ距離と同じ処理手順である。 First, in S321, a gap penalty g set as a design parameter is selected. This is the same processing procedure as the M distance for global alignment.

次のＳ３２２では、｛ｉ＝０｝の行と、｛ｊ＝０｝の列について、各成分をすべて０に設定する。 In next step S322, all components are set to 0 for the row of {i = 0} and the column of {j = 0}.

さらにＳ３２３では、（ｉ，ｊ）＝（１，１）の位置から開始し、｛ｉ＝０｝の行と｛ｊ＝０｝の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数１３に基づいて算出する。 Furthermore, in S323, starting from the position of (i, j) = (1, 1), the remaining matrix components f (i, j) other than the row of {i = 0} and the column of {j = 0} are It calculates based on following Formula 13.

その後、０ではない最大値を与える成分に向かって矢印を描く。数１３の右辺では数１０と異なり、ｍａｘ記号の中の最左成分を０としている。これにより、スコア値が負になった場合に値を０にリセットし、アライメントを新しく始めることができる。 Then, an arrow is drawn toward the component that gives the maximum value that is not zero. On the right side of Equation 13, unlike the Equation 10, the leftmost component in the max symbol is 0. Thereby, when the score value becomes negative, the value can be reset to 0 and the alignment can be newly started.

Ｓ３３０：局所アライメントの決定
テーブルを構成する全ての行列成分の算出が終了した後、行列ｆ（ｉ，ｊ）の中の最大値ｆ_ｍａｘの成分から矢印を元に辿ることで、局所アライメントの経路を得ることができる。 S330: Determination of local alignment After calculation of all matrix components constituting the table is completed, the path of local alignment is traced from the component of the maximum value f _{max in} the matrix f (i, j) based on the arrow. Can be obtained.

Ｓ３４０：類似度の算出
大域アライメント用のＭ距離と同じ処理手順を用い、動画像ｖ_Ａと動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を、数１２に基づいて算出する。 S340: Calculation of Similarity Using the same processing procedure as the M distance for global alignment, the similarity u (A, B) between moving image v _A and moving image v _B is calculated based on Equation 12.

以上述べたように、フレームに対する特徴量の抽出、時系列を保つ代表フレームの選出、及び、動画像間の解析による類似度の算出の手順を、類似動画像検索処理装置２が処理実行して、動画像ｖ_Ａに対する個々の動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を算出する。これにより、クエリ動画像に類似する類似動画像を対象動画像の中から検索することができる。 As described above, the similar moving image search processing apparatus 2 executes the procedure of extracting feature amounts for frames, selecting representative frames that maintain time series, and calculating similarity by analyzing between moving images. calculates a similarity u (a, B) of each of the moving image _{v B} for video images _{v a.} Thereby, a similar moving image similar to the query moving image can be searched from the target moving image.

次に、上記動画像検索装置を用いた実験の詳細と結果を述べる。 Next, the details and results of the experiment using the moving image retrieval apparatus will be described.

先ず、実験データについて、検索のエンドユーザーは人間であるので、類似度の最終判断は人間のもつ感性に強く依存する。したがって、類似度の判断は、題目にできるだけ依存しないことが望ましい。しかし、単なるセットでは、プレーンな機器でも判断するのが容易になり過ぎるので、実験では次の条件を満足するデータセットを用意した。 First, as for the experimental data, since the end user of the search is a human, the final determination of the similarity is strongly dependent on the sensitivity of the human. Therefore, it is desirable that the determination of similarity is not dependent on the subject as much as possible. However, with a simple set, it would be too easy to judge even with a plain device, so a data set that satisfies the following conditions was prepared in the experiment.

・元の動画像データは、ラベル付けされていないものとする。
・検索における適合率および再現率を、自動的に判断できるものとする。
・図３の右下に示すような時間に依存する性質を判別できるように、各動画像は、シーンの時間変化を捉えた情報を有するものとする。・ The original video data is not labeled.
・ The precision and recall in the search can be automatically judged.
It is assumed that each moving image has information that captures the time change of the scene so that the time-dependent property as shown in the lower right of FIG. 3 can be determined.

図１４は、元の動画像データの生成過程を表したものである。グループＡの動画像は「鳥瞰図」を表し、グループＢの動画像は「鳥」を表す。実験において、これらのグループＡ，Ｂより20の動画クラスが作られる（図１５を参照）。各クラス１〜２０はそれぞれ21個の動画像を構成するので、互いに異なる合計で420個のテスト動画を使用した。 FIG. 14 shows the generation process of the original moving image data. The group A moving image represents a “bird's eye view”, and the group B moving image represents a “bird”. In the experiment, 20 animation classes are created from these groups A and B (see FIG. 15). Since each class 1 to 20 constitutes 21 moving images, a total of 420 different test videos were used.

・事前実験
APアルゴリズムの収束はデータに依存するので、数５と数６におけるパラメータλの範囲を確認する必要がある。試行データの集合として、５集団のガウス型クラスターを生成し、さらに、それぞれの時間方向について２集団に分け、１０集団とした。図１６は、これらのデータを図示したものである。・ Preliminary experiment
Since the convergence of the AP algorithm depends on the data, it is necessary to confirm the range of the parameter λ in Equations 5 and 6. As a set of trial data, five groups of Gaussian clusters were generated, and further divided into two groups for each time direction to form ten groups. FIG. 16 illustrates these data.

図１６に示す試行データの集合に関して、本実施の形態で提案するTB-APと、通常のAPに対して実験を行なった。主な目的は収束のためのパラメータλの範囲と、時間窓パラメータωの効果とを確認することにある。繰り返しの実験により、パラメータλ∈［0.3，1.0−ε］の選択はデータの特徴に依存する。ただしεは正の微少量であり、図１６の実験ではε＝0.02とした。 With respect to the collection of trial data shown in FIG. 16, experiments were performed on the TB-AP proposed in the present embodiment and a normal AP. The main purpose is to confirm the range of the parameter λ for convergence and the effect of the time window parameter ω. Through repeated experiments, the selection of the parameter λε [0.3,1.0−ε] depends on the characteristics of the data. However, ε is a small positive amount, and ε = 0.02 in the experiment of FIG.

表１は、収束の傾向と計算時間を示したものである。収束は、初期値で正規化されたａ（ｉ，ｋ）およびｒ（ｉ，ｋ）の総和により、それが10^−５未満であるか否かによりチェックされた。表１では、時間窓Ｗのパラメータωを変化させたときの典型静止画の数「exemplars」、反復回数「iterations」、メッセージ伝達数「message passing」をそれぞれ示している。表１より、例えばω＝10のとき、本実施の形態のTB-APは通常のAPのメッセージ伝達と比べて６倍高速であることがわかる。 Table 1 shows the tendency of convergence and the calculation time. Convergence was checked by the sum of a (i, k) and r (i, k) normalized by the initial value to see if it was less than ^10-5 . Table 1 shows the number of typical still images “exemplars”, the number of iterations “iterations”, and the number of message passing “message passing” when the parameter ω of the time window W is changed. From Table 1, for example, when ω = 10, it can be seen that the TB-AP of the present embodiment is 6 times faster than the message transmission of a normal AP.

ω＝１の場合は、全フレームが典型静止画であることを意味する。ω＝ｎ＝100は、時間依存性のある典型静止画抽出を含まない通常のAPの場合に相当する。これは収束傾向を見るため比較例として求めた。ω＝５，１０，２０，５０とした場合を見ると、メッセージ伝達数に単調増加が認められ、これは時間窓のサイズに起因するためＣＰＵタイムに直接関わる。しかし、通常のAPのメッセージ伝達数に対して、TB-APのメッセージ伝達数の比率は小さいまま保たれているので、これらは望ましい範囲にあるといえる。 When ω = 1, it means that all frames are typical still images. ω = n = 100 corresponds to the case of a normal AP that does not include time-dependent typical still image extraction. This was obtained as a comparative example in order to see the convergence tendency. In the case of ω = 5, 10, 20, 50, a monotonic increase is observed in the number of message transmissions, which is directly related to the CPU time because of the size of the time window. However, since the ratio of the number of message transmissions of TB-AP to the number of message transmissions of normal AP is kept small, it can be said that these are in the desirable range.

準備した図１５の動画像セット（クラス１〜２０）に対して、信頼できるフレームを伴う典型静止画を見つけ出すのにTB-APを適用した。これは、各動画像に「数値的な注釈」を付与するのに重要な手順であり、図１５に示す整理されていない動画像の集合を、数値的なタグという観点から構造化することと等価である。こうしたタグは、オンラインとオフラインの何れかで算出できる。 TB-AP was applied to the prepared moving image set (classes 1 to 20) of FIG. 15 to find a typical still image with a reliable frame. This is an important procedure for assigning “numerical annotations” to each moving image. The unorganized set of moving images shown in FIG. 15 is structured in terms of numerical tags. Is equivalent. Such tags can be calculated either online or offline.

こうして、TB-APにより得られた、比較対象とする二つのクラスの動画像（動画像ｖ_Ａと動画像ｖ_Ｂ）の各典型静止画について、それらのSimilarity測度ｓ（ｉ，ｊ）から得られる距離行列の一例を図１７に示す。同図において、動画像ｖ_Ａは４つの典型静止画を有し、動画像ｖ_Ｂは３つの典型静止画を有するので、当該行列は３×４行列である。表中の数字は、動画像ｖ_Ａと動画像ｖ_Ｂとの典型静止画（代表フレーム）間の距離行列の各成分を表し、例えば動画像ｖ_Ａの代表フレーム２と、動画像ｖ_Ｂの代表フレーム３との距離は0.26となる。当該距離は各代表フレーム間の類似度に対応して数値化したもので、ここでは距離が小さいほど代表フレーム同士が類似したものとなり、反対に距離が大きいほど代表フレーム同士が互いに相違したものとなる。 Thus, for each typical still image of the two classes of moving images (moving image v _A and moving image v _B ) obtained by TB-AP, they are obtained from their Similarity measures s (i, j). An example of the obtained distance matrix is shown in FIG. In the figure, since the moving image v _A has four typical still images and the moving image v _B has three typical still images, the matrix is a 3 × 4 matrix. The numbers in the table represent the components of the distance matrix between the typical still images (representative frames) of the moving image v _A and the moving image v _B. For example, the representative frame 2 of the moving image v _A and the moving image v _B The distance from the representative frame 3 is 0.26. The distance is quantified corresponding to the degree of similarity between the representative frames. Here, the smaller the distance, the more similar the representative frames, and the larger the distance, the more different the representative frames are from each other. Become.

図１８は、上述した大域アライメント用のＭ距離を適用して作成したテーブルの一例である。この大域アライメント用のテーブルを作成する上で、アルゴリズムのパラメータは、bar（Ｄ）＝１／√（１−（１／ｄ）），ｄ＝３×256＝768，ｇ＝0.05にそれぞれ設定した。同図において、テーブルの右下に記載される数字26.60が最終成分の数値ｆ_ｌａｓｔであり、そこから元に辿る最適経路を大域アライメントとして網掛けで示している。その後、動画像ｖ_Ａと動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を求めるために、算出された数値ｆ_ｌａｓｔは平均化関数で正規化される。例えば、ギャップペナルティーｇとフレーム数の集合との積の総和で正規化する場合、図１８において、その総和は、0.05×（８＋12＋10＋７）＋0.05×（12＋７＋10）＝1.85＋1.45＝3.30であるので、正規化された類似度ｕ（Ａ，Ｂ）は26.60／（1.85＋1.45）＝8.06と算出される。また適合パターンは、当該行列の最終行最終列より矢印を辿って探すことができる。こうして、動画像ｖ_Ａと動画像ｖ_Ｂとの間で、大域アライメント用の重み付きＭ距離による適切な類似度ｕ（Ａ，Ｂ）の算出が可能となる。 FIG. 18 is an example of a table created by applying the M distance for global alignment described above. In creating this global alignment table, the algorithm parameters were set to bar (D) = 1 / √ (1− (1 / d)), d = 3 × 256 = 768, g = 0.05, respectively. . In the figure, numeral 26.60 described at the lower right of the table is the numerical value f _last of the final component, and the optimal path traced from there is shown by shading as global alignment. Thereafter, in order to obtain the similarity u (A, B) between the moving image v _A and the moving image v _B , the calculated numerical value f _last is normalized by an averaging function. For example, when normalizing by the sum of products of the gap penalty g and the set of the number of frames, in FIG. 18, the sum is 0.05 × (8 + 12 + 10 + 7) + 0.05 × (12 + 7 + 10) = 1.85 + 1.45 = 3.30. Therefore, the normalized similarity u (A, B) is calculated as 26.60 / (1.85 + 1.45) = 0.06. The matching pattern can be found by tracing the arrow from the last row and last column of the matrix. In this way, it is possible to calculate an appropriate similarity u (A, B) between the moving image v _A and the moving image v _B based on the weighted M distance for global alignment.

図１９は、上述した局所アライメント用のＭ距離を適用して作成したテーブルの一例である。アルゴリズムのパラメータは、bar（Ｄ）＝√（１−（１／ｄ）），ｄ＝３×256＝768，ｇ＝0.05と設定した。局所アライメント用のテーブルでは、大域アライメント用のテーブルでスコアがマイナスになる成分、すなわち数８や数９で算出した成分が、すべて０にリセットされることが確認できる。ここでは最大値ｆ_ｍａｘが26.95であり、その最大値ｆ_ｍａｘの成分から元に辿る最適経路を、局所アライメントとして網掛けで示している。こうして、比較する動画像が大きく異なるものであっても、動画像ｖ_Ａと動画像ｖ_Ｂとの間で、局所アライメント用の重み付きＭ距離による適切な類似度ｕ（Ａ，Ｂ）の算出が可能となる。 FIG. 19 is an example of a table created by applying the M distance for local alignment described above. The algorithm parameters were set as bar (D) = √ (1− (1 / d)), d = 3 × 256 = 768, g = 0.05. In the local alignment table, it can be confirmed that the components whose scores are negative in the global alignment table, that is, the components calculated in the equations 8 and 9, are all reset to zero. Here, the maximum value f _max is 26.95, and the optimum path traced from the component of the maximum value f _max is indicated by shading as local alignment. Thus, even if the moving images to be compared are greatly different, the appropriate similarity u (A, B) is calculated between the moving image v _A and the moving image v _B by the weighted M distance for local alignment. Is possible.

なお、大域アライメントにおける好ましい変形例として、上述したＭ距離による類似度の算出は、その計算量を削減するために、図２０に示すように、シーケンス全体の割合に対して任意の幅ｒだけを計算に用いることもできる。これは、Ｍ距離をコンテンツベースの動画像検索の類似度算出に特化させることで高速化した算出手法である。図２０において、黒く塗りつぶされた部分は、計算を省略した個所を示す。またこの図は、大域アライメントによる図１８のテーブルに対応する。この場合、計算の範囲は以下の数１４によって定められる。 As a preferred modification in the global alignment, the above-described calculation of the similarity based on the M distance has an arbitrary width r with respect to the ratio of the entire sequence as shown in FIG. 20 in order to reduce the amount of calculation. It can also be used for calculations. This is a calculation method that is speeded up by specializing the M distance in similarity calculation of content-based moving image search. In FIG. 20, the blacked-out portion indicates a portion where the calculation is omitted. This figure also corresponds to the table of FIG. 18 with global alignment. In this case, the range of calculation is determined by the following equation (14).

上記数１４において、ｍは動画像ｖ_Ａの代表フレーム数であり、ｎは動画像ｖ_Ｂの代表フレーム数である。幅ｒは、類似度比較における計算範囲の絞り具合を表すパラメータであって、計算範囲を数１４のように絞ることで、通常の全ての行列成分を算出する計算と比べて、100×（１−ｒ）^２％の計算量削減が可能となる。図２０の例では、計算範囲を絞ったにもかかわらず、計算範囲を絞っていない図１８の計算結果と同じ結果を算出していることが確認できる。 In Equation 14, m is the number of representative frames of the moving image v _A , and n is the number of representative frames of the moving image v _B. The width r is a parameter that represents the degree of narrowing of the calculation range in the similarity comparison, and is 100 × (1 compared to the calculation for calculating all the matrix components by narrowing the calculation range as shown in Equation 14. -R) The calculation amount can be reduced by ² %. In the example of FIG. 20, although the calculation range is narrowed, it can be confirmed that the same result as the calculation result of FIG. 18 in which the calculation range is not narrowed is calculated.

次に、前述した時間窓のパラメータωと、類似度ｕ（Ａ，Ｂ）を算出した後の再現率や適合率との関係について説明する。ここでは、全ての動画像が類似度の高い順に分類された後、再現率と適合率が次の数１５のように定義される。 Next, the relationship between the above-described time window parameter ω and the recall and relevance ratio after calculating the similarity u (A, B) will be described. Here, after all the moving images are classified in descending order of similarity, the recall rate and the matching rate are defined as in the following equation (15).

再現率（recall）：＝検索によって見つかった正解動画像の数／同じクラスに属する動画像数
適合率（precision）：＝検索によって見つかった正解動画像の数／検索された上位動画像数 Recall rate: = Number of correct moving images found by search / Number of moving images belonging to the same class Precision: = Number of correct moving images found by searching / Number of upper moving images searched

再現率とは、検索要求を満たす全動画像に対しての、検索要求を満たす検索結果の割合を意味し、検索の網羅性を表す指標である。一方、適合率とは、全検索結果に対しての、検索要求を満たす検索結果の割合を表し、検索の正確さを表す指標である。一般に再現率や適合率が高いほど検索の性能が高いとみなされる。 The reproduction rate means the ratio of search results that satisfy the search request to all moving images that satisfy the search request, and is an index that represents the completeness of the search. On the other hand, the relevance rate represents the ratio of search results that satisfy the search request to all search results, and is an index that represents the accuracy of the search. Generally, the higher the recall and relevance rate, the higher the search performance.

図２１は、実験で得られた適合率−再現率曲線の結果を示している。同図において、縦軸は適合率を表し、横軸は再現率を表す。曲線がグラフ中右上に行くほど、検索性能が高い。ここでは、TB-APを適用した時間窓パラメータが「ω＝５」と「ω＝10」の各実験を載せている。図中枠内に記載されるように、数１２で示した関数ｈ（ｘ）を定義し、そこで使用する係数ａをａ＝１に設定した。また比較として、通常のAPを適用した「Plain AP」の実験結果も併せて掲載した。 FIG. 21 shows the result of the precision-recall rate curve obtained in the experiment. In the figure, the vertical axis represents the precision, and the horizontal axis represents the recall. The higher the curve goes to the upper right of the graph, the higher the search performance. Here, each experiment with time window parameters “ω = 5” and “ω = 10” to which TB-AP is applied is shown. As described in the frame in the figure, the function h (x) expressed by Equation 12 was defined, and the coefficient a used therein was set to a = 1. For comparison, the experimental results of “Plain AP” using ordinary AP are also shown.

これらの実験結果から、再現率として20％までを要求する関心領域（図中「ROI」で囲んだ領域）において、本実施の形態のTB-APによる適合率は非常に高く、処理能力は満足のいくものであることが読み取れる。一方、通常のAPは時間軸を考慮していないため、本発明のTB-APと比べてその性能は劣ることが見てとれる。 From these experimental results, in the region of interest that requires a recall rate of up to 20% (the region surrounded by “ROI” in the figure), the precision of TB-AP in this embodiment is very high, and the processing capability is satisfactory. It can be read that On the other hand, since the normal AP does not consider the time axis, it can be seen that the performance is inferior to that of the TB-AP of the present invention.

次に、代表フレームを選出する際に用いたクラスタリング手法の違いによる検索結果の違いについて示す。図２２Ａおよび図２２Ｂは、前述した５つのクラスタリング手法における適合率−再現率曲線の実験結果を示している。 Next, differences in search results due to differences in clustering methods used when selecting representative frames will be described. FIG. 22A and FIG. 22B show the experimental results of the precision-recall rate curve in the five clustering methods described above.

実験データとして、互いに異なる2100個の動画像を用意した。次にこれらを21個ずつに分け、合計100個の動画セットにする。各セットにおいて、21個の動画像のうちの一つをクエリとし、このクエリの10％以上のフレームを残りの20個の動画像の中にそれぞれ一様乱数によって得られた位置に挿入した。これによりクエリを除く2000個のテスト動画を被検索対象群とした。 As experiment data, 2100 different moving images were prepared. Next, divide these into 21 pieces to make a total of 100 movie sets. In each set, one of the 21 moving images was used as a query, and 10% or more frames of this query were inserted into the remaining 20 moving images at positions obtained by uniform random numbers. As a result, 2000 test videos excluding the query were set as the search target group.

上記の実験データに対して、フレームに対する特徴量の抽出、時系列を保つ代表フレームの選出、及び動画像間の解析による類似度の算出を行ない、クエリに関連する動画像の検索を行なった。特徴量はCSDを用い、類似度は局所アライメントを基に求めた。 With respect to the above experimental data, feature amounts for frames were extracted, representative frames for keeping time series were selected, and similarity was calculated by analysis between moving images, and a moving image related to the query was searched. The feature quantity was CSD, and the similarity was obtained based on local alignment.

図２２Ａは、クラスタリング時に冗長代表フレームの削除の事後処理を行なうことなく、そのまま類似度の算出を行なった場合の５つのクラスタリング手法における適合率−再現率曲線を表す。５つの手法とも概ね良好な結果が得られており、差はほとんど見られない。 FIG. 22A shows relevance ratio-recall ratio curves in five clustering methods when similarity is calculated without performing post-processing of redundant representative frame deletion during clustering. All of the five methods yielded good results with almost no difference.

図２２Ｂは、クラスタリング時に冗長代表フレームの削除の事後処理を行なったときの５つのクラスタリング手法における適合率−再現率曲線を表す。このとき閾値ε＝0.05としている。代表フレームの数が概ね半分に減ることで、事後処理の無いものに比べれば適合率が下がっているが差は小さい。 FIG. 22B shows relevance ratio-recall ratio curves in five clustering methods when post-processing of redundant representative frame deletion is performed during clustering. At this time, the threshold ε = 0.05. By reducing the number of representative frames to almost half, the matching rate is reduced compared to those without post processing, but the difference is small.

上記５つのクラスタリング手法では、TB-APに比べて他のTP k-means、TS k-means、TP-PNN、及びTB-PNNは計算量が少ないため、処理速度を大幅に早くできる。また、局所最適化を避けるためには、初期値の設定で工夫する必要があるが、TB-AP、TP-PNN、及びTB-PNNについては、そのアルゴリズムから不適切な局所最適化は生じない。 In the above five clustering methods, other TP k-means, TS k-means, TP-PNN, and TB-PNN have a smaller calculation amount than TB-AP, so that the processing speed can be significantly increased. Moreover, in order to avoid local optimization, it is necessary to devise by setting the initial value, but for TB-AP, TP-PNN, and TB-PNN, inappropriate local optimization does not occur from the algorithm. .

次に、別の実験例として、前記実験例において用いたCSDのかわりに、Video Signatureに係る特徴量を本発明に適用した実験例を以下に示す。Video SignatureはISO/IEC標準の中で仕様が公開されているものであり（Multimedia content description interface, International Standard of ISO/IEC 15938-3, Amendment 4 (2010)）、コンテンツの同一性を判断するための一意に識別可能な特徴量を規定している。コンテンツの複製（コピー）を高精度に検知することができるため，インターネット上の不正コピーや不正流通の検知に用いることができるものと期待されている。 Next, as another experimental example, an experimental example in which the feature amount related to the video signature is applied to the present invention instead of the CSD used in the experimental example will be described below. Video Signature is a specification published in the ISO / IEC standard (Multimedia content description interface, International Standard of ISO / IEC 15938-3, Amendment 4 (2010)), to determine the identity of content. The feature quantity that can be uniquely identified is defined. Since it is possible to detect content duplication (copying) with high accuracy, it is expected that it can be used to detect unauthorized copying and unauthorized distribution on the Internet.

Video Signatureの規格では、特徴量のデータは、フレーム単位の特徴を記述するFrame Signature、Frame Confidence及びWord、並びに連続する複数フレームから構成される区間単位の特徴を記述するBagOfWordsから構成されている。本実験例では、この中のFrame Signature（フレームシグネチャ）に着目して計算を行なっている。なお、これはFrame Signatureに限定するという意味ではなく、Video Signatureの規格にある他の特徴量を用いたり、また、先に述べたように複数の特徴量を合成特徴量ベクトルとして利用することもできる。 In the Video Signature standard, the feature amount data is composed of Frame Signature, Frame Confidence, and Word that describe the feature of each frame, and BagOfWords that describes the feature of each section composed of a plurality of continuous frames. In this experimental example, calculation is performed by paying attention to the frame signature. Note that this does not mean that it is limited to Frame Signature, but other feature quantities in the Video Signature standard can be used, or multiple feature quantities can be used as composite feature quantity vectors as described above. it can.

Frame Signature は、特徴量として、映像の各フレームを部分領域に分割し、部分領域間の輝度の差分を三値化し、380次元のベクトルとして表わすものである。このとき、フレーム内の中央領域の方を周辺領域よりも密に部分領域としてサンプリングするようにしている。本実験例では、各フレームを32×32の部分領域に分割し、ＹＣｂＣｒの色彩空間の中のＹの輝度の平均値を各部分領域の輝度情報として用いた。 The frame signature is a feature quantity that divides each frame of the video into partial areas, ternary the luminance difference between the partial areas, and represents it as a 380-dimensional vector. At this time, the central region in the frame is sampled as a partial region more densely than the peripheral region. In this experimental example, each frame was divided into 32 × 32 partial areas, and the average value of Y luminance in the color space of YCbCr was used as luminance information of each partial area.

先のクラスタリングの実験と同様に、実験データとして、まず互いに異なる2100個の動画像を用意した。次にこれらを21個ずつに分け、合計100個の動画セットにする。各セットにおいて、21個の動画像のうちの一つをクエリとし、このクエリの10％以上のフレームを残りの20個の動画像の中にそれぞれ一様乱数によって得られた位置に挿入した。これによりクエリを除く2000個のテスト動画を被検索対象群とした。また、単純にクエリの一部を挿入するのではなく、挿入する動画像に以下のような５つの処理を施して検索しにくくしている。
１）フレームレートを増加（フレームの削除）（frame）
挿入する動画像の２または３または６フレーム毎(一様乱数により決定)に１フレームを削除することにより、フレームレートの増加を施している。
２）グレースケール化（gray）
挿入する動画像の各フレーム画像に対して、
Ｙ＝0.11448×Ｒ＋0.58661×Ｇ＋0.29891×Ｂ
の式によりグレースケール変換を施している。ここで、Ｒ、Ｇ、Ｂ（赤、緑、青）は各フレーム画像を表現する色彩空間を表し、値はそれぞれ０〜255の整数値を取る。
３）輝度を低下（br）
挿入する動画像の各フレーム画像のＲ、Ｇ、Ｂの値に輝度の低下率を表す変数Ｐを乗ずる。Ｐの値は0.6〜0.9の中で一様乱数を取る。
４）サイズ変化（縮小及び拡大）（resize）
挿入する動画像の各フレーム画像サイズを、0.5〜2.0倍の範囲から一様乱数を取り変更を施している。
５）JPEG（Joint Photographic Experts Group）変換（jpeg）
JPEG変換は、画像の周波数成分のうち一部の情報を削減することで圧縮を施している。
実験ではこの圧縮率を高めることで、挿入する動画像の品質を低下させた。本実験中では画像処理ライブラリのOpenCVを使用しており、この中でJPEGの品質を決定する変数CV_IMWRITE_JPEG_QUALITYに対して、20〜80の中から一様乱数を取って設定することで、圧縮を行なった。この変数CV_IMWRITE_JPEG_QUALITYは0〜100の値を取り、100のとき画像の品質が最高となる。 Similar to the previous clustering experiment, 2100 different moving images were prepared as experimental data. Next, divide these into 21 pieces to make a total of 100 movie sets. In each set, one of the 21 moving images was used as a query, and 10% or more frames of this query were inserted into the remaining 20 moving images at positions obtained by uniform random numbers. As a result, 2000 test videos excluding the query were set as the search target group. Further, instead of simply inserting a part of the query, the following five processes are performed on the moving image to be inserted to make it difficult to search.
1) Increase frame rate (delete frame) (frame)
The frame rate is increased by deleting one frame every 2 or 3 or 6 frames (determined by a uniform random number) of the moving image to be inserted.
2) Gray scale (gray)
For each frame image of the moving image to be inserted,
Y = 0.11448 x R + 0.58661 x G + 0.29891 x B
The gray scale conversion is given by the following formula. Here, R, G, and B (red, green, and blue) represent a color space that represents each frame image, and each value takes an integer value of 0 to 255.
3) Decrease brightness (br)
The R, G, and B values of each frame image of the moving image to be inserted are multiplied by a variable P that represents the rate of decrease in luminance. The value of P is a uniform random number in the range of 0.6 to 0.9.
4) Size change (reduction and enlargement) (resize)
Each frame image size of the moving image to be inserted is changed by taking a uniform random number from a range of 0.5 to 2.0 times.
5) JPEG (Joint Photographic Experts Group) conversion (jpeg)
In JPEG conversion, compression is performed by reducing a part of information among frequency components of an image.
In the experiment, the quality of the moving image to be inserted was lowered by increasing the compression rate. In this experiment, OpenCV of the image processing library is used, and compression is performed by taking a uniform random number from 20 to 80 for the variable CV_IMWRITE_JPEG_QUALITY that determines JPEG quality. It was. This variable CV_IMWRITE_JPEG_QUALITY takes a value from 0 to 100. When the value is 100, the image quality is the highest.

上記の実験データに対して、フレームに対する特徴量の抽出、時系列を保つ代表フレームの選出、及び、動画像間の解析による類似度の算出を行ない、クエリに関連する動画像の検索を行なった。また、比較例として、特徴量としてCSDを用いた実験も行なった。 For the above experimental data, we extracted feature quantities for frames, selected representative frames to keep time series, and calculated similarity by analyzing between moving images, and searched for moving images related to the query. . As a comparative example, an experiment using CSD as a feature amount was also performed.

図２３Ａおよび図２３Ｂは、実験で得られた適合率−再現率曲線の結果を示している。図２３ＡはCSDを用いた場合、図２３ＢはFrame Signatureを用いた場合である。典型静止画の取得方法はTP-PNNを用い、各パラメータの値はギャップペナルティーｇ＝0.2、CSDを用いた場合δ＝0.1、bar（Ｄ）＝0.05、距離計算はユークリッド距離、Frame Signatureを用いた場合δ=320、bar（Ｄ）＝15、距離計算は絶対値距離を用いた。このときδはTP-PNNで併合する条件の閾値である。各パラメータについては、適正な比較のため、圧縮率（代表フレームの取得数）が同等となるようにδを設定し、また、それぞれの方法で結果が最良となるようにbar（Ｄ）の値を選択している。 FIG. 23A and FIG. 23B show the result of the precision-recall rate curve obtained in the experiment. FIG. 23A shows a case where CSD is used, and FIG. 23B shows a case where Frame Signature is used. TP-PNN is used as a typical still image acquisition method. Gap penalty g = 0.2 for each parameter, δ = 0.1 when using CSD, bar (D) = 0.05, distance calculation uses Euclidean distance, Frame Signature In this case, δ = 320, bar (D) = 15, and distance calculation uses an absolute distance. At this time, δ is a threshold value of the condition for merging with TP-PNN. For each parameter, for proper comparison, δ is set so that the compression rate (number of representative frames acquired) is equal, and the value of bar (D) is set so that the result is the best in each method. Is selected.

図２３Ａに示すように、輝度が落ちている場合は、CSDによる検索は難しくなる。グレースケール化された場合にはほとんど検索できない。また、JPEG変換によっても精度が低下した。ただし、フレームレートの変換とサイズ変換に関しては高い精度を保っていた。一方、図２３Ｂに示すように、Video Frame Signatureを用いた場合、CSDに比べ実験を行なったすべての変換に対して精度低下の差異が少なくなることが確認できた。 As shown in FIG. 23A, when the brightness is low, the search by CSD becomes difficult. When it is made grayscale, it can hardly be searched. Also, JPEG conversion reduced accuracy. However, high accuracy was maintained for frame rate conversion and size conversion. On the other hand, as shown in FIG. 23B, when Video Frame Signature is used, it was confirmed that the difference in accuracy reduction was reduced for all the conversions that were experimented compared to CSD.

以上述べたように、Frame Signatureを採用することで、CSDを用いた場合と比較して、輝度変化などの変更に強いという利点がある。また、特徴ベクトルのデータ量がCSDと比べて小さいため、比較時の距離計算を高速に行なうことが可能となる利点がある。 As described above, adopting Frame Signature has an advantage that it is more resistant to changes such as luminance changes than the case of using CSD. In addition, since the data amount of the feature vector is smaller than that of CSD, there is an advantage that distance calculation at the time of comparison can be performed at high speed.

以上、本実施の形態における動画像検索装置のシステム構成とアルゴリズムを提示したが、動画像は大きな容量サイズであり、各典型静止画とそれらの責任のあるフレームの集合は、通常オフラインで算出される。そのため、大きなデータ集合の構造情報に対する数値ラベルとして、こうした集合を利用できる。また本実施の形態では、各々の典型静止画が責任のあるフレームを含んでいることを考慮して、レーベンシュタイン距離を含む新規な類似度の算出方法を、Ｍ距離として提案している。このＭ距離は、選出した典型静止画に対して、大域アライメントと局所アライメントの両方で利用が可能である。 As mentioned above, the system configuration and algorithm of the moving image search apparatus in the present embodiment have been presented. However, the moving image has a large capacity size, and each typical still image and a set of frames for which they are responsible are usually calculated offline. The Therefore, such a set can be used as a numerical label for the structural information of a large data set. In the present embodiment, considering that each typical still image includes a responsible frame, a novel similarity calculation method including the Levenshtein distance is proposed as the M distance. This M distance can be used for both the global alignment and the local alignment for the selected typical still image.

また、上述した５つのクラスタリング手法による検索結果の違いを調べた実験例では、クエリ動画像と対象動画像に対して同じクラスタリング手法を用いて代表フレームを選出している。このようにすることで、同一の規範で代表フレームを選出することができ、類似度比較の際に、異なる規範を用いて代表フレームを選出することによる不整合の発生を防ぐことができる。ただし、必ずしも同じクラスタリング手法をとる必要があるとは限らず、検索する動画像のデータサイズやクラスタリングの処理速度、検索精度等を考慮して、クエリ動画像と対象動画像とで異なるクラスタリング手法を用いて代表フレームを選出しても構わない。 Further, in the experimental example in which the difference between the search results by the five clustering methods described above is used, the representative frame is selected using the same clustering method for the query moving image and the target moving image. By doing so, it is possible to select representative frames with the same standard, and it is possible to prevent the occurrence of inconsistency due to selection of representative frames using different standards in the similarity comparison. However, it is not always necessary to use the same clustering method. Considering the data size of the moving image to be searched, the processing speed of clustering, the search accuracy, etc., different clustering methods are used for the query moving image and the target moving image. It may be used to select a representative frame.

なお、上述した処理装置２による動画像検索処理は、ＦＰＧＡ（field-programmable gate array）等を用いた電子回路により実行することができるが、その機能の全部又は一部をコンピュータにより実行することができる。図２４は、処理装置２としてコンピュータ１００を用いる場合のハードウェア構成の一例である。コンピュータ１００には、各種制御を行なうＣＰＵ（Central Processing Unit）１０２、主記憶装置１０３、補助記憶装置１０４、通信インターフェース１０５、及びドライブ装置１０６などがバス１０１を介して相互に接続されている。 The moving image search processing by the processing device 2 described above can be executed by an electronic circuit using an FPGA (field-programmable gate array) or the like, but all or part of the functions can be executed by a computer. it can. FIG. 24 is an example of a hardware configuration when the computer 100 is used as the processing device 2. A central processing unit (CPU) 102 that performs various controls, a main storage device 103, an auxiliary storage device 104, a communication interface 105, a drive device 106, and the like are connected to the computer 100 via a bus 101.

ＣＰＵ１０２は、補助記憶装置１０４に格納されているプログラムを読み出し、その指示に従って本発明に係る動画検索処理を実行することができる。主記憶装置１０３はランダムアクセスメモリ等からなる記憶装置であり、補助記憶装置１０４に格納されているプログラムやデータ等の一部を一時的に記憶し、また、ＣＰＵ１０２による処理結果を一時的に記憶するものである。補助記憶装置１０４は、例えばハードディスクドライブやＳＳＤ（Solid State Drive）等からなる記憶装置であり、本発明に係る動画検索処理を実行するためのプログラムや処理に用いるデータを記憶する。通信インターフェース１０５は、ネットワークを介して外部のサーバや外部の記憶装置と通信を行なうためのものである。コンピュータ１００は通信インターフェース１０５を介して、外部のサーバからプログラムをダウンロードしたり、外部の記憶装置に対してデータの読み出しや書き込みを行なったりすることができる。ドライブ装置１０６は、記録媒体１０７に対してデータの読み出しや書き込みを行うために用いるものである。記録媒体１０７はプログラムやデータを記録するものであり、ＣＤ（Compact Disk ）やＤＶＤ（Digital Versatile Disk）等を用いたＲＯＭ（Read only Memory）等がある。また、記録媒体１０７としては、その他ＵＳＢ（Universal Serial Bus）メモリ等の電子デバイスメモリを用いることも可能である。なお、これらのハードウェアは一例として示したものであり、必要に応じて取捨選択可能であり、また、他の要素を加えても構わない。 The CPU 102 can read out a program stored in the auxiliary storage device 104 and execute the moving image search process according to the present invention in accordance with the instruction. The main storage device 103 is a storage device such as a random access memory, and temporarily stores a part of programs, data, and the like stored in the auxiliary storage device 104, and temporarily stores the processing results by the CPU 102. To do. The auxiliary storage device 104 is a storage device such as a hard disk drive or SSD (Solid State Drive), for example, and stores a program for executing the moving image search processing according to the present invention and data used for the processing. The communication interface 105 is for communicating with an external server or an external storage device via a network. The computer 100 can download a program from an external server or read / write data from / to an external storage device via the communication interface 105. The drive device 106 is used to read data from and write data to the recording medium 107. The recording medium 107 records programs and data, and includes a ROM (Read Only Memory) using a CD (Compact Disk), a DVD (Digital Versatile Disk), or the like. Further, as the recording medium 107, other electronic device memory such as a USB (Universal Serial Bus) memory may be used. In addition, these hardware is shown as an example, can be selected as needed, and other elements may be added.

以上を纏めると、本実施の形態では、クエリ動画像に対して検索対象となる対象動画像の類似度を算出して、クエリ動画像に類似する動画像を検索する動画像検索装置のシステムを用いた動画像検索方法を提案する。 In summary, in this embodiment, a system of a moving image search apparatus that calculates a similarity of a target moving image to be searched for a query moving image and searches for a moving image similar to the query moving image. We propose a video retrieval method.

特にこの方法では、処理装置２に組み込んだ特徴量抽出部２２が、クエリ動画像のフレームから、フレームごとに第１の特徴量を抽出し、対象動画像のフレームから、フレームごとに第２の特徴量を抽出する。次に、処理装置２に組み込んだ動画像代表フレーム選出部２３が、クエリ動画像について、特徴量抽出部２２で抽出された第１の特徴量に対して、TP k-means、TS k-means、TB-PNN、又はTP-PNNの何れか用いて、時系列で第１の代表フレームを取得し、対象動画像について、特徴量抽出部２２で抽出された第２の特徴量に対して、TP k-means、TS k-means、TB-PNN、又はTP-PNNの何れかを用いて、時系列で第２の代表フレームを取得する。そして、処理装置２に組み込まれた動画像検索部としての類似度算出部２４や類似動画像表示制御部２５が、クエリ動画像における第１の代表フレームと、対象動画像における第２の代表フレームとの間の代表フレーム間類似度から、クエリ動画像と対象動画像との間の類似度を求めて、クエリ動画像に類似する動画像を検索する。 In particular, in this method, the feature amount extraction unit 22 incorporated in the processing device 2 extracts the first feature amount for each frame from the query moving image frame, and the second amount for each frame from the target moving image frame. Extract features. Next, the moving image representative frame selection unit 23 incorporated in the processing device 2 performs TP k-means, TS k-means on the query feature image with respect to the first feature amount extracted by the feature amount extraction unit 22. , TB-PNN, or TP-PNN is used to acquire the first representative frame in time series, and for the second feature amount extracted by the feature amount extraction unit 22 for the target moving image, A second representative frame is acquired in time series using any one of TP k-means, TS k-means, TB-PNN, or TP-PNN. Then, the similarity calculation unit 24 or the similar moving image display control unit 25 as a moving image search unit incorporated in the processing device 2 performs the first representative frame in the query moving image and the second representative frame in the target moving image. The similarity between the query moving image and the target moving image is obtained from the similarity between the representative frames between and the moving image similar to the query moving image.

このような方法を採用すると、クエリ動画像と対象動画像に対して、動画像の経過時間の要素を考慮した時間的な制限を導入して、１乃至複数の第１の代表フレームと第２の代表フレームをそれぞれ取得することが可能になる。そして、この時系列の第１の代表フレームと第２の代表フレームを利用して、クエリ動画像と対象動画像との間で類似性を判定することで、対象動画像の中からクエリ動画像に類似する動画像を、通常のAPよりも精度を上げて検索できる。また、k-means、及びPNNの各手法においても同様の結果が期待できる。また、TP k-means、TS k-means、TB-PNN、TP-PNNの各手法は、TB-APに比べ計算量が少なく、結果的に動画像そのものをクエリとする検索やグループ化を、計算コストを低く抑えて高い精度で容易に行なうことが可能になる。 When such a method is adopted, a time restriction that considers an element of the elapsed time of the moving image is introduced to the query moving image and the target moving image, and one or more first representative frames and the second representative frame Each representative frame can be acquired. Then, using the first representative frame and the second representative frame in time series, the similarity between the query moving image and the target moving image is determined, so that the query moving image can be selected from the target moving images. Can be searched with higher accuracy than a normal AP. Similar results can be expected with the k-means and PNN methods. The TP k-means, TS k-means, TB-PNN, and TP-PNN methods require less computation than TB-AP, and as a result, search and grouping that uses the moving image itself as a query, It is possible to easily carry out with high accuracy while keeping the calculation cost low.

上記の方法では、特徴量抽出部２２が第１の特徴量及び第２の特徴量として、Video Signatureに係る特徴量として規定されるFrame Signatureを用いるのが好ましい。これにより、CSDを用いた場合と比較して、輝度変化などの変更があっても高い検索精度を維持できる。また、データ量がCSDと比べて小さいため、比較時の距離計算を高速に行なうことができる。 In the above method, it is preferable that the feature quantity extraction unit 22 uses a frame signature defined as a feature quantity related to the video signature as the first feature quantity and the second feature quantity. Thereby, compared with the case where CSD is used, high search accuracy can be maintained even if there is a change such as luminance change. In addition, since the amount of data is small compared to CSD, distance calculation at the time of comparison can be performed at high speed.

上記クエリ動画像に類似する動画像の検索では、クエリ画像におけるｉ番目の第１の代表フレームが表わすフレーム数Ｅ^Ａ _ｉと、対象動画像におけるｊ番目の第２の代表フレームが表わすフレーム数Ｅ^Ｂ _ｊを取得し、クエリ画像におけるｉ番目の第１の代表フレームと対象動画像におけるｊ番目の第２の代表フレームとの間のSimilarity測度ｓ（ｉ，ｊ）を、前記代表フレーム間類似度として取得する。 In searching for a moving image similar to the query moving image, the number of frames E ^A _i represented by the i-th first representative frame in the query image and the number of frames E represented by the j-th second representative frame in the target moving image. ^B _j is obtained, and the similarity measure s (i, j) between the i-th first representative frame in the query image and the j-th second representative frame in the target moving image is used as the similarity between the representative frames. Get as.

そして、ギャップペナルティーｇを設定した上で、クエリ動画像における第１の代表フレームの数がｍであり、対象動画像における第２の代表フレームの数がｎであるときに、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、ｉ＝０の行の各成分ｆ（０，ｊ）を数８に基づいて算出し、ｊ＝０の列の各成分ｆ（ｉ，０）を、数９に基づいて算出する。さらに、フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、数１０に基づいて算出する。こうして、行列の最終行最終列の数値ｆ_ｌａｓｔを算出したら、その数値ｆ_ｌａｓｔを利用してクエリ動画像と対象動画像との間の類似度ｕ（Ａ，Ｂ）を、例えば数１１で算出し、クエリ動画像に類似する動画像を検索する。 Then, after setting the gap penalty g, when the number of first representative frames in the query moving image is m and the number of second representative frames in the target moving image is n, (m + 1) × ( For each component f (i, j) of the matrix of (n + 1), each component f (0, j) in the row with i = 0 is calculated based on Equation 8, and each component f (i, j) in the column with j = 0 is calculated. 0) is calculated based on Equation (9). Further, when the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i) other than the row of i = 0 and the column of j = 0. , J) is calculated based on Equation (10). When the numerical value f _last of the last row and the last column of the matrix is calculated in this way, the similarity u (A, B) between the query moving image and the target moving image is calculated by using the numerical value f _last , for example, using Equation 11. Then, a moving image similar to the query moving image is searched.

このような方法を採用すれば、クエリ動画像と対象動画像との間で第１の代表フレームと第２の代表フレームの数が異なっていても、ギャップペナルティーｇを導入することで、双方の代表フレーム同士の類似性とその時系列の類似性を、正しく判定して検索を行なうことができる。また、こうして得られた行列ｆ（ｉ，ｊ）の各成分から、大域アライメントの経路を得ることができる。しかも、ここで得られる動画像データに適合した類似度の算出は、レーベンシュタイン距離や、Needleman−Wunschアルゴリズムのそれぞれを単なる特例として含んでおり、生命情報配列の類似度を測る手法にも適用できる。 If such a method is adopted, even if the number of the first representative frame and the second representative frame is different between the query moving image and the target moving image, by introducing the gap penalty g, Search can be performed by correctly determining the similarity between the representative frames and the similarity in time series. Also, a global alignment path can be obtained from each component of the matrix f (i, j) thus obtained. In addition, the calculation of similarity suitable for the moving image data obtained here includes the Levenshtein distance and the Needleman-Wunsch algorithm as mere special cases, and can be applied to a technique for measuring the similarity of life information sequences. .

また別な手法として、上記クエリ動画像に類似する動画像の検索では、フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊと、Similarity測度ｓ（ｉ，ｊ）をそれぞれ取得し、ギャップペナルティーｇを設定した上で、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、ｉ＝０の行の各成分ｆ（０，ｊ）と、ｊ＝０の列の各成分ｆ（ｉ，０）を、全て０に設定する。そして、フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、数１３に基づいて算出する。こうして、行列の各要素の中で最大値ｆ_ｍａｘを算出したら、その数値ｆ_ｍａｘを利用してクエリ動画像と対象動画像との間の類似度ｕ（Ａ，Ｂ）を、例えば数１２でｆ_ｌａｓｔをｆ_ｍａｘに代えて算出し、クエリ動画像に類似する動画像を検索してもよい。 As another method, in the search for a moving image similar to the above-described query moving image, the number of frames E ^A _i and E ^B _j and the similarity measure s (i, j) are acquired, and the gap penalty g is set. Then, for each component f (i, j) of the matrix of (m + 1) × (n + 1), each component f (0, j) in the row where i = 0 and each component f (i, j) in the column where j = 0. 0) are all set to 0. Then, assuming that the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i) other than the row of i = 0 and the column of j = 0. , J) is calculated based on Equation (13). When the maximum value f _max is calculated in each element of the matrix in this way, the similarity u (A, B) between the query moving image and the target moving image is calculated by using the numerical value f _max , for example, Equation 12. It is also possible to calculate f _last instead of f _max and search for a moving image similar to the query moving image.

この場合も、クエリ動画像と対象動画像との間で第１の代表フレームと第２の代表フレームの数が異なっていても、ギャップペナルティーｇを導入することで、双方の代表フレーム同士の類似性とその時系列の類似性を、正しく判定して検索を行なうことができる。また、こうして得られた行列ｆ（ｉ，ｊ）の各成分から、局所アライメントの経路を得ることができる。しかも、ここで得られる動画像データに適合した類似度の算出は、Smith-Watermanアルゴリズムを単なる特例として含んでおり、生命情報配列の類似度を測る手法にも適用できる。 Also in this case, even if the number of the first representative frame and the second representative frame is different between the query moving image and the target moving image, the similarity between both the representative frames can be obtained by introducing the gap penalty g. It is possible to perform a search by correctly determining the similarity between the sex and the time series. Further, a path of local alignment can be obtained from each component of the matrix f (i, j) thus obtained. In addition, the calculation of the similarity suitable for the moving image data obtained here includes the Smith-Waterman algorithm as a special case, and can be applied to a technique for measuring the similarity of life information sequences.

そして、上述した作用効果は、上記の処理を実行することができる動画像検索装置や、上記の処理をコンピュータに実行させるための動画像検索プログラムにおいても同様に実現する。 The above-described operational effects are similarly realized in a moving image search apparatus capable of executing the above processing and a moving image search program for causing a computer to execute the above processing.

以上、本発明の実施の形態について説明したが、本発明はこうした実施の形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において、種々なる形態で実施し得ることは勿論である。例えば上述した各パラメータ値の設定は、入力装置４を介して適宜変更することも可能である。 As mentioned above, although embodiment of this invention was described, this invention is not limited to such embodiment at all, Of course, in the range which does not deviate from the summary of this invention, it can implement with a various form. It is. For example, the setting of each parameter value described above can be appropriately changed via the input device 4.

２２特徴量抽出部
２３動画像代表フレーム選出部
２４類似度算出部（動画像検索部）
２５類似動画像表示制御部（動画像検索部）
22 feature amount extraction unit 23 moving image representative frame selection unit 24 similarity calculation unit (moving image search unit)
25 Similar video display control unit (video search unit)

Claims

A moving image search method by a moving image search apparatus for calculating a similarity of a target moving image with respect to a query moving image and searching for a moving image similar to the query moving image,
Extract the first feature value for each frame from the frame of the query video,
Extract the second feature value for each frame from the frame of the target moving image,
For the query moving image, any one of the time-segment k-average method, the time-division k-average method, the time-constrained pair-wise nearest neighbor method, or the time-segment pair-wise nearest neighbor method is applied to the extracted first feature amount. To obtain the first representative frame in time series,
For the extracted second feature value, any one of the time-segment k-average method, the time-division k-average method, the time-constrained pair-wise nearest neighbor method, or the time-segment pair-wise nearest neighbor method is applied to the target moving image. To obtain the second representative frame in time series,
A similarity between the query moving image and the target moving image is obtained from a similarity between the first representative frame in the query moving image and the second representative frame in the target moving image. , Search for the moving image similar to the query moving image,
A moving image search method characterized by the above.

The moving image search method according to claim 1, wherein a frame signature is used as the first feature amount and the second feature amount.

3. The moving image search method according to claim 1, wherein the acquisition of the first representative frame and the acquisition of the second representative frame are performed using the same method.

In the search for similar moving images,
The number of frames E ^A _i represented by the i th first representative frame in the query video, the number of frames E ^B _j represented by the j th second representative frame in the target video, and the query video Obtaining the inter-representative frame similarity s (i, j) between the i-th first representative frame and the j-th second representative frame in the target moving image,
Set gap penalty g,
When the number of the first representative frames in the query moving image is m and the number of the second representative frames in the target moving image is n, each component of the matrix of (m + 1) × (n + 1) For f (i, j)
Calculate each component f (0, j) of the row of i = 0 based on the following equation 1;
Calculate each component f (i, 0) in the column of j = 0 based on the following formula 2;
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 3,
The said similarity between the said query moving image and the said target moving image is calculated using the numerical value of the last row | line | column last column of the said matrix, The any one of Claims 1-3 characterized by the above-mentioned. Video search method.

In the search for similar moving images,
The number of frames E ^A _i represented by the i th first representative frame in the query video, the number of frames E ^B _j represented by the j th second representative frame in the target video, and the query video Obtaining the inter-representative frame similarity s (i, j) between the i-th first representative frame and the j-th second representative frame in the target moving image,
Set gap penalty g,
When the number of the first representative frames in the query moving image is m and the number of the second representative frames in the target moving image is n, each component of the matrix of (m + 1) × (n + 1) For f (i, j)
Each component f (0, j) in the row with i = 0 and each component f (i, 0) in the column with j = 0 are all set to 0,
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 4,
4. The similarity between the query moving image and the target moving image is calculated using a maximum value among the components of the matrix, according to any one of claims 1 to 3. The moving image search method described.

A moving image search apparatus for calculating a similarity of a target moving image to be searched with respect to a query moving image and searching for a moving image similar to the query moving image,
A feature amount extraction unit that extracts a first feature amount for each frame related to the query moving image and a second feature amount for each frame related to the target moving image from the frame of the query moving image and the target moving image. When,
For the query moving image and the target moving image, with respect to the extracted first feature quantity and second feature quantity, a time-division k-average method, a time-division k-average method, a time-constrained pairwise nearest neighbor method Or a representative frame for acquiring a first representative frame related to the first feature value and a second representative frame related to the second feature value in time series using either the time-wise pairwise nearest neighbor method. A selection department;
A similarity between the query moving image and the target moving image is obtained from a similarity between the first representative frame in the query moving image and the second representative frame in the target moving image. And a moving image search unit for searching for the moving image similar to the query moving image.

The moving image search apparatus according to claim 6, wherein the feature amount extraction unit uses a frame signature as the first feature amount and the second feature amount.

The moving image search unit
The number of frames E ^A _i represented by the i th first representative frame in the query video, the number of frames E ^B _j represented by the j th second representative frame in the target video, and the query video Obtaining the inter-representative frame similarity s (i, j) between the i-th first representative frame and the j-th second representative frame in the target moving image,
Set gap penalty g,
When the number of the first representative frames in the query moving image is m and the number of the second representative frames in the target moving image is n, each component of the matrix of (m + 1) × (n + 1) For f (i, j)
Calculate each component f (0, j) in the row of i = 0 based on the following equation 5:
Calculate each component f (i, 0) in the column of j = 0 based on the following equation 6:
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 7,
The moving image according to claim 6 or 7, wherein the similarity between the query moving image and the target moving image is calculated using a numerical value of a last row and a last column of the matrix. Image retrieval device.

The moving image search unit
The number of frames E ^A _i represented by the i th first representative frame in the query video, the number of frames E ^B _j represented by the j th second representative frame in the target video, and the query video Obtaining the inter-representative frame similarity s (i, j) between the i-th first representative frame and the j-th second representative frame in the target moving image,
Set gap penalty g,
When the number of the first representative frames in the query moving image is m and the number of the second representative frames in the target moving image is n, each component of the matrix of (m + 1) × (n + 1) For f (i, j)
Each component f (0, j) in the row with i = 0 and each component f (i, 0) in the column with j = 0 are all set to 0,
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 8,
8. The configuration according to claim 6, wherein the similarity between the query moving image and the target moving image is calculated using a maximum value among the components of the matrix. Video search device.

A moving image search program for causing a computer to execute the moving image search process according to any one of claims 1 to 5.