JP2016014990A

JP2016014990A - Moving image search method, moving image search device, and program thereof

Info

Publication number: JP2016014990A
Application number: JP2014136257A
Authority: JP
Inventors: 泰男松山; Yasuo Matsuyama; 雅史森脇; Masafumi Moriwaki; 良太横手; Ryota Yokote
Original assignee: Waseda University
Current assignee: Waseda University
Priority date: 2014-07-01
Filing date: 2014-07-01
Publication date: 2016-01-28

Abstract

PROBLEM TO BE SOLVED: To easily perform search or grouping with a moving image itself as a query with high accuracy by more suppressing calculation costs than a conventional manner.SOLUTION: First featured value extraction means 12 or second featured value extraction means 22 respectively extracts the featured values of a query moving image and an object moving image. Query moving image representative frame selection means 13 or object moving image representative frame selection means 23 clusters each frame from a relation between featured values within a range of time width limited by a time window, and allows the time window to slide on a time axis as for the query moving image and the object moving image. Thus, it is possible to time-sequentially acquire the representative frame for each scene. Furthermore, similarity calculation means 31 or similar moving image display control means 32 obtains similarity between the query moving image and the object moving image from inter-representative similarity between the query moving image and the object moving image, and searches a moving image similar to the query moving image.

Description

本発明は、動画像検索方法、動画像検索装置及びそのプログラムに関する。 The present invention relates to a moving image search method, a moving image search device, and a program thereof.

インターネット上には、莫大な量の動画像データが蓄積されている。これらの動画像の検索に関してよく使われる方法として、テキストベースの検索手法が知られている。このテキストベースの検索手法は、動画像のそれぞれに適切なテキスト情報を対応させ、テキスト検索を行なうことで所望の動画像を検索するものである。しかしながら、このような手法では、以下のような問題がある。
（ａ）動画像とテキスト情報の対応付け（ラベル付け）は手作業で行なう必要があり、コストがかかる。
（ｂ）汎用性が乏しい不適切なラベルが付けられることにより、検索の精度や効率が低下する可能性がある。 A huge amount of moving image data is accumulated on the Internet. A text-based search method is known as a frequently used method for searching these moving images. This text-based search method searches for a desired moving image by associating appropriate text information with each moving image and performing a text search. However, this method has the following problems.
(A) The association (labeling) of the moving image and the text information needs to be performed manually, which is expensive.
(B) Since an inappropriate label with poor versatility is attached, the accuracy and efficiency of search may be reduced.

上記問題点を解決する手法として、画像の内容（コンテンツ）そのものをクエリとして用いるコンテンツベースの検索が期待される。 As a method for solving the above problem, a content-based search using the image content (content) itself as a query is expected.

コンテンツベースの検索方法として、画像全体の輝度の変化と複雑度、エッジ量の複雑度、輝度分布を用いた簡易構図情報など画像全体に関する情報、またはそれらの組み合わせで、クエリ画像と検索対象画像との類似度を計算する手法が提案されている（例えば特許文献１参照）。また、複数の検索対象画像をそれぞれ所定の大きさの要素画像に分割し、分割された要素画像から代表的な要素画像を選択し、この代表的な要素画像を用いて描画した検索参照画像による画像の検索を行う手法も提案されている（例えは特許文献２参照）。 As a content-based search method, the query image and the search target image can be obtained by combining information about the entire image such as brightness change and complexity of the entire image, complexity of the edge amount, simple composition information using the luminance distribution, or a combination thereof. There has been proposed a method for calculating the degree of similarity (see, for example, Patent Document 1). Further, by dividing a plurality of search target images into element images of a predetermined size, selecting a representative element image from the divided element images, and using a search reference image drawn using the representative element image A method for searching for an image has also been proposed (for example, see Patent Document 2).

しかしながらこれらの提案手法は、静止画像をクエリとして用いて静止画像の検索を行なうものであり、動画像をクエリとして検索する手法については考慮されていない。 However, these proposed methods search still images using still images as queries, and do not consider a method for searching for moving images as queries.

特開２００１−５６８２０号公報JP 2001-56820 A 特開２００２−１４０３３１号公報JP 2002-140331 A

上記したように、従来技術においてはテキストベース又は静止画像をクエリとして用いる手法について開示があるのみであり、動画像をクエリとする、動画像のコンテンツに基づいた動画像検索について開示されたものはなく、検討されていなかった。本発明は、動画像そのものをクエリとする検索を、計算コストを低く抑えつつ精度良く行なうことが可能な動画像検索方法、動画像検索装置及びそのプログラムを提供することを目的とする。 As described above, in the prior art, there is only a disclosure about a method using a text base or a still image as a query, and what is disclosed about a moving image search based on a moving image content using a moving image as a query is disclosed. There was no consideration. An object of the present invention is to provide a moving image search method, a moving image search apparatus, and a program thereof that can perform a search using a moving image itself as a query with high accuracy while keeping calculation cost low.

本発明は、大量の動画像間の相互関係、すなわち位相を数値的に計算できる手段を与えるものである。これにより、動画像からなるビッグデータに対して機械学習によりラベルを付け、動画像そのものをクエリとする検索やグループ化を行なうことを可能にする。このとき、ラベルは代表フレームの枚数や、番号や、それらの支配フレーム数をヘッダーに付加したものとなる。このようなフォーマット化は容易であり、標準化への道を有している。 The present invention provides a means by which a correlation between a large number of moving images, that is, a phase can be calculated numerically. As a result, it is possible to label big data composed of moving images by machine learning, and to perform searches and groupings using the moving images themselves as queries. At this time, the label is obtained by adding the number of representative frames, the number, and the number of governing frames to the header. Such formatting is easy and has the path to standardization.

すなわち本発明は、クエリ動画像に対して対象動画像の類似度を算出して、前記クエリ動画像に類似する動画像を検索する動画像検索装置による動画像検索方法であって、
前記クエリ動画像と前記対象動画像の各フレームから特徴量を抽出する第１ステップと、
前記クエリ動画像と前記対象動画像について、時間窓で制限された時間幅の範囲内で、前記特徴量間の関係から前記各フレームをクラスター化し、前記時間窓を時間軸上でスライドさせることで、シーン毎に時系列で代表フレームを取得する第２ステップと、
前記クエリ動画像における前記代表フレームと、前記対象動画像における前記代表フレームとの間の代表フレーム間類似度から、前記クエリ動画像と前記対象動画像との間の前記類似度を求めて、前記クエリ動画像に類似する前記動画像を検索する第３ステップと、を備えたことを特徴とする。 That is, the present invention is a moving image search method by a moving image search apparatus for calculating a similarity of a target moving image with respect to a query moving image and searching for a moving image similar to the query moving image,
A first step of extracting a feature amount from each frame of the query moving image and the target moving image;
For the query moving image and the target moving image, the frames are clustered from the relationship between the feature amounts within a time width limited by the time window, and the time window is slid on the time axis. A second step of acquiring a representative frame in time series for each scene;
From the representative frame similarity between the representative frame in the query video and the representative frame in the target video, the similarity between the query video and the target video is obtained, And a third step of searching for the moving image similar to the query moving image.

上記動画像検索方法において、前記第３ステップでは、
前記クエリ画像のｉ番目の前記代表フレームが表わすフレーム数Ｅ^Ａ _ｉと、前記対象動画像のｊ番目の前記代表フレームが表わすフレーム数Ｅ^Ｂ _ｊと、前記クエリ画像のｉ番目の前記代表フレームと前記対象動画像のｊ番目の前記代表フレームとの間の前記代表フレーム間類似度ｓ（ｉ，ｊ）と、をそれぞれ取得し、
ギャップペナルティーｇを設定し、
前記クエリ動画像における前記代表フレームの数がｍであり、前記対象動画像における前記代表フレームの数がｎであるときに、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、
ｉ＝０の行の各成分ｆ（０，ｊ）を、次の数１に基づいて算出し、 In the moving image search method, in the third step,
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Calculate each component f (0, j) of the row of i = 0 based on the following equation 1;

ｊ＝０の列の各成分ｆ（ｉ，０）を、次の数２に基づいて算出し、 Calculate each component f (i, 0) in the column of j = 0 based on the following formula 2;

前記フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数３に基づいて算出し、 When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 3,

前記行列の最終行最終列の数値を利用して、前記クエリ動画像と前記対象動画像との間の前記類似度を算出する。 The similarity between the query moving image and the target moving image is calculated using the numerical value of the last row and the last column of the matrix.

代わりに、前記第３ステップでは、
前記クエリ画像のｉ番目の前記代表フレームが表わすフレーム数Ｅ^Ａ _ｉと、前記対象動画像のｊ番目の前記代表フレームが表わすフレーム数Ｅ^Ｂ _ｊと、前記クエリ画像のｉ番目の前記代表フレームと前記対象動画像のｊ番目の前記代表フレームとの間の前記代表フレーム間類似度ｓ（ｉ，ｊ）と、をそれぞれ取得し、
ギャップペナルティーｇを設定し、
前記クエリ動画像における前記代表フレームの数がｍであり、前記対象動画像における前記代表フレームの数がｎであるときに、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、
ｉ＝０の行の各成分ｆ（０，ｊ）と、ｊ＝０の列の各成分ｆ（ｉ，０）を、全て０に設定し、
前記フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数４に基づいて算出し、 Instead, in the third step,
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Each component f (0, j) in the row with i = 0 and each component f (i, 0) in the column with j = 0 are all set to 0,
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 4,

前記行列の各成分の中の最大値を利用して、前記クエリ動画像と前記対象動画像との間の前記類似度を算出してもよい。 The similarity between the query moving image and the target moving image may be calculated using the maximum value among the components of the matrix.

また本発明は、クエリ動画像に対して検索対象となる対象動画像の類似度を算出して、前記クエリ動画像に類似する動画像を検索する動画像検索装置であって、
前記クエリ動画像と前記対象動画像の各フレームから特徴量を抽出する特徴量抽出手段と、
前記クエリ動画像と前記対象動画像について、時間窓で制限された時間幅の範囲内で、前記特徴量間の関係から前記各フレームをクラスター化し、前記時間窓を時間軸上でスライドさせることで、シーン毎に時系列で代表フレームを取得する代表フレーム選出手段と、
前記クエリ動画像における前記代表フレームと、前記対象動画像における前記代表フレームとの間の代表フレーム間類似度から、前記クエリ動画像と前記対象動画像との間の前記類似度を求めて、前記クエリ動画像に類似する前記動画像を検索する動画像検索手段と、を備えている。 The present invention is also a moving image search device for calculating a similarity of a target moving image to be searched with respect to a query moving image and searching for a moving image similar to the query moving image,
Feature amount extraction means for extracting a feature amount from each frame of the query moving image and the target moving image;
For the query moving image and the target moving image, the frames are clustered from the relationship between the feature amounts within a time width limited by the time window, and the time window is slid on the time axis. Representative frame selection means for acquiring representative frames in time series for each scene;
From the representative frame similarity between the representative frame in the query video and the representative frame in the target video, the similarity between the query video and the target video is obtained, Moving image search means for searching for the moving image similar to the query moving image.

上記動画像検索装置において、前記動画像検索手段は、
前記クエリ画像のｉ番目の前記代表フレームが表わすフレーム数Ｅ^Ａ _ｉと、前記対象動画像のｊ番目の前記代表フレームが表わすフレーム数Ｅ^Ｂ _ｊと、前記クエリ画像のｉ番目の前記代表フレームと前記対象動画像のｊ番目の前記代表フレームとの間の前記代表フレーム間類似度ｓ（ｉ，ｊ）と、をそれぞれ取得し、
ギャップペナルティーｇを設定し、
前記クエリ動画像における前記代表フレームの数がｍであり、前記対象動画像における前記代表フレームの数がｎであるときに、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、
ｉ＝０の行の各成分ｆ（０，ｊ）を、次の数５に基づいて算出し、 In the moving image search apparatus, the moving image search means includes:
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Calculate each component f (0, j) in the row of i = 0 based on the following equation 5:

ｊ＝０の列の各成分ｆ（ｉ，０）を、次の数６に基づいて算出し、 Calculate each component f (i, 0) in the column of j = 0 based on the following equation 6:

前記フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数７に基づいて算出し、 When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 7,

前記行列の最終行最終列の数値を利用して、前記クエリ動画像と前記対象動画像との間の前記類似度を算出する構成としている。 The similarity between the query moving image and the target moving image is calculated using the numerical value of the last row and the last column of the matrix.

代わりに、前記動画像検索手段は、
前記クエリ画像のｉ番目の前記代表フレームが表わすフレーム数Ｅ^Ａ _ｉと、前記対象動画像のｊ番目の前記代表フレームが表わすフレーム数Ｅ^Ｂ _ｊと、前記クエリ画像のｉ番目の前記代表フレームと前記対象動画像のｊ番目の前記代表フレームとの間の前記代表フレーム間類似度ｓ（ｉ，ｊ）と、をそれぞれ取得し、
ギャップペナルティーｇを設定し、
前記クエリ動画像における前記代表フレームの数がｍであり、前記対象動画像における前記代表フレームの数がｎであるときに、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、
ｉ＝０の行の各成分ｆ（０，ｊ）と、ｊ＝０の列の各成分ｆ（ｉ，０）を、全て０に設定し、
前記フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数８に基づいて算出し、 Instead, the moving image search means
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Each component f (0, j) in the row with i = 0 and each component f (i, 0) in the column with j = 0 are all set to 0,
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 8,

前記行列の各成分の中の最大値を利用して、前記クエリ動画像と前記対象動画像との間の前記類似度を算出する構成としてもよい。 It is good also as a structure which calculates the said similarity between the said query moving image and the said target moving image using the maximum value in each component of the said matrix.

上述した本発明の動画像検索方法、動画像検索装置およびそのプログラムにより、動画像をクエリとする検索を、計算コストを低く抑えつつ精度良く行なうことができる。 With the above-described moving image search method, moving image search apparatus, and program thereof of the present invention, a search using a moving image as a query can be performed with high accuracy while keeping the calculation cost low.

本発明の一実施例において、動画像検索装置のシステム構成を示すブロック図である。In one Example of this invention, it is a block diagram which shows the system configuration | structure of a moving image search device. 同上、図１の処理装置で処理される動作フローの説明図である。It is explanatory drawing of the operation | movement flow processed with the processing apparatus of FIG. 1 same as the above. 本発明の一実施例で適用する時系列情報を考慮したクラスタリングと、従来技術の時系列情報を考慮していないクラスタリングとの違いを示す説明図である。It is explanatory drawing which shows the difference between the clustering which considered the time series information applied in one Example of this invention, and the clustering which does not consider the time series information of a prior art. 本発明の一実施例において、時間窓を用いたaffinity propagation 法（TBAP，Time-Bound Affinity Propagation）による代表フレーム選出アルゴリズムを示すフローチャートである。6 is a flowchart showing a representative frame selection algorithm by an affinity propagation method (TBAP, Time-Bound Affinity Propagation) using a time window in an embodiment of the present invention. 本発明の一実施例と従来技術において、メッセージ交換の様子を示す説明図である。It is explanatory drawing which shows the mode of message exchange in one Example of this invention, and a prior art. 本発明の一実施例において、メッセージ交換のリンク行列を示す説明図である。FIG. 4 is an explanatory diagram showing a message exchange link matrix in an embodiment of the present invention. 同上、TBAPにおける時間窓と、改良したk-means法やharmonic competition法における時間ブロックを、それぞれ時間軸上で示した説明図である。FIG. 4 is an explanatory diagram showing, on the time axis, time windows in TBAP and time blocks in the improved k-means method and harmonic competition method. 同上、Ｍ距離による類似度算出アルゴリズムを示すフローチャートである。It is a flowchart which shows the similarity calculation algorithm by M distance same as the above. 実験で使用した元の動画像データの生成過程を表わす説明図である。It is explanatory drawing showing the production | generation process of the original moving image data used in experiment. 図９の動画像データの構造を示す説明図である。It is explanatory drawing which shows the structure of the moving image data of FIG. ５つのガウス型クラスターのそれぞれを、時間方向で２つに分けて１０集団としたプロットであり、時間性を反映した代表点（典型静止画）の出現と計算速度を調べるために使用した図である。This is a plot of each of the five Gaussian clusters divided into two in the time direction and made into 10 groups. This figure was used to examine the appearance and calculation speed of representative points (typical still images) reflecting temporal characteristics. is there. 動画像ｖ_Ａと動画像ｖ_Ｂの各典型静止画の距離を行列として表した図である。Is a diagram showing the distance of each representative still image of the moving image v _A and moving image v _B as matrix. 実験で得られた大域アラインメント用テーブルの一例を示す図である。It is a figure which shows an example of the table for global alignment obtained by experiment. 実験で得られた局所アラインメント用テーブルの一例を示す図である。It is a figure which shows an example of the table for local alignment obtained by experiment. 図１３の大域アラインメント用テーブルに対応して、計算を省略した個所を黒く塗り潰した図である。14 corresponds to the global alignment table of FIG. 13 and is a diagram in which a portion where calculation has been omitted is blacked out. 実験で得られた動画像検索における適合率−再現率曲線を示すグラフである。It is a graph which shows the precision-recall rate curve in the moving image search obtained by experiment.

以下、添付図面に基づき、本発明における動画像検索方法と、それを実現する装置の好ましい実施例を詳しく説明する。 Hereinafter, preferred embodiments of a moving image search method and an apparatus for realizing the same according to the present invention will be described in detail with reference to the accompanying drawings.

図１は、本実施例における動画像検索装置のシステム構成について、好ましい一例を模式的に示したものである。同図において、１は動画像の集合を記憶保存するデータベース、２は類似画像の検索を行なうコンピュータなどの処理装置、３は検索に用いる動画像や検索対象の動画像を表示したり、検索結果を表示するディスプレイ、４はマウスやキーボード等からなる入力手段である。データベース１は処理装置２に読み出し可能な状態に接続される。処理装置２は、必要に応じてデータベース１に蓄積された動画像を表示手段であるディスプレイ３に表示できるようになっている。なお、データベース１は処理装置２に内蔵または外付けされる記憶媒体（ハードディスクなど）であってもよく、あるいは処理装置２に通信手段を介して接続されるサーバ上に構築されていてもよい。また、クエリ画像と検索対象画像とを別々のデータベースに記憶させておくこともでき、どのような形態であるかは特に限定されない。また、処理装置２は入力手段４と接続されており、ユーザーによる入力を適宜受け付けることができる。 FIG. 1 schematically shows a preferred example of the system configuration of the moving image search apparatus in the present embodiment. In the figure, 1 is a database that stores and saves a set of moving images, 2 is a processing device such as a computer that searches for similar images, 3 is a moving image used for searching and a moving image to be searched, and search results A display 4 displays input means such as a mouse and a keyboard. The database 1 is connected to the processing device 2 in a readable state. The processing device 2 can display the moving images stored in the database 1 on a display 3 as display means as necessary. The database 1 may be a storage medium (such as a hard disk) that is built in or externally attached to the processing device 2, or may be constructed on a server that is connected to the processing device 2 via communication means. Further, the query image and the search target image can be stored in separate databases, and the form is not particularly limited. Further, the processing device 2 is connected to the input means 4 and can appropriately accept input from the user.

処理装置２には類似動画像検索を行なうためのアプリケーションソフトウェアがインストールされる。処理装置２はインストールされたプログラムを実行することによって、データベース１から動画像を取込み、取り込まれた動画像の特徴量を算出して解析した上で、クエリ動画像に類似している動画像を対象動画像から検索する処理を行なうものである。また、アプリケーションソフトウェアがインストールされることにより、以下に示すような各手段として機能させることができる。 The processing device 2 is installed with application software for searching for similar moving images. The processing device 2 executes the installed program to capture a moving image from the database 1, calculate and analyze the feature amount of the captured moving image, and then analyze a moving image similar to the query moving image. A process for searching from the target moving image is performed. Moreover, by installing application software, it can function as each means as shown below.

処理装置２は、クエリ動画像の処理を行う機能手段として、クエリ動画像取込み手段１１と、第１の特徴量抽出手段１２と、クエリ動画像代表フレーム選出手段１３とを備えている。クエリ動画像取込み手段１１は、データベース１に保存された動画像の中から、入力手段４によってユーザーが選択したクエリ動画像を取込むものである。第１の特徴量抽出手段１２は、クエリ動画像取込み手段１１で取込んだクエリ動画像の特徴量を抽出するためのものである。クエリ動画像代表フレーム選出手段１３は、第１の特徴量抽出手段１２で抽出したクエリ動画像の特徴量に基づき、当該クエリ動画像内における一乃至複数の代表フレームを選出するためのものである。 The processing device 2 includes a query moving image capturing unit 11, a first feature quantity extracting unit 12, and a query moving image representative frame selecting unit 13 as functional units that process the query moving image. The query moving image capturing means 11 is for capturing a query moving image selected by the user by the input means 4 from the moving images stored in the database 1. The first feature amount extraction unit 12 is for extracting the feature amount of the query moving image captured by the query moving image capturing unit 11. The query moving image representative frame selection unit 13 is for selecting one or more representative frames in the query moving image based on the feature amount of the query moving image extracted by the first feature amount extraction unit 12. .

処理装置２は、対象動画像の処理を行なう機能手段として、対象動画像取込み手段２１と、第２の特徴量抽出手段２２と、対象動画像代表フレーム選出手段２３とを備えている。対象動画像取込み手段２１は、データベース１に保存された動画像の中から、検索の対象となる対象動画像を取込むものである。第２の特徴量抽出手段２２は、対象動画像取込み手段２１で取込んだ対象動画像の特徴量を抽出するためのものである。対象動画像代表フレーム選出手段２３は、第２の特徴量抽出手段２２で抽出したクエリ動画像の特徴量に基づき、当該対象動画像内における一乃至複数の代表フレームを選出するためのものである。 The processing device 2 includes a target moving image capturing unit 21, a second feature amount extracting unit 22, and a target moving image representative frame selection unit 23 as functional units that perform processing of the target moving image. The target moving image capturing means 21 captures a target moving image to be searched from among the moving images stored in the database 1. The second feature amount extraction unit 22 is for extracting the feature amount of the target moving image captured by the target moving image capturing unit 21. The target moving image representative frame selecting unit 23 is for selecting one or more representative frames in the target moving image based on the feature amount of the query moving image extracted by the second feature amount extracting unit 22. .

処理装置２はその他に、クエリ動画像に類似する対象動画像を検索する手段として、類似度算出手段３１と、類似動画像表示制御手段３２とを備えている。類似度算出手段３１は、クエリ動画像内における各代表フレームの特徴量と対象動画像内における各代表フレームの特徴量とをそれぞれ比較し、類似度を算出するものである。類似動画像表示制御手段３２は、類似度算出手段３１で算出した類似度の高い順に、対象動画像をディスプレイ３に一乃至複数表示させるものである。 In addition, the processing device 2 includes a similarity calculation unit 31 and a similar moving image display control unit 32 as means for searching for a target moving image similar to the query moving image. The similarity calculation unit 31 compares the feature amount of each representative frame in the query moving image with the feature amount of each representative frame in the target moving image, and calculates the similarity. The similar moving image display control unit 32 displays one or more target moving images on the display 3 in descending order of the similarity calculated by the similarity calculating unit 31.

図２は、本発明の処理装置２の処理概要を示したものである。図中、左経路はクエリ動画像の処理ステップを表し、右経路は対象動画像の処理ステップを表す。図２の処理により、最終的にクエリ動画像に対する対象動画像の類似度が算出される。 FIG. 2 shows a processing outline of the processing apparatus 2 of the present invention. In the figure, the left route represents a query moving image processing step, and the right route represents a target moving image processing step. With the processing in FIG. 2, finally, the similarity of the target moving image to the query moving image is calculated.

図２の左経路のクエリ動画像の処理ステップを説明する。ステップＳ１１０で、データベース１からクエリ動画像を取り込む。次にステップＳ１２０で、クエリ動画像の特徴量を抽出する。特徴量としては、MPEG-7で提案されている画像特徴量の一つである色構造記述子（Color Structure Descriptor；CSD）を用いることができる。次にステップＳ１３０で、典型静止画としてクエリ動画像の代表フレームを決める。このとき、時間拘束アフィニティ伝播法（Time-Bound Affinity propagation：TBAP）などのように、動画像の経過時間の要素を考慮した手法を用いて代表フレームは決められる。クエリ動画像のフレーム数にもよるが、通常は複数の代表フレームが選択され、選択された代表フレームに対応するそれぞれの典型静止画が持つ特徴量の集合がクエリ動画像の特徴量とされる。 The processing steps for the query moving image of the left route in FIG. 2 will be described. In step S110, a query moving image is fetched from the database 1. In step S120, the feature amount of the query moving image is extracted. As the feature amount, a color structure descriptor (CSD), which is one of the image feature amounts proposed in MPEG-7, can be used. In step S130, a representative frame of the query moving image is determined as a typical still image. At this time, the representative frame is determined by using a method that takes into consideration an element of the elapsed time of the moving image, such as a time-bound affinity propagation method (TBAP). Although depending on the number of frames of the query moving image, a plurality of representative frames are usually selected, and the feature amount set of each typical still image corresponding to the selected representative frame is set as the feature amount of the query moving image. .

図２の右経路の対象動画像の処理ステップもクエリ動画像の処理ステップと同様の手順で行われる。すなわち、ステップＳ１４０で、データベース１から対象動画像を取り込む。次にステップＳ１５０で、対象動画像の特徴量を抽出する。特徴量として前記したCSDを用いることができる。次にステップＳ１６０で、典型静止画として代表フレームを決める。このとき、前記したTBAPなどのように、動画像の経過時間の要素を考慮した手法を用いて代表フレームは決められる。対象動画像のフレーム数にもよるが、通常は複数の代表フレームが選択され、選択された代表フレームに対応するそれぞれの典型静止画が持つ特徴量の集合が対象動画像の特徴量とされる。 The processing steps for the target moving image on the right path in FIG. 2 are performed in the same procedure as the processing steps for the query moving image. That is, the target moving image is captured from the database 1 in step S140. In step S150, the feature amount of the target moving image is extracted. The CSD described above can be used as the feature amount. In step S160, a representative frame is determined as a typical still image. At this time, the representative frame is determined using a method that takes into account the elements of the elapsed time of the moving image, such as the above-described TBAP. Although depending on the number of frames of the target moving image, a plurality of representative frames are usually selected, and a set of feature amounts of each typical still image corresponding to the selected representative frame is set as a feature amount of the target moving image. .

なお、対象動画像代表フレーム選出手段２３は、TBAPの代替技術として時間ブロックを導入したk-means法や、その一般化である時間ブロックを導入したharmonic competitionを適用することも可能である。harmonic competition法は、参考文献１などに記載されるが、本発明で提案する改良したk-means法やharmonic competition法については、後程詳しく説明する。 Note that the target moving image representative frame selection means 23 can apply a k-means method in which a time block is introduced as an alternative technique of TBAP, or a harmonic competition in which a time block, which is a generalization thereof, is introduced. The harmonic competition method is described in Reference 1 and the like, but the improved k-means method and harmonic competition method proposed in the present invention will be described in detail later.

参考文献１：Matsuyama, Yasuo,“Harmonic Competition: A Self-Organizing Multiple Criteria Optimization”、IEEE Transactions on Neural Networks、Vol.7、No.3、pp.652-668、MAY 1996 Reference 1: Matsuyama, Yasuo, “Harmonic Competition: A Self-Organizing Multiple Criteria Optimization”, IEEE Transactions on Neural Networks, Vol.7, No.3, pp.652-668, MAY 1996

こうして、クエリ動画像と対象動画像について、それぞれの代表フレーム（典型静止画）が選出されると、ステップＳ１７０の類似度算出の処理が行われる。ここでの典型静止画の数は各動画像に依存するので、クエリ動画像と対象動画像との類似度は、ギャップの挿入が可能な重み付きＭ距離によって算出される。なお、Ｍ距離については後程詳しく説明する。ステップＳ１８０において、ステップＳ１７０で算出した類似度スコアを基に、類似度の高い対象動画像を検索してディスプレイ３に表示する。 In this way, when the representative frames (typical still images) are selected for the query moving image and the target moving image, the similarity calculation processing in step S170 is performed. Since the number of typical still images here depends on each moving image, the similarity between the query moving image and the target moving image is calculated by a weighted M distance that allows gap insertion. The M distance will be described in detail later. In step S180, based on the similarity score calculated in step S170, a target moving image with a high similarity is retrieved and displayed on the display 3.

上述の処理ステップで、ステップＳ１２０，Ｓ１５０における特徴量の抽出を段階１とし、ステップＳ１３０，Ｓ１６０における代表フレームの選出を段階２とし、ステップＳ１７０における類似度の算出を段階３とする。以下、それぞれの段階１〜３を詳しく説明していく。 In the processing steps described above, the extraction of the feature amount in steps S120 and S150 is stage 1, the representative frame selection in steps S130 and S160 is stage 2, and the similarity calculation in step S170 is stage 3. Hereinafter, each of the steps 1 to 3 will be described in detail.

・段階１：各フレームに対する特徴量の抽出
段階１では、クエリ動画像や代表動画像（以下、単に動画像という）に含まれる各フレームの特徴量を抽出する。ここで抽出された特徴量は、時系列を考慮したベクトル｛ｘ_ｔ｝^ｎ _ｔ＝１として表現される。例えば、類似画像検索で用いるアプリケーションソフトフェアによる処理が色情報を持つ動画像を扱えるように、当該ベクトルを先に述べたMPEG-7のCSDによって表現し、ベクトルをすべて正規化されたCSDヒストグラムとすることができる。 Step 1: Extracting Feature Amounts for Each Frame In Step 1, feature amounts of each frame included in a query moving image or a representative moving image (hereinafter simply referred to as a moving image) are extracted. The feature amount extracted here is expressed as a vector {x _t } ⁿ _{t = 1} in consideration of time series. For example, so that the processing by the application software used in the similar image search can handle a moving image having color information, the vector is expressed by the above-described MPEG-7 CSD, and all the vectors are normalized with the CSD histogram. can do.

ただし、動画像の各フレームの特徴量はCSDを用いたものに限らず、当該特徴量は以下の正規化条件を満たすベクトル量であればどのようなものでも構わない。
・正規化の条件
（ｉ）各ベクトルの成分は非負とする。
（ｉｉ）ベクトル成分全部の和は、1.0となるようにする。
これは、次の手順で得られる。
（ａ）動画像の各フレームについて、その特徴を表すベクトル量に変えておく（特徴量抽出）。
（ｂ）類似度の比較法およびそれをソフトウェアとして実現した時の汎用性を確保できるように、各ベクトルを正規化できる手法を採用する。 However, the feature amount of each frame of the moving image is not limited to that using CSD, and the feature amount may be any vector amount as long as the following normalization condition is satisfied.
-Normalization conditions (i) Each vector component is non-negative.
(Ii) The sum of all vector components is set to 1.0.
This is obtained by the following procedure.
(A) Each frame of the moving image is changed to a vector amount representing the feature (feature amount extraction).
(B) A method that can normalize each vector is employed so as to ensure a similarity comparison method and versatility when it is realized as software.

上記手順（ｂ）は、実質的に各ベクトルの最小値と総和を使うことで常に可能になる。また、CSDは手順（ａ）における特徴ベクトルの単なる一例で、画像フレーム全体だけを対象にするさらに簡単なものがヒストグラムである。 The procedure (b) is always possible by using substantially the minimum value and sum of each vector. CSD is merely an example of the feature vector in the procedure (a), and a simpler one that covers only the entire image frame is a histogram.

CSDは簡単な特徴量ではあるが、良好な結果を得ることができて好ましい。また、CSDの代わりに、計算量が多くなるものの、別な手法による特徴量を用いることも可能である。例えば、動画像の各フレームの特徴量は、SIFT（Scale-Invariant Feature Transform）や、SURF（Speeded Up Robust Features）や、FREAK（Fast Retina Keypoint）や、bag of visual wordsなどを用いることができる。また、これらCSD以外の特徴量や、CSDを含む特徴量を組み合わせることも可能である。 CSD is a simple feature amount, but it is preferable because good results can be obtained. Moreover, although the amount of calculation increases instead of CSD, it is also possible to use a feature amount by another method. For example, SIFT (Scale-Invariant Feature Transform), SURF (Speeded Up Robust Features), FREAK (Fast Retina Keypoint), bag of visual words, etc. can be used for the feature amount of each frame of the moving image. It is also possible to combine feature quantities other than CSD and feature quantities including CSD.

例えば、前述した｛CSD，FREAK，bag of visual words，SIFT，…｝などを個別の正規化済み特徴量の集団として、個別の正規化済み特徴量をNormalizedFeatureVector（ｉ）と表記すると、動画像フレームの特徴量として、以下の数９に定義される合成特徴量ベクトルCompositeFeatureVectorを利用することもできる。 For example, when {CSD, FREAK, bag of visual words, SIFT,...} Is described as a group of individual normalized feature amounts, and the individual normalized feature amounts are expressed as NormalizedFeatureVector (i), a moving image frame As the feature quantity, a composite feature quantity vector CompositeFeatureVector defined in the following Equation 9 can also be used.

ただし、合成特徴量ベクトルCompositeFeatureVectorは上式において正規化条件：ｗ_ｉ≧０、Σ^Ｎ _ｉ＝０ｗ_ｉ＝１を満たすものとする。上記数９の式は、合成特徴量ベクトルCompositeFeatureVectorを得る一例に過ぎず、例えば特徴量同士を掛け算したものも、合成特徴量ベクトルCompositeFeatureVectorとして利用できる。したがって、合成特徴量ベクトルCompositeFeatureVectorは、次の数１０の一般式で得られる。これは、幾つかの個別の特徴量FeatureVector（ｉ）を組み合わせて最終的に正規化したものが、合成特徴量ベクトルCompositeFeatureVectorになるという表現を、数式化したものである。 However, the combined feature vector CompositeFeatureVector satisfies the normalization conditions: w _i ≧ 0 and Σ ^N _{i = 0} w _i = 1 in the above equation. The above formula 9 is merely an example of obtaining a composite feature vector CompositeFeatureVector. For example, a product obtained by multiplying feature amounts can be used as a composite feature vector CompositeFeatureVector. Therefore, the composite feature quantity vector CompositeFeatureVector is obtained by the following general formula (10). This is a mathematical expression of the expression that what is finally normalized by combining several individual feature quantities FeatureVector (i) becomes a composite feature quantity vector CompositeFeatureVector.

・段階２：時系列を保つ代表フレームの選出
本段階は、段階１で動画像の各フレームについて、様々な手法を利用して正規化された特徴量を抽出した後に行なわれる。ここでは、処理装置２が動画像内の各フレームを当該特徴量に基づいてクラスター化し、各クラスターを代表するフレームを選出する。各クラスターを代表するフレームを典型静止画と呼ぶが、典型静止画の選出には時系列情報が加味されていなければならない。本段階での具体的なアルゴリズムを説明する前に、注意点をいくつか述べる。 Stage 2: Selection of representative frames to keep time series This stage is performed after extracting normalized features using various techniques in Step 1 for each frame of the moving image. Here, the processing device 2 clusters each frame in the moving image based on the feature amount, and selects a frame representing each cluster. A frame representing each cluster is referred to as a typical still image, but time series information must be added to the selection of a typical still image. Before describing the specific algorithm at this stage, a few points to note.

図３は、本実施例と通常のアフィニティ伝播法（AP）におけるクラスタリングの違いを説明する図である。図中上段にある時系列データは、段階１で抽出した時間ｔ＝１から時間ｔ＝１０までのｎ＝１０個の時系列の特徴ベクトル｛ｘ_ｔ｝^１０ _ｔ＝１を表している。ここでの各ボックスは各時間ｔにおけるフレームを表し、例えば時間ｔ＝１のフレームには１の数字を、時間ｔ＝５のフレームには５の数字が付してある。また、図で同じグラデーションで描いたフレームは、互いに類似しているフレーム（例えば動画の中の１シーン等）を表し、同じクラスに属するものとする。図示の例では、｛１，２，１０｝のクラス、｛３，４，５｝のクラス、｛６，７，８，９｝のクラスと、類似フレームのセットが３種類存在する。 FIG. 3 is a diagram for explaining a difference in clustering between the present embodiment and a normal affinity propagation method (AP). The time-series data in the upper part of the figure represents n = 10 time-series feature vectors {x _t } ¹⁰ _{t = 1} from time t = 1 to time t = 10 extracted in stage 1. Each box here represents a frame at each time t. For example, a number 1 is assigned to a frame at time t = 1, and a number 5 is assigned to a frame at time t = 5. Also, the frames drawn with the same gradation in the figure represent similar frames (for example, one scene in a moving image) and belong to the same class. In the illustrated example, there are three types of sets of similar frames: {1,2,10} class, {3,4,5} class, {6,7,8,9} class.

これらのフレームを特徴量別にクラスター化する際、通常のアフィニティ伝播法（AP）を用いた場合には、図中左側下段に示すように、典型静止画として例えばフレーム｛２，４，７｝が選出される。しかし、これは時系列情報が考慮されておらず、動画像を検索するラベルとして適切ではない。時系列が正しく考慮されることにより、図中右側下段に示すように、典型静止画は｛２，４，７，１０｝と選出される。動画像の経過時間を考慮する時間拘束アフィニティ伝播法（TBAP）や、それを近似する改良したk-means法若しくはharmonic competition法などの手法では、そのような時系列を考慮したクラスタリングが可能となる。つまり図３の例では、時間拘束アフィニティ伝播法を適用すると、典型静止画が正しく｛（１，２，０）（１，４，１），（１，７，２），（０，１０，０）｝と選出される。このようなクラスタリングは、TBAPに限らず、改良したk-means法やharmonic competition法でも、時間窓に制限を設けることによって実現される。 When these frames are clustered by feature amount, when the normal affinity propagation method (AP) is used, for example, frames {2, 4, 7} Elected. However, this does not take time series information into consideration, and is not appropriate as a label for searching for moving images. By correctly considering the time series, the typical still image is selected as {2, 4, 7, 10} as shown in the lower right part of the figure. Time-constrained affinity propagation method (TBAP) that considers the elapsed time of moving images, and improved k-means method or harmonic competition method that approximates it allow clustering considering such time series. . That is, in the example of FIG. 3, when the time-constrained affinity propagation method is applied, the typical still image is correctly {(1, 2 , 0) (1, 4 , 1), (1, 7 , 2), (0, 10 , 0)}. Such clustering is realized not only by TBAP but also by the restriction of the time window by the improved k-means method and harmonic competition method.

なお、これらの典型静止画のフレーム数は動画像の長さやコンテンツに応じて可変する。したがって、各々の典型静止画は、時間順序におけるその左右に、責任のある（responsible）フレームを含んでいる。 Note that the number of frames of these typical still images varies according to the length of the moving image and the content. Thus, each typical still image includes a responsible frame on its left and right in time order.

以下、時系列を保つ典型静止画を選出するためのアルゴリズムの一つであるTBAPのアルゴリズムについて、図４のフローチャートを参照しながら詳しく説明する。 Hereinafter, the TBAP algorithm, which is one of the algorithms for selecting a typical still image that keeps time series, will be described in detail with reference to the flowchart of FIG.

ステップＳ２１０：Similarityの計算
ここでは、時間ｔ＝１から時間ｔ＝ｎまでの動画像フレームについて、段階１で抽出されたｎ個の時系列の特徴ベクトル｛ｘ_ｔ｝^ｎ _ｔ＝１が与えられる。前述のように、各々の特徴ベクトル｛ｘ_ｔ｝^ｎ _ｔ＝１は、合計が１となる正規化されたCSDヒストグラムである。時間ｔ＝ｉの特徴ベクトルｘ_ｉから、時間ｔ＝ｊの特徴ベクトルｘ_ｊへの関係性を表わすSimilarity測度ｓ（ｘ_ｉ，ｘ_ｊ）を、ｓ（ｉ，ｊ）として定義する。そして、特徴ベクトルｘ_ｉが特徴ベクトルｘ_ｊよりも時間ｔ＝ｋの特徴ベクトルｘ_ｋに類似している時かつその時に限り、ｓ（ｋ，ｉ）＞ｓ（ｋ，ｊ）を満たすような性質を持つものとする。 Step S210: Calculation of Similarity Here, n time-series feature vectors {x _t } ⁿ _{t = 1} extracted in stage ₁ are given for a moving image frame from time t = 1 to time t = n. . As described above, each feature vector {x _t } ⁿ _{t = 1} is a normalized CSD histogram with a total of one. From the feature vectors _{x i} of the time t = i, define the time t = j Similarity measure _{s (x} i, _{x j)} representing the relationship between the feature vector _{x j} of the as s (i, j). Then, when and only when the feature vector x _i is more similar to the feature vector x _k at time t = k than the feature vector x _j , s (k, i)> s (k, j) is satisfied. It shall have properties.

本実施例では、以下の数１１の式で計算したSimilarity測度ｓ（ｉ，ｊ）を用いることとする。 In the present embodiment, the similarity measure s (i, j) calculated by the following equation 11 is used.

ただし、bar（Ｄ）（以下、数１１の右辺第１項のように、記号Ｄにオーバーラインが付いたものを、本文中ではbar（Ｄ）と表記する）は、例えば入力手段４によりユーザーが設定可能な定数パラメータである。 However, bar (D) (hereinafter, the symbol D with an overline as in the first term on the right side of Expression 11 is expressed as bar (D) in the text) is input by the input means 4, for example. Is a constant parameter that can be set.

Similarityは、「値が大きいほど関係性も大きく、値が小さいほど関係性も小さい」というルールに則った値の設定であれば、数１１の式に限らずどのようなものでも構わない。また、bar（Ｄ）の選び方はTBAPの結果に影響しないので、ここでは０としておくが、当該パラメータは動画像間を比較する上では重要な要素となり、０以外の値をとりうる。 As long as the similarity is set according to the rule that “the larger the value, the greater the relationship, and the smaller the value, the smaller the relationship”, any value may be used as long as the similarity is set. The method of selecting bar (D) does not affect the result of TBAP, so it is set to 0 here. However, this parameter is an important factor in comparing moving images, and can take a value other than 0.

ステップＳ２２０：行列Ａを初期化
ここでは、初期値を零行列にした行列Ａ＝［ａ_ｉｊ］を用意する（ａ_ｉｊ←０）。この行列Ａは、availabilityを表す行列である。availabilityは、典型静止画候補のデータポイントｊからデータポイントｉへ送られるメッセージで、データポイントｉがデータポイントｊの典型静止画として選ぶことがどれだけ適切かを表す値である。 Step S220: Initializing the matrix A Here, a matrix A = [a _ij ] having an initial value of zero matrix is prepared (a _ij ← 0). This matrix A is a matrix representing availability. The availability is a message sent from the data point j of the typical still image candidate to the data point i and is a value representing how appropriate it is to select the data point i as the typical still image of the data point j.

ステップＳ２３０：行列Ａを用いて行列Ｒを算出（または更新）
ここでは先ず、スライドする時間窓の長さを、その中心が対角成分に位置するように、Ｗ＝２ω−１に設定した後、時間窓内での時系列を考慮して、特徴ベクトルｘ_ｉ，ｘ_ｊ，ｘ_ｋを選択する。 Step S230: Calculate (or update) the matrix R using the matrix A
Here, first, the length of the sliding time window is set to W = 2ω−1 so that its center is located at the diagonal component, and then the feature vector x is considered in consideration of the time series in the time window. _i, _x j, to select the _{x k.}

次に、responsibilityを表す行列Ｒの成分ｒ_ｉｊを、先に説明したavailability行列Ａを利用して計算更新する。responsibilityは、データポイントｉから典型静止画候補のデータポイントｊへ送られるメッセージで、データポイントｉがデータポイントｊの典型静止画としてどのくらい適切かを表す値である。このresponsibility行列Ｒの計算と、後述するavailability行列Ａの計算では、データポイントｉ，ｊを次の数１２の範囲に設定して算出する。つまりTBAPでは、メッセージ交換を行なう範囲に数１２に示す制限を設けている。 Next, the component r _{ij of the} matrix R representing the responsibility is calculated and updated using the availability matrix A described above. Responsibility is a message sent from data point i to data point j of a typical still image candidate, and is a value representing how appropriate data point i is as a typical still image of data point j. In the calculation of the responsibility matrix R and the calculation of the availability matrix A, which will be described later, the data points i and j are set within the range of the following equation (12). In other words, in TBAP, the limit shown in Equation 12 is set in the range for message exchange.

responsibility行列Ｒは通常対称的であり、行列Ａを用いて数１３に示すように更新される。 The responsibility matrix R is usually symmetric and is updated using the matrix A as shown in Equation 13.

数１３において、ρ_ｉｊは、データポイントｉとデータポイントｊとの間のresponsibilityの伝播値である。また、λ∈（０，１）は設計パラメータとしてのダンピング係数であり、これを用いて後述するステップＳ２５０での収束を遅らせ、安定した結果を得ることができる。行列Ｒも初期値は零行列とされるが、例えばダンピング係数λが０であれば、responsibilityの伝播値ρ_ｉｊがそのまま行列Ｒの値として更新される。 In Equation 13, ρ _ij is a propagation value of responsibility between the data point i and the data point j. Further, λε (0,1) is a damping coefficient as a design parameter, and by using this, the convergence in step S250 described later can be delayed, and a stable result can be obtained. The initial value of the matrix R is also a zero matrix, but if the damping coefficient λ is 0, for example, the response propagation value ρ _ij is updated as it is as the value of the matrix R.

ステップＳ２４０：行列Ａを算出
ここでは、データポイントｉ，ｊを数１２で示した範囲に制限し、次の数１４を用いてavailability行列Ａの成分ａ_ｉｊを更新する。 Step S240: Calculate Matrix A Here, the data points i and j are limited to the range indicated by Equation 12, and the component a _ij of the availability matrix A is updated using the following Equation 14.

数１４において、α_ｉｊは、データポイントｉとデータポイントｊとの間のavailabilityの伝播値である。 In Expression 14, α _ij is a propagation value of availability between the data point i and the data point j.

・ステップＳ２５０：行列Ａと行列Ｒの総和が収束したか否かを判定
以上のように算出した行列Ａの成分ａ_ｉｊと行列Ｒの成分ｒ_ｉｊとを加算し、その値ａ_ｉｊ＋ｒ_ｉｊが収束の基準を満たすまで、ステップＳ２３０とステップＳ２４０の手順を繰り返す。一方、ａ_ｉｊ＋ｒ_ｉｊの値が収束の基準を満たすと、次のステップＳ２６０の手順に移行する。 Step S250: Determine whether or not the sum of the matrix A and the matrix R has converged. The component a _{ij of the} matrix A calculated as described above and the component r _{ij of the} matrix R are added, and the value a _ij + r _ij Steps S230 and S240 are repeated until the convergence criterion is satisfied. On the other hand, when the value of a _ij + r _ij satisfies the convergence criterion, the process proceeds to the next step S260.

・ステップＳ２６０：代表フレーム選出
ここでは、以下の数１５に定義されるインデックスが、典型静止画のインデックスとして採用される。 Step S260: Representative Frame Selection Here, the index defined in the following Equation 15 is adopted as the index of the typical still image.

つまり、ａ_ｉｊ＋ｒ_ｉｊの値が最大となるデータポイントでの静止画が、典型静止画として採用される。例えば、データポイントｉは、ａ_ｉｊ＋ｒ_ｉｊの値が最大となるデータポイントｊを中心とするクラスターに属する。最終的に代表フレームとして選出される典型静止画の集合は、こうしたインデックスを収集して見つけ出される。 That is, the still image at the data point where the value of a _ij + r _ij is the maximum is adopted as the typical still image. For example, the data point i belongs to a cluster centered on the data point j having the maximum value of a _ij + r _ij . A set of typical still images finally selected as representative frames is found by collecting such indexes.

上記一連のステップＳ２３０〜ステップＳ２６０の中で、時間窓を設定する上で必要なωは、行列Ａや行列Ｒを算出する際のメッセージ交換の範囲を制限するパラメータとなる。この制限下におけるメッセージ交換の様子を、図５に示す。同図において、上段の「AP」は、従来技術となるアフィニティ伝播法によるメッセージ交換の範囲を示している。また下段の「TBAP」は、本実施例における時間拘束アフィニティ伝播法によるメッセージ交換の範囲を示している。ここでは、前記図３で説明したものと同じフレームが時系列に並べられているものとする。 In the series of steps S230 to S260, ω necessary for setting the time window is a parameter that limits the range of message exchange when the matrix A and the matrix R are calculated. FIG. 5 shows how messages are exchanged under this restriction. In the figure, “AP” in the upper part indicates a range of message exchange by the affinity propagation method as a conventional technique. The lower “TBAP” indicates the range of message exchange by the time-constrained affinity propagation method in this embodiment. Here, it is assumed that the same frames as described in FIG. 3 are arranged in time series.

従来技術では、全てのデータポイントに対して、メッセージの交換が行われるが、本実施例では、時間窓で制限された近傍のデータポイントに対して、メッセージの交換が行われる。具体的には、図５に示す「AP」の例では、データポイントｉ＝５を中心として、データポイントｊ＝１からｊ＝１０の全範囲にデータ交換が行われる。一方、同図「TBAP」の例では、パラメータω＝４に設定して、時間窓２ω―１＝７の範囲でメッセージ交換を行なう構成になっている。そのため、データポイントｉ＝５を中心として、データポイントｊ＝２からｊ＝７の範囲にメッセージ交換の範囲が制限される。 In the prior art, messages are exchanged for all data points. In this embodiment, messages are exchanged for neighboring data points limited by a time window. Specifically, in the example of “AP” shown in FIG. 5, data exchange is performed over the entire range from data point j = 1 to j = 10 with data point i = 5 as the center. On the other hand, in the example of “TBAP” in the figure, the parameter ω is set to 4 and the message exchange is performed in the range of the time window 2ω-1 = 7. Therefore, the range of message exchange is limited to the range of data points j = 2 to j = 7 with data point i = 5 as the center.

図６は、本実施例のTBAPを適用したメッセージ交換のリンク行列を示している。ここでもパラメータω＝４に設定しており、データポイントｉを１から１０の何れかに選択したときに、どのデータポイントｊとの間でメッセージ交換が行われるのかを塗り潰して示している。 FIG. 6 shows a link matrix for message exchange to which the TBAP of this embodiment is applied. Again, the parameter ω = 4 is set, and when the data point i is selected from 1 to 10, the data point j with which the message is exchanged is shown.

通常のAPの特性として、メッセージ交換を行なったデータポイントのみが典型静止画の候補として最終的に選出される。これは、本実施例のTBAPでも同じことがいえる。本実施例のTBAPでは、メッセージを交換する範囲を時系列を考慮して制限しているため、結果としてメッセージ交換を行わない時系列的に遠いデータポイント（例えば、図３に示す時間ｔ＝１０）の典型静止画も、クラスタリングにより代表フレームの一つとして選出できる。 As a characteristic of a normal AP, only data points for which messages have been exchanged are finally selected as typical still image candidates. The same can be said for the TBAP of this embodiment. In the TBAP of this embodiment, the range for exchanging messages is limited in consideration of time series. As a result, data points that are distant in time series without message exchange (for example, time t = 10 shown in FIG. 3). ) Can be selected as one of the representative frames by clustering.

また、TBAPにおける典型静止画の採用に関して、連続する類似フレームについては時間窓を越えても同じ典型静止画をとることができる。具体的には、いま第ｉフレームに対する典型静止画ｅ_ｉを決定するとする。このとき、第ｉフレームより前にある第ｋフレームから第ｉ−１フレームまでの特徴ベクトルが、閾値などを用いて類似していると判定され、そして第ｋフレームに対する典型静止画ｅ_ｋと第ｉフレームに対する典型静止画ｅ_ｉも類似していると判定される場合、新たな典型静止画ｅ_ｉを生成せずに典型静止画ｅ_ｋをそのまま用いることができる。これによって、本発明の性能を向上させることができる。 Regarding the adoption of a typical still image in TBAP, the same typical still image can be taken even if the time window is exceeded for consecutive similar frames. Specifically, it is assumed that the typical still image e _i for the i-th frame is determined. At this time, it is determined that the feature vectors from the k-th frame to the (i−1) -th frame before the i-th frame are similar using a threshold or the like, and the typical still image e _k for the k-th frame and the If it is determined that the typical still image e _i for the i frame is also similar, the typical still image e _k can be used as it is without generating a new typical still image e _i . As a result, the performance of the present invention can be improved.

以上のように、本実施例では上記TBAPに基づく処理ステップを類似動画像検索処理装置２が実行することにより、時系列を考慮した典型静止画の集合を、クエリ動画像および対象動画像よりそれぞれ抽出することができる。そしてこの抽出された代表フレームの特徴量を、当該動画像の特徴量として決定し、本段階を終える。特に、本実施例で採用するTBAPでは、選択したデータポイントと共にスライドする時間窓を設けているので、メッセージ交換の回数が制限され、メッセージ伝達の複雑さはＯ（ωｎ）となる。一方、通常のAPでの複雑さはＯ（ｎ^２）である。つまり、APの計算コストはデータポイントの数の２乗に比例するため、データポイントの数が多くなると計算時間を要するといった問題点がある。それに対して、本実施例では一般的にω＜ｎであるため、TBAPはAPよりも計算コストを低く抑えることができる。また、APと比べてTBAPのアルゴリズムは検索速度も速い。 As described above, in the present embodiment, the similar moving image search processing device 2 executes the processing steps based on the TBAP, so that a set of typical still images considering time series is obtained from the query moving image and the target moving image, respectively. Can be extracted. Then, the extracted feature amount of the representative frame is determined as the feature amount of the moving image, and this stage is finished. In particular, the TBAP employed in the present embodiment provides a time window that slides with the selected data point, so that the number of message exchanges is limited, and the complexity of message transmission is O (ωn). On the other hand, the complexity of a normal AP is O (n ² ). In other words, since the calculation cost of the AP is proportional to the square of the number of data points, there is a problem that a calculation time is required when the number of data points increases. On the other hand, in the present embodiment, generally, ω <n, so that TBAP can keep the calculation cost lower than AP. In addition, the algorithm of TBAP is faster than AP.

次に、TBAPの代替手法である改良したk-means法やharmonic competition法について、図７を参照しながら説明する。同図左側には、TBAPにおける時間軸上でスライドする時間窓Ｗが示されており、同図右側には、改良したk-means法やharmonic competition法における時間軸上でスライドする時間窓Ｗが示されている。 Next, an improved k-means method and harmonic competition method, which are alternative methods of TBAP, will be described with reference to FIG. The left side of the figure shows a time window W that slides on the time axis in TBAP, and the right side of the figure shows a time window W that slides on the time axis in the improved k-means method and harmonic competition method. It is shown.

TBAPは、時間軸に沿って重なり合うようにスライドする時間窓Ｗ中の一連の画像フレームを計算対象とする。これに対して、代替技術である改良したk-meansやharmonic competitionでは、重なりの無い時間窓Ｗの画像フレームを計算対象としている。つまりTBAPでは、時間窓Ｗのずらし幅が時間窓Ｗのフレーム長さよりも小さく、一乃至複数フレームずつ時間窓Ｗがスライディングするのに対し、改良したk-meansやharmonic competitionでは、時間窓Ｗのずらし幅と時間窓Ｗのフレーム長さが一致し、時間窓Ｗの単位で時間窓Ｗがスライディングしている。 TBAP uses a series of image frames in a time window W that slides so as to overlap along a time axis as a calculation target. On the other hand, in the improved k-means and harmonic competition, which are alternative techniques, image frames in the time window W without overlap are targeted for calculation. In other words, in TBAP, the shift width of the time window W is smaller than the frame length of the time window W, and the time window W slides by one or more frames, whereas in the improved k-means and harmonic competition, the time window W The shift width matches the frame length of the time window W, and the time window W slides in units of the time window W.

例えば前記段階２の手順で、改良したk-meanやharmonic competitionを用いた場合、時間軸に沿って画像フレームを時間窓Ｗに分割し、分割された時間窓Ｗの中でクラスタリングがなされる。ここで得られる代表点ベクトルは、元の画像そのものに対応するものではなく、k-meanやharmonic competitionに基づく計算により生成された類似画像（人工画像）の特徴ベクトルである。したがって、その代表点ベクトルと最も類似した特徴を持つフレームを、典型静止画として選出するという、二段階の計算手順が必要となる。この二段階の計算により得られた典型静止画は、TBAPの典型静止画と近似している。 For example, when the improved k-mean or harmonic competition is used in the procedure of Step 2, the image frame is divided into time windows W along the time axis, and clustering is performed in the divided time windows W. The representative point vector obtained here does not correspond to the original image itself, but is a feature vector of a similar image (artificial image) generated by calculation based on k-mean or harmonic competition. Therefore, a two-step calculation procedure is required in which a frame having features most similar to the representative point vector is selected as a typical still image. The typical still image obtained by this two-stage calculation is approximate to the typical still image of TBAP.

なお、前述した図４のフローチャートは、スライドする時間窓Ｗを用いるTBAPの処理手順を示しているが、k-meanやharmonic competitionでも、時間窓Ｗの長さを２ω−１に設定することで、同様の処理手順を適用できる。 In addition, although the flowchart of FIG. 4 mentioned above has shown the processing procedure of TBAP using the sliding time window W, the length of the time window W is set to 2ω-1 also in k-mean and harmonic competition. A similar processing procedure can be applied.

こうして、TBAPにおける時間窓Ｗの概念を、k-meanやharmonic competitionにも適用し、当該時間窓Ｗを時間軸に沿って重なり合わないようにスライドさせながら、個々の時間窓Ｗの中から代表点ベクトルを取得すれば、その代表点ベクトルに最も近い特徴を持つフレームを典型静止画として選出できる。 In this way, the concept of the time window W in TBAP is applied to k-mean and harmonic competition, and the time window W is slid from the time axis W so that it does not overlap along the time axis. If a point vector is acquired, a frame having a feature closest to the representative point vector can be selected as a typical still image.

また、データの正規化と距離測度に関して、上記Similarity測度ｓ（ｉ，ｊ）は、アルゴリズムが収束すればどのようなものを用いてもよい。パラメータλを適切に設定すれば、数３におけるSimilarity測度ｓ（ｉ，ｊ）の形式は、アルゴリズムの収束をもたらす。 Regarding the normalization of data and the distance measure, the Similarity measure s (i, j) may be any as long as the algorithm converges. If the parameter λ is set appropriately, the form of the Similarity measure s (i, j) in Equation 3 results in convergence of the algorithm.

各々の特徴ベクトルｘ_ｔは、各成分の総和が１になる非負成分だけを有するように正規化されるので、bar（Ｄ）の基準となる選択肢は次の通りである：
（ａ）全データ距離の平均値：これは、データのサイズが小さい場合にのみ許される。
（ｂ）√２，√（（ｄ−２）／（２ｄ）），または√（１−（１／ｄ））のいずれか：ただし、ｄはシンプレックスに存在する特徴ベクトルｘ_ｔの次元を表わす。√２はシンプレックスのエッジであり、√（（ｄ−２）／（２ｄ））≒１／√２は内球の半径であり、√（１−（１／ｄ））≒１は外球の半径を表す。なお、≒は近似記号を意味し、ｄが十分大きい場合に当該近似が有効となる。 Since each feature vector _xt is normalized to have only non-negative components where the sum of each component is 1, the options for bar (D) are as follows:
(A) Average value of all data distances: This is allowed only when the data size is small.
(B) Either √2, √ ((d−2) / (2d)), or √ (1- (1 / d)): where d represents the dimension of the feature vector x _t existing in the simplex . √2 is the edge of the simplex, √ ((d−2) / (2d)) ≈1 / √2 is the radius of the inner sphere, and √ (1− (1 / d)) ≈1 is the outer sphere Represents the radius. Note that ≈ means an approximate symbol, and the approximation is effective when d is sufficiently large.

例えば、前述のCSDではHSV（色相−彩度−明度）色空間表現を用いており、Ｈ，Ｓ，Ｖの各値の比率を、Ｈ：Ｓ：Ｖ＝12：８：８に量子化するのが好ましいとされる。この場合、各画像は12×８×８＝768色で表現され、CSDによる特徴量は768次元のヒストグラムとなるので、CSD２値のサイズがｄ＝768で、外球の半径を表す値をbar（Ｄ）として用いる。 For example, the above-mentioned CSD uses HSV (Hue-Saturation-Brightness) color space expression, and the ratio of each value of H, S, V is quantized to H: S: V = 12: 8: 8. Is preferred. In this case, each image is expressed by 12 × 8 × 8 = 768 colors, and the feature amount by CSD is a 768-dimensional histogram. Therefore, the CSD binary size is d = 768 and the value representing the radius of the outer sphere is bar. Used as (D).

・段階３：動画像間の解析による類似度の算出
段階３では、異なる典型静止画の集合を有する異なる動画像同士の比較を行なう。コンテンツに基づいた動画像間の検索をするためには、クエリの動画像とターゲットの動画像の類似度を比較する必要がある。ここでは、段階１〜段階２で解析された動画像内の情報をもとに、異なる動画像同士の比較を、以下に説明する２つの方法の何れかで行なう。本実施例では比較対象を動画像とするが、代表フレームのようなexemplarを持つ時系列データであれば、これに限るものではない。 Stage 3: Calculation of similarity by analysis between moving pictures In stage 3, different moving pictures having different sets of typical still pictures are compared. In order to search between moving images based on content, it is necessary to compare the similarity between the query moving image and the target moving image. Here, based on the information in the moving image analyzed in Step 1 to Step 2, comparison between different moving images is performed by one of the two methods described below. In this embodiment, the comparison target is a moving image, but it is not limited to this as long as it is time-series data having an exemplar such as a representative frame.

（１）大域アライメントを求めるＭ距離を適用した類似度の比較
本実施例では、類似度を比較するための尺度としてＭ距離（M-distance）と呼ぶ距離を用いる。大域アライメント用のＭ距離は、文字列の類似度を測るためのレーベンシュタイン距離（L-distance）や、生命情報学において大域アライメントを求めるためのNeedleman−Wunschアルゴリズムを一般化したものである。したがって、この大域アライメント用のＭ距離は、レーベンシュタイン距離や、Needleman−Wunschアルゴリズムのそれぞれ特例として含んでいる。 (1) Comparison of Similarity Using M Distance for Obtaining Global Alignment In this embodiment, a distance called M distance is used as a scale for comparing similarity. The M distance for global alignment is a generalization of the Levenshtein distance (L-distance) for measuring the similarity of character strings and the Needleman-Wunsch algorithm for obtaining global alignment in bioinformatics. Accordingly, the M distance for global alignment is included as a special case of the Levenshtein distance or the Needleman-Wunsch algorithm.

以下、大域アライメント用のＭ距離を利用したアルゴリズムについて、図８のフローチャートを参照しながら詳しく説明する。 Hereinafter, an algorithm using the M distance for global alignment will be described in detail with reference to the flowchart of FIG.

ステップＳ３１０：クエリ動画像と対象動画像の各典型静止画を用意
クエリ動画像に相当する動画像ｖ_Ａに対して、典型静止画の集合｛ｅ^Ａ _ｉ｝と、当該典型静止画が所有するフレーム数の集合｛Ｅ^Ａ _ｉ｝（ｉ＝１，２，…）が与えられる。また、動画像ｖ_Ａと比較される対象動画像に相当する動画像ｖ_Ｂに対しても、同様の集合｛ｅ^Ｂ _ｊ｝，｛Ｅ^Ｂ _ｊ｝（ｊ＝１，２，…）が与えられる。ここでは、数１１におけるSimilarity測度ｓ（ｉ，ｊ）が、代表フレーム間類似度として選択される。 Step S310: Prepare typical still images of the query moving image and the target moving image For the moving image v _A corresponding to the query moving image, the set of typical still images {e ^A _i } and the typical still image possessed by the typical still image. ^A set {E ^A _i } of frames (i = 1, 2,...) Is given. _A similar set {e ^B _j }, {E ^B _j } (j = 1, 2,...) Is also given to the moving image v _B corresponding to the target moving image to be compared with the moving image v _A. It is done. Here, the Similarity measure s (i, j) in Equation 11 is selected as the similarity between representative frames.

ステップＳ３２０：テーブルを作成し、次の動的計画法の手順に従って経路を元に辿る。
動画像ｖ_Ａの典型静止画の数をｍとし、動画像ｖ_Ｂの典型静止画の数をｎとしたときに、（ｍ＋１）×（ｎ＋１）の行列を大域アライメントのテーブルとして作成する。行列の各成分ｆ（ｉ，ｊ）は、動的計画法により次の手順で算出される。 Step S320: Create a table and follow the route according to the following dynamic programming procedure.
_A matrix of (m + 1) × (n + 1) is created as a global alignment table, where m is the number of typical still images of the moving image v _A and n is the number of typical still images of the moving image v _B. Each component f (i, j) of the matrix is calculated by the following procedure by dynamic programming.

先ず、ステップＳ３２１では、設計パラメータとして設定したギャップペナルティーｇが選択される。このギャップペナルティーｇを導入することで、クエリ動画像の典型静止画の数ｍと、対象動画像の典型静止画の数ｎが異なっていても、これらの時系列を考慮した典型静止画同士の類似性を、正しく判定できるようになる。 First, in step S321, a gap penalty g set as a design parameter is selected. By introducing this gap penalty g, even if the number m of the typical still images of the query moving image and the number n of the typical still images of the target moving image are different from each other, Similarity can be determined correctly.

次のステップＳ３２２では、｛ｉ＝０｝の行と、｛ｊ＝０｝の列について、ギャップｇにフレームの総数を掛け合わせて、段階的に値が下降するように、テーブルを構成する行列の各成分を計算する。具体的には、｛ｉ＝０｝の行は、次の数１６に基づいて行列成分ｆ（０，ｊ）を算出する。 In the next step S322, for the row of {i = 0} and the column of {j = 0}, the matrix constituting the table so that the value decreases stepwise by multiplying the gap g by the total number of frames. Calculate each component of. Specifically, the row of {i = 0} calculates the matrix component f (0, j) based on the following equation (16).

また、｛ｊ＝０｝の列は、次の数１７に基づいて行列成分ｆ（ｉ，０）を算出する。 For the column of {j = 0}, the matrix component f (i, 0) is calculated based on the following equation (17).

さらにステップＳ３２３では、（ｉ，ｊ）＝（１，１）の位置から開始し、｛ｉ＝０｝の行と｛ｊ＝０｝の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数１８に基づいて算出する。 Further, in step S323, the remaining matrix components f (i, j) other than the row of {i = 0} and the column of {j = 0} are started from the position of (i, j) = (1,1). The calculation is based on the following equation (18).

その後、最大値を与える成分に向かって矢印を描く。この計算は、全ての行列成分が算出されるまで続けられる。ここでｒ（ｉ，ｊ）は、典型静止画の所有度を反映する重みを表し、その値は例えば、次の数１９に基づいて算出される。 Then, an arrow is drawn toward the component that gives the maximum value. This calculation is continued until all matrix components are calculated. Here, r (i, j) represents a weight reflecting the ownership of the typical still image, and the value is calculated based on the following equation 19, for example.

数１８に示すように、ここでの行列成分ｆ（ｉ，ｊ）は、行または列の数が大きくなる順に向けて逐次的に生成される。そして、ｉ行ｊ列の行列成分ｆ（ｉ，ｊ）は、（ｉ−１）行ｊ列を経由しｉ行ｊ列に至る経路、ｉ行（ｊ−１）列を経由しｉ行ｊ列に至る経路、（ｉ−１）行（ｊ−１）列を経由しｉ行ｊ列に至る経路のうち、コストの最も大きい経路が最適経路として選択され、そのときの数値が当該行列の成分として逐次的に決められていく。 As shown in Equation 18, the matrix component f (i, j) here is sequentially generated in the order of increasing number of rows or columns. Then, the matrix component f (i, j) of i row and j column passes through (i-1) row j column and reaches i row j column, i row j through i row (j-1) column. The route with the highest cost is selected as the optimum route from the route to the column and the route to the i row and j column via the (i-1) row (j-1) column, and the value at that time is the value of the matrix It is determined sequentially as a component.

ステップＳ３３０：大域アライメントの決定
テーブルを構成する全ての行列成分の算出が終了した後、行列ｆ（ｉ，ｊ）の最終行最終列の成分から矢印を元に辿ることで、大域アライメントの経路を得ることができる。 Step S330: Determination of global alignment After the calculation of all matrix components constituting the table is completed, the global alignment path is determined by following the arrows from the last row and last column components of the matrix f (i, j). Can be obtained.

ステップＳ３４０：類似度の算出
動画像ｖ_Ａと動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を、次の数２０に基づいて算出する。 Step S340: Calculation of similarity The similarity u (A, B) between the moving image v _A and the moving image v _B is calculated based on the following equation (20).

ただしｈ（ｘ）は正の増加関数とし、ｆ_ｌａｓｔは当該行列の最終行最終列の数値である。また、ｗは平均化関数とする。当該平均化関数は例えば、Σ_ｉＥ_ｉ ^ＡとΣ_ｉＥ_ｉ ^Ｂとの算術平均とすることができる。 However, h (x) is a positive increase function, and f _last is a numerical value of the last row and last column of the matrix. W is an averaging function. The averaging function can be, for example, an arithmetic average of Σ _i E _i ^A and Σ _i E _i ^B.

（２）局所アライメントを求めるＭ距離を適用した類似度の比較
比較すべき動画像が互いに大きく相違している場合、上述した大域アライメントは人間の感覚からずれたものとなる。このような場合には、局所アライメントが利用できる。局所アラインメントは、配列が全体としては似ておらず、部分的類似を見つけたい場合に有効である。以下に記述するアルゴリズムは、上記の大域的アライメントによる方法を、局所アライメントに適応させたもので、類似度の尺度として局所アライメント用のＭ距離を用いる。この局所アライメント用のＭ距離は、生命情報学における局所アライメントを求めるためのSmith-Watermanアルゴリズムの一般化となっている。したがって、本実施例で適用する局所アライメント用のＭ距離は、Smith-Watermanアルゴリズムを特例として含んでいる。 (2) Comparison of Similarity Applying M Distance for Obtaining Local Alignment When the moving images to be compared are greatly different from each other, the above-described global alignment is deviated from human senses. In such a case, local alignment can be used. Local alignment is useful when the sequences are not similar overall and you want to find partial similarities. The algorithm described below is an adaptation of the above global alignment method to local alignment, and uses M distance for local alignment as a measure of similarity. This M distance for local alignment is a generalization of the Smith-Waterman algorithm for obtaining local alignment in bioinformatics. Therefore, the M distance for local alignment applied in the present embodiment includes the Smith-Waterman algorithm as a special case.

以下、局所アライメント用のＭ距離を利用したアルゴリズムについて、図８のフローチャートを参照しながら詳しく説明する。ただし、既に説明した大域アライメント用のＭ距離と重複する手順が多いため、違いを記述するに留める。 Hereinafter, an algorithm using the M distance for local alignment will be described in detail with reference to the flowchart of FIG. However, since there are many procedures that overlap with the already described M distance for global alignment, only the differences are described.

ステップＳ３１０：クエリ動画像と対象動画像の各典型静止画を用意
これは、大域アライメント用のＭ距離と同じ処理手順である。 Step S310: Prepare typical still images of the query moving image and the target moving image This is the same processing procedure as the M distance for global alignment.

ステップＳ３２０：テーブルを作成し、次の動的計画法の手順に従って経路を元に辿る。
動画像ｖ_Ａの典型静止画の数をｍとし、動画像ｖ_Ｂの典型静止画の数をｎとしたときに、（ｍ＋１）×（ｎ＋１）の行列を局所アライメントのテーブルとして作成する。行列の各成分は、動的計画法により次の手順で算出される。 Step S320: Create a table and follow the route according to the following dynamic programming procedure.
_A matrix of (m + 1) × (n + 1) is created as a local alignment table where m is the number of typical still images of the moving image v _A and n is the number of typical still images of the moving image v _B. Each component of the matrix is calculated by the following procedure by dynamic programming.

先ず、ステップＳ３２１では、設計パラメータとして設定したギャップペナルティーｇが選択される。これは、大域アライメント用のＭ距離と同じ処理手順である。 First, in step S321, a gap penalty g set as a design parameter is selected. This is the same processing procedure as the M distance for global alignment.

次のステップＳ３２２では、｛ｉ＝０｝の行と、｛ｊ＝０｝の列について、各成分をすべて０に設定する。 In the next step S322, all components are set to 0 for the row of {i = 0} and the column of {j = 0}.

さらにステップＳ３２３では、（ｉ，ｊ）＝（１，１）の位置から開始し、｛ｉ＝０｝の行と｛ｊ＝０｝の列以外の残りの行列成分ｆ（ｉ，ｊ）を、次の数２１に基づいて算出する。 Further, in step S323, the remaining matrix components f (i, j) other than the row of {i = 0} and the column of {j = 0} are started from the position of (i, j) = (1,1). , Based on the following equation (21).

その後、０ではない最大値を与える成分に向かって矢印を描く。数２１の右辺では数１８と異なり、ｍａｘ記号の中の最左成分を０としている。これにより、スコア値が負になった場合に値を０にリセットし、アラインメントを新しく始めることができる。 Then, an arrow is drawn toward the component that gives the maximum value that is not zero. On the right side of Equation 21, unlike the Equation 18, the leftmost component in the max symbol is 0. Thus, when the score value becomes negative, the value can be reset to 0 and the alignment can be newly started.

ステップＳ３３０：局所アライメントの決定
テーブルを構成する全ての行列成分の算出が終了した後、行列ｆ（ｉ，ｊ）の中の最大値ｆ_ｍａｘの成分から矢印を元に辿ることで、局所アライメントの経路を得ることができる。 Step S330: Determination of local alignment After calculation of all matrix components constituting the table is completed, tracing of the local alignment is performed by tracing from the component of the maximum value f _{max in} the matrix f (i, j) based on the arrow. A route can be obtained.

ステップＳ３４０：類似度の算出
大域アライメント用のＭ距離と同じ処理手順を用い、動画像ｖ_Ａと動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を、数２０に基づいて算出する。 Step S340: Calculation of Similarity Using the same processing procedure as the M distance for global alignment, the similarity u (A, B) between the moving image v _A and the moving image v _B is calculated based on Equation 20.

以上の段階１〜段階３の手順を、類似動画像検索処理装置２が処理実行して、動画像ｖ_Ａに対する個々の動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を算出する。これにより、クエリ動画像に類似する類似動画像を対象動画像の中から検索することができる。 The similar moving image search processing device 2 executes the above steps 1 to 3 to calculate the similarity u (A, B) of each moving image v _B to the moving image v _A. Thereby, a similar moving image similar to the query moving image can be searched from the target moving image.

なお、上述の段階２で用いたTBAPは、段階３との組み合わせにおいて、時間で拘束しないAPを利用してもよい。類似度の評価に用いるためのクラスタリング法として、APの他に既存のharmonic competitionや、既存のk-means法を使用することも可能である。 Note that the TBAP used in stage 2 described above may use an AP that is not time-constrained in combination with stage 3. As a clustering method for use in the evaluation of similarity, it is possible to use an existing harmonic competition or an existing k-means method in addition to AP.

次に、上記動画像検索装置を用いた実験の詳細と結果を述べる。 Next, the details and results of the experiment using the moving image retrieval apparatus will be described.

先ず、実験データについて、検索のエンドユーザーは人間であるので、類似度の最終判断は人間のもつ感性に強く依存する。したがって、類似度の判断は、題目にできるだけ依存しないことが望ましい。しかし、単なるセットでは、プレーンな機器でも判断するのが容易になり過ぎるので、実験では次の条件を満足するデータセットを用意した。 First, as for the experimental data, since the end user of the search is a human, the final determination of the similarity is strongly dependent on the sensitivity of the human. Therefore, it is desirable that the determination of similarity is not dependent on the subject as much as possible. However, with a simple set, it would be too easy to judge even with a plain device, so a data set that satisfies the following conditions was prepared in the experiment.

・元の動画像データは、ラベル付けされていないものとする。
・検索における適合率および再現率を、自動的に判断できるものとする。
・図３の右下に示すような時間に依存する性質を判別できるように、各動画像は、シーンの時間変化を捉えた情報を有するものとする。・ The original video data is not labeled.
・ The precision and recall in the search can be automatically judged.
It is assumed that each moving image has information that captures the time change of the scene so that the time-dependent property as shown in the lower right of FIG. 3 can be determined.

図９は、元の動画像データの生成過程を表したものである。グループＡの動画像は「鳥瞰図」を表し、グループＢの動画像は「鳥」を表す。実験において、これらのグループＡ，Ｂより20の動画クラスが作られる（図１０を参照）。各クラス１〜２０はそれぞれ21個の動画像を構成するので、互いに異なる合計で420個のテスト動画を使用した。 FIG. 9 shows the generation process of the original moving image data. The group A moving image represents a “bird's eye view”, and the group B moving image represents a “bird”. In the experiment, 20 animation classes are created from these groups A and B (see FIG. 10). Since each class 1 to 20 constitutes 21 moving images, a total of 420 different test videos were used.

・事前実験
APアルゴリズムの収束はデータに依存するので、数１３と数１４におけるパラメータλの範囲を確認する必要がある。試行データの集合として、５集団のガウス型クラスターを生成し、さらに、それぞれの時間方向について２集団に分け、１０集団とした。図１１は、これらのデータを図示したものである。・ Preliminary experiment
Since the convergence of the AP algorithm depends on the data, it is necessary to confirm the range of the parameter λ in the equations 13 and 14. As a set of trial data, five groups of Gaussian clusters were generated, and further divided into two groups for each time direction to form ten groups. FIG. 11 illustrates these data.

図１１に示す試行データの集合に関して、本実施例で提案するTBAPと、通常のAPに対して実験を行なった。主な目的は収束のためのパラメータλの範囲と、時間窓パラメータωの効果とを確認することにある。繰り返しの実験により、パラメータλ∈［0.3，1.0−ε］の選択はデータの特徴に依存する。ただしεは正の微少量であり、図１１の実験ではε＝0.02とした。 With respect to the set of trial data shown in FIG. 11, experiments were performed on the TBAP proposed in this example and a normal AP. The main purpose is to confirm the range of the parameter λ for convergence and the effect of the time window parameter ω. Through repeated experiments, the selection of the parameter λε [0.3,1.0−ε] depends on the characteristics of the data. However, ε is a small positive amount, and ε = 0.02 in the experiment of FIG.

表１は、収束の傾向と計算時間を示したものである。収束は、初期値で正規化されたａ（ｉ，ｋ）およびｒ（ｉ，ｋ）のＬ_１ノルムにより、それが10^−５未満であるか否かによりチェックされた。表１では、時間窓Ｗのパラメータωを変化させたときの典型静止画の数「exemplars」、反復回数「iterations」、メッセージ伝達数「message passing」をそれぞれ示している。表１より、例えばω＝10のとき、本実施例のTBAPは通常のAPのメッセージ伝達と比べて６倍高速であることがわかる。 Table 1 shows the tendency of convergence and the calculation time. Convergence was checked by the L ₁ norm of a (i, k) and r (i, k) normalized by the initial value to see if it was less than 10 ⁻⁵ . Table 1 shows the number of typical still images “exemplars”, the number of iterations “iterations”, and the number of message passing “message passing” when the parameter ω of the time window W is changed. From Table 1, it can be seen that, for example, when ω = 10, the TBAP of this embodiment is 6 times faster than the message transmission of a normal AP.

ω＝１の場合は、全フレームが典型静止画であることを意味する。ω＝ｎ＝100は、時間依存性のある典型静止画抽出を含まない通常のAPの場合に相当する。これは収束傾向を見るため比較例として求めた。ω＝５，10，20，50とした場合を見ると、メッセージ伝達数に単調増加が認められ、これは時間窓のサイズに起因するためＣＰＵタイムに直接関わる。しかし、通常のAPのメッセージ伝達数に対して、TBAPのメッセージ伝達数の比率は小さいまま保たれているので、これらは望ましい範囲にあるといえる。 When ω = 1, it means that all frames are typical still images. ω = n = 100 corresponds to the case of a normal AP that does not include time-dependent typical still image extraction. This was obtained as a comparative example in order to see the convergence tendency. In the case of ω = 5, 10, 20, 50, a monotonic increase is observed in the number of message transmissions, which is directly related to the CPU time because of the size of the time window. However, since the ratio of the number of message transmissions of TBAP to the number of message transmissions of normal AP is kept small, it can be said that these are in a desirable range.

準備した図１０の動画像セット（クラス１〜２０）に対して、信頼できるフレームを伴う典型静止画を見つけ出すのにTBAPを適用した。これは、各動画像に「数値的な注釈」を付与するのに重要な手順であり、図１０に示す整理されていない動画像の集合を、数値的なタグという観点から構造化することと等価である。こうしたタグは、オンラインとオフラインの何れかで算出できる。 TBAP was applied to find a typical still image with a reliable frame for the prepared moving image set of FIG. 10 (classes 1 to 20). This is an important procedure for assigning “numerical annotations” to each moving image. The unorganized set of moving images shown in FIG. 10 is structured in terms of numerical tags. Is equivalent. Such tags can be calculated either online or offline.

こうして、TBAPにより得られた、比較対象とする二つのクラスの動画像（動画像ｖ_Ａと動画像ｖ_Ｂ）の各典型静止画について、それらのSimilarity測度ｓ（ｉ，ｊ）から得られる距離行列の一例を図１２に示す。同図において、動画像ｖ_Ａは４つの典型静止画を有し、動画像ｖ_Ｂは３つの典型静止画を有するので、当該行列は３×４行列である。表中の数字は、動画像ｖ_Ａと動画像ｖ_Ｂとの典型静止画（代表フレーム）間の距離行列の各成分を表し、例えば動画像ｖ_Ａの代表フレーム２と、動画像ｖ_Ｂの代表フレーム３との距離は0.26となる。当該距離は各代表フレーム間の類似度に対応して数値化したもので、ここでは距離が小さいほど代表フレーム同士が類似したものとなり、反対に距離が大きいほど代表フレーム同士が互いに相違したものとなる。 Thus, the distance obtained from the Similarity measure s (i, j) for each of the typical still images of the two classes of moving images (moving image v _A and moving image v _B ) obtained by TBAP. An example of the matrix is shown in FIG. In the figure, since the moving image v _A has four typical still images and the moving image v _B has three typical still images, the matrix is a 3 × 4 matrix. The numbers in the table represent the components of the distance matrix between the typical still images (representative frames) of the moving image v _A and the moving image v _B. For example, the representative frame 2 of the moving image v _A and the moving image v _B The distance from the representative frame 3 is 0.26. The distance is quantified corresponding to the degree of similarity between the representative frames. Here, the smaller the distance, the more similar the representative frames, and the larger the distance, the more different the representative frames are from each other. Become.

図１３は、上述した大域アライメント用のＭ距離を適用して作成したテーブルの一例である。この大域アライメント用のテーブルを作成する上で、アルゴリズムのパラメータは、bar（Ｄ）＝１／√（１−（１／ｄ）），ｄ＝３×256＝768，ｇ＝0.05にそれぞれ設定した。同図において、テーブルの右下に記載される数字26.60が最終成分の数値ｆ_ｌａｓｔであり、そこから元に辿る最適経路を大域アライメントとして網掛けで示している。その後、動画像ｖ_Ａと動画像ｖ_Ｂの類似度ｕ（Ａ，Ｂ）を求めるために、算出された数値ｆ_ｌａｓｔは平均化関数で正規化される。例えば、ギャップｇとフレーム数の集合との積の総和で正規化する場合、図１３において、その総和は、0.05×（８＋12＋10＋７）＋0.05×（12＋７＋10）＝1.85＋1.45＝3.30であるので、正規化された類似度ｕ（Ａ，Ｂ）は26.60／（1.85＋1.45）＝8.06と算出される。また適合パターンは、当該行列の最終行最終列より矢印を辿って探すことができる。こうして、動画像ｖ_Ａと動画像ｖ_Ｂとの間で、大域アライメント用の重み付きＭ距離による適切な類似度ｕ（Ａ，Ｂ）の算出が可能となる。 FIG. 13 is an example of a table created by applying the above-mentioned M distance for global alignment. In creating this global alignment table, the algorithm parameters were set to bar (D) = 1 / √ (1− (1 / d)), d = 3 × 256 = 768, g = 0.05, respectively. . In the figure, numeral 26.60 described at the lower right of the table is the numerical value f _last of the final component, and the optimal path traced from there is shown by shading as global alignment. Thereafter, in order to obtain the similarity u (A, B) between the moving image v _A and the moving image v _B , the calculated numerical value f _last is normalized by an averaging function. For example, when normalization is performed using the sum of products of the gap g and the set of the number of frames, the sum is 0.05 × (8 + 12 + 10 + 7) + 0.05 × (12 + 7 + 10) = 1.85 + 1.45 = 3.30 in FIG. The normalized similarity u (A, B) is calculated as 26.60 / (1.85 + 1.45) = 0.06. The matching pattern can be found by tracing the arrow from the last row and last column of the matrix. In this way, it is possible to calculate an appropriate similarity u (A, B) between the moving image v _A and the moving image v _B based on the weighted M distance for global alignment.

図１４は、上述した局所アライメント用のＭ距離を適用して作成したテーブルの一例である。アルゴリズムのパラメータは、bar（Ｄ）＝√（１−（１／ｄ）），ｄ＝３×256＝768，ｇ＝0.05と設定した。局所アライメント用のテーブルでは、大域アライメント用のテーブルでスコアがマイナスになる成分、すなわち数１６や数１７で算出した成分が、すべて０にリセットされることが確認できる。ここでは最大値ｆ_ｍａｘが26.95であり、その最大値ｆ_ｍａｘの成分から元に辿る最適経路を、局所アライメントとして網掛けで示している。こうして、比較する動画像が大きく異なるものであっても、動画像ｖ_Ａと動画像ｖ_Ｂとの間で、局所アライメント用の重み付きＭ距離による適切な類似度ｕ（Ａ，Ｂ）の算出が可能となる。 FIG. 14 is an example of a table created by applying the M distance for local alignment described above. The algorithm parameters were set as bar (D) = √ (1− (1 / d)), d = 3 × 256 = 768, g = 0.05. In the local alignment table, it can be confirmed that the components whose scores are negative in the global alignment table, that is, the components calculated in Equations 16 and 17, are all reset to zero. Here, the maximum value f _max is 26.95, and the optimum path traced from the component of the maximum value f _max is indicated by shading as local alignment. Thus, even if the moving images to be compared are greatly different, the appropriate similarity u (A, B) is calculated between the moving image v _A and the moving image v _B by the weighted M distance for local alignment. Is possible.

なお、好ましい変形例として、上述したＭ距離による類似度の算出は、その計算量を削減するために、図１５に示すように、シーケンス全体の割合に対して任意の幅ｒだけを計算に用いることもできる。これは、Ｍ距離をコンテンツベースの動画像検索の類似度算出に特化させることで高速化した算出手法である。図１５において、黒く塗りつぶされた部分は、計算を省略した個所を示す。またこの図は、大域アラインメントによる図１３のテーブルに対応する。この場合、計算の範囲は以下の数２２によって定められる。 As a preferred modification, the above-described calculation of the similarity based on the M distance uses only an arbitrary width r for the ratio of the entire sequence as shown in FIG. 15 in order to reduce the amount of calculation. You can also. This is a calculation method that is speeded up by specializing the M distance in similarity calculation of content-based moving image search. In FIG. 15, a blacked portion indicates a place where calculation is omitted. This figure also corresponds to the table of FIG. 13 with global alignment. In this case, the range of calculation is determined by the following equation (22).

上記数２２において、ｍは動画像ｖ_Ａの代表フレーム数であり、ｎは動画像ｖ_Ｂの代表フレーム数である。幅ｒは、類似度比較における計算範囲の絞り具合を表すパラメータであって、計算範囲を数２２のように絞ることで、通常の全ての行列成分を算出する計算と比べて、100×（１−ｒ）^２％の計算量削減が可能となる。図１５の例では、計算範囲を絞ったにもかかわらず、計算範囲を絞っていない図１３の計算結果と同じ結果を算出していることが確認できる。 In Equation 22, m is the number of representative frames of the moving image v _A , and n is the number of representative frames of the moving image v _B. The width r is a parameter representing the degree of narrowing of the calculation range in the similarity comparison, and is 100 × (1 compared to the calculation for calculating all matrix components by narrowing the calculation range as shown in Equation 22. -R) The calculation amount can be reduced by ² %. In the example of FIG. 15, it can be confirmed that although the calculation range is narrowed, the same result as the calculation result of FIG. 13 in which the calculation range is not narrowed is calculated.

次に、前述した時間窓のパラメータωと、類似度ｕ（Ａ，Ｂ）を算出した後の再現率や適合率との関係について説明する。ここでは、全ての動画像が類似度の高い順に分類された後、再現率と適合率が次の数２３のように定義される。 Next, the relationship between the above-described time window parameter ω and the recall and relevance ratio after calculating the similarity u (A, B) will be described. Here, after all the moving images are classified in descending order of similarity, the recall rate and the matching rate are defined as in the following Expression 23.

再現率（recall）：＝検索によって見つかった正解動画像の数／同じクラスに属する動画像数
適合率（precision）：＝検索によって見つかった正解動画像の数／検索された上位動画像数 Recall rate: = Number of correct moving images found by search / Number of moving images belonging to the same class Precision: = Number of correct moving images found by searching / Number of upper moving images searched

再現率とは、検索要求を満たす全動画像に対しての、検索要求を満たす検索結果の割合を意味し、検索の網羅性を表す指標である。一方、適合率とは、全検索結果に対しての、検索要求を満たす検索結果の割合を表し、検索の正確さを表す指標である。一般に再現率や適合率が高いほど検索の性能が高いとみなされる。 The reproduction rate means the ratio of search results that satisfy the search request to all moving images that satisfy the search request, and is an index that represents the completeness of the search. On the other hand, the relevance rate represents the ratio of search results that satisfy the search request to all search results, and is an index that represents the accuracy of the search. Generally, the higher the recall and relevance rate, the higher the search performance.

図１６は、実験で得られた適合率−再現率曲線の結果を示している。同図において、縦軸は適合率を表し、横軸は再現率を表す。曲線がグラフ中右上に行くほど、検索性能が高い。ここでは、TBAPを適用した時間窓パラメータが「ω＝５」と「ω＝10」の各実験を載せている。図中枠内に記載されるように、数２０で示した関数ｈ（ｘ）を定義し、そこで使用する係数ａ＝１に設定した。また比較として、通常のAPを適用した「Plain AP」の実験結果も併せて掲載した。 FIG. 16 shows the result of the precision-recall rate curve obtained in the experiment. In the figure, the vertical axis represents the precision, and the horizontal axis represents the recall. The higher the curve goes to the upper right of the graph, the higher the search performance. Here, each experiment in which time window parameters to which TBAP is applied is “ω = 5” and “ω = 10” is shown. As described in the frame in the figure, the function h (x) expressed by Equation 20 was defined, and the coefficient a used therein was set to 1. For comparison, the experimental results of “Plain AP” using ordinary AP are also shown.

これらの実験結果から、再現率として20％までを要求する関心領域（図中「ROI」で囲んだ領域）において、本実施例のTBAPによる適合率は非常に高く、処理能力は満足のいくものであることが読み取れる。一方、通常のAPは時間軸を考慮していないため、本発明のTBAPと比べてその性能は劣ることが見てとれる。 From these experimental results, in the region of interest that requires a recall rate of up to 20% (the region surrounded by “ROI” in the figure), the precision of TBAP in this example is very high and the processing capability is satisfactory. It can be read that. On the other hand, since the normal AP does not consider the time axis, it can be seen that the performance is inferior to the TBAP of the present invention.

以上、本実施例における動画像検索装置のシステム構成とアルゴリズムを提示したが、動画像は大きな容量サイズであり、各典型静止画とそれらの責任のあるフレームの集合は、通常オフラインで算出される。そのため、大きなデータ集合の構造情報に対する数値ラベルとして、こうした集合を利用できる。また本実施例では、各々の典型静止画が責任のあるフレームを含んでいることを考慮して、レーベンシュタイン距離を含む新規な類似度の算出方法を、Ｍ距離として提案している。このＭ距離は、選出した典型静止画に対して、大域アライメントと局所アライメントの両方で利用が可能である。 As mentioned above, the system configuration and algorithm of the moving image search apparatus in the present embodiment have been presented. The moving image has a large capacity size, and each typical still image and a set of frames for which they are responsible are usually calculated offline. . Therefore, such a set can be used as a numerical label for the structural information of a large data set. In the present embodiment, a new similarity calculation method including the Levenshtein distance is proposed as the M distance in consideration that each typical still image includes a responsible frame. This M distance can be used for both the global alignment and the local alignment for the selected typical still image.

特に本実施例では、大量の動画像に互いの相互関係を与えるソフトなラベルを与えてそれらを比較することにより、動画像検索を行なう考えを示しており、その特徴は次の通りである。
・各動画像中の複数枚のフレームを、典型静止画として選出する。
・それぞれの典型静止画は責任のあるフレームを含み、勢力範囲となる近傍の枚数を有する。
・異なる枚数、異なる勢力範囲の典型静止画集合同士の類似度を数値として比較できる計算法を得ることができて、高速化も可能にしている。これは、動画像間の検索である。なお、この検索は生命情報配列の類似度を測る方法の完全な一般化になっており、この方面での新展開ももたらし得るものである。
・与えられた動画像内にある典型静止画とその勢力範囲を自動計算する方法として、通常のアフィニティ伝播法に時間窓を導入することにより、動画像に特有な時間進行性を反映できるものとなっている。これは、単一の動画像内での検索であり（intra-video retrieval）、オフラインでの計算が可能である。 In particular, in this embodiment, the idea of performing a moving image search by giving a soft label that gives a mutual relationship to a large amount of moving images and comparing them is shown, and the features thereof are as follows.
-Select multiple frames in each video as typical still images.
Each typical still image includes a responsible frame and has a number of nearby frames that are in the power range.
-It is possible to obtain a calculation method that can compare similarities between typical still image sets of different numbers of sheets and different power ranges as numerical values, and also enables high speed. This is a search between moving images. This search is a complete generalization of the method for measuring the similarity of life information sequences, and can also lead to new developments in this direction.
・ As a method to automatically calculate the typical still image and its power range in a given moving image, the time progression characteristic of the moving image can be reflected by introducing a time window into the normal affinity propagation method. It has become. This is a search within a single video (intra-video retrieval) and can be calculated off-line.

本実施例では、動画像の各フレームから特徴量（例えは輝度ヒストグラム）を抽出し、例えばAPを用いてクラスター化してシーン毎の代表フレームを求める。このとき、所定時間幅の時間窓を用いてAPを適用するフレーム数を制限することで、クエリ動画像と検索対象動画像との間で抜け（ギャップ）を許した時系列の代表フレーム列を抽出し、代表フレーム同士の類似性とその時系列の類似性を判定することで検索を行なう。この判定方法では、ギャップペナルティーｇを導入した新たなＭ距離を提案し、採取的に望ましい再現率と適合率を達成できた。 In this embodiment, feature amounts (for example, luminance histograms) are extracted from each frame of a moving image, and clustered using, for example, an AP to obtain a representative frame for each scene. At this time, by limiting the number of frames to which the AP is applied using a time window of a predetermined time width, a time-series representative frame sequence that allows a gap (gap) between the query moving image and the search target moving image is obtained. A search is performed by extracting and determining the similarity between representative frames and the similarity in time series. In this determination method, a new M distance with a gap penalty g introduced was proposed, and the desired recall and relevance rate could be achieved.

以上のように、本実施例では、クエリ動画像に対して検索対象となる対象動画像の類似度を算出して、クエリ動画像に類似する動画像を検索する動画像検索装置のシステムを用いた動画像検索方法を提案する。特にこの方法は、次の３つのステップからなる。 As described above, the present embodiment uses a system of a moving image search apparatus that calculates the similarity of a target moving image to be searched with respect to a query moving image and searches for moving images similar to the query moving image. We propose a video search method. In particular, this method consists of the following three steps.

すなわち第１ステップでは、処理装置２に組み込まれた特徴量抽出手段としての第１の特徴量抽出手段１２や第２の特徴量抽出手段２２が、クエリ動画像と対象動画像の各フレームから、好ましくはCSDを用いて特徴量をそれぞれ抽出する。第２ステップでは、処理装置２に組み込まれた代表フレーム選出手段としてのクエリ動画像代表フレーム選出手段１３や対象動画像代表フレーム選出手段２３が、クエリ動画像と対象動画像について、好ましくはTBAPや、改良したk-means法や、改良したharmonic competition法を用いた時間窓で制限された時間幅の範囲内で、第１ステップで得た特徴量間の関係から各フレームをクラスター化し、時間窓を時間軸上でスライドさせる。これにより、シーン毎に時系列で代表フレームを取得する。さらに第３ステップでは、処理装置２に組み込まれた動画像検索手段としての類似度算出手段３１や類似動画像表示制御手段３２が、クエリ動画像における代表フレームと、対象動画像における代表フレームとの間の代表フレーム間類似度から、クエリ動画像と対象動画像との間の類似度を求めて、クエリ動画像に類似する動画像を検索する。 That is, in the first step, the first feature quantity extraction unit 12 or the second feature quantity extraction unit 22 as the feature quantity extraction unit incorporated in the processing device 2 is obtained from each frame of the query moving image and the target moving image. Preferably, each feature amount is extracted using CSD. In the second step, the query moving image representative frame selecting unit 13 and the target moving image representative frame selecting unit 23 as the representative frame selecting unit incorporated in the processing device 2 preferably perform TBAP or target moving image on the query moving image and the target moving image. In the time range limited by the time window using the improved k-means method and the improved harmonic competition method, each frame is clustered from the relationship between the feature values obtained in the first step. Slide on the time axis. Thereby, a representative frame is acquired in time series for each scene. Further, in the third step, the similarity calculation unit 31 or the similar moving image display control unit 32 as the moving image search unit incorporated in the processing device 2 performs the process of representing the representative frame in the query moving image and the representative frame in the target moving image. The similarity between the query moving image and the target moving image is obtained from the similarity between the representative frames, and a moving image similar to the query moving image is searched.

このような方法を採用すると、クエリ動画像と対象動画像に対して、時間軸上で時間窓をスライドさせながら、その時間窓で制限された時間幅内のフレームについて、各フレームの特徴量に基づくクラスタリングを、より少ない計算コストで行なうことができる。そのため、特に時間軸を有する動画像に対して、時系列を考慮したシーン毎の代表フレームを取得することが可能になる。そして、この時系列の代表フレームを利用して、クエリ動画像と対象動画像との間で類似性を判定することで、対象動画像の中からクエリ動画像に類似する動画像を検索できる。こうして、大量の動画像間の相互関係、すなわち位相を数値的に計算できる手段を与えることで、動画像そのものをクエリとする検索やグループ化を、計算コストを低く抑えて高い精度で容易に行なうことが可能になる。 When such a method is adopted, the feature amount of each frame is calculated for frames within the time width limited by the time window while sliding the time window on the time axis with respect to the query moving image and the target moving image. Based clustering can be performed with less computational cost. Therefore, it is possible to acquire a representative frame for each scene in consideration of time series, particularly for a moving image having a time axis. Then, by using this time-series representative frame and determining similarity between the query moving image and the target moving image, a moving image similar to the query moving image can be retrieved from the target moving images. In this way, by providing a means for calculating the interrelationship, that is, the phase of a large number of moving images numerically, search and grouping using the moving images themselves as queries can be easily performed with high accuracy while keeping the calculation cost low. It becomes possible.

また、上記第３ステップでは、クエリ画像のｉ番目の代表フレームが表わすフレーム数Ｅ^Ａ _ｉと、対象動画像のｊ番目の前記代表フレームが表わすフレーム数Ｅ^Ｂ _ｊを取得し、第２ステップのTBAPで得られたクエリ画像のｉ番目の代表フレームと、対象動画像のｊ番目の代表フレームとの間のSimilarity測度ｓ（ｉ，ｊ）を、前記代表フレーム間類似度として取得する。 In the third step, the frame number E ^A _i represented by the i-th representative frame of the query image and the frame number E ^B _j represented by the j-th representative frame of the target moving image are obtained. A similarity measure s (i, j) between the i-th representative frame of the query image obtained by TBAP and the j-th representative frame of the target moving image is acquired as the similarity between the representative frames.

そして、ギャップペナルティーｇを設定した上で、クエリ動画像における代表フレームの数がｍであり、対象動画像における代表フレームの数がｎであるときに、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、ｉ＝０の行の各成分ｆ（０，ｊ）を数１６に基づいて算出し、ｊ＝０の列の各成分ｆ（ｉ，０）を、数１７に基づいて算出する。さらに、フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、数１８に基づいて算出する。こうして、行列の最終行最終列の数値ｆ_ｌａｓｔを算出したら、その数値ｆ_ｌａｓｔを利用してクエリ動画像と対象動画像との間の類似度ｕ（Ａ，Ｂ）を、例えば数２０で算出し、クエリ動画像に類似する動画像を検索する Then, after setting the gap penalty g, when the number of representative frames in the query moving image is m and the number of representative frames in the target moving image is n, each (m + 1) × (n + 1) matrix For the component f (i, j), each component f (0, j) in the row where i = 0 is calculated based on the equation 16, and each component f (i, 0) in the column where j = 0 is calculated as the equation 17 Calculate based on Further, when the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i) other than the row of i = 0 and the column of j = 0. , J) is calculated based on equation (18). When the numerical value f _last of the last row and the last column of the matrix is calculated in this way, the similarity u (A, B) between the query moving image and the target moving image is calculated by using the numerical value f _last , for example, using Equation 20. Search for a video that is similar to the query video

このような方法を採用すれば、クエリ動画像と対象動画像との間で代表フレームの数が異なっていても、ギャップペナルティーｇを導入することで、双方の代表フレーム同士の類似性とその時系列の類似性を、正しく判定して検索を行なうことができる。また、こうして得られた行列ｆ（ｉ，ｊ）の各成分から、大域アライメントの経路を得ることができる。しかも、ここで得られる動画像データに適合した類似度の算出は、レーベンシュタイン距離や、Needleman−Wunschアルゴリズムのそれぞれを単なる特例として含んでおり、生命情報配列の類似度を測る手法にも適用できる。 If such a method is adopted, even if the number of representative frames is different between the query moving image and the target moving image, the similarity between both the representative frames and its time series can be obtained by introducing the gap penalty g. It is possible to perform a search by correctly determining the similarity of. Also, a global alignment path can be obtained from each component of the matrix f (i, j) thus obtained. In addition, the calculation of similarity suitable for the moving image data obtained here includes the Levenshtein distance and the Needleman-Wunsch algorithm as mere special cases, and can be applied to a technique for measuring the similarity of life information sequences. .

また別な手法として、上記第３ステップでは、フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊと、Similarity測度ｓ（ｉ，ｊ）をそれぞれ取得し、ギャップペナルティーｇを設定した上で、（ｍ＋１）×（ｎ＋１）の行列の各成分ｆ（ｉ，ｊ）について、ｉ＝０の行の各成分ｆ（０，ｊ）と、ｊ＝０の列の各成分ｆ（ｉ，０）を、全て０に設定する。そして、フレーム数Ｅ^Ａ _ｉ，Ｅ^Ｂ _ｊに応じて算出される重みをｒ（ｉ，ｊ）としたときに、ｉ＝０の行とｊ＝０の列以外の残りの行列成分ｆ（ｉ，ｊ）を、数２１に基づいて算出する。こうして、行列の各要素の中で最大値ｆ_ｍａｘを算出したら、その数値ｆ_ｍａｘを利用してクエリ動画像と対象動画像との間の類似度ｕ（Ａ，Ｂ）を、例えば数２０でｆ_ｌａｓｔをｆ_ｍａｘに代えて算出し、クエリ動画像に類似する動画像を検索する。 As another method, in the third step, the number of frames E ^A _i , E ^B _j and Similarity measure s (i, j) are acquired, and after setting the gap penalty g, (m + 1) × ( For each component f (i, j) of the matrix of (n + 1), all the components f (0, j) in the row where i = 0 and each component f (i, 0) in the column where j = 0 are all set to 0. Set. Then, assuming that the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i) other than the row of i = 0 and the column of j = 0. , J) is calculated based on Equation 21. When the maximum value f _max is calculated in each element of the matrix in this way, the similarity u (A, B) between the query moving image and the target moving image is calculated by using the numerical value f _max , for example, in Expression 20. f _last is calculated in place of f _max , and a moving image similar to the query moving image is searched.

この場合も、クエリ動画像と対象動画像との間で代表フレームの数が異なっていても、ギャップペナルティーｇを導入することで、双方の代表フレーム同士の類似性とその時系列の類似性を、正しく判定して検索を行なうことができる。また、こうして得られた行列ｆ（ｉ，ｊ）の各成分から、局所アライメントの経路を得ることができる。しかも、ここで得られる動画像データに適合した類似度の算出は、Smith-Watermanアルゴリズムを単なる特例として含んでおり、生命情報配列の類似度を測る手法にも適用できる。 Even in this case, even if the number of representative frames is different between the query moving image and the target moving image, by introducing the gap penalty g, the similarity between both the representative frames and the similarity in time series thereof can be obtained. A search can be performed with correct judgment. Further, a path of local alignment can be obtained from each component of the matrix f (i, j) thus obtained. In addition, the calculation of the similarity suitable for the moving image data obtained here includes the Smith-Waterman algorithm as a special case, and can be applied to a technique for measuring the similarity of life information sequences.

上述した動画検索方法において、特に第２ステップでは、TBAPによって代表フレームを取得するのが好ましい。すなわちTBAPでは、時間窓の中の一連の画像フレームを計算対象として、そこから一段階で代表フレームを選出することが可能となり、計算手順を簡素化できる。 In the moving image search method described above, it is preferable to acquire a representative frame by TBAP, particularly in the second step. In other words, TBAP makes it possible to select a representative frame in one step from a series of image frames in a time window as a calculation target, thereby simplifying the calculation procedure.

そして、上述した作用効果は、上記第１ステップ〜第３ステップを実行することができる動画像検索装置や、その動画像検索装置をコンピュータである処理装置２に機能させることができる動画像検索プログラムにおいても同様に実現する。 The above-described effects are obtained by a moving image search apparatus capable of executing the first to third steps, and a moving image search program capable of causing the processing apparatus 2 that is a computer to function the moving image search apparatus. This is also realized in the same way.

以上、本発明の実施の形態について実施例を用いて説明したが、本発明はこうした実施例に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において、種々なる形態で実施し得ることは勿論である。例えば上述した各パラメータ値の設定は、入力手段４を操作して適宜変更することも可能である。 The embodiments of the present invention have been described using the embodiments. However, the present invention is not limited to these embodiments, and can be implemented in various forms without departing from the gist of the present invention. Of course you get. For example, the setting of each parameter value described above can be changed as appropriate by operating the input means 4.

本発明は、上述した情報処理方法や情報処理装置の製造産業などに利用可能である。なお、本発明で提示した類似度の概念は、生命情報配列の類似度を測る方法の完全な一般化になっており、この方面に応用することも可能である。特に、本発明におけるTBAPは、時系列情報を持ったデータに幅広く適応可能であり、例えば自然言語処理の分野では、時系列を考慮したトピック抽出などにも利用できる。このように、TBAPは今後様々な分野での利用が期待される。 The present invention can be used in the information processing method and the information processing apparatus manufacturing industry described above. Note that the concept of similarity presented in the present invention is a complete generalization of a method for measuring the similarity of life information sequences, and can be applied in this direction. In particular, the TBAP according to the present invention can be widely applied to data having time series information. For example, in the field of natural language processing, it can be used for topic extraction considering time series. In this way, TBAP is expected to be used in various fields in the future.

１２第１の特徴量抽出手段（特徴量抽出手段）
１３クエリ動画像代表フレーム選出手段（代表フレーム選出手段）
２２第２の特徴量抽出手段（特徴量抽出手段）
２３対象動画像代表フレーム選出手段（代表フレーム選出手段）
３１類似度算出手段（動画像検索手段）
３２類似動画像表示制御手段（動画像検索手段） 12 First feature quantity extraction means (feature quantity extraction means)
13 Query moving picture representative frame selection means (representative frame selection means)
22 Second feature quantity extraction means (feature quantity extraction means)
23 Target moving image representative frame selection means (representative frame selection means)
31 Similarity calculation means (moving image search means)
32 Similar moving image display control means (moving image search means)

Claims

A moving image search method by a moving image search apparatus for calculating a similarity of a target moving image with respect to a query moving image and searching for a moving image similar to the query moving image,
A first step of extracting a feature amount from each frame of the query moving image and the target moving image;
For the query moving image and the target moving image, the frames are clustered from the relationship between the feature amounts within a time width limited by the time window, and the time window is slid on the time axis. A second step of acquiring a representative frame in time series for each scene;
From the representative frame similarity between the representative frame in the query video and the representative frame in the target video, the similarity between the query video and the target video is obtained, And a third step of searching for the moving image similar to the query moving image.

In the third step,
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Calculate each component f (0, j) of the row of i = 0 based on the following equation 1;
Calculate each component f (i, 0) in the column of j = 0 based on the following formula 2;
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 3,
The moving image search method according to claim 1, wherein the similarity between the query moving image and the target moving image is calculated using a numerical value of a last row and a last column of the matrix.

In the third step,
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Each component f (0, j) in the row with i = 0 and each component f (i, 0) in the column with j = 0 are all set to 0,
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 4,
The moving image search method according to claim 1, wherein the similarity between the query moving image and the target moving image is calculated using a maximum value among the components of the matrix.

The moving image search method according to claim 1, wherein in the second step, the representative frame is acquired by a time-constrained affinity propagation method.

A moving image search apparatus for calculating a similarity of a target moving image to be searched with respect to a query moving image and searching for a moving image similar to the query moving image,
Feature amount extraction means for extracting a feature amount from each frame of the query moving image and the target moving image;
For the query moving image and the target moving image, the frames are clustered from the relationship between the feature amounts within a time width limited by the time window, and the time window is slid on the time axis. Representative frame selection means for acquiring representative frames in time series for each scene;
From the representative frame similarity between the representative frame in the query video and the representative frame in the target video, the similarity between the query video and the target video is obtained, A moving image search device comprising: moving image search means for searching for a moving image similar to a query moving image.

The moving image search means includes
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Calculate each component f (0, j) in the row of i = 0 based on the following equation 5:
Calculate each component f (i, 0) in the column of j = 0 based on the following equation 6:
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 7,
6. The moving image search apparatus according to claim 5, wherein the similarity between the query moving image and the target moving image is calculated using a numerical value of a last row and a last column of the matrix. .

The moving image search means includes
The number of frames E ^A _i represented by the i th representative frame of the query image, the number of frames E ^B _j represented by the j th representative frame of the target moving image, and the i th representative frame of the query image Obtaining the inter-representative frame similarity s (i, j) with the j-th representative frame of the target moving image,
Set gap penalty g,
When the number of the representative frames in the query moving image is m and the number of the representative frames in the target moving image is n, each component f (i, j) of the matrix of (m + 1) × (n + 1) about,
Each component f (0, j) in the row with i = 0 and each component f (i, 0) in the column with j = 0 are all set to 0,
When the weight calculated according to the number of frames E ^A _i and E ^B _j is r (i, j), the remaining matrix components f (i, j) other than i = 0 rows and j = 0 columns are used. j) is calculated based on the following equation 8,
6. The moving image search according to claim 5, wherein the similarity between the query moving image and the target moving image is calculated using a maximum value among the components of the matrix. apparatus.

The moving image search apparatus according to claim 5, wherein the representative frame selection unit is configured to acquire the representative frame by a time-constrained affinity propagation method.

A moving picture search program for causing a computer to function as the moving picture search apparatus according to any one of claims 5 to 8.