JP5116017B2

JP5116017B2 - Video search method and system

Info

Publication number: JP5116017B2
Application number: JP2007226433A
Authority: JP
Inventors: 啓一郎帆足; 俊晃上向; 一則松本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2007-08-31
Filing date: 2007-08-31
Publication date: 2013-01-09
Anticipated expiration: 2027-08-31
Also published as: JP2009060413A

Description

本発明は、動画検索方法およびシステムに係り、特に、カット点やショットの無い動画、あるいはカット点やショットの識別が難しい低画質の動画でも高精度な動画検索が可能な動画検索方法およびシステムに関する。 The present invention relates to a moving image search method and system , and more particularly, to a moving image search method and system capable of performing a high-accuracy moving image search even for a moving image having no cut points or shots or a low-quality moving image in which it is difficult to identify cut points or shots. .

大量の動画の中から必要な動画を検索する動画検索方式として、検索対象動画から輝度・色などに関する特徴量（画像特徴量）を抽出しておき、この画像特徴量に基づいて必要な動画を検索する方式が知られている。このような動画検索方式は、例えばネット上に不正に流出した動画コンテンツそのものを厳密に検索する厳密検索方式と、ある動画に類似する動画を検索する曖昧検索方式の２つの方式に大別できる。ここでは、曖昧検索方式を例にして従来の動画検索方式を説明する。 As a video search method for searching for a required video from a large number of videos, feature quantities (image feature quantities) related to brightness, color, etc. are extracted from the search target videos, and the necessary videos are extracted based on these image feature quantities. A search method is known. Such moving image search methods can be broadly classified into two methods, for example, a strict search method that strictly searches for moving image content that has been illegally leaked on the network, and an ambiguous search method that searches for a moving image similar to a certain moving image. Here, a conventional moving image search method will be described using an ambiguous search method as an example.

動画検索技術を評価するワークショップとして、TREC Video Retrieval Workshop (TRECVID) が知られている。TRECVIDで行われている実験の中に検索（Search）タスクが含まれており、非特許文献１，２にはTRECVIDの検索タスクが開示されている。 TREC Video Retrieval Workshop (TRECVID) is known as a workshop for evaluating video search technology. A search task is included in an experiment performed in TRECVID, and TRECVID search tasks are disclosed in Non-Patent Documents 1 and 2.

この検索手法では、検索対象動画の「ショット」の中から、検索クエリに該当するショットを検索するための特徴として、色情報などの低レベル特徴量から推測された中レベル特徴量や、当該ショットに付随するテキスト情報（クローズドキャプションや音声認識処理結果など）を利用している。しかしながら、これらの検索手法では、検索対象動画が比較的短い個別のショットに限定されている。 In this search method, as a feature for searching for a shot corresponding to a search query from “shots” of a search target video, a medium level feature amount estimated from a low level feature amount such as color information, or the shot Text information (closed captions and speech recognition processing results) is used. However, in these search methods, the search target moving image is limited to individual shots that are relatively short.

テレビ番組のように、複数のショットから構成される動画を対象とした類似検索手法が非特許文献３，４，５，６および特許文献１に開示されている。 Non-Patent Documents 3, 4, 5, 6 and Patent Document 1 disclose similar search methods for moving images composed of a plurality of shots such as television programs.

非特許文献３では、動画検索の単位を複数のショットから構成される区間とし、例えば、多くのニュース番組の中から同一の話題を検索する動画検索技術が開示されている。ここでは、Earth Mover's Distance(EMD)と呼ばれる類似度算出指標を用い、動画間の類似度を算出することにより、ニュース番組の中から同一の話題を検索している。 Non-Patent Document 3 discloses a moving image search technique in which a unit of moving image search is a section composed of a plurality of shots, and for example, the same topic is searched from many news programs. Here, by using a similarity calculation index called Earth Mover's Distance (EMD) and calculating the similarity between videos, the same topic is searched from news programs.

非特許文献４および特許文献１では、非特許文献３の技術を応用し、ニュース番組の同一の話題の進行を自動的に追跡する動画話題追跡システムが提案されている。これらの文献では、ユーザからの適合フィードバック情報に基づいてユーザプロファイルを更新し、それ以降の話題追跡の精度を向上させる手法が導入されている。 In Non-Patent Document 4 and Patent Document 1, a moving image topic tracking system that automatically tracks the progress of the same topic of a news program by applying the technique of Non-Patent Document 3 is proposed. In these documents, a method of updating a user profile based on relevance feedback information from a user and improving the accuracy of topic tracking after that is introduced.

非特許文献５では、ニュース番組に付与されているテキスト情報に加え、動画から抽出される視覚的特徴量を利用し、同一の話題を検索する技術が提案されている。具体的には、顔検出処理を応用した手法により検索対象動画に映っている人物を特定し、この特定結果を利用して同一の話題を関連づける手法、人物が映っていない動画の場合には、同一の取材ＶＴＲの検出処理(共同配信された動画であるかの検出処理)などを適用して同一の話題を関連づける手法が提案されている。 Non-Patent Document 5 proposes a technique for searching for the same topic by using a visual feature amount extracted from a moving image in addition to text information given to a news program. Specifically, by using a technique that applies face detection processing to identify the person appearing in the search target video, using this identification result to associate the same topic, A technique for associating the same topic by applying a process for detecting the same news gathering VTR (a process for detecting whether or not the videos are jointly distributed) has been proposed.

上記した従来技術は、テレビ番組のように、プロが製作した動画に対する有効性が示されている検索手法であるが、その一方、近年はYouTubeなどに代表される「動画共有サイト」が非常に流行している。このような動画共有サイトでは、一般ユーザが撮影した動画を、ユーザ自身がサイトにアップロードし、他のサイト閲覧者に公開することが可能になっている。ユーザは動画をアップロードする際に、動画の内容を表すカテゴリ情報や「タグ」を付与することが可能になっている。動画共有サイトの閲覧者が見たい動画を検索する場合は、このカテゴリ情報もしくはタグ情報をキーとして検索を行うことができる。
S-F. Chang et al: “Columbia University TRECVID 2005 Video Search and High-level Feature Extraction”, Proc of TRECVID 2005, 2005. A. G. Hauptmann et al: “CMU Informedia’s TRECVID 2005 Skirmishes”, Proc of TRECVID 2005, 2005. Y. Peng et al.: “EMD-based video clip retrieval by many-to-many matching”, Proc CIVR 2005, pp. 71-81, 2005. M. Uddenfeldt et al.: “Adaptive video news story tracking based on Earth Mover’s Distance”, Proc. ICME 2006, pp. 1029-1032, 2006. Y. Zhai et al: Tracking news stories across different sources, Proceedings of ACM Multimedia 2005, pp. 2-9, 2005. Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas, “The earth mover’s distance as a metric for image retrieval,” Int. J. Comput. Vision, vol. 40, no. 2, pp. 99-121, 2000. 特願２００６−９３８４１号 The above-mentioned conventional technology is a search method that shows the effectiveness of professionally produced videos such as TV programs, but in recent years, “video sharing sites” represented by YouTube and the like have become very popular. boom. In such a moving image sharing site, a moving image taken by a general user can be uploaded to the site by the user himself and made available to other site viewers. When uploading a moving image, the user can give category information or “tag” representing the content of the moving image. When searching for a moving image that a viewer of the moving image sharing site wants to view, the search can be performed using this category information or tag information as a key.
SF. Chang et al: “Columbia University TRECVID 2005 Video Search and High-level Feature Extraction”, Proc of TRECVID 2005, 2005. AG Hauptmann et al: “CMU Informedia's TRECVID 2005 Skirmishes”, Proc of TRECVID 2005, 2005. Y. Peng et al .: “EMD-based video clip retrieval by many-to-many matching”, Proc CIVR 2005, pp. 71-81, 2005. M. Uddenfeldt et al .: “Adaptive video news story tracking based on Earth Mover's Distance”, Proc. ICME 2006, pp. 1029-1032, 2006. Y. Zhai et al: Tracking news stories across different sources, Proceedings of ACM Multimedia 2005, pp. 2-9, 2005. Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas, “The earth mover's distance as a metric for image retrieval,” Int. J. Comput. Vision, vol. 40, no. 2, pp. 99-121, 2000. Japanese Patent Application No. 2006-93841

上記した各特許・非特許文献では、検索対象動画が、プロにより製作されたテレビ番組などの動画に限られてしまう。プロにより製作された動画では、編集によって多数のショットが動画に含まれていることが多く、上記した従来技術でも、ショットを抽出できることが検索の有効性の前提となっている。 In each of the patent and non-patent documents described above, the search target moving image is limited to a moving image such as a television program produced by a professional. In a moving image produced by a professional, a large number of shots are often included in the moving image by editing, and even with the above-described conventional technology, it is a premise for the effectiveness of search that a shot can be extracted.

動画のショットはカット点に基づいて識別できるが、動画共有サイトにアップロードされている動画の多くは、素人の一般ユーザが携帯電話などで撮影した低品質の動画であり、このような動画に対して従来技術の検索手法を適用した場合、低品質が原因となるカット点検出精度の劣化により、得られるショットが実際の動画の中身と異なってしまい、それぞれの検索手法が有効に機能しない可能性が高い。 Video shots can be identified based on cut points, but most of the videos uploaded to video sharing sites are low-quality videos taken by amateur users with mobile phones. When the conventional search methods are applied, the resulting shots may differ from the actual video content due to the deterioration of cut point detection accuracy caused by low quality, and each search method may not function effectively. Is expensive.

また、そもそも一般ユーザが撮影した動画の場合、ショット自体が存在しないものが多いと考えられる。そうした動画に対しては、従来の検索手法をそのまま適用することは困難である。たとえば、EMDを利用した動画間の類似度算出においては、検索対象動画から抽出されたショットのキーフレーム間の距離と、個々のショット長が特徴量として利用されるが、検索対象動画にカット点が存在しない場合、ショットが１つしか抽出されないため、実質キーフレーム間の距離のみで動画間の類似度が算出されることになる。したがって、たまたま抽出されたキーフレームの特徴が類似していた場合、キーフレーム以外のコンテンツが全く異なった性質を持っていたとしても、動画間の類似度が高く算出されることになり、検索精度の劣化につながる。無論、上記の問題によって、実際の類似度が高い動画間の類似度が低くなり、ユーザにとって思わしくない検索結果が得られる可能性が高い。 In the first place, in the case of moving images taken by general users, it is considered that many shots do not exist. It is difficult to apply the conventional search method as it is to such a moving image. For example, in calculating similarity between videos using EMD, the distance between keyframes of shots extracted from the search target video and the length of each shot are used as feature quantities. Since there is no single shot, only one shot is extracted, so that the similarity between moving images is calculated only by the distance between the substantial key frames. Therefore, if the extracted key frame features are similar, even if the content other than the key frame has completely different properties, the similarity between videos will be calculated high, and the search accuracy will be high. Leading to deterioration. Of course, due to the above problem, the similarity between moving images having a high actual similarity is low, and there is a high possibility that a search result unintended for the user is obtained.

本発明の目的は、上記した従来技術の課題を解決し、カット点やショットの無い動画、あるいはカット点やショットの識別が難しい低画質の動画でもクラスタリングを可能にして、高精度な動画検索を可能にする動画検索方法およびシステムを提供することにある。 An object of the present invention shows the above-mentioned solves the problem of the prior art, and enables clustering at the cut point or shot without moving or cut point or shot identification difficult low quality, videos, video search precision It is an object of the present invention to provide a moving image search method and system that make it possible.

上記した目的を達成するために、本発明は、以下のような手段を講じた点に特徴がある。 In order to achieve the above object, the present invention is characterized in that the following measures are taken.

本発明の動画検索システムは、検索対象動画ごとにフレーム単位で特徴量を抽出するフレーム特徴量抽出手段と、前記抽出されたフレーム特徴量に基づいて、各検索動画の各フレームを複数のクラスタに分類するクラスタリングするクラスタリング手段と、クエリ動画を指定するクエリ動画指定手段と、検索対象動画とクエリ動画との類似度を、各動画のクラスタリング結果に基づいて算出する類似度算出手段と、前記類似度の算出結果を出力する検索結果出力手段とを含むことを特徴とする。 The moving image search system of the present invention includes a frame feature amount extracting unit that extracts a feature amount in units of frames for each search target movie, and each frame of each search moving image is divided into a plurality of clusters based on the extracted frame feature amount. Clustering means for performing clustering, query video specifying means for specifying a query video, similarity calculation means for calculating the similarity between the search target video and the query video based on the clustering result of each video, and the similarity And a search result output means for outputting the calculation result.

本発明によれば、以下のような効果が達成される。 According to the present invention, the following effects are achieved.

本発明の動画検索方法およびシステムによれば、検索対象の動画がフレーム単位で分析されて特徴量を抽出され、動画ごとに特徴量が類似するフレームを同一クラスタに分類することでクラスタリングが行われ、各クラスタをショットと見なして類似度が算出されるので、カット点やショットの無い動画、あるいはカット点やショットの識別が難しい低画質の動画でも、高精度な動画検索が可能になる。 According to the moving image search method and system of the present invention, a search target moving image is analyzed in units of frames, feature amounts are extracted, and frames having similar feature amounts are classified into the same cluster for each moving image. Since the similarity is calculated by regarding each cluster as a shot, it is possible to perform a high-accuracy video search even for a video with no cut points or shots, or with a low-quality video in which it is difficult to identify cut points or shots.

以下、図面を参照して本発明の最良の実施の形態について詳細に説明する。図１は、本発明に係る動画検索システムの主要部の構成を示した機能ブロック図である。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the best embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a functional block diagram showing the configuration of the main part of the moving picture search system according to the present invention.

動画データベース(DB)１１には、検索対象となる多数の動画が格納されている。特徴量抽出モジュール１２は、動画DB１１に格納されている各動画を所定の単位ごとに分析して画像の特徴量を抽出する。本実施形態では、各動画がフレーム単位で分析されて各フレームの色情報が抽出される。クラスタリングモジュール１３は、前記特徴量抽出モジュール１２においてフレームごとに抽出されたフレーム特徴量に基づいて、検索対象動画ごとに、関連性の高いフレームを同一グループに分類するクラスタリング処理を行う。このクラスタリング処理は、カット点が少ないと考えられる個人ユーザ撮影動画の性質を勘案した処理である。このクラスタリング結果は、動画ごとに特徴量DB１４に格納される。 The moving image database (DB) 11 stores a large number of moving images to be searched. The feature amount extraction module 12 analyzes each moving image stored in the moving image DB 11 for each predetermined unit and extracts the feature amount of the image. In this embodiment, each moving image is analyzed in units of frames, and color information of each frame is extracted. The clustering module 13 performs a clustering process for classifying highly related frames into the same group for each search target moving image based on the frame feature amount extracted for each frame in the feature amount extraction module 12. This clustering process is a process that takes into account the nature of an individual user-captured moving image that is considered to have few cut points. The clustering result is stored in the feature amount DB 14 for each moving image.

クエリ動画指定モジュール１５では、検索クエリとなる動画（以下、クエリ動画と表現する場合もある）がユーザにより指定される。ユーザは、動画DB１１に格納されている検索対象動画の中からクエリ動画を指定しても良いし、あるいはクエリ動画を別途に入力しても良い。なお、クエリ動画を別途に入力する場合には、図２に示したように、前記クエリ動画指定モジュール１５を、クエリ動画を外部入力するクエリ動画入力モジュール１５ａ、入力されたクエリ動画を分析して特徴量を抽出するクエリ特徴量抽出モジュール１５ｂ、およびクエリ動画のフレームごとに抽出されたクエイ特徴量に基づいて、関連性の高いフレームを同一グループに分類するクラスタリングモジュール１５ｃで構成する必要がある。 In the query moving image specifying module 15, a user specifies a moving image to be a search query (hereinafter also referred to as a query moving image). The user may specify a query video from among the search target videos stored in the video DB 11, or may separately input a query video. In addition, when a query video is input separately, as shown in FIG. 2, the query video specification module 15 analyzes the query video input module 15a that inputs the query video externally, The query feature amount extraction module 15b that extracts the feature amount and the clustering module 15c that classifies highly related frames into the same group based on the query feature amount extracted for each frame of the query moving image need to be configured.

類似度算出モジュール１６は、後に詳述するように、前記クエリ動画と全ての検索対象動画との類似度を、各動画のクラスタ同士の比較結果に基づいて算出する。検索結果出力モジュール１７は、本システムのユーザに対し、指定されたクエリ動画に対する検索結果を提示する。具体的には、クエリ動画との類似度が高い検索対象動画をシステムのGUI上に表示する。 As will be described in detail later, the similarity calculation module 16 calculates the similarity between the query moving image and all search target moving images based on the comparison result between the clusters of the moving images. The search result output module 17 presents the search result for the designated query video to the user of this system. Specifically, a search target video having a high similarity to the query video is displayed on the system GUI.

次いで、フローチャートを参照して本発明の動作を詳細に説明する。本発明の動作は、検索対象動画から予め特徴量を抽出する「検索対象動画特徴量抽出処理」と、検索対象動画の特徴量とクエリ動画の特徴量とに基づいて当該クエリ動画に類似した動画を検索する「動画検索処理」とに大別される。 Next, the operation of the present invention will be described in detail with reference to a flowchart. The operation of the present invention is based on a “search target video feature amount extraction process” for extracting a feature amount from a search target video in advance, a video similar to the query video based on the feature amount of the search target video and the feature amount of the query video Is broadly classified as “video search processing”.

図３は、前記「検索対象動画特徴量抽出処理」の手順を示したフローチャートであり、ステップＳ１では、動画DB１１に登録されている検索対象動画の一つが特徴量抽出モジュール１２に取り込まれる。ステップＳ２では、特徴量抽出モジュール１２において特徴抽出処理が実行され、今回の検索対象動画からフレームごとに画像特徴量が抽出される。ここで抽出される画像特徴量としては、MPEG-7 Visual (ISO/IEC 15938-3, "Information technology --- Multimedia content description interface --- Part 8: Extraction and use of MPEG-7 descriptions," Dec. 2002.参照）で定義されている色配置情報や、フレーム画像での色分布を表すカラーヒストグラムなどである。 FIG. 3 is a flowchart showing the procedure of the “search target moving image feature amount extraction process”. In step S1, one of the search target moving images registered in the moving image DB 11 is taken into the feature amount extraction module 12. In step S2, feature extraction processing is executed in the feature amount extraction module 12, and an image feature amount is extracted for each frame from the current search target moving image. Image features extracted here are MPEG-7 Visual (ISO / IEC 15938-3, "Information technology --- Multimedia content description interface --- Part 8: Extraction and use of MPEG-7 descriptions," Dec 2002.)), and a color histogram representing a color distribution in a frame image.

ステップＳ３では、今回の検索対象動画から特徴量の抽出が完了したか否かが判定され、完了していなければステップＳ２へ戻り、次のフレームを対象に特徴量抽出処理が同様に実行される。なお、当該抽出処理では、検索対象動画の全フレームから画像特徴量を抽出すると計算量が膨大になることから、動画を構成する全n個のフレームから所定の規則（例えば、等間隔）で、特徴量抽出を行うフレームとしてm個だけ選別するなどして計算量を減じることが望ましい。 In step S3, it is determined whether or not feature amount extraction has been completed from the current search target moving image. If it has not been completed, the process returns to step S2 and the feature amount extraction processing is similarly performed for the next frame. . In the extraction process, if the image feature amount is extracted from all the frames of the search target moving image, the calculation amount becomes enormous. Therefore, a predetermined rule (for example, at equal intervals) from all n frames constituting the moving image, It is desirable to reduce the amount of calculation by selecting only m frames as feature amount extraction frames.

今回の検索対象動画に関して特徴量抽出処理が完了すると、ステップＳ４では、前記クラスタリングモジュール１３においてフレームクラスタリング処理が実行され、前記フレーム特徴量抽出処理において画像特徴量を抽出された各フレームが、当該特徴量に基づいて複数のクラスに分類される。この処理の目的は、カット点が少ない、あるいはカット点が存在しないと考えられる、一般ユーザが撮影した動画に対して、特徴が類似しているフレームをクラスタリングすることにより、擬似的な「ショット」を構築することにある。 When the feature amount extraction processing for the current search target video is completed, in step S4, frame clustering processing is executed in the clustering module 13, and each frame from which the image feature amount is extracted in the frame feature amount extraction processing is the feature. Based on quantity, it is classified into multiple classes. The purpose of this process is to create a pseudo “shot” by clustering frames that have similar features to a video shot by a general user, where there are few or no cut points. Is to build.

図４は、フレームクラスタリング処理の概念図であり、検索対象動画を構成する全フレームのうち、前記特徴抽出処理が行われたフレームのみが、その画像特徴量に基づいて複数のフレームクラスタC1，C2…のいずれかに分類される。図４に示した例では、フレームクラスタC1には３つのフレームが分類され、フレームクラスタC2には４つのフレームが分類されている。このような形でクラスタリング処理を行うことにより、カット点が存在しない検索対象動画であっても、動画の中身に応じた細かい特徴量抽出が可能になり、以降の動画間類似度算出に活用することができる。 FIG. 4 is a conceptual diagram of the frame clustering process. Of all the frames constituting the search target moving image, only the frame on which the feature extraction process has been performed includes a plurality of frame clusters C1, C2 based on the image feature amount. It is classified as one of the following. In the example shown in FIG. 4, three frames are classified into the frame cluster C1, and four frames are classified into the frame cluster C2. By performing clustering processing in this way, even if the search target video does not have a cut point, it is possible to extract detailed features according to the content of the video, which is used for calculating the similarity between subsequent videos. be able to.

図５は、前記ステップＳ４で実行されるクラスタリング処理の手順を示したフローチャートであり、ここでは、検索対象動画から１５個のフレーム特徴量が抽出された場合を例にして説明する。 FIG. 5 is a flowchart showing the procedure of the clustering process executed in step S4. Here, a case where 15 frame feature amounts are extracted from the search target moving image will be described as an example.

クラスタリングのアルゴリズムには、類似しているフレームやクラスタを順次に統合するボトムアップ型の階層的クラスタリングアルゴリズムと、たとえばk-means法のように、事前にクラスタ数を決定した上でクラスタリングを行うトップダウン型（非階層型）のクラスタリングアルゴリズムとがあるが、動画検索では、動画の中身に応じてクラスタ数を動的に決定した方が、より動画自体の特徴を表すことが可能であると考えられるので、ここではボトムアップ型を例にして説明する。 The clustering algorithm includes a bottom-up hierarchical clustering algorithm that integrates similar frames and clusters sequentially, and a top that performs clustering after determining the number of clusters in advance, such as the k-means method. There is a down type (non-hierarchical) clustering algorithm, but in video search, the dynamic number of clusters according to the content of the video can express the characteristics of the video itself more. Therefore, here, a bottom-up type will be described as an example.

ステップＳ４０１では、フレーム特徴量の一つが取り込まれ、ステップＳ４０２では、このフレーム特徴量を所属データとするクラスタが生成される。ステップＳ４０３では、当該クラスタの代表ベクトルが生成される。ステップＳ４０４では、今回の検索対象動画の全てのフレーム特徴量に関して、クラスタおよびクラスタ代表ベクトルの生成が完了したか否かが判定され、完了していなければ前記ステップＳ４０１へ戻り、次のフレーム特徴量を取り込んで上記した各処理が繰り返される。 In step S401, one of the frame feature amounts is captured, and in step S402, a cluster having the frame feature amount as belonging data is generated. In step S403, a representative vector of the cluster is generated. In step S404, it is determined whether or not the generation of the cluster and the cluster representative vector has been completed for all the frame feature amounts of the current search target moving image. If not completed, the process returns to step S401, and the next frame feature amount is determined. Each process described above is repeated.

全１５個のフレーム特徴量に関して上記した処理が完了し、１５個のクラスタC1〜C15およびそのクラスタ特徴ベクトルが生成されると、ステップＳ４０５では、全クラスタを対象に２つのクラスタCi，Cj間の距離が算出される。本実施形態では、代表的な階層的クラスタリング手法であるWard法を採用し、初めに２つのクラスタCi, Cj 間の距離 d(Ci, Cj ) が次式(1)で算出される。ただし、E(Ci)はクラスタCi の全ての所属データとクラスタCi の代表ベクトル（セントロイド）との距離の二乗の総和を表している。 When the above-described processing is completed for all 15 frame feature amounts and 15 clusters C1 to C15 and their cluster feature vectors are generated, in step S405, between the two clusters Ci and Cj for all clusters. A distance is calculated. In this embodiment, the Ward method, which is a representative hierarchical clustering method, is employed, and the distance d (Ci, Cj) between the two clusters Ci, Cj is first calculated by the following equation (1). However, E (Ci) represents the sum of the squares of the distances between all the belonging data of the cluster Ci and the representative vector (centroid) of the cluster Ci.

ステップＳ４０６では、現時点でのクラスタの全組合せの中から、クラスタ間距離d(Ci，Cj) が最小となる２つのクラスタが１つのクラスタに統合される。ステップＳ４０７では、当該統合されたクラスタの代表ベクトルが生成される。 In step S406, two clusters having the smallest inter-cluster distance d (Ci, Cj) are integrated into one cluster from all combinations of clusters at the present time. In step S407, a representative vector of the integrated cluster is generated.

ここで、クラスタの所属データが一般的な数値ベクトルの場合、クラスタの代表ベクトルは重心ベクトルを算出することなどにより求めることが可能であり、データ間の距離もユークリッド長などによって算出可能である。しかしながら、フレーム特徴量として色配置情報を利用した場合、ベクトルの数値がDCT係数であるため、単純に各要素値を加算すると理論的におかしい結果となりえる。 Here, when the cluster belonging data is a general numerical vector, the representative vector of the cluster can be obtained by calculating a centroid vector or the like, and the distance between the data can also be calculated by the Euclidean length or the like. However, when the color arrangement information is used as the frame feature amount, since the numerical value of the vector is a DCT coefficient, it is possible to obtain a theoretically strange result if each element value is simply added.

そこで、本実施形態ではデータ間の距離算出方法として、前記非特許文献１で定義されている色配置間距離を採用している。また、クラスタ代表ベクトルは、クラスタに所属する全てのデータ（ベクトル）の個々の要素ごとの最頻値を代表ベクトルの当該要素値として採用する。同じ要素で最頻値が複数出現する場合は、たとえばフレーム番号が小さいフレームの要素値を採用するなどの方法を適用する。 Therefore, in this embodiment, the distance between color arrangements defined in Non-Patent Document 1 is adopted as a method for calculating the distance between data. The cluster representative vector adopts the mode value for each element of all data (vector) belonging to the cluster as the element value of the representative vector. When a plurality of mode values appear in the same element, for example, a method of adopting an element value of a frame having a small frame number is applied.

なお、フレーム特徴量として、たとえばカラーヒストグラムなど、重心ベクトルやユークリッド距離が算出可能な画像特徴量を利用する場合は、これらの基準に基づいてクラスタリングを行うようにしても良い。 Note that when an image feature quantity that can calculate a centroid vector or Euclidean distance, such as a color histogram, is used as the frame feature quantity, clustering may be performed based on these criteria.

ステップＳ４０８では、クラスタが１つに集約されたか否かが判定され、クラスタが１つに集約されるまで上記した統合処理が繰り返される。クラスタが１つに集約されるとステップＳ４０９へ進み、クラスタ数を決定して各クラスタを抽出する処理が実行される。 In step S408, it is determined whether or not the clusters are aggregated into one, and the above integration process is repeated until the clusters are aggregated into one. When the clusters are aggregated into one, the process proceeds to step S409, and processing for determining the number of clusters and extracting each cluster is executed.

図６では、１５個のフレーム特徴量が前記ステップＳ４０６でのクラスタ統合を繰り返して段階的に１つのクラスタに集約される様子が模式的に図示されており、縦軸はフレーム識別子、横軸は各クラスタが統合された際のクラスタ間距離[上式(1)の左辺の値]を示している。 FIG. 6 schematically illustrates how 15 frame feature quantities are aggregated into one cluster in a stepwise manner by repeating cluster integration in step S406. The vertical axis represents the frame identifier, and the horizontal axis represents The inter-cluster distance [value on the left side of the above equation (1)] when each cluster is integrated is shown.

本実施形態では、動画ごとのクラスタ数を決定するために、クラスタ統合時の距離を閾値として設定し、閾値以下の距離で統合されたクラスタのみを抽出する。図６において、垂直波線が閾値の場合、この距離以下で統合されている１０個のクラスタが、この動画から抽出されるクラスタになる。なお、図６の例では、フレーム識別子が「１」，「８」「１４」の各フレーム特徴量は、所属データが１つのクラスタとして抽出される。 In the present embodiment, in order to determine the number of clusters for each moving image, the distance at the time of cluster integration is set as a threshold, and only clusters integrated at a distance equal to or less than the threshold are extracted. In FIG. 6, when the vertical wavy line is a threshold value, 10 clusters integrated below this distance are clusters extracted from this moving image. In the example of FIG. 6, the belonging data is extracted as one cluster for each frame feature amount having the frame identifiers “1”, “8”, and “14”.

次いで、図７のフローチャートを参照して、動画検索処理の手順を説明する。ステップＳ３１では、前記クエリ動画指定モジュール１５においてクエリ動画が指定されたか否かが判定される。本実施形態では、動画DB１１に格納されている検索対象動画のいずれかがクエリ動画として指定されるか、あるいはユーザによってクエリ動画が新規に入力されると、クエリ動画が指定されたと判定される。 Next, the procedure of the moving image search process will be described with reference to the flowchart of FIG. In step S31, it is determined whether or not a query moving image is specified in the query moving image specifying module 15. In the present embodiment, when any of the search target videos stored in the video DB 11 is designated as a query video, or when a query video is newly input by the user, it is determined that the query video has been designated.

クエリ動画が新規に入力されるとステップＳ３２へ進み、前記検索対象動画からフレーム特徴量を抽出した手順と同様の手順で、クエリ動画が分析されてフレーム特徴量が抽出される。ステップＳ３３では、前記フレーム特徴量に基づいて、前記と同様に各フレームがクラスタリングされる。なお、動画DB１１に格納されている検索対象動画のいずれかがクエリ動画として指定されていれば、当該クエリ動画のクラスタリング結果は特徴量DB１４に既登録なので、前記ステップＳ３２，Ｓ３３の処理は省略される。 When a query moving image is newly input, the process proceeds to step S32, where the query moving image is analyzed and the frame feature amount is extracted in the same procedure as the procedure for extracting the frame feature amount from the search target moving image. In step S33, each frame is clustered in the same manner as described above based on the frame feature amount. Note that if any of the search target videos stored in the video DB 11 is designated as a query video, the clustering result of the query video is already registered in the feature amount DB 14, and thus the processes in steps S32 and S33 are omitted. The

ステップＳ３４，Ｓ３５では、非特許文献６に開示されたEarth Mover's Distance (EMD)に基づく類似度算出方法を適用してクエリ動画と各検索対象動画との類似度が算出される。ただし、非特許文献６では検索対象動画から抽出された個々のショットのキーフレームの色情報、およびショット長という２つの特徴量に基づいて動画間の類似度(EMD)が算出されているのに対して、本発明ではショット抽出を行っていないため、非特許文献６の手法をそのまま適用することはできない。具体的には、特徴量抽出の結果得られた各フレームのクラスタから、非特許文献におけるキーフレームの色情報およびショット長に該当する各特徴量を抽出する必要がある。 In steps S34 and S35, the similarity between the query video and each search target video is calculated by applying the similarity calculation method based on Earth Mover's Distance (EMD) disclosed in Non-Patent Document 6. However, in Non-Patent Document 6, the similarity (EMD) between videos is calculated based on two feature quantities, namely, key frame color information of each shot extracted from the search target video and the shot length. On the other hand, since shot extraction is not performed in the present invention, the method of Non-Patent Document 6 cannot be applied as it is. Specifically, it is necessary to extract each feature quantity corresponding to the color information and shot length of the key frame in the non-patent document from the cluster of each frame obtained as a result of the feature quantity extraction.

そこで、本発明では、前記ショットのキーフレームの色情報に代わる特徴量として、各クラスタの代表ベクトルを利用し、前記ショット長に代わる特徴量として、各クラスタに所属するデータ数を利用する。 Therefore, in the present invention, the representative vector of each cluster is used as the feature quantity that replaces the color information of the key frame of the shot, and the number of data belonging to each cluster is used as the feature quantity that replaces the shot length.

ステップＳ３４では、検索対象動画の一つについて、各クラスタの特徴ベクトルと所属データ数とが取り込まれる。ステップＳ３５では、検索対象動画の各クラスタの特徴ベクトルおよび所属データ数と、クエリ動画の各クラスタの特徴ベクトルおよび所属データ数とをパラメータとして、前記非特許文献６に開示されたEMDに基づいて類似度が計算される。すなわち、各動画の代表ベクトル同士の距離および所属データ数同士の距離に基づいて類似度が算出される。 In step S34, the feature vector of each cluster and the number of belonging data are captured for one of the search target moving images. In step S35, the search is based on the EMD disclosed in Non-Patent Document 6 using the feature vector and number of belonging data of each cluster of the search target moving image and the feature vector and number of belonging data of each cluster of the query moving image as parameters. The degree is calculated. That is, the similarity is calculated based on the distance between the representative vectors of each moving image and the distance between the numbers of belonging data.

ステップＳ３６では、全ての検索対象動画に関して類似度計算が完了したか否かが判定され、完了していなければステップＳ３４へ戻り、検索対象動画を切替ながら上記した各処理が繰り返される。 In step S36, it is determined whether or not the similarity calculation has been completed for all the search target videos. If not, the process returns to step S34, and the above-described processes are repeated while switching the search target videos.

全ての検索対象動画に関して類似度計算が完了するとステップＳ３７へ進み、検索対象動画が類似度に基づいてソートされる。ステップＳ３８では、類似度が上位の複数の検索対象動画が検索結果としてユーザへ提示される。 When the similarity calculation is completed for all the search target videos, the process proceeds to step S37, and the search target videos are sorted based on the similarity. In step S38, a plurality of search target videos with higher similarity are presented to the user as search results.

本発明に係る動画検索システムの構成を示した機能ブロック図である。It is the functional block diagram which showed the structure of the moving image search system which concerns on this invention. 本発明に係る動画検索システムの他の構成を示した機能ブロック図である。It is the functional block diagram which showed the other structure of the moving image search system which concerns on this invention. 検索対象動画特徴量抽出処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the search object moving image feature-value extraction process. フレームクラスタリング処理の概念図である。It is a conceptual diagram of a frame clustering process. クラスタリング処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the clustering process. フレームが段階的にクラスタリングされていく様子を示した図である。It is the figure which showed a mode that a frame was clustered in steps. 動画検索処理の手順を示したフローチャートである。It is the flowchart which showed the procedure of the moving image search process.

Explanation of symbols

１１…動画DB，１２…特徴量抽出モジュール，１３…クラスタリングモジュール，１４…特徴量DB，１５…クエリ動画指定モジュール，１６…類似度算出モジュール，１７…検索結果出力モジュール DESCRIPTION OF SYMBOLS 11 ... Movie DB, 12 ... Feature-value extraction module, 13 ... Clustering module, 14 ... Feature-value DB, 15 ... Query animation specification module, 16 ... Similarity calculation module, 17 ... Search result output module

Claims

In a video search system that searches a number of search target videos for videos similar to the query video,
Frame feature amount extraction means for extracting feature amounts in units of frames for each search target video;
Clustering means for classifying each frame of each search video into a plurality of clusters based on the extracted frame feature amount;
Query video specification means for specifying a query video,
Similarity calculation means for calculating the similarity between the search target video and the query video based on the clustering result of each video,
And a search result output means for outputting the calculation result of the similarity.

The moving image search system according to claim 1 , wherein the query moving image specifying unit specifies one of the search target moving images as a query moving image.

The query video specifying means
A way to enter query videos,
Means for extracting a query feature value in frame units from the query video;
The moving image search system according to claim 1 , further comprising: a unit that classifies each frame of the query moving image into a plurality of clusters based on the extracted query feature amount.

The similarity calculation means includes:
Means for obtaining the representative vector and the number of belonging data of each cluster of the search target video;
A means for obtaining the representative vector and the number of belonging data of each cluster of the query video,
Video retrieval system according to any one of claims 1 to 3, characterized in that it comprises a similarity between each video clusters, and means for calculating, based on the distance of the distance and belonging data number between between the representative vector .

In a video search method that searches a number of search target videos for videos similar to the query video,
A procedure for extracting feature values in frame units for each search target video;
A procedure for classifying each frame of each search video into a plurality of clusters based on the extracted frame feature amount;
Steps to specify a query video,
A procedure for calculating the similarity between the search target video and the query video based on the clustering result of each video,
And a procedure for outputting the calculation result of the similarity.

The moving image search method according to claim 5 , wherein in the procedure of specifying the query moving image, one of the search target moving images is specified as a query moving image.

The procedure to specify the query video is:
How to enter a query video,
A procedure for extracting a query feature amount from the query video in frame units;
6. The moving image search method according to claim 5 , further comprising a step of classifying each frame of the query moving image into a plurality of clusters based on the extracted query feature amount.

The procedure for calculating the similarity is:
The procedure for obtaining the representative vector and the number of data belonging to each cluster of the search target video,
The procedure to find the representative vector and the number of data belonging to each cluster of the query video,
The similarity between each video clusters, video searching method according to any one of claims 5 to 7, characterized in that it comprises a step of calculating, based on the distance of the distance and belonging data number between between the representative vector .