JP2014099110A

JP2014099110A - Image search device, method, and program

Info

Publication number: JP2014099110A
Application number: JP2012251626A
Authority: JP
Inventors: Shinya Murata; 眞哉村田; Hidenao Nagano; 秀尚永野; Ryo Mukai; 良向井; Kunio Kashino; 邦夫柏野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-11-15
Filing date: 2012-11-15
Publication date: 2014-05-29
Anticipated expiration: 2032-11-15
Also published as: JP5851381B2

Abstract

PROBLEM TO BE SOLVED: To more accurately search an image indicating an instance.SOLUTION: A feature summarization and tabulation part 12 creates a summarized feature DB 21 in which local features extracted from an instance image are summarized and an instance feature DB 22 in which the appearance frequency of each summarized feature in each instance image group is shown, and extracts the local features from each frame image. A feature collation and tabulation part 15 creates a frame image feature DB 23 of each video and a video feature DB 24 on the basis of the collation result of each local feature extracted from each frame image with each summarized feature in the summarized feature DB 21. An inverse frame image frequency calculation part 16 creates an inverse frame image frequency DB 25 indicating the IFF(Inverse Frame Frequency) of each summarized feature, and a search ranking part 17 refers to the instance feature DB 22, the inverse frame image frequency DB25 and the video feature DB 24, and calculates EBM(Extension Best Match) 25 under the consideration of the instance feature DB 22 and the IFF from BM(Best Match), and creates a search result based on the EBM 25.

Description

本発明は、画像検索装置、方法、及びプログラムに関する。 The present invention relates to an image search apparatus, method, and program.

従来、特定の人、物、場所等であるイスタンスを示す画像をクエリとして、大規模画像データベースからインスタンスを含む画像を検索することが行われている。例えば、インスタンスを示す画像群と映像フレーム画像群との間の局所特徴量の出現パターンの類似性に基づいて、インスタンスを含む映像を検索する技術、つまりクエリ画像群と映像との類似尺度を使用した映像検索技術が提案されている（例えば、非特許文献１参照）。 Conventionally, an image including an instance is searched from a large-scale image database by using an image indicating an instance such as a specific person, thing, or place as a query. For example, based on the similarity of the appearance pattern of local features between the image group showing the instance and the video frame image group, a technology that searches the video including the instance, that is, the query image group and video similarity measure is used. A video search technique has been proposed (see Non-Patent Document 1, for example).

Cai-Zhi Zhu et al., "Large Vocabulary Quantization for Searching Instances from Videos.", In Proc. of ICMR'12, 2012.Cai-Zhi Zhu et al., "Large Vocabulary Quantization for Searching Instances from Videos.", In Proc. Of ICMR'12, 2012.

本発明は、従来技術とは異なるアプローチにより、より高精度にインスタンスを示す画像を検索することができる画像検索装置、方法、及びプログラムを提供することを目的とする。 An object of the present invention is to provide an image search apparatus, method, and program capable of searching for an image showing an instance with higher accuracy by an approach different from the prior art.

上記目的を達成するために、本発明の画像検索装置は、クエリとなるインスタンスを示すインスタンス画像、及び検索対象となる複数のフレーム画像からなる画像群の各フレーム画像から複数の特徴を抽出する特徴抽出手段と、前記特徴抽出手段により複数のインスタンスの各々を示す複数のインスタンス画像の各々から抽出された複数の特徴から、重複する特徴を集約した集約特徴の各々の前記複数のインスタンス画像における前記インスタンス毎の第１出現頻度を集計する特徴集約集計手段と、前記集約特徴の各々と前記フレーム画像から抽出された複数の特徴とを照合し、前記集約特徴の各々について、前記各フレーム画像における第２出現頻度、及び前記画像群における第３出現頻度を集計する特徴照合集計手段と、前記集約特徴の各々について、前記第２出現頻度に基づいて、前記画像群における該集約特徴の出現頻度が低いほど高くなる逆フレーム画像頻度を計算する逆フレーム画像頻度計算手段と、前記集約特徴の各々の前記逆フレーム画像頻度、前記第１出現頻度、及び第３出現頻度、並びに前記画像群に含まれる集約特徴の数に関する画像群長に基づいて定まる各集約特徴の重要度の和で表される検索対象のインスタンスに対する各画像群の評価値を求め、前記評価値に基づく検索結果を作成する検索結果作成手段と、を含んで構成されている。 In order to achieve the above object, the image search apparatus of the present invention is characterized by extracting a plurality of features from each frame image of an image group consisting of an instance image indicating an instance to be a query and a plurality of frame images to be searched. The instance in each of the plurality of instance images of the aggregated feature obtained by aggregating overlapping features from a plurality of features extracted from each of the plurality of instance images indicating each of the plurality of instances by the extracting unit and the feature extracting unit A feature aggregation totaling unit that aggregates the first appearance frequency for each, a collation between each of the aggregated features and a plurality of features extracted from the frame image, and for each of the aggregated features, a second in each frame image A feature collating and summing unit for summing up the appearance frequency and the third appearance frequency in the image group; On the basis of the second appearance frequency, reverse frame image frequency calculation means for calculating a reverse frame image frequency that increases as the appearance frequency of the aggregate feature in the image group decreases, and the inverse of each of the aggregate features The search target represented by the sum of the importance of each aggregated feature determined based on the frame image frequency, the first appearance frequency, the third appearance frequency, and the image group length related to the number of aggregated features included in the image group. Search result creating means for obtaining an evaluation value of each image group for the instance and creating a search result based on the evaluation value.

本発明の画像検索装置によれば、特徴抽出手段が、クエリとなるインスタンスを示すインスタンス画像から複数の特徴を抽出する。ここでは、複数のインスタンスの各々を示す複数のインスタンス画像の各々から複数の特徴を抽出する。そして、特徴集約集計手段が、特徴抽出手段により複数のインスタンス画像から抽出された複数の特徴のうち、重複する特徴を集約した集約特徴の各々の前記複数のインスタンス画像における前記インスタンス毎の第１出現頻度を集計する。 According to the image search device of the present invention, the feature extraction unit extracts a plurality of features from an instance image indicating an instance that is a query. Here, a plurality of features are extracted from each of a plurality of instance images indicating each of the plurality of instances. Then, the feature aggregation and aggregation unit first appearance for each instance in the plurality of instance images of each of the aggregated features obtained by aggregating overlapping features among the plurality of features extracted from the plurality of instance images by the feature extraction unit Aggregate frequency.

また、特徴抽出手段が、検索対象となる複数のフレーム画像からなる画像群の各フレーム画像から複数の特徴を抽出する。そして、特徴照合集計手段が、集約特徴の各々とフレーム画像から抽出された複数の特徴とを照合し、集約特徴の各々について、各フレーム画像における第２出現頻度、及び画像群における第３出現頻度を集計する。さらに、逆フレーム画像頻度計算手段が、集約特徴の各々について、第２出現頻度に基づいて、画像群における集約特徴の出現頻度が低いほど高くなる逆フレーム画像頻度を計算する。 In addition, the feature extraction unit extracts a plurality of features from each frame image of an image group including a plurality of frame images to be searched. Then, the feature collating and summing unit collates each aggregated feature with a plurality of features extracted from the frame image, and for each aggregated feature, the second appearance frequency in each frame image and the third appearance frequency in the image group Are counted. Further, the reverse frame image frequency calculation means calculates the reverse frame image frequency that becomes higher as the appearance frequency of the aggregate feature in the image group is lower for each of the aggregate features based on the second appearance frequency.

そして、検索結果作成手段が、集約特徴の各々の逆フレーム画像頻度、第１出現頻度、及び第３出現頻度、並びに画像群に含まれる集約特徴の数に関する画像群長に基づいて定まる各集約特徴の重要度の和で表される検索対象のインスタンスに対する各画像群の評価値を求め、評価値に基づく検索結果を作成する。 Then, the search result creating means determines each aggregate feature determined based on the image group length related to the reverse frame image frequency, the first appearance frequency, the third appearance frequency, and the number of aggregate features included in the image group. The evaluation value of each image group for the search target instance represented by the sum of the importance levels is obtained, and a search result based on the evaluation value is created.

このように、インスタンス画像から抽出され集約された集約特徴のインスタンス画像における出現頻度を、各集約特徴の重要度として考慮した評価値に基づいて検索結果を作成するため、より高精度にインスタンスを示す画像を検索することができる。 In this way, the search result is created based on the evaluation value in which the appearance frequency in the instance image of the aggregated feature extracted and aggregated from the instance image is considered as the importance of each aggregated feature. You can search for images.

また、前記特徴集約集計手段は、検索対象として新たに追加されたインスタンスを示すインスタンス画像から抽出された特徴のうち、前記集約特徴に含まれない追加特徴に基づいて、前記第１出現頻度の集計結果を更新し、前記特徴照合集計手段は、前記追加特徴と前記フレーム画像から抽出された複数の特徴とを照合し、前記第２出現頻度及び前記第３出現頻度を更新し、前記逆フレーム画像頻度計算手段は、更新された前記第２出現頻度に基づいて、前記逆フレーム画像頻度を再計算し、前記検索結果作成手段は、再計算された逆フレーム画像頻度、更新された前記第１出現頻度、及び更新された前記第３出現頻度に基づいて、前記新たに追加されたインスタンスに対する検索結果を作成することができる。これにより、追加分の処理を行うだけで、追加されたインスタンスについても高精度にインスタンスを示す画像を検索することができる。 Further, the feature aggregation and aggregation unit is configured to calculate the first appearance frequency based on an additional feature that is not included in the aggregation feature among features extracted from an instance image indicating an instance newly added as a search target. Updating the result, the feature collating and summing unit collates the additional feature with a plurality of features extracted from the frame image, updates the second appearance frequency and the third appearance frequency, and the reverse frame image Frequency calculation means recalculates the reverse frame image frequency based on the updated second appearance frequency, and the search result creation means recalculates the reverse frame image frequency and the updated first appearance. A search result for the newly added instance can be created based on the frequency and the updated third appearance frequency. As a result, it is possible to search for an image showing an instance with high accuracy for the added instance only by performing processing for the added amount.

また、前記検索結果作成手段は、前記検索結果に前記画像群のファイル名、または前記画像群のファイル名と該画像群の評価値とを含めることができる。検索結果は評価値に基づいて作成されればよく、様々な形態の検索結果を作成可能である。 The search result creating means may include a file name of the image group or a file name of the image group and an evaluation value of the image group in the search result. The search result may be created based on the evaluation value, and various forms of search results can be created.

また、本発明の画像検索方法は、特徴抽出手段と、特徴集約集計手段と、特徴照合集計手段と、逆フレーム画像頻度計算手段と、検索結果作成手段とを含む画像検索装置における画像検索方法であって、前記特徴抽出手段が、クエリとなるインスタンスを示すインスタンス画像、及び検索対象となる複数のフレーム画像からなる画像群の各フレーム画像から複数の特徴を抽出し、前記特徴抽出手段により複数のインスタンスの各々を示す複数のインスタンス画像の各々から抽出された複数の特徴から、重複する特徴を集約した集約特徴の各々の前記複数のインスタンス画像における前記インスタンス毎の第１出現頻度を集計し、前記特徴照合集計手段が、前記集約特徴の各々と前記フレーム画像から抽出された複数の特徴とを照合し、前記集約特徴の各々について、前記各フレーム画像における第２出現頻度、及び前記画像群における第３出現頻度を集計し、前記逆フレーム画像頻度計算手段が、前記集約特徴の各々について、前記第２出現頻度に基づいて、前記画像群における該集約特徴の出現頻度が低いほど高くなる逆フレーム画像頻度を計算し、前記検索結果作成手段が、前記集約特徴の各々の前記逆フレーム画像頻度、前記第１出現頻度、及び第３出現頻度、並びに前記画像群に含まれる集約特徴の数に関する画像群長に基づいて定まる各集約特徴の重要度の和で表される検索対象のインスタンスに対する各画像群の評価値を求め、前記評価値に基づく検索結果を作成する方法である。 The image search method of the present invention is an image search method in an image search apparatus including a feature extraction unit, a feature aggregation totaling unit, a feature matching totaling unit, an inverse frame image frequency calculation unit, and a search result creation unit. The feature extraction means extracts a plurality of features from each frame image of an image group consisting of an instance image indicating an instance to be a query and a plurality of frame images to be searched, and the feature extraction means From the plurality of features extracted from each of a plurality of instance images indicating each of the instances, the first appearance frequency for each instance in each of the plurality of instance images of each of the aggregated features that aggregate overlapping features is aggregated, and A feature collating and summing unit collates each of the aggregated features with a plurality of features extracted from the frame image, and For each feature, the second appearance frequency in each frame image and the third appearance frequency in the image group are tabulated, and the inverse frame image frequency calculation means calculates the second appearance frequency for each of the aggregated features. Based on this, the inverse frame image frequency that increases as the appearance frequency of the aggregated feature in the image group becomes low, and the search result creation means calculates the inverse frame image frequency and the first appearance frequency of each of the aggregated features. , And the third appearance frequency, and the evaluation value of each image group for the search target instance represented by the sum of importance of each aggregated feature determined based on the image group length related to the number of aggregated features included in the image group This is a method of obtaining and creating a search result based on the evaluation value.

また、本発明の画像検索プログラムは、コンピュータを、上記の画像検索装置を構成する各手段として機能させるためのプログラムである。 The image search program of the present invention is a program for causing a computer to function as each means constituting the image search apparatus.

以上説明したように、本発明の画像検索装置、方法、及びプログラムによれば、インスタンス画像から抽出され集約された集約特徴のインスタンス画像における出現頻度を、各集約特徴の重要度として考慮した評価値に基づいて検索結果を作成するため、より高精度にインスタンスを示す画像を検索することができる、という効果が得られる。 As described above, according to the image search device, method, and program of the present invention, the evaluation value considering the appearance frequency of the aggregated feature extracted from the instance image in the instance image as the importance of each aggregated feature Since a search result is created based on the above, an effect that an image showing an instance can be searched with higher accuracy can be obtained.

第１の実施の形態に係る映像検索装置の構成を示す概略図である。It is the schematic which shows the structure of the video search device which concerns on 1st Embodiment. インスタンス画像のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of an instance image. 局所特徴のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of a local feature. 集約特徴ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of aggregation feature DB. インスタンス特徴ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of instance feature DB. フレーム画像特徴ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of frame image characteristic DB. 映像特徴ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of video characteristic DB. 逆フレーム画像頻度ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of reverse frame image frequency DB. 検索結果のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of a search result. 第１の実施の形態における映像検索処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the image | video search processing routine in 1st Embodiment. 第２の実施の形態に係る映像検索装置の構成を示す概略図である。It is the schematic which shows the structure of the video search device which concerns on 2nd Embodiment. 追加特徴ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of additional feature DB. 更新された集約特徴ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of updated aggregation feature DB. 更新されたインスタンス特徴ＤＢのデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of updated instance feature DB. 評価結果を示す図である。It is a figure which shows an evaluation result.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜各実施の形態の概要＞
各実施の形態では、インスタンス（実例）を示す画像（以下、「インスタンス画像」という）を入力とし、大規模映像データベースからインスタンスを含む映像を検索し、検索結果を出力する映像検索装置に本発明の画像検索装置を適用した場合について説明する。各実施の形態に係る映像検索装置は、インスタンス画像群の局所特徴の重要度の指標である逆フレーム画像頻度、インスタンス画像群及び映像群における局所特徴の出現頻度、並びに各映像に含まれる局所特徴の総数である映像長に注目することで、インスタンスを含む映像の検索（以下、「インスタンス検索」ともいう）を高精度に行うインスタンス検索システムを実現するものである。 <Outline of each embodiment>
In each embodiment, the present invention is applied to a video search apparatus that receives an image (hereinafter referred to as “instance image”) indicating an instance (example), searches a video including the instance from a large-scale video database, and outputs a search result. A case in which the image search apparatus is applied will be described. The video search device according to each embodiment includes an inverse frame image frequency that is an index of the importance of local features of an instance image group, an appearance frequency of local features in the instance image group and the video group, and a local feature included in each video. By paying attention to the video length, which is the total number of images, an instance search system that performs high-precision video search including instances (hereinafter also referred to as “instance search”) is realized.

各実施の形態では、Ｗｅｂページのキーワード検索でよく用いられるＢＭ２５（Best Match 25）と呼ばれるランキング手法を、インスタンス検索に応用する。ＢＭ２５を応用する際、ＢＭ２５におけるキーワードの重要度を示す指標であるＩＤＦ（Inverse Document Frequency）をインスタンスに関する局所特徴の重要度とみなし、重要度が高い局所特徴を多く含む映像を検索するための指標とする。また、インスタンス検索におけるクエリは、１つ以上のインスタンス画像から得られた局所特徴量分の大きなベクトルとして表現されるため、クエリとして小さいベクトルを想定したＢＭ２５を、キーワードに基づく文書検索から映像検索に適合した形に拡張し、拡張したＢＭ２５（各実施の形態では「ＥＢＭ２５」と呼ぶ）をインスタンス検索に適用する。 In each embodiment, a ranking method called BM25 (Best Match 25), which is often used for Web page keyword search, is applied to instance search. When applying BM25, IDF (Inverse Document Frequency), which is an index indicating the importance of keywords in BM25, is regarded as the importance of local features related to instances, and is an index for searching for a video including many local features with high importance And In addition, since the query in the instance search is expressed as a large vector corresponding to the local feature amount obtained from one or more instance images, the BM 25 that assumes a small vector as a query is changed from a document search based on a keyword to a video search. The expanded BM 25 (referred to as “EBM 25” in each embodiment) is applied to the instance search.

＜第１の実施の形態＞
第１の実施の形態に係る映像検索装置１０は、ＣＰＵと、ＲＡＭと、後述する映像検索処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成されている。このコンピュータは、機能的には、図１に示すように、インスタンス画像特徴抽出部１１と、特徴集約集計部１２と、フレーム画像抽出部１３と、フレーム画像特徴抽出部１４と、特徴照合集計部１５と、逆フレーム画像頻度計算部１６と、検索ランキング部１７と、検索結果出力部１８とを含んだ構成で表すことができる。また、映像検索装置１０には、追加特徴データベース（ＤＢ）２１、インスタンス特徴ＤＢ２２、フレーム画像特徴ＤＢ２３、映像特徴ＤＢ２４、及び逆フレーム画像頻度ＤＢ２５を記憶する所定の記憶領域が設けられている。 <First Embodiment>
The video search apparatus 10 according to the first embodiment is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a video search processing routine described later. As shown in FIG. 1, this computer functionally includes an instance image feature extraction unit 11, a feature aggregation totaling unit 12, a frame image extraction unit 13, a frame image feature extraction unit 14, and a feature matching totalization unit. 15, a reverse frame image frequency calculation unit 16, a search ranking unit 17, and a search result output unit 18. In addition, the video search device 10 is provided with a predetermined storage area for storing an additional feature database (DB) 21, an instance feature DB 22, a frame image feature DB 23, a video feature DB 24, and an inverse frame image frequency DB 25.

映像検索装置１０には、インスタンス検索のクエリとして、インスタンス画像が入力される。第１の実施の形態では、インスタンス特徴ＤＢ２２を作成するため、複数のインスタンスを対象とし、インスタンスのそれぞれについて、複数のインスタンス画像からなるインスタンス画像群が入力される。例えば、インスタンスの数がｑ個で、各インスタンスについてｈ枚のインスタンス画像が用意されている場合には、ｑ個のインスタンス画像群、総数ｑ×ｈ枚のインスタンス画像が入力されることとなる。ただし、入力されるインスタンス画像は、１つのインスタンスについて１枚以上であればよい。また、インスタンスを示す映像を入力とし、その映像の各フレームをインスタンス画像群として入力してもよい。なお、１つのインスタンスに対するインスタンス画像群に含まれるインスタンス画像の数が多いほど、インスタンス検索の検索精度が向上する。 An instance image is input to the video search apparatus 10 as an instance search query. In the first embodiment, in order to create the instance feature DB 22, a plurality of instances are targeted, and an instance image group including a plurality of instance images is input for each instance. For example, when the number of instances is q and h instance images are prepared for each instance, q instance image groups and a total number of q × h instance images are input. However, one or more instance images may be input for one instance. Alternatively, an image indicating an instance may be input and each frame of the image may be input as an instance image group. Note that as the number of instance images included in the instance image group for one instance increases, the search accuracy of the instance search improves.

図２に、インスタンス画像のデータ構造の一例を示す。図２は、インスタンス画像がｎ×ｍ画素で構成され、各画素の画素値としてＲＧＢ値を有する場合の例であり、画素毎の位置（ｘ_ｎ，ｙ_ｍ）とＲＧＢ値（ｒ_ｎｍ，ｇ_ｎｍ，ｂ_ｎｍ）とが対応付けられたデータ構造となっている。 FIG. 2 shows an example of the data structure of the instance image. FIG. 2 shows an example in which the instance image is composed of n × m pixels and has RGB values as the pixel values of each pixel. The position (x _n , y _m ) and the RGB value (r _nm , g) for each pixel are shown. _nm , b _nm ).

インスタンス画像特徴抽出部１１は、映像検索装置１０に入力されたインスタンス画像群を受け付け、各インスタンス画像から特徴点を検出し、検出した特徴点の特徴量を記述した特徴ベクトルを局所特徴として抽出する。インスタンス画像特徴抽出部１１は、例えば、インスタンス画像において輝度値の変化が激しい箇所をHarris-Laplace法（「C. Harris et al., “A combined corner and edge detector.”, 4th Alvey Vision Conf., 1988.」参照）により特徴点として検出する。そして、検出した各特徴点の特徴量をCompact Color SIFT（「K. Mikolajczyk et al., “Scale and affine invariant interest point detectors.”, IJCV, 2004.」参照）により記述する。Compact Color SIFTは輝度に関する１２８次元のＳＩＦＴ特徴量に色度を表す６４次元のベクトルを追加した局所特徴量である。 The instance image feature extraction unit 11 receives an instance image group input to the video search device 10, detects a feature point from each instance image, and extracts a feature vector describing a feature amount of the detected feature point as a local feature. . For example, the instance image feature extraction unit 11 detects a location where the luminance value changes drastically in the instance image using the Harris-Laplace method (“C. Harris et al.,“ A combined corner and edge detector. ”, 4th Alvey Vision Conf., 1988.)). And the feature-value of each detected feature point is described by Compact Color SIFT (refer to "K. Mikolajczyk et al.," Scale and affine invariant interest point detectors. ", IJCV, 2004.). Compact Color SIFT is a local feature amount obtained by adding a 64-dimensional vector representing chromaticity to a 128-dimensional SIFT feature amount relating to luminance.

図３に局所特徴のデータ構造の一例を示す。図３は、上記のCompact Color SIFTを用いた例であり、抽出された各局所特徴の識別番号（特徴１，特徴２，・・・，特徴ｉ）と１９２次元のCompact Color SIFT特徴量（特徴ベクトル）とが対応付けられたデータ構造となっている。 FIG. 3 shows an example of the data structure of local features. FIG. 3 is an example using the above-described Compact Color SIFT. The extracted identification numbers (feature 1, feature 2,..., Feature i) of each local feature and the 192-dimensional Compact Color SIFT feature (feature) Vector).

特徴集約集計部１２は、インスタンス画像特徴抽出部１１で抽出された局所特徴のうち、重複している局所特徴を一つに集約する。「重複している局所特徴」とは、インスタンス画像群として映像の各フレームを用いた場合などのように、ほぼ同一のインスタンス画像から同一の局所特徴が抽出された場合に、その同一の局所特徴を「重複」とみなすものである。特徴集約集計部１２は、集約した局所特徴を集約特徴とした集約特徴ＤＢ２１を作成し、所定の記憶領域に記憶する。図４に、集約特徴ＤＢ２１のデータ構造の一例を示す。図４の例では、各集約特徴の識別番号（特徴１，特徴２，・・・，特徴ｊ）と１９２次元のCompact Color SIFT特徴量（特徴ベクトル）とが対応付けられたデータ構造となっている。このデータ構造は、インスタンス画像特徴抽出部１１により抽出された局所特徴（例えば、図３）のデータ構造と同様であるが、特徴集約集計部１２により局所特徴の数がｉからｊに集約されたことを示している。 The feature aggregation / aggregation unit 12 aggregates overlapping local features into one among the local features extracted by the instance image feature extraction unit 11. "Overlapping local features" means that the same local features are extracted when the same local features are extracted from almost the same instance image, such as when each frame of video is used as the instance image group. Is regarded as “duplication”. The feature aggregation / aggregation unit 12 creates an aggregation feature DB 21 in which the aggregated local features are aggregated features, and stores them in a predetermined storage area. FIG. 4 shows an example of the data structure of the aggregate feature DB 21. In the example of FIG. 4, the data structure has an identification number (feature 1, feature 2,..., Feature j) of each aggregate feature and a 192-dimensional Compact Color SIFT feature quantity (feature vector) associated with each other. Yes. This data structure is the same as the data structure of the local features (for example, FIG. 3) extracted by the instance image feature extraction unit 11, but the number of local features is aggregated from i to j by the feature aggregation / aggregation unit 12. It is shown that.

また、特徴集約集計部１２は、集約特徴に基づいて、各インスタンスを示すインスタンス画像群における各集約特徴の出現回数を、インスタンスの特徴としてインスタンス毎に抽出し、インスタンス特徴ＤＢ２２として作成し、所定の記憶領域に記憶する。図５に、インスタンス特徴ＤＢ２２のデータ構造の一例を示す。図５は、インスタンスの数がｑ個、集約特徴の総数がｊ個の例であり、各インスタンスの識別番号（インスタンス１，・・・，イスタンスｑ）と、各集約特徴の出現回数ｋｆ（keypoint frequency）とが対応付けられたデータ構造となっている。 In addition, the feature aggregation totaling unit 12 extracts the number of appearances of each aggregate feature in the instance image group indicating each instance based on the aggregate feature for each instance as an instance feature, creates the instance feature DB 22, Store in the storage area. FIG. 5 shows an example of the data structure of the instance feature DB 22. FIG. 5 is an example in which the number of instances is q and the total number of aggregate features is j. The identification number of each instance (instance 1,..., Instance q) and the number of appearances of each aggregate feature kf (keypoint) frequency) is associated with the data structure.

なお、特徴集約集計部１２は、集約された局所特徴をクラスタリングして次元数を下げることにより、さらに集約してもよい。 The feature aggregation / aggregation unit 12 may further aggregate by clustering the aggregated local features to reduce the number of dimensions.

フレーム画像抽出部１３は、映像検索装置１０に入力された映像群を受け付け、例えば１ｆｐｓ（１秒間に１フレーム）のレートで各映像からフレーム画像を抽出する。フレーム画像のデータ構造は、インスタンス画像のデータ構造（例えば、図２）と同様であるため、詳細な説明は省略する。 The frame image extraction unit 13 receives a video group input to the video search device 10 and extracts a frame image from each video at a rate of, for example, 1 fps (one frame per second). Since the data structure of the frame image is the same as the data structure of the instance image (for example, FIG. 2), detailed description is omitted.

フレーム画像特徴抽出部１４は、局所特徴を抽出する対象がインスタンス画像ではなくフレーム画像抽出部１３で抽出されたフレーム画像であるという点が、インスタンス画像特徴抽出部１１と異なるだけであり、また、抽出される局所特徴のデータ構造も、インスタンス画像特徴抽出部１１で抽出される局所特徴のデータ構造（例えば、図３）と同様であるため、詳細な説明は省略する。 The frame image feature extraction unit 14 differs from the instance image feature extraction unit 11 only in that the target for extracting local features is not the instance image but the frame image extracted by the frame image extraction unit 13. Since the data structure of the extracted local features is the same as the data structure of the local features extracted by the instance image feature extraction unit 11 (for example, FIG. 3), detailed description thereof is omitted.

特徴照合集計部１５は、フレーム画像特徴抽出部１４で抽出された各局所特徴と、集約特徴ＤＢ２１に記憶されている各集約特徴との照合を行う。上記のCompact Color SIFTの場合、例えば、１９２次元の特徴ベクトル間のコサイン類似度（０〜１の範囲の値を取り、同一の特徴ベクトルの場合は１）を用いて、コサイン類似度が所定値以上（例えば、０．９５）の局所特徴と集約特徴とを一致する特徴と判定する。フレーム画像から抽出された局所特徴に対して、コサイン類似度が所定値以上となる集約特徴が集約特徴ＤＢ２１に複数存在する場合には、コサイン類似度が最も大きい集約特徴同士を疎の局所特徴に一致する集約特徴と判定する。 The feature matching totaling unit 15 compares each local feature extracted by the frame image feature extracting unit 14 with each aggregate feature stored in the aggregate feature DB 21. In the case of the above Compact Color SIFT, for example, the cosine similarity between 192-dimensional feature vectors (takes a value in the range of 0 to 1 and is 1 in the case of the same feature vector), and the cosine similarity is a predetermined value. The local features and the aggregate features described above (for example, 0.95) are determined as matching features. In the case where there are a plurality of aggregate features whose cosine similarity is equal to or greater than a predetermined value with respect to the local features extracted from the frame image, the aggregate features having the largest cosine similarity are sparse local features. Judged as matching aggregate features.

なお、局所特徴間の照合は、コサイン類似度を用いる場合に限定されず、各特徴ベクトル間の距離や類似度を測る尺度であれば、どのようなものを用いてもよい。 Note that collation between local features is not limited to using cosine similarity, and any measure may be used as long as it is a scale for measuring the distance and similarity between feature vectors.

また、特徴集約集計部１２によりクラスタリングにより局所特徴が集約されている場合には、特徴照合集計部１５は、集約特徴ＤＢ２１における各クラスタの重心を用いて、フレーム画像から抽出された局所特徴との照合を行うことができる。 Further, when local features are aggregated by clustering by the feature aggregation / aggregation unit 12, the feature matching aggregation unit 15 uses the centroid of each cluster in the aggregation feature DB 21 to determine the local feature extracted from the frame image. Verification can be performed.

特徴照合集計部１５は、フレーム画像から抽出された局所特徴と集約特徴ＤＢ２１に記憶されている集約特徴との照合結果に基づいて、フレーム画像特徴ＤＢ２３を映像毎に作成し、所定の記憶領域に記憶する。図６に、フレーム画像特徴ＤＢ２３のデータ構造の一例を示す。図６の例では、１つの映像から抽出された各フレーム画像の識別番号（フレーム画像１，・・・，フレーム画像Ｎ）と、各フレーム画像における各集約特徴の出現回数Ｋｆが対応付けられている。 The feature matching totaling unit 15 creates the frame image feature DB 23 for each video based on the matching result between the local feature extracted from the frame image and the aggregate feature stored in the aggregate feature DB 21, and stores it in a predetermined storage area. Remember. FIG. 6 shows an example of the data structure of the frame image feature DB 23. In the example of FIG. 6, the identification number (frame image 1,..., Frame image N) of each frame image extracted from one video is associated with the appearance frequency Kf of each aggregated feature in each frame image. Yes.

また、特徴照合集計部１５は、フレーム画像から抽出された局所特徴と集約特徴ＤＢ２１に記憶されている局所特徴との照合結果に基づいて、映像特徴ＤＢ２４を作成し、所定の記憶領域に記憶する。図７に、映像特徴ＤＢ２４のデータ構造の一例を示す。図７の例では、映像の識別番号（映像１，・・・，映像ｖ）と、各映像における各集約特徴の出現回数ＫＦが対応付けられている。出現回数ＫＦは、映像毎に作成されたフレーム画像特徴ＤＢ２３の集約特徴毎の出現回数をフレーム画像１からフレーム画像Ｎまで合計して求めることができる。 In addition, the feature matching totaling unit 15 creates the video feature DB 24 based on the matching result between the local feature extracted from the frame image and the local feature stored in the aggregate feature DB 21 and stores it in a predetermined storage area. . FIG. 7 shows an example of the data structure of the video feature DB 24. In the example of FIG. 7, the video identification number (video 1,..., Video v) is associated with the number of appearances KF of each aggregated feature in each video. The number of appearances KF can be obtained by summing up the number of appearances for each aggregated feature in the frame image feature DB 23 created for each video from the frame image 1 to the frame image N.

逆フレーム画像頻度計算部１６は、映像毎のフレーム画像特徴ＤＢ２３を参照し、集約特徴ｊの逆フレーム画像頻度ＩＦＦ_ｊ（Inverse Frame Frequency）を、下記（１）式に基づいて計算する。 The inverse frame image frequency calculation unit 16 refers to the frame image feature DB 23 for each video, and calculates the inverse frame image frequency IFF _j (Inverse Frame Frequency) of the aggregate feature j based on the following equation (1).

ここで、Ｎは１つの映像内の全フレーム画像数、ｎ_ｊは１つの映像内における集約特徴ｊを含むフレーム画像数であり、フレーム画像特徴ＤＢ２３において、集約特徴ｊの出現回数Ｋｆ_ｊが１以上となっているフレーム画像をカウントすることにより求めることができる。集約特徴ｊがフレーム画像特徴ＤＢ２３において高い頻度で出現している場合には、集約特徴ｊはインスタンスに対する識別能力が低い局所特徴であるとみなせるため、ＩＦＦ_ｊは小さくなる。逆に、集約特徴ｊのフレーム画像特徴ＤＢ２３における出現頻度が低い場合には、集約特徴ｊはインスタンスに対する識別能力が高い集約特徴であるとみなせるため、ＩＦＦ_ｊは大きくなる。 Here, N is the number of all frame images in one video, n _j is the number of frame images including the aggregate feature j in one video, and the number of appearances Kf _j of the aggregate feature _j is 1 in the frame image feature DB 23. It can be obtained by counting the frame images as described above. When the aggregate feature j appears at a high frequency in the frame image feature DB 23, the aggregate feature j can be regarded as a local feature having a low identification ability for the instance, and therefore the IFF _j becomes small. Conversely, when the frequency of appearance of the aggregated feature j in the frame image feature DB 23 is low, the aggregated feature j can be regarded as an aggregated feature with a high identification capability for the instance, so IFF _j becomes large.

逆フレーム画像頻度計算部１６は、集約特徴ＤＢ２１に記憶されている集約特徴（特徴１，・・・，特徴ｊ）の全てについてＩＦＦ_ｊを計算し、計算結果を逆フレーム画像頻度ＤＢ２５として作成し、所定の記憶領域に記憶する。図８に、逆フレーム画像頻度ＤＢ２５のデータ構造の一例を示す。図８の例では、集約特徴の識別番号（特徴１，・・・，特徴ｊ）と、計算されたとＩＦＦ_ｊとが対応付けられている。逆フレーム画像頻度ＤＢ２５は映像毎に作成される。 The inverse frame image frequency calculation unit 16 calculates IFF _j for all of the aggregate features (features 1,..., Feature j) stored in the aggregate feature DB 21 and creates a calculation result as the inverse frame image frequency DB 25. , Stored in a predetermined storage area. FIG. 8 shows an example of the data structure of the inverse frame image frequency DB 25. In the example of FIG. 8, the identification number (feature 1,..., Feature j) of the aggregate feature is associated with the calculated IFF _j . The reverse frame image frequency DB 25 is created for each video.

検索ランキング部１７は、各インスタンスに対してそのインスタンスが含まれている可能性のある映像候補を、入力された映像群の中から検索し、映像候補をランキングする。具体的には、まず、検索ランキング部１７は、インスタンス特徴ＤＢ２２からインスタンスｑのインスタンス特徴である各集約特徴の出現回数ｋｆ_１，・・・，ｋｆ_ｊを取得する。また、検索ランキング部１７は、映像ｖの逆フレーム画像頻度ＤＢ２５から各集約特徴の逆フレーム画像頻度ＩＦＦ_１，・・・，ＩＦＦ_ｊを取得する。さらに、検索ランキング部１７は、映像特徴ＤＢ２４から映像ｖに対応付けられた各集約特徴の出現回数ＫＦ_１，・・・，ＫＦ_ｊを取得する。そして、検索ランキング部１７は、下記（２）式に示すインスタンスｑに対する映像ｖの評価値ＥＢＭ２５（ｑ，ｖ）を計算する。 The search ranking unit 17 searches the input video group for video candidates that may contain the instance for each instance, and ranks the video candidates. Specifically, first, the search ranking unit 17 acquires the number of appearances kf ₁ ,..., Kf _j of each aggregate feature that is an instance feature of the instance q from the instance feature DB 22. Further, the search ranking unit 17 acquires the reverse frame image frequencies IFF ₁ ,..., IFF _j of each aggregate feature from the reverse frame image frequency DB 25 of the video v. Further, the search ranking unit 17 acquires the number of appearances KF ₁ ,..., KF _j of each aggregated feature associated with the video v from the video feature DB 24. Then, the search ranking unit 17 calculates the evaluation value EBM25 (q, v) of the video v for the instance q shown in the following equation (2).

ここで、ｋ_１、ｋ_２、及びｂ_１は設定パラメータで、例えば、ｋ_１＝ｋ_２＝１．２、ｂ_１＝０．７５とすることができる。またｖｌは映像長（video length）、ａｖｖｌは平均映像長（average video length）を意味し、ｖｌは映像ｖに対応付けられた各集約特徴の出現回数ＫＦの和、ａｖｖｌは映像特徴ＤＢ２４内における各映像のｖｌの平均である。また、Σ_ｑｊはインスタンスｑの集約特徴１，・・・，ｊに関する和を意味する。（２）式に示した評価値ＥＢＭ２５は、下記（３）式に示すＢＭ２５（「S. E. Robertson et al., “Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval.”, In Proc. of SIGIR'04, 1994.」参照）に、各集約特徴の重要度として、インスタンス特徴ＤＢ２２における集約特徴の出現頻度の項（（ｋ_２＋１）ｋｆ_ｊ）／（ｋ_２＋ｋｆ_ｊ）と、逆フレーム画像頻度ＩＦＦ_ｊとをさらに考慮したものである。 Here, k ₁ , k ₂ , and b ₁ are setting parameters. For example, k ₁ = k ₂ = 1.2 and b ₁ = 0.75 can be set. Further, vl means video length, avvl means average video length, vl is the sum of the number of appearances KF of each aggregated feature associated with the video v, and avvl is in the video feature DB 24. It is the average of vl of each video. Σ _qj means the sum of the aggregate features 1,..., J of the instance q. The evaluation value EBM25 shown in the equation (2) is the BM25 shown in the following equation (3) (“SE Robertson et al.,“ Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. ”, In Proc. Of SIGIR '04, 1994 "), as the importance of each aggregated feature, the term ((k ₂ +1) kf _j ) / (k ₂ + kf _j ) of the aggregated feature appearance frequency in the instance feature DB 22 and the reverse frame This further considers the image frequency IFF _j .

ＢＭ２５及びＥＢＭ２５は共に、インスタンスに対して重要度が高い集約特徴を多く含み、かつ映像長が大き過ぎない映像に対してハイスコアを与える。しかし、ＢＭ２５では、フレーム画像特徴ＤＢ２３内での集約特徴の出現頻度のみで、各集約特徴の重要度が定義されたが、ＥＢＭ２５では、インスタンス特徴ＤＢ２２における集約特徴の出現頻度もさらに考慮されている。ＥＢＭ２５は、テキスト検索分野でキーワードによる検索ではなくドキュメントをクエリとした検索を行う際のＢＭ２５の拡張に相当する。 Both the BM 25 and the EBM 25 give a high score to a video that includes many aggregate features that are highly important for the instance and that has a video length that is not too large. However, in the BM 25, the importance level of each aggregate feature is defined only by the appearance frequency of the aggregate feature in the frame image feature DB 23. However, in the EBM 25, the appearance frequency of the aggregate feature in the instance feature DB 22 is further considered. . The EBM 25 corresponds to an extension of the BM 25 when performing a search using a document as a query rather than a search using a keyword in the text search field.

検索ランキング部１７は、インスタンスｑに対する各映像（映像１，・・・，映像ｖ）の評価値ＥＢＭ２５をそれぞれ計算し、評価値ＥＢＭ２５の降順で映像をランキングした検索結果を作成する。検索結果は、インスタンスの各々（インスタンス１，・・・，インスタンスｑ）について作成する。図９に、検索結果のデータ構造の一例を示す。図９では、各インスタンスの識別番号（インスタンス１，・・・，インスタンスｑ）と、評価値ＥＢＭ２５の降順で並べられた映像とが対応付けられている。 The search ranking unit 17 calculates the evaluation value EBM25 of each video (video 1,..., Video v) for the instance q, and creates a search result ranking the video in descending order of the evaluation value EBM25. A search result is created for each instance (instance 1,..., Instance q). FIG. 9 shows an example of the data structure of the search result. In FIG. 9, the identification number of each instance (instance 1,..., Instance q) and the video arranged in descending order of the evaluation value EBM 25 are associated with each other.

なお、検索結果は、上記のようにランキング形式にする場合に限定されず、評価値が最大となる映像のみを検索結果としてもよいし、評価値が所定値以上となる映像をランダムに並べた検索結果としてもよい。また、検索結果を、映像のファイル名としてもよいし、映像のファイル名とＥＢＭ２５の値としてもよい。検索結果は、ＥＢＭ２５の値に基づくものであれば、様々な形態をとることが可能である。 The search result is not limited to the ranking format as described above, and only the video with the maximum evaluation value may be used as the search result, or videos with the evaluation value equal to or greater than the predetermined value are randomly arranged. It is good also as a search result. The search result may be a video file name, or a video file name and a value of the EBM 25. The search result can take various forms as long as it is based on the value of the EBM 25.

検索結果出力部１８は、検索ランキング部１７で作成された検索結果を出力する。 The search result output unit 18 outputs the search result created by the search ranking unit 17.

次に、第１の実施の形態に係る映像検索装置１０の作用について説明する。映像検索装置１０に、複数のインスタンスを示す複数のインスタンス画像群が入力されると、映像検索装置１０において、図１０に示す映像検索処理ルーチンが実行される。 Next, the operation of the video search apparatus 10 according to the first embodiment will be described. When a plurality of instance image groups indicating a plurality of instances are input to the video search device 10, the video search processing routine shown in FIG.

ステップ１００で、インスタンス画像特徴抽出部１１が、映像検索装置１０に入力されたインスタンス画像群を受け付け、各インスタンス画像から特徴点を検出し、検出した特徴点の特徴量を記述した特徴ベクトルを局所特徴として抽出する。 In step 100, the instance image feature extraction unit 11 receives the instance image group input to the video search device 10, detects a feature point from each instance image, and localizes a feature vector describing the feature amount of the detected feature point. Extract as a feature.

次に、ステップ１０２で、特徴集約集計部１２が、上記ステップ１００で抽出された局所特徴から、重複している局所特徴を一つに集約し、集約した局所特徴を集約特徴とした集約特徴ＤＢ２１を作成し、所定の記憶領域に記憶する。また、特徴集約集計部１２が、集約特徴に基づいて、各インスタンスの特徴として、各インスタンス画像群における各集約特徴の出現回数を示すインスタンス特徴ＤＢ２２を作成し、所定の記憶領域に記憶する。 Next, in step 102, the feature aggregation / aggregation unit 12 aggregates the overlapping local features into one from the local features extracted in step 100, and the aggregated feature DB 21 having the aggregated local features as an aggregate feature. Is created and stored in a predetermined storage area. Further, the feature aggregation / aggregation unit 12 creates an instance feature DB 22 indicating the number of appearances of each aggregated feature in each instance image group as a feature of each instance based on the aggregated feature, and stores it in a predetermined storage area.

次に、ステップ１０４で、フレーム画像抽出部１３が、映像検索装置１０に入力された映像群を受け付け、各映像からフレーム画像を抽出する。次に、ステップ１０６で、フレーム画像特徴抽出部１４が、上記ステップ１０４で抽出された各フレーム画像から局所特徴を抽出する。 Next, in step 104, the frame image extraction unit 13 receives the video group input to the video search device 10, and extracts a frame image from each video. Next, in step 106, the frame image feature extraction unit 14 extracts local features from each frame image extracted in step 104.

次に、ステップ１０８で、特徴照合集計部１５が、上記ステップ１０６でフレーム画像から抽出された各局所特徴と、上記ステップ１０２で記憶された集約特徴ＤＢ２１内の各集約特徴とを、特徴ベクトル間の類似度に基づいて照合する。そして、特徴照合集計部１５が、照合結果に基づいて、各フレーム画像における各集約特徴の出現頻度Ｋｆを示すフレーム画像特徴ＤＢ２３を映像毎に作成し、所定の記憶領域に記憶する。また、特徴照合集計部１５が、照合結果に基づいて、各映像における各集約特徴の出現回数ＫＦを示す映像特徴ＤＢ２４を作成し、所定の記憶領域に記憶する。 Next, in step 108, the feature collating and summing unit 15 calculates each local feature extracted from the frame image in step 106 and each aggregated feature in the aggregated feature DB 21 stored in step 102 between feature vectors. Match based on the similarity of. Then, the feature matching totaling unit 15 creates a frame image feature DB 23 indicating the appearance frequency Kf of each aggregated feature in each frame image based on the matching result, and stores it in a predetermined storage area. Also, the feature matching totaling unit 15 creates a video feature DB 24 indicating the number of appearances KF of each aggregated feature in each video based on the matching result, and stores it in a predetermined storage area.

次に、ステップ１１０で、逆フレーム画像頻度計算部１６が、フレーム画像特徴ＤＢ２３を参照し、集約特徴ｊの逆フレーム画像頻度ＩＦＦ_ｊを、集約特徴ＤＢ２１に記憶されている集約特徴（特徴１，・・・，特徴ｊ）の全てについて計算し、各集約特徴のＩＦＦを示す逆フレーム画像頻度ＤＢ２５を作成し、所定の記憶領域に記憶する。逆フレーム画像頻度ＤＢ２５は映像毎に作成する。 Next, in step 110, the reverse frame image frequency calculation unit 16 refers to the frame image feature DB 23, and sets the reverse frame image frequency IFF _j of the aggregate feature _j to the aggregate feature (feature 1, feature 1). .., Feature j) are calculated, and an inverse frame image frequency DB 25 indicating the IFF of each aggregated feature is created and stored in a predetermined storage area. The reverse frame image frequency DB 25 is created for each video.

次に、ステップ１１２で、検索ランキング部１７が、インスタンス特徴ＤＢ２２からインスタンスｑのインスタンス特徴である各集約特徴の出現回数ｋｆ_１，・・・，ｋｆ_ｊを取得し、各映像の逆フレーム画像頻度ＤＢ２５から各集約特徴の逆フレーム画像頻度ＩＦＦ_１，・・・，ＩＦＦ_ｊを取得し、映像特徴ＤＢ２４から各映像に対応付けられた各集約特徴の出現回数ＫＦ_１，・・・，ＫＦ_ｊを取得する。そして、検索ランキング部１７が、インスタンスｑに対する各映像の評価値ＥＢＭ２５を計算し、評価値ＥＢＭ２５の降順で映像をランキングした検索結果を作成する。検索結果は、各インスタンスについて作成する。 Next, in step 112, the search ranking unit 17 obtains the number of appearances kf ₁ ,..., Kf _j of each aggregate feature that is the instance feature of the instance q from the instance feature DB 22, and the inverse frame image frequency of each video. The inverse frame image frequency IFF ₁ ,..., IFF _j of each aggregated feature is acquired from the DB 25, and the appearance frequency KF ₁ ,..., KF _j of each aggregated feature associated with each video is obtained from the video feature DB 24. get. Then, the search ranking unit 17 calculates the evaluation value EBM25 of each video for the instance q, and creates a search result ranking the video in descending order of the evaluation value EBM25. A search result is created for each instance.

次に、ステップ１１４で、検索結果出力部１８が、上記ステップ１１２で作成された検索結果を出力して、映像検索処理ルーチンを終了する。 Next, in step 114, the search result output unit 18 outputs the search result created in step 112, and ends the video search processing routine.

以上説明したように、第１の実施の形態に係る映像検索装置によれば、インスタンス画像群における出現頻度と、逆フレーム画像頻度とを集約特徴の重要度としてさらに考慮したＥＢＭ２５を評価値として用いることにより、より高精度にインスタンスを示す画像を検索することができる。 As described above, according to the video search device according to the first embodiment, the EBM 25 that further considers the appearance frequency in the instance image group and the inverse frame image frequency as the importance of the aggregate feature is used as the evaluation value. Thus, it is possible to search for an image showing an instance with higher accuracy.

＜第２の実施の形態＞
第２の実施の形態では、既にインスタンス特徴ＤＢ２２に所定数のデータが蓄積されている状態で、新しく追加されたインスタンス画像をクエリとして映像を検索する場合について説明する。なお、第２の実施の形態に係る映像検索装置について、第１の実施の形態に係る映像検索装置１０と同一の構成については、同一符号を付して詳細な説明を省略する。 <Second Embodiment>
In the second embodiment, a case will be described in which a video is searched using a newly added instance image as a query in a state where a predetermined number of data has already been accumulated in the instance feature DB 22. Note that in the video search device according to the second embodiment, the same components as those of the video search device 10 according to the first embodiment are denoted by the same reference numerals, and detailed description thereof is omitted.

第２の実施の形態に係る映像検索装置２１０は、ＣＰＵと、ＲＡＭと、後述する映像検索処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成されている。このコンピュータは、機能的には、図１１に示すように、インスタンス画像特徴抽出部１１と、特徴集約集計部２１２と、フレーム画像抽出部１３と、フレーム画像特徴抽出部１４と、特徴照合集計部２１５と、逆フレーム画像頻度計算部１６と、検索ランキング部１７と、検索結果出力部１８とを含んだ構成で表すことができる。また、映像検索装置１０には、集約特徴ＤＢ２１、インスタンス特徴ＤＢ２２、フレーム画像特徴ＤＢ２３、映像特徴ＤＢ２４、逆フレーム画像頻度ＤＢ２５、及び追加特徴ＤＢ２６を記憶する所定の記憶領域が設けられている。 The video search apparatus 210 according to the second embodiment is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a video search processing routine described later. As shown in FIG. 11, this computer functionally includes an instance image feature extraction unit 11, a feature aggregation totaling unit 212, a frame image extraction unit 13, a frame image feature extraction unit 14, and a feature matching totalization unit. 215, the reverse frame image frequency calculation unit 16, the search ranking unit 17, and the search result output unit 18. In addition, the video search apparatus 10 is provided with a predetermined storage area for storing an aggregate feature DB 21, an instance feature DB 22, a frame image feature DB 23, a video feature DB 24, an inverse frame image frequency DB 25, and an additional feature DB 26.

特徴集約集計部２１２は、インスタンス画像特徴抽出部１１で抽出された新しいインスタンス画像の局所特徴を受け取り、集約特徴ＤＢ２１を参照して、重複している局所特徴を排除し、新たに追加された追加特徴を抽出する。ここでは、インスタンスｑ＋１を示すインスタンス画像群が新たに入力され、追加特徴ｊ＋１が追加された場合について説明する。特徴集約集計部２１２は、追加特徴ｊ＋１を、集約特徴ＤＢ２１と同じデータ構造の追加特徴ＤＢ２６に記憶すると共に、集約特徴ＤＢ２１を追加特徴ｊ＋１の追加分だけ更新する。図１２に、追加特徴ＤＢ２６の一例を、図１３に、更新された集約特徴ＤＢ２１の一例を示す。 The feature aggregation / aggregation unit 212 receives the local feature of the new instance image extracted by the instance image feature extraction unit 11, refers to the aggregation feature DB 21, eliminates the overlapping local feature, and is newly added Extract features. Here, a case where an instance image group indicating the instance q + 1 is newly input and an additional feature j + 1 is added will be described. The feature aggregation totaling unit 212 stores the additional feature j + 1 in the additional feature DB 26 having the same data structure as the aggregate feature DB 21 and updates the aggregate feature DB 21 by the addition of the additional feature j + 1. FIG. 12 shows an example of the additional feature DB 26, and FIG. 13 shows an example of the updated aggregate feature DB 21.

また、特徴集約集計部２１２は、追加されたインスタンスｑ＋１及び追加特徴ｊ＋１により、インスタンス特徴ＤＢ２２を更新する。図１４に、更新されたインスタンス特徴ＤＢ２２の一例を示す。図１４の例では、更新前のインスタンス特徴ＤＢ２２（図５）に、追加特徴ｊ＋１の列及びインスタンスｑ＋１の行が追加されている。なお、追加特徴ｊ＋１は新たに追加された集約特徴であるため、インスタンス１，・・・，インスタンスｑについての出現回数ｋｆ_ｊ＋１は０である。 In addition, the feature aggregation totaling unit 212 updates the instance feature DB 22 with the added instance q + 1 and the added feature j + 1. FIG. 14 shows an example of the updated instance feature DB 22. In the example of FIG. 14, the column of the additional feature j + 1 and the row of the instance q + 1 are added to the instance feature DB 22 (FIG. 5) before the update. Since the additional feature j + 1 is a newly added aggregate feature, the number of appearances kf _{j + 1} for the instances 1,..., Instance q is 0.

特徴照合集計部２１５は、フレーム画像から抽出された局所特徴と追加特徴ＤＢ２６に記憶されている追加特徴との照合結果に基づいて、フレーム画像特徴ＤＢ２３及び映像特徴ＤＢ２４を更新する。局所特徴と追加特徴との照合方法は、第１の実施の形態の特徴照合集計部１５における局所特徴と集約特徴との照合方法と同様である。フレーム画像特徴ＤＢ２３の更新では、各フレーム画像における追加特徴ｊ＋１の出現回数Ｋｆ_ｊ＋１をカウントし、追加特徴ｊ＋１の列をフレーム画像特徴ＤＢ２３に追加する。また、映像特徴ＤＢ２４の更新では、更新された映像毎のフレーム画像特徴ＤＢ２３の追加特徴ｊ＋１の出現回数Ｋｆ_ｊ＋１をフレーム画像１からフレーム画像Ｎまで合計して、映像毎の追加特徴ｊ＋１の出現回数ＫＦ_ｊ＋１を求めて、追加特徴ｊ＋１の列を映像特徴ＤＢ２４に追加する。 The feature matching totaling unit 215 updates the frame image feature DB 23 and the video feature DB 24 based on the matching result between the local feature extracted from the frame image and the additional feature stored in the additional feature DB 26. The matching method between the local feature and the additional feature is the same as the matching method between the local feature and the aggregated feature in the feature matching totaling unit 15 of the first embodiment. In the update of the frame image feature DB 23, the number of appearances Kf _{j + 1} of the additional feature j + 1 in each frame image is counted, and the column of the additional feature j + 1 is added to the frame image feature DB 23. In addition, in the update of the video feature DB 24, the number of appearances Kf _{j + 1} of the additional feature j + 1 in the updated frame image feature DB 23 for each video is totaled from the frame image 1 to the frame image N, so KF _{j + 1} is obtained, and a column of additional features j + 1 is added to the video feature DB 24.

次に、第２の実施の形態に係る映像検索装置２１０の作用について、第１の実施の形態と異なる点について説明する。 Next, the operation of the video search apparatus 210 according to the second embodiment will be described with respect to differences from the first embodiment.

図１０の映像検索処理ルーチンのステップ１０２では、特徴集約集計部２１２が、ステップ１００で抽出された新しいインスタンス画像の局所特徴から、集約特徴ＤＢ２１を参照して追加特徴を抽出し、追加特徴ＤＢ２６に記憶すると共に、集約特徴ＤＢ２１を新たな追加特徴の追加分だけ更新する。また、特徴集約集計部２１２が、新たなインスタンス及び追加特徴の追加分だけ、インスタンス特徴ＤＢ２２を更新する。 In step 102 of the video search processing routine of FIG. 10, the feature aggregation / aggregation unit 212 extracts additional features from the local features of the new instance image extracted in step 100 with reference to the aggregation feature DB 21, and stores them in the additional feature DB 26. In addition to storing, the aggregated feature DB 21 is updated by the amount of new added features. Further, the feature aggregation / aggregation unit 212 updates the instance feature DB 22 by the amount of new instances and additional features added.

ステップ１０８では、特徴照合集計部２１５が、ステップ１０６で抽出された局所特徴と、上記ステップ１０２で追加特徴ＤＢ２６に記憶された追加特徴との照合結果に基づいて、追加特徴の追加分だけ、フレーム画像特徴ＤＢ２３及び映像特徴ＤＢ２４を更新する。 In step 108, the feature matching totaling unit 215 adds the additional feature to the frame based on the matching result between the local feature extracted in step 106 and the additional feature stored in the additional feature DB 26 in step 102. The image feature DB 23 and the video feature DB 24 are updated.

後段の処理では、上記のように更新された各データベースを参照して、第１の実施の形態と同様に評価値ＥＢＭ２５を計算して、検索結果を作成すればよい。 In the subsequent processing, the database updated as described above is referred to, the evaluation value EBM 25 is calculated in the same manner as in the first embodiment, and a search result may be created.

以上説明したように、第２次の実施の形態に係る映像検索装置によれば、インスタンス特徴ＤＢに所定数以上のデータが蓄積されている場合には、新たに追加されたインスタンス画像について処理するだけで、第１の実施の形態と同様に、より高精度にインスタンスを示す画像を検索することができる。 As described above, according to the video search device according to the second embodiment, when a predetermined number or more of data is accumulated in the instance feature DB, the newly added instance image is processed. As in the first embodiment, it is possible to search for an image showing an instance with higher accuracy.

＜評価結果＞
ここで、ＴＲＥＣＶＩＤ２０１１，２０１２のインスタンス検索タスクのデータセットを使用した検索精度の評価結果について説明する。ＴＲＥＣＶＩＤは毎年行われている映像検索分野のコンペティションで、アメリカのＮＩＳＴ（National Institute of Standards and Technology）が主催している。ＴＲＥＣＶＩＤ２０１１のインスタンス検索タスクでは２５個のインスタンスが用意され、それぞれに平均約４枚の画像が付与されている。そして約２万本の映像データベース（海外の番組映像）が検索対象になる。ＴＲＥＣＶＩＤ２０１２のインスタンス検索タスクでは２１個のインスタンスが用意され、それぞれに平均約５枚の画像が付与されている。そして約８万本の映像データベース（Ｗｅｂ上のConsumer Generated Media（ＣＧＭ））が検索対象になる。ＴＲＥＣＶＩＤ２０１１のインスタンス毎の平均正解映像数は約７３で、ＴＲＥＣＶＩＤ２０１２では約５９である。検索結果ランキングの精度はMean Average Precision（ＭＡＰ）という指標で評価した。ＭＡＰは下記（４）式で定義される。 <Evaluation results>
Here, the evaluation result of the search accuracy using the instance search task data set of TRECVID 2011, 2012 will be described. TRECVID is an annual competition in the field of video search and is hosted by the National Institute of Standards and Technology (NIST) in the United States. In the instance search task of TRECVID 2011, 25 instances are prepared, and an average of about 4 images is assigned to each instance. And about 20,000 video databases (overseas program videos) are to be searched. In the instance search task of TRECVID 2012, 21 instances are prepared, and an average of about 5 images is assigned to each instance. About 80,000 video databases (Consumer Generated Media (CGM) on the Web) are to be searched. The average number of correct answers for each instance of TRECVID 2011 is about 73, and about 59 for TRECVID2012. The accuracy of the search result ranking was evaluated by an index called Mean Average Precision (MAP). MAP is defined by the following equation (4).

ここで｜Ｑ｜は全インスタンス数、｜Ｒ_ｑ｜はインスタンスｑの正解映像数、ｊは映像の検索結果ランク、ｒｅｌ（ｑ，ｊ）はランクｊの映像がｑに対して正解であれば１を、不正解であれば０を返す関数である。ｃ（ｑ，ｊ）はランク１からランクｊまでに存在した正解映像の数である。ランク１から順に正解映像が並んでいる場合、つまり最高の検索精度はＭＡＰ＝１００（％）である。 Where | Q | is the total number of instances, | R _q | is the number of correct videos of instance q, j is the search result rank of video, and rel (q, j) is the video of rank j that is correct with respect to q. It is a function that returns 1 if the answer is incorrect. c (q, j) is the number of correct images existing from rank 1 to rank j. When correct videos are arranged in order from rank 1, that is, the highest search accuracy is MAP = 100 (%).

図１５に検索精度の評価結果を示す。ＫＦは映像中における局所特徴の出現頻度の総和のみで検索ランキングした手法、ＩＦＦは映像中における局所特徴の逆フレーム画像頻度の総和のみで検索ランキングした手法である。図１５の結果から、ＩＦＦを使用することで検索精度が大幅に改善することが分かる。そして、映像中における局所特徴の出現頻度及び映像長による正規化を考慮するＢＭ２５を使用することでランキングが改善され、さらにクエリとなるインスタンス画像における局所特徴の出現頻度及びＩＦＦをさらに考慮するＥＢＭ２５（本実施の形態の手法）の検索精度は、ＢＭ２５を上回る結果を出すことが分かる。ＭＡＰの差は危険度１％水準での検定で有意であることを確認している。 FIG. 15 shows the evaluation result of the search accuracy. KF is a search ranking method using only the sum of appearance frequencies of local features in the video, and IFF is a search ranking method using only the sum of the reverse frame image frequencies of local features in the video. From the results of FIG. 15, it can be seen that the search accuracy is greatly improved by using IFF. Then, the ranking is improved by using the BM 25 that considers the appearance frequency of the local feature in the video and the normalization based on the video length, and the EBM 25 further considering the frequency of appearance of the local feature and the IFF in the instance image that is the query. It can be seen that the search accuracy of the method of the present embodiment gives results exceeding BM25. It has been confirmed that the difference in MAP is significant by a test at a risk level of 1%.

今後、映像メディアは録画や記録デバイス、ソーシャルネットワーキングサービス等の発達により爆発的に増加することが予想され、その様な状況に対処しうるロバストな映像検索技術が求められているが、本実施の形態の手法を使用することで、大規模映像データベースからの高精度なインスタンス検索を実現することが可能になる。 In the future, video media is expected to increase explosively with the development of recording and recording devices, social networking services, etc., and robust video search technology that can cope with such situations is required. By using the method of the form, it becomes possible to realize a highly accurate instance search from a large-scale video database.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上記各実施の形態では、インスタンスの検索対象の画像を映像とする場合について説明したが、検索対象の画像を静止画像とした場合にも、本発明を適用可能である。この場合、例えば１００００枚の静止画像を検索対象として入力した場合、映像中の特徴点の出現回数ＫＦを静止画像中の特徴点の出現回数ＫＦ’に、映像中のＫＦの総和であるｖｌを静止画像中の特徴点の出現回数の総和であるｖｌ’に、ｖｌの平均値であるａｖｄｌをｖｌ’の平均値であるａｖｄｌ’に置き換え、上記のＥＢＭ２５を適用すればよい。ＩＦＦは１００００枚の静止画像集合の中で（１）式により計算される。 For example, although cases have been described with the above embodiments where the search target image is an image, the present invention can also be applied to a case where the search target image is a still image. In this case, for example, when 10,000 still images are input as search targets, the number of appearances KF of feature points in the video is set to the number of appearances KF ′ of feature points in the still image, and vl that is the sum of KF in the video is set. The above-described EBM 25 may be applied to vl ′, which is the total number of appearances of feature points in the still image, by replacing avdl, which is the average value of vl, with avdl ′, which is the average value of vl ′. The IFF is calculated by the equation (1) in a set of 10,000 still images.

また、検索対象を静止画像とした場合には、検索結果として、静止画像のファイル名や、静止画像のファイル名とＥＢＭ２５の値など、ＥＢＭ２５の値に基づく様々な形態の検索結果を出力することができる。さらに、検索対象を映像及び静止画像の両方とした場合、検索結果として、映像と静止画像とが混在したものを出力してもよい。 When the search target is a still image, search results in various forms based on the value of the EBM 25, such as a still image file name, a still image file name, and an EBM 25 value, are output as search results. Can do. Furthermore, when the search target is both a video and a still image, a search result may be output in which video and still images are mixed.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０、２１０映像検索装置
１１インスタンス画像特徴抽出部
１２、２１２特徴集約集計部
１３フレーム画像抽出部
１４フレーム画像特徴抽出部
１５、２１５特徴照合集計部
１６逆フレーム画像頻度計算部
１７検索ランキング部
１８検索結果出力部
２１集約特徴ＤＢ
２２インスタンス特徴ＤＢ
２３フレーム画像特徴ＤＢ
２４映像特徴ＤＢ
２５逆フレーム画像頻度ＤＢ
２６追加特徴ＤＢ DESCRIPTION OF SYMBOLS 10,210 Image | video search apparatus 11 Instance image feature extraction part 12, 212 Feature aggregation total part 13 Frame image extraction part 14 Frame image feature extraction part 15, 215 Feature collation totaling part 16 Reverse frame image frequency calculation part 17 Search ranking part 18 Search Result output unit 21 Aggregated feature DB
22 Instance feature DB
23 Frame image feature DB
24 Video feature DB
25 Reverse frame image frequency DB
26 Additional features DB

Claims

Feature extraction means for extracting a plurality of features from each frame image of an image group consisting of an instance image indicating an instance to be queried and a plurality of frame images to be searched;
The first for each instance in the plurality of instance images of the aggregated feature obtained by aggregating overlapping features from the plurality of features extracted from each of the plurality of instance images indicating each of the plurality of instances by the feature extraction unit. A feature aggregation and aggregation means for counting the appearance frequency;
Each of the aggregated features is compared with a plurality of features extracted from the frame image, and the second appearance frequency in each frame image and the third appearance frequency in the image group are totaled for each of the aggregated features. A feature matching and aggregation means;
For each of the aggregate features, based on the second appearance frequency, an inverse frame image frequency calculation unit that calculates a reverse frame image frequency that increases as the appearance frequency of the aggregate feature in the image group decreases;
The importance level of each aggregated feature determined based on the image frame length related to the reverse frame image frequency, the first appearance frequency, and the third appearance frequency of each aggregated feature, and the number of aggregated features included in the image group. A search result creating means for obtaining an evaluation value of each image group for a search target instance represented by a sum, and creating a search result based on the evaluation value;
Image search device including

The feature aggregation and aggregation means calculates the aggregation result of the first appearance frequency based on an additional feature not included in the aggregation feature among features extracted from an instance image indicating an instance newly added as a search target. Updated,
The feature collation tabulation unit collates the additional feature with a plurality of features extracted from the frame image, updates the second appearance frequency and the third appearance frequency,
The inverse frame image frequency calculating means recalculates the inverse frame image frequency based on the updated second appearance frequency,
The search result creating means creates a search result for the newly added instance based on the recalculated reverse frame image frequency, the updated first appearance frequency, and the updated third appearance frequency. The image search device according to claim 1.

The image search apparatus according to claim 1, wherein the search result creation unit includes a file name of the image group or a file name of the image group and an evaluation value of the image group in the search result.

An image search method in an image search apparatus including a feature extraction unit, a feature aggregation totaling unit, a feature matching totalization unit, an inverse frame image frequency calculation unit, and a search result creation unit,
The feature extraction means extracts a plurality of features from each frame image of an image group consisting of an instance image indicating an instance to be a query and a plurality of frame images to be searched;
The first for each instance in the plurality of instance images of the aggregated feature obtained by aggregating overlapping features from the plurality of features extracted from each of the plurality of instance images indicating each of the plurality of instances by the feature extraction unit. Total frequency of appearance,
The feature collating and summing unit collates each of the aggregated features with a plurality of features extracted from the frame image, and for each of the aggregated features, a second appearance frequency in each frame image, and in the image group Aggregate the third appearance frequency,
The inverse frame image frequency calculating means calculates, for each of the aggregate features, an inverse frame image frequency that increases as the appearance frequency of the aggregate feature in the image group decreases based on the second appearance frequency;
The search result creation means is determined based on an image group length related to the reverse frame image frequency, the first appearance frequency, and the third appearance frequency of each of the aggregate features, and the number of aggregate features included in the image group. An image search method for obtaining an evaluation value of each image group for a search target instance represented by a sum of importance of each aggregated feature and creating a search result based on the evaluation value.

The image search program for functioning a computer as each means which comprises the image search device of any one of Claims 1-3.