JP2014137637A

JP2014137637A - Image processor and image processing program

Info

Publication number: JP2014137637A
Application number: JP2013004775A
Authority: JP
Inventors: Takahiro Mochizuki; 貴裕望月; Masato Fujii; 真人藤井
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-01-15
Filing date: 2013-01-15
Publication date: 2014-07-28
Anticipated expiration: 2033-01-15
Also published as: JP6034702B2

Abstract

PROBLEM TO BE SOLVED: To properly acquire the feature information of each scene included in a video.SOLUTION: An image processor for extracting the feature information of each scene included in a video comprises: sampling acquisition means for sampling a predetermined frame image from a sample video; block feature information generation means for dividing each frame image acquired by the sampling acquisition means respectively for one or more scales, and for generating the feature information of each of those divided blocks; scene division means for dividing a scene from an object video from which the feature information is generated; and histogram generation means for generating a histogram based on the appearance ratio of each block by using the blocks acquired by the block feature information generation means for each scene divided by the scene division means.

Description

本発明は、映像に含まれるフレーム画像に対する画像処理装置及び画像処理プログラムに関する。 The present invention relates to an image processing apparatus and an image processing program for a frame image included in a video.

従来では、記録技術の進歩によりハードディスクに大量の映像を蓄積することが可能となっている。また、ネットワーク環境の発達により、インターネット等の通信ネットワークを通じて多種多様な映像にアクセスすることができるようになっている。そのため、所望の映像を素早く探し出すための検索技術が有用となっている。 Conventionally, it has become possible to store a large amount of video on a hard disk due to advances in recording technology. In addition, with the development of the network environment, it is possible to access a wide variety of videos through a communication network such as the Internet. Therefore, a search technique for quickly searching for a desired video is useful.

ここで、映像の一般的な検索技術としては、映像内容に関するキーワード検索が挙げられる（例えば、特許文献１参照）。しかしながら、映像の量が膨大となった場合には、各シーンへの的確なキーワードやテキスト情報の付与が非常に高い作業コストとなる。また、付与された情報は、作業者の違いによる感覚のブレを含んでおり、検索精度の低下を招く恐れがある。そこで、キーワード検索とは異なるアプローチとして、画像特徴の類似性に基づいた「ビジュアル検索」に関する研究が盛んに行われている。従来のビジュアル検索は、カメラの切り替わりで区切ったショット単位での検索であり、高速に検索するために「代表フレーム画像の類似性」をそのままショットの類似性としている。 Here, as a general video search technique, keyword search related to video content can be cited (see, for example, Patent Document 1). However, when the amount of video becomes enormous, it is very expensive to assign appropriate keywords and text information to each scene. Moreover, the given information includes a sense blur due to a difference in workers, and there is a possibility that the search accuracy is lowered. Therefore, research on “visual search” based on the similarity of image features has been actively conducted as an approach different from keyword search. The conventional visual search is a search in shot units divided by camera switching, and “similarity of representative frame images” is used as a shot similarity as it is for high-speed search.

ＴｏｍｏｋｉＭａｓｕｄａ，ＤａｉｓｕｋｅＹａｍａｍｏｔｏ，ＳｈｉｇｅｋｉＯｈｉｒａ，ＫａｔａｓｈｉＮａｇａｏ、"ＶｉｄｅｏＳｃｅｎｅＲｅｔｒｉｅｖａｌＵｓｉｎｇＯｎｌｉｎｅＶｉｄｅｏＡｎｎｏｔａｔｉｏｎ"、ＮｅｗＦｒｏｎｔｉｅｒｓｉｎａｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ、ＡｗａｒｄｅｄＰａｐｅｒｓ、ＬＮＡＩ４９１４、Ｓｐｒｉｎｇｅｒ−Ｖｅｒｌａｇ、ｐｐ．５４−６２（２００８）Tomoki Masuda, Daisuke Yamamoto, Shigeki Ohira, Katashi Nagao, "Video Scene Retrieval Using Online Video Annotation", New Frontiers in artificial Intelligence, Awarded Papers, LNAI 4914, Springer-Verlag, pp. 54-62 (2008)

しかしながら、上述したショット単位の検索では、検索意図の一部しか満たされない場合があるため、複数ショットで構成される「シーン」を単位とした検索の仕組みが必要となる。また、シーン単位の検索では、シーン途中のフレーム画像の「見た目」が代表フレーム画像と大きく異なる場合がある。したがって、一枚のフレーム画像を「シーンの代表」と位置付けるのは困難である。 However, since the above-described search by shot unit may satisfy only a part of the search intention, a search mechanism using “scene” composed of a plurality of shots as a unit is required. Further, in the search by scene unit, the “look” of the frame image in the middle of the scene may be significantly different from the representative frame image. Therefore, it is difficult to position one frame image as a “scene representative”.

例えば、シーンの代表として複数の画像（例えば、全てのショットの代表画像）を用いるアプローチも考えられるが、その場合には、シーン同士の類似度を求めるために「総当り的」な画像同士の類似度算出が必要となるため、算出コストが非常に高くなってしまう。 For example, an approach using a plurality of images (for example, representative images of all shots) as a scene representative is also conceivable, but in this case, in order to obtain the similarity between scenes, Since the similarity calculation is required, the calculation cost becomes very high.

本発明は、上述した問題点に鑑みなされたものであり、映像中に含まれるシーン毎の特徴情報を適切に取得するための画像処理装置及び画像処理プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide an image processing apparatus and an image processing program for appropriately acquiring feature information for each scene included in a video.

本発明の一態様における画像処理装置は、映像に含まれる各シーンの特徴情報を抽出する画像処理装置において、サンプル映像から所定のフレーム画像をサンプリングするサンプリング取得手段と、前記サンプリング取得手段により得られる各フレーム画像に対して、１又は複数のスケール毎にそれぞれ分割し、分割したブロック毎の特徴情報を生成するブロック特徴情報生成手段と、前記特徴情報を生成する対象映像からシーンを分割するシーン分割手段と、前記シーン分割手段により分割されたシーン毎に、前記ブロック特徴情報生成手段により得られるブロックを用いて、ブロック毎の出現比率に基づくヒストグラムを生成するヒストグラム生成手段とを有する。 An image processing apparatus according to an aspect of the present invention is obtained by a sampling acquisition unit that samples a predetermined frame image from a sample video in the image processing apparatus that extracts feature information of each scene included in the video, and the sampling acquisition unit. Block feature information generating means for generating feature information for each divided block and dividing each frame image by one or a plurality of scales, and scene division for dividing a scene from a target video for generating the feature information And a histogram generating means for generating a histogram based on the appearance ratio for each block using the block obtained by the block feature information generating means for each scene divided by the scene dividing means.

また、本発明の一態様における画像処理プログラムは、映像に含まれる各シーンの特徴情報を抽出する画像処理をコンピュータに実行させるための画像処理プログラムにおいて、前記コンピュータを、サンプル映像から所定のフレーム画像をサンプリングするサンプリング取得手段、前記サンプリング取得手段により得られる各フレーム画像に対して、１又は複数のスケール毎にそれぞれ分割し、分割したブロック毎の特徴情報を生成するブロック特徴情報生成手段、前記特徴情報を生成する対象映像からシーンを分割するシーン分割手段、及び、前記シーン分割手段により分割されたシーン毎に、前記ブロック特徴情報生成手段により得られるブロックを用いて、ブロック毎の出現比率に基づくヒストグラムを生成するヒストグラム生成手段として機能させる。 An image processing program according to an aspect of the present invention is an image processing program for causing a computer to execute image processing for extracting feature information of each scene included in a video. Sampling acquisition means for sampling the frame image, block feature information generation means for generating feature information for each divided block for each frame image obtained by the sampling acquisition means, and generating the feature information for each divided block; Based on the appearance ratio of each block using the block obtained by the block feature information generating means for each scene divided by the scene dividing means for dividing the scene from the target video for generating information Histogram generator for generating histograms To function as.

本発明によれば、映像中に含まれるシーン毎の特徴情報を適切に取得することができる。 According to the present invention, it is possible to appropriately acquire feature information for each scene included in a video.

多重スケール画像片ワードヒストグラムの概念図である。It is a conceptual diagram of a multiscale image piece word histogram. 本実施形態におけるブロック画像例を示す図である。It is a figure which shows the example of a block image in this embodiment. 画像処理装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of an image processing apparatus. 多重スケール画像片ワードの生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of the production | generation process of a multiscale image piece word. 画像片ワードの生成の流れを示す図である。It is a figure which shows the flow of a production | generation of an image fragment word. 多重スケール画像片ワードヒストグラム生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of a multiscale image piece word histogram production | generation process. 多重スケール画像片ワードヒストグラムの生成の流れを示す図である。It is a figure which shows the flow of a production | generation of a multiscale image piece word histogram. 検索処理の一例を示すフローチャートである。It is a flowchart which shows an example of a search process. 距離Ｄ_ｉの算出例を示す図である。Distance is a diagram showing an example of calculation of D _i. クエリーとする１２種類のシーンとそれぞれについて設定した正解映像内容を示す図である。It is a figure which shows 12 types of scenes used as a query, and the content of the correct video set about each. 関連度の概略的な算出例を示す図である。It is a figure which shows the rough calculation example of a relevance degree. 本実施形態における検索結果の一例を示す図である。It is a figure which shows an example of the search result in this embodiment. 比較手法の一例を示す図である。It is a figure which shows an example of the comparison method. 実験結果の比較例を示す図である。It is a figure which shows the comparative example of an experimental result. 精度比較の一例を示す図である。It is a figure which shows an example of an accuracy comparison.

＜本発明について＞
本発明は、映像に含まれる複数のフレーム画像を用いて、映像（例えば、シーン毎）に対する特徴情報を取得する。具体的には、各フレーム画像に対して１又は複数の異なる画像サイズ（以下、「多重スケール」という）を有する画像片ワードのヒストグラム（多重スケール画像片ワードヒストグラム、ＨｉｓｔｏｇｒａｍｏｆＭｕｌｔｉ−ｓｃａｌｅＩｍａｇｅＰｉｅｃｅＷｏｒｄ、以下、必要に応じて「Ｈ−ＭＩＰＷ」という）に基づく特徴情報を用いて各シーンの分類を行う。 <About the present invention>
The present invention acquires feature information for a video (for example, for each scene) using a plurality of frame images included in the video. Specifically, a histogram of image fragment words having one or a plurality of different image sizes (hereinafter referred to as “multi-scale”) for each frame image (multi-scale image fragment word histogram, Histogram of Multi-scale Image Piece Word). Hereinafter, the scenes are classified using feature information based on “H-MIPW” as necessary.

画像片とは、例えば１フレーム画像を所定の画像サイズで区切って分割されたときの各ブロック画像である。画像サイズ（スケール）は、例えば正方形でもよく、その他の形状でもよい。また、ワードとは、例えば参照ベクトル等の所定の特徴情報等であるが、これに限定されるものではない。また、Ｈ−ＭＩＰＷは、例えばブロック画像の種類と出現比率（頻度）による静止画分類手法をベースとし、ブロックの大きさを多重スケールにすると共に動画特徴に拡張したものである。 An image piece is, for example, each block image obtained by dividing one frame image by dividing it into a predetermined image size. The image size (scale) may be, for example, a square or other shapes. The word is, for example, predetermined feature information such as a reference vector, but is not limited thereto. H-MIPW is based on a still image classification method based on, for example, the type and appearance ratio (frequency) of a block image, and expands the block size to a multi-scale and a moving image feature.

ここで、図１は、多重スケール画像片ワードヒストグラムの概念図である。また、図２は、本実施形態におけるブロック画像例を示す図である。図１の例では、所定のシーンからサンプリングしたフレーム画像を１又は複数種類の画像サイズ毎にブロック単位で分割し、分割した画像片に対して特徴情報に基づく多重スケール画像片ワードヒストグラム（Ｈ−ＭＩＰＷ）を生成し、シーン中にどの種類のブロック画像がどのくらい存在するか（出現比率）を取得する。 Here, FIG. 1 is a conceptual diagram of a multiscale image fragment word histogram. FIG. 2 is a diagram illustrating an example of a block image in the present embodiment. In the example of FIG. 1, a frame image sampled from a predetermined scene is divided into blocks for each of one or more types of image sizes, and a multiscale image fragment word histogram (H− MIPW) is generated, and how many types of block images are present in the scene (appearance ratio) is acquired.

ここで、ブロック画像の種類は、映っている内容（被写体）と強い因果関係がある。例えば図２に示すように、「空」、「山、森」、「夕焼け」等の映像の内容は、フレーム画像を分割した各ブロックから取得することができる。したがって、したがって、上述したＨ−ＭＩＰＷは、シーンの内容を包括的に表現する特徴の１つと考えることができる。本実施形態では、Ｈ−ＭＩＰＷに基づいて、例えば複数のフレーム画像を含むシーンの特徴情報を取得する。 Here, the type of block image has a strong causal relationship with the content (subject) being shown. For example, as shown in FIG. 2, the contents of the video such as “sky”, “mountain, forest”, “sunset”, and the like can be acquired from each block obtained by dividing the frame image. Therefore, the above-described H-MIPW can be considered as one of the features that comprehensively express the contents of the scene. In the present embodiment, scene feature information including a plurality of frame images is acquired based on H-MIPW, for example.

また、本実施形態では、シーンが１つのヒストグラムで表現されるため、適切なシーンの分類により類似度を高速に算出することができ、取得した特徴情報を用いて映像内容の類似性に基づいたシーン検索を可能とする。以下に、画像処理装置及び画像処理プログラムを好適に実施した形態について、図面を用いて詳細に説明する。 In the present embodiment, since the scene is represented by one histogram, the similarity can be calculated at high speed by appropriate scene classification, and based on the similarity of the video content using the acquired feature information. Allows scene search. Hereinafter, embodiments in which an image processing apparatus and an image processing program are suitably implemented will be described in detail with reference to the drawings.

＜画像処理装置の機能構成例＞
図３は、画像処理装置の機能構成の一例を示す図である。図３の例に示す画像処理装置１０は、大別すると、特徴抽出装置２０と、シーン検索装置３０とを有する。なお、本実施形態における画像処理装置１０は、特徴抽出装置２０及びシーン検索装置３０のうちの何れかを有する構成であってもよい。 <Example of functional configuration of image processing apparatus>
FIG. 3 is a diagram illustrating an example of a functional configuration of the image processing apparatus. The image processing device 10 illustrated in the example of FIG. 3 includes a feature extraction device 20 and a scene search device 30 when roughly classified. Note that the image processing apparatus 10 according to the present embodiment may be configured to include any one of the feature extraction apparatus 20 and the scene search apparatus 30.

特徴抽出装置２０は、予め設定された準備用（サンプル）フレーム画像集合を入力し、画像片ワードを生成する。また、特徴抽出装置２０は、例えば映像を所定の間隔（例えば、一定間隔や映像区切り等）で区切ったシーン（複数のフレーム画像）を入力し、そのシーンに対して上述した画像片ワードヒストグラム（Ｈ−ＭＩＰＷ）を算出する。シーン検索装置３０は、特徴抽出装置２０で得られたＨ−ＭＩＰＷの類似性に基づき、予め蓄積された映像情報の中からユーザ等の要求シーンに対応するシーンを検索する。以下に、特徴抽出装置２０及びシーン検索装置３０について具体的に説明する。 The feature extraction device 20 inputs a set of preparatory (sample) frame images, and generates an image fragment word. In addition, the feature extraction apparatus 20 inputs, for example, a scene (a plurality of frame images) obtained by dividing a video at a predetermined interval (for example, a fixed interval or a video separator), and the above-described image fragment word histogram (for the scene). H-MIPW) is calculated. Based on the H-MIPW similarity obtained by the feature extraction device 20, the scene retrieval device 30 retrieves a scene corresponding to a requested scene such as a user from video information stored in advance. Hereinafter, the feature extraction device 20 and the scene search device 30 will be specifically described.

特徴抽出装置２０は、サンプリング取得手段２１と、分割ブロック設定手段２２と、画像片ワード生成手段（ブロック特徴情報生成手段）２３と、シーン分割手段２４と、ヒストグラム生成手段２５とを有する。また、シーン検索装置３０は、ヒストグラム生成手段３１と、検索手段３２とを有する。 The feature extraction device 20 includes a sampling acquisition unit 21, a divided block setting unit 22, an image fragment word generation unit (block feature information generation unit) 23, a scene division unit 24, and a histogram generation unit 25. In addition, the scene search device 30 includes a histogram generation unit 31 and a search unit 32.

サンプリング取得手段２１は、予め蓄積された準備用（サンプル）映像集合４１から、所定の間隔（例えば、Ｔ_１フレーム）毎にフレーム画像をサンプリングし、準備用フレーム画像集合４２（Ｐ_１，・・・,Ｐ_Ｎ＿Ｐ）を出力する。所定の間隔（Ｔ_１）は、例えば予め設定された一定のフレーム間隔であるが、これに限定されるものではなく、例えば一定の時間間隔であってもよく、またシーンを構成する各ショット（例えば、映像の切り替わり）の先頭画像であってもよい。 The sampling acquisition means 21 samples a frame image at a predetermined interval (for example, T ₁ frame) from a preliminarily accumulated (sample) video set 41 and prepares a preparatory frame image set 42 (P ₁ ,...・, P _{N_P} ) is output. The predetermined interval (T ₁ ) is, for example, a predetermined constant frame interval, but is not limited thereto, and may be, for example, a constant time interval, and each shot ( For example, it may be the first image of video switching).

分割ブロック設定手段２２は、画像片ワード生成手段２３により生成される１又は複数の画像片（画像ブロック）の大きさ（スケール、画像サイズ）、種類、及び数等のうち、少なくとも１つを設定する。例えば、分割ブロック設定手段２２は、画像片の大きさを４×４画素、８×８画素、１６×３２画素の３種類と設定することができるが、画像片の大きさや数はこれに限定されるものではない。なお、画像ブロックの設定は、予めユーザが設定しておいてもよく、また入力される映像の解像度や映像のジャンル（例えば、ニュース番組、スポーツ、ドラマ）等に対応して自動的に設定されてもよい。更に、分割ブロック設定手段２２は、映像に対して画像全体における「目立つ領域」を表す顕著性マップ(ＳａｌｉｅｎｃｙＭａｐ)に基づいて、画像片の大きさ、種類、及び数等のうち、少なくとも１つを設定してもよい。顕著性マップは、周辺領域と性質の異なる領域を「顕著性が高い(注意を引く)領域」として抽出するものである。 The divided block setting unit 22 sets at least one of the size (scale, image size), type, number, and the like of one or a plurality of image pieces (image blocks) generated by the image piece word generation unit 23. To do. For example, the divided block setting means 22 can set the size of the image piece as three types of 4 × 4 pixels, 8 × 8 pixels, and 16 × 32 pixels, but the size and number of the image pieces are limited to this. Is not to be done. The image block setting may be set by the user in advance, and is automatically set according to the resolution of the input video, the genre of the video (for example, news program, sports, drama), etc. May be. Further, the divided block setting means 22 has at least one of the size, type, number, and the like of the image pieces based on the saliency map (Saliency Map) representing the “conspicuous area” in the entire image with respect to the video. May be set. In the saliency map, an area having a property different from that of the surrounding area is extracted as an “area having high saliency (attracting attention)”.

画像片ワード生成手段２３は、例えば分割ブロック設定手段２２により設定された条件等に基づいて、準備用フレーム画像集合４２から画像片ワード４３（Ｗ）を生成する。なお、画像片ワードの具体的な生成手法については、後述する。 The image fragment word generation unit 23 generates an image fragment word 43 (W) from the preparation frame image set 42 based on, for example, the conditions set by the divided block setting unit 22. A specific method for generating an image fragment word will be described later.

シーン分割手段２４は、ユーザ等により入力手段等を用いて指定される検索対象映像４４（Ｖ_１，・・・,Ｖ_Ｎ＿Ｔ）に対して、所定の間隔（例えば、Ｔ_２フレーム）毎のシーンに自動分割し、検索対象シーン４５（Ｓ_１，・・・,Ｓ_Ｎ＿Ｓ）を生成する。 The scene dividing means 24 is a scene for every predetermined interval (for example, T ₂ frame) with respect to the search target video 44 (V ₁ ,..., V _{N_T} ) designated by the user or the like using the input means. _Are automatically divided to generate a search target scene 45 (S ₁ ,..., S _{N —} S).

なお、検索対象映像４４とは、シーン毎の特徴情報を抽出する対象の映像を意味し、本実施形態では、一例として、後述するシーン検索装置３０における検索対象となる映像を示している。また、所定の間隔（Ｔ_２）は、例えば予め設定された一定のフレーム間隔であるが、これに限定されるものではなく、例えば一定の時間間隔であってもよく、また映像の区切りの最初のフレームの間隔であってもよい。また、所定の間隔（Ｔ_２）は、上述した所定の間隔（Ｔ_１）と同一間隔であってもよく、異なる間隔であってもよい。 Note that the search target video 44 means a target video from which feature information for each scene is extracted, and in the present embodiment, as an example, a search target video in the scene search device 30 described later is shown. Further, the predetermined interval (T ₂ ) is, for example, a preset constant frame interval, but is not limited to this, and may be, for example, a fixed time interval, or may be the first of video segmentation. It may be an interval of frames. Further, the predetermined interval (T ₂ ) may be the same interval as the predetermined interval (T ₁ ) described above, or may be a different interval.

ヒストグラム生成手段２５は、映像を一定間隔で区切ったシーンである検索対象シーン４５（Ｓ_１，・・・,Ｓ_Ｎ＿Ｓ）を入力し、その各シーンの画像片ワードヒストグラム４６（Ｈ_１，・・・,Ｈ_Ｎ＿Ｓ）を出力する。なお、ヒストグラム生成手段２５における画像片ワードヒストグラム４６の具体的な生成例については、後述する。 The histogram generation means 25 inputs search target scenes 45 (S ₁ ,..., S _{N —} S), which are scenes obtained by dividing the video at regular intervals, and the image fragment word histogram 46 (H ₁ _,.・, H _{N_S} ) is output. A specific generation example of the image fragment word histogram 46 in the histogram generation means 25 will be described later.

このように、特徴抽出装置２０を用いて、画像片（ブロック領域）単位の特徴を用いることにより、例えば検索精度の向上等につながる高精度な画像の特徴情報を抽出することができる。 As described above, by using the feature of the image piece (block region) unit using the feature extraction device 20, it is possible to extract the feature information of the image with high accuracy that leads to the improvement of the search accuracy, for example.

なお、上述した準備用映像集合４１、準備用フレーム画像集合４２、画像片ワード４３、検索対象映像４４、検索対象シーン４５、画像片ワードヒストグラム４６は、画像処理装置１０内に設けられる記憶手段等に記憶されていてもよく、また外部装置（例えば、データベースサーバ）等で管理されていてもよい。外部装置で管理される場合、画像処理装置１０は、例えばインターネットやＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等に代表される通信ネットワークを介して外部装置とデータの送受信が可能な状態で接続され、外部装置で記憶されているデータの読み出しや、外部装置への書き込みを行うことができる。 The preparation video set 41, the preparation frame image set 42, the image fragment word 43, the search target video 44, the search target scene 45, and the image fragment word histogram 46 are stored in the image processing apparatus 10 or the like. Or may be managed by an external device (for example, a database server) or the like. When managed by an external device, the image processing apparatus 10 is connected in a state where data can be transmitted to and received from the external device via a communication network represented by the Internet or a LAN (Local Area Network), for example. Reading stored data and writing to an external device can be performed.

シーン検索装置３０において、ヒストグラム生成手段３１は、ユーザ等から入力される要求シーンに対して、上述した特徴抽出装置２０におけるヒストグラム生成手段２５と同様に、ヒストグラムを生成する。図３の例では、ユーザ等により要求される検索対象のシーン５１（Ｖ_Ｑ）に対してヒストグラムを生成し、要求シーンに対する画像片のヒストグラム５２（Ｈ_Ｑ）を出力する。 In the scene search device 30, the histogram generation unit 31 generates a histogram for a requested scene input from a user or the like, similar to the histogram generation unit 25 in the feature extraction device 20 described above. In the example of FIG. 3, a histogram is generated for the search target scene 51 (V _Q ) requested by the user or the like, and an image fragment histogram 52 (H _Q ) for the requested scene is output.

検索手段３２は、要求シーンの画像片ワードヒストグラム５２に基づいて、上述した特徴抽出装置２０で取得した各シーンの画像片ワードヒストグラム４６を参照して同様のシーンの検索を行い、その検索結果５３を出力する。なお、検索結果５３は、例えば予め設定された閾値以上の画像片の類似度を有するシーンであればよいが、これに限定されるものではない。例えば、類似度が高い順に所定数のシーンを出力してもよい。なお、上述した要求シーン５１、要求シーンの画像片ワードヒストグラム５２、検索結果５３は、例えば予め設定された記憶手段に記憶されていてもよく、外部のデータベース等で管理されていてもよい。 Based on the image fragment word histogram 52 of the requested scene, the retrieval unit 32 retrieves the same scene by referring to the image fragment word histogram 46 of each scene acquired by the feature extraction device 20 described above, and the retrieval result 53. Is output. For example, the search result 53 may be a scene having a similarity between image pieces equal to or greater than a preset threshold, but is not limited thereto. For example, a predetermined number of scenes may be output in descending order of similarity. The requested scene 51, the image fragment word histogram 52 of the requested scene, and the search result 53 described above may be stored in, for example, a preset storage unit, or may be managed in an external database or the like.

上述したように、本実施形態において画像片は、画像中の内容との相関が強いと考えられるため、Ｈ−ＭＩＰＷは映像内容の類似性によるシーン検索のための有効な動画特徴となり得る。したがって、要求シーンに対して高精度な検索を行うことができ、類似性の高いシーンを取得することができる。 As described above, in the present embodiment, since the image piece is considered to have a strong correlation with the content in the image, H-MIPW can be an effective moving image feature for scene search based on the similarity of the video content. Therefore, a highly accurate search can be performed on the requested scene, and a scene with high similarity can be acquired.

次に、上述したブロック画像の種類を表す多重スケール画像片ワード（以下、必要に応じて「ＭＩＰＷＯＲＤ」という）、及び所定のシーン単位のＨ−ＭＩＰＷの算出例について、具体的に説明する。 Next, a calculation example of the above-described multi-scale image fragment word (hereinafter referred to as “MIPWORD” as necessary) representing the type of block image and a predetermined scene unit H-MIPW will be specifically described.

＜多重スケール画像片ワード（ＭＩＰＷｏｒｄ）の例について＞
上述した画像片ワード生成手段２３における多重スケール画像片ワード（ＭＩＰＷｏｒｄ）を生成する手法について説明する。ＭＩＰＷｏｒｄは、例えば検索対象映像から無作為に選んだ準備用映像等を用いて生成する。図４は、多重スケール画像片ワードの生成処理の一例を示すフローチャートである。また、図５は、画像片ワードの生成の流れを示す図である。 <Example of Multiscale Image Single Word (MIPWord)>
A method for generating a multi-scale image fragment word (MIPWord) in the image fragment word generation means 23 described above will be described. The MIPWord is generated by using, for example, a preparation video randomly selected from the search target video. FIG. 4 is a flowchart illustrating an example of a multiscale image fragment word generation process. FIG. 5 is a diagram showing a flow of generation of an image fragment word.

図４の例において、画像片ワード生成処理は、準備用映像集合から所定のフレーム画像をサンプリングする（Ｓ０１）。サンプリングは、例えば一定間隔毎のフレーム画像を取得してもよく、映像区切り等に基づいてフレーム画像を取得してもよい。次に、画像片ワード生成処理は、サンプリングした各フレーム画像を１又は複数スケールにブロック分割する（Ｓ０２）。Ｓ０２の処理では、例えばフレーム画像毎にスケール１（ｎ_Ｗ１×ｎ_Ｈ１個）,・・・,スケールＮ_ｄ（ｎ_ＷＮｄ×ｎ_ＨＮｄ個）の複数のスケールで、それぞれブロック分割する。 In the example of FIG. 4, the image fragment word generation process samples a predetermined frame image from the preparation video set (S01). For sampling, for example, frame images at regular intervals may be acquired, or frame images may be acquired based on video segmentation or the like. Next, in the image fragment word generation process, each sampled frame image is divided into blocks of one or a plurality of scales (S02). In the process of S02, for example, each frame image is divided into blocks at a plurality of scales of scale 1 (n _W1 × n _H1 ),..., Scale N _d (n _WNd × n _HNd ).

次に、画像片ワード生成処理は、分割した各ブロック画像について、所定の特徴ベクトル（特徴情報）を算出する（Ｓ０３）。所定の特徴ベクトルとしては、例えば色特徴やテクスチャ特徴等があるがこれに限定されるものではなく、他の特徴を用いてもよく、複数の特徴情報を組み合わせてもよい。色特徴としては、例えばＲＧＢ平均値ベクトルや色相ヒストグラム等がある。また、テクスチャ特徴としては、例えばフラクタルシーケンスやエッジ方向ヒストグラム、ＣＳ−ＬＢＰ（ＣｅｎｔｅｒＳｙｍｍｅｔｒｉｃ − ＬｏｃａｌＢｉｎａｒｙＰａｔｔｅｒｎ）特徴等がある。 Next, the image fragment word generation process calculates a predetermined feature vector (feature information) for each divided block image (S03). Examples of the predetermined feature vector include a color feature and a texture feature, but are not limited thereto. Other features may be used, and a plurality of feature information may be combined. Examples of the color feature include an RGB average value vector and a hue histogram. The texture features include, for example, a fractal sequence, an edge direction histogram, a CS-LBP (Center Symmetric-Local Binary Pattern) feature, and the like.

次に、画像片ワード生成処理は、各スケールｉ（ｉ＝１，・・・，Ｎ_ｄ）において、ブロック画像集合を特徴ベクトルの類似性に基づいてクラスタリング（分類分け）する（Ｓ０４）。なお、Ｓ０４の処理において、クラスタリング手法は、例えばＫ−Ｍｅａｎｓ法等の分割最適化法を用いることができるが、これに限定されるものではない。Ｓ０４の処理により生成された，各スケールｉにおけるＫ_ｉ個のクラスタをＣ［ｉ，１］，・・・，Ｃ［ｉ，Ｋ_ｉ］とする。 Next, in the image fragment word generation process, the block image sets are clustered (classified) based on the similarity of the feature vectors at each scale i (i = 1,..., N _d ) (S04). In the process of S04, the clustering method can be a division optimization method such as the K-Means method, but is not limited to this. Let K _i clusters in each scale i generated by the process of S04 be C [i, 1],..., C [i, K _i ].

次に、画像片ワード生成処理は、例えば各クラスタＣ［ｉ，ｋ］の中心ベクトルｗ［ｉ，ｋ］を要素とする画像片ワードＷ＝｛ｗ［１，１］，・・・，ｗ［ｉ，ｋ］，・・・，ｗ［Ｎ_ｄ，Ｋ_Ｎｄ］｝を多重スケール画像片ワード（ＭＩＰＷｏｒｄ）として生成する（Ｓ０５）。そして、生成された多重スケール画像片ワード（ＭＩＰＷｏｒｄ）を記憶手段（例えば、画像片ワード４３）等に記憶する（Ｓ０６）。 Next, in the image fragment word generation processing, for example, an image fragment word W = {w [1, 1],..., W having the center vector w [i, k] of each cluster C [i, k] as an element. [I, k],..., W [N _d , K _Nd ]} are generated as multi-scale image fragment words (MIPWord) (S05). Then, the generated multiscale image fragment word (MIPWord) is stored in storage means (for example, image fragment word 43) or the like (S06).

図５の例では、上述した図４に示す処理において、ブロック分割スケールＮ_ｄ＝２の場合のＭＩＰＷｏｒｄ生成の流れを示している。図５に示すように、同一の準備用（サンプル）映像に対して複数のスケール（画像サイズ）でブロック分割し、それぞれのスケールで分割された画像片毎に特徴ベクトルに基づいてクラスタリングして、画像片ワードを生成する。 The example of FIG. 5 shows the flow of MIPWord generation when the block division scale N _d = 2 in the process shown in FIG. As shown in FIG. 5, the same preparation (sample) video is divided into blocks at a plurality of scales (image sizes), and each piece of image divided at each scale is clustered based on a feature vector, Generate an image fragment word.

なお、準備用映像は、例えば検索対象映像や検索要求シーン等のジャンル（例えば、ニュース、各種のスポーツ（サッカー、野球）等）が予め決まっている場合には、同一のジャンルの準備用映像にすることが好ましいが、これに限定されるものではない。また、スケールは、例えば上述した分割ブロック設定手段２２により任意のスケール、種類、数に設定される。また、スケールは、入力される映像の解像度等に応じて任意に設定されてもよい。 For example, when a genre (for example, news, various sports (soccer, baseball), etc.) such as a search target video or a search request scene is determined in advance, the preparation video is a preparation video of the same genre. However, the present invention is not limited to this. Further, the scale is set to an arbitrary scale, type, and number, for example, by the divided block setting means 22 described above. The scale may be arbitrarily set according to the resolution of the input video.

＜シーン単位の多重スケール画像片ワードヒストグラム（Ｈ−ＭＩＰＷ）の算出例＞
次に、所定のシーン単位のＨ−ＭＩＰＷの算出例について、図を用いて説明する。本実施形態では、多重スケール画像片ワード（ＭＩＰＷｏｒｄ）に基づき、検索対象映像の各シーンのＨ−ＭＩＰＷを算出する。 <Example of Calculation of Multi-Scale Image Single Word Histogram (H-MIPW) in Scene Unit>
Next, a calculation example of H-MIPW for a predetermined scene unit will be described with reference to the drawings. In the present embodiment, the H-MIPW of each scene of the search target video is calculated based on the multiscale image piece word (MIPWord).

ここで、図６は、多重スケール画像片ワードヒストグラム生成処理の一例を示すフローチャートである。また、図７は、多重スケール画像片ワードヒストグラムの生成の流れを示す図である。 Here, FIG. 6 is a flowchart showing an example of the multiscale image fragment word histogram generation processing. FIG. 7 is a diagram showing a flow of generating a multiscale image fragment word histogram.

図６において、多重スケール画像片ワードヒストグラム生成処理は、例えば各スケールから生成されたＭＩＰＷｏｒｄ（Ｗ）を構成するベクトルｗ｛ｉ，ｋ｝の数と同じ数からなるヒストグラムＨ＝｛ｈ［１，１］，・・・，ｈ［ｉ，ｋ］，・・・，ｈ［Ｎ_ｄ，Ｋ_Ｎｄ］｝を準備し（Ｓ１１）、各要素の初期値を０とする（Ｓ１２）。 In FIG. 6, the multi-scale image fragment word histogram generation processing is performed by, for example, a histogram H = {h [1] having the same number as the number of vectors w {i, k} constituting MIPWord (W) generated from each scale. 1],..., H [i, k],..., H [N _d , K _Nd ]} are prepared (S11), and the initial value of each element is set to 0 (S12).

次に、多重スケール画像片ワードヒストグラム生成処理は、シーンＳの各ショットから所定間隔（例えば、Ｔフレーム）毎にフレーム画像をサンプリングする（Ｓ１３）。次に、多重スケール画像片ワードヒストグラム生成処理は、サンプリングした各フレーム画像を１又は複数スケールにブロック分割する（Ｓ１４）。このときのスケールは、例えば、上述したＳ０２の同様のスケール（スケール１（ｎ_Ｗ１×ｎ_Ｗ１個），・・・，スケールＮ_ｄ（ｎ_ＷＮｄ×Ｎ_ＨＮｄ個）であってもよく、Ｓ０２の処理で得られる複数のスケール（例えば、５種類）に含まれる所定数（例えば、３種類）のスケールであってもよい。 Next, in the multiscale image fragment word histogram generation process, frame images are sampled from each shot of the scene S at predetermined intervals (for example, T frames) (S13). Next, in the multiscale image piece word histogram generation processing, each sampled frame image is divided into blocks of one or a plurality of scales (S14). The scale at this time may be, for example, the same scale as in S02 described above (scale 1 (n _W1 × n _W1 ),..., Scale N _d (n _WNd × N _HNd ). A predetermined number (for example, three types) of scales included in a plurality of scales (for example, five types) obtained by processing may be used.

次に、多重スケール画像片ワードヒストグラム生成処理は、Ｓ１４の処理で得られた各ブロック画像について、上述したＳ０３の処理と同様に特徴ベクトルを算出する（Ｓ１５）。次に、各スケールｉ（ｉ＝１，・・・，Ｎ_ｄ）において、全ブロック画像についてヒストグラムＨの各要素の加算を行う（Ｓ１６）。Ｓ１６の処理では、具体的にはＭＩＰＷｏｒｄ（Ｗ）のＷ［ｉ，ｋ］（ｋ＝１，・・・，Ｋ_ｉ）の中で、ブロック画像の特徴ベクトルと最も類似度の高いものをｗ［ｉ，ｋ'］とする。また、Ｓ１６の処理では、最も類似度の高い特徴ベクトルｗ［ｉ，ｋ'］に対応するヒストグラムＨの要素ｈ［ｉ，ｋ'］に１を加算する。 Next, in the multiscale image piece word histogram generation process, a feature vector is calculated for each block image obtained in the process of S14 as in the process of S03 described above (S15). Next, in each scale i (i = 1,..., N _d ), each element of the histogram H is added to all block images (S16). In the process of S16, specifically, among W [i, k] (k = 1,..., K _i ) of MIPWord (W), the one having the highest similarity with the feature vector of the block image is represented by w. [I, k ′]. In the process of S16, 1 is added to the element h [i, k ′] of the histogram H corresponding to the feature vector w [i, k ′] having the highest similarity.

また、多重スケール画像片ワードヒストグラム生成処理は、ヒストグラムＨの各要素をサンプリングした全フレーム画像数で除算し（Ｓ１８）、算出されたヒストグラムＨ＝｛ｈ［１，１］，・・・，ｈ［ｉ，ｋ］，・・・，ｈ［Ｎ_ｄ，Ｋ_Ｎｄ］｝をシーンＳのＨ−ＭＩＰＷとし、記憶手段（例えば、画像片ワードヒストグラム４６）等に記憶する（Ｓ１９）。 In the multiscale image fragment word histogram generation process, each element of the histogram H is divided by the total number of sampled frame images (S18), and the calculated histogram H = {h [1,1],. [I, k],..., H [N _d , K _Nd ]} are set as the H-MIPW of the scene S and stored in the storage means (for example, the image fragment word histogram 46) or the like (S19).

図７の例では、上述した図６に示す多重スケール画像片ワードヒストグラム生成処理に対するブロック分割スケールＮ_ｄ＝２の場合の処理の流れを示している。図７の例では、検索対象映像に含まれるシーン（複数ショット）Ｓ毎に所定の間隔（Ｔ）でフレーム画像をサンプリングし、複数のスケールにブロック分割する。 The example of FIG. 7 shows the flow of processing when the block division scale N _d = 2 with respect to the multiscale image fragment word histogram generation processing shown in FIG. In the example of FIG. 7, a frame image is sampled at a predetermined interval (T) for each scene (multiple shots) S included in the search target video, and is divided into a plurality of scales.

また、図７の例では、分割された各ブロックの特徴ベクトルに基づいて、生成済みのＭＩＰＷＯＲＤ（Ｗ）に対して各ブロックの特徴ベクトルに最も近いｗ［ｊ，ｋ］を求め、対応するｈ［ｊ，ｋ］を加算する。これにより、図７の例に示すように、シーンＳのＨ−ＭＩＰＷを取得することができる。したがって、本実施形態では、シーン毎の特徴情報を抽出することができると共に、画像分類を迅速かつ適切に行うことができる。 In the example of FIG. 7, w [j, k] closest to the feature vector of each block is obtained for the generated MIPWORD (W) based on the feature vector of each divided block, and the corresponding h Add [j, k]. As a result, the H-MIPW of the scene S can be acquired as shown in the example of FIG. Therefore, in this embodiment, feature information for each scene can be extracted, and image classification can be performed quickly and appropriately.

＜多重スケール画像片ワードヒストグラム（Ｈ−ＭＩＰＷ）を用いたシーン検索＞
次に、シーン検索装置３０における多重スケール画像片ワードヒストグラム（Ｈ−ＭＩＰＷ）を用いたシーン検索の一例について説明する。 <Scene Search Using Multiscale Image Single Word Histogram (H-MIPW)>
Next, an example of scene search using a multiscale image fragment word histogram (H-MIPW) in the scene search device 30 will be described.

図８は、検索処理の一例を示すフローチャートである。図８の例において、検索処理は、各要素を識別する変数の初期値ｉ＝１とし（Ｓ２１）、検索対象シーンＳ_ｉの画像片ワードヒストグラムＨ_ｉと、要求シーンの画像片ワードヒストグラムＨ_Ｑとの距離Ｄ_ｉを算出する（Ｓ２２）。ここで、図９は、距離Ｄ_ｉの算出例を示す図である。本実施形態では、図９に示すように、要求シーンと検索対象シーンＳ_ｉのそれぞれの画像片ワードヒストグラムＨ_Ｑ、Ｈ_ｉのベクトルの距離Ｄ_ｉを要素毎に求めることで、類似性に基づく検索を行う。 FIG. 8 is a flowchart illustrating an example of the search process. In the example of FIG. 8, the search process, the initial value i = 1 of the variable that identifies each element (S21), the search target scene S and the image piece word histogram H _i of _i, the image piece word histogram H _Q requests scenes A distance D _i is calculated (S22). Here, FIG. 9 is a diagram illustrating a calculation example of the distance D _i . In the present embodiment, as shown in FIG. 9, the distance D _i between the image fragment word histograms H _Q and H _i of each of the request scene and the search target scene S _{i is obtained} for each element, and thus based on similarity. Perform a search.

つまり、検索処理は、ｉ＝ｉ＋１として（Ｓ２３）順番に次の要素についてベクトルの距離Ｄ_ｉを算出する。ここで、例えばｉがＮ＿Ｓ（要素の最後）よりも値が大きいか否かを判断し（Ｓ２４）、ｉの値がＮ＿Ｓよりも大きくない場合（Ｓ２４において、ＮＯ）、Ｓ２２の処理に戻る。また、ｉの値がＮ＿Ｓより大きい場合（Ｓ２４において、ＹＥＳ）、距離Ｄ_ｉが小さい方が、類似度が高いため、距離Ｄ_ｉの小さい方から予め設定された上位Ｎ_ＨＩＴ個のシーン検索結果を出力する（Ｓ２５）。つまり、Ｓ２５の処理は、類似度の高い方から上位Ｎ_ＨＩＴ個のシーン検索結果を出力するのと同様である。 That is, in the search process, i = i + 1 is set (S23), and the vector distance D _i is calculated for the next element in order. Here, for example, it is determined whether or not i is greater than N_S (the last of the elements) (S24). If i is not greater than N_S (NO in S24), the process returns to S22. Furthermore, (in S24, YES) when the value of i is greater than n_s distance towards _{D i} is small, due to high degree of similarity, the distance _D Top _{N HIT} pieces of scene search result set in advance from the smaller _i Is output (S25). That is, the processing of S25 is the same as outputting the top N _HIT scene search results from the one with the higher similarity.

これにより、画像処理装置１０は、映像中に含まれるシーンに対して適切な特徴情報を設定し、設定した特徴情報を用いて高精度なシーン検索を実現することができる。 As a result, the image processing apparatus 10 can set appropriate feature information for a scene included in a video, and can realize a highly accurate scene search using the set feature information.

＜実験結果＞
次に、本実施形態における効果を明確にするため、一例として実際の番組映像を対象とした各シーンのＨ−ＭＩＰＷの類似性に基づくシーン検索実験について説明し、「映像内容の類似したシーンを検索する」という観点でのＨ−ＭＩＰＷの性能を検証する。 <Experimental result>
Next, in order to clarify the effect of this embodiment, a scene search experiment based on the similarity of H-MIPW of each scene targeting an actual program video will be described as an example. The performance of H-MIPW in terms of “search” is verified.

＜実験条件＞
実験条件としては、使用映像の一例として自然関連の放送番組映像２５４本を用い、ＭＩＰＷｏｒｄ生成用の準備用映像１００本を用いる。また、検索対象映像も２５４本を用いる。シーンの区切りは、１シーンのショット数を固定とし、５ショット毎に１シーンとした。全シーン数は約７３００であり、フレーム画像正規化サイズは３２０×１８０とし、ブロック分割スケールはＮｄ＝２、スケール１（１６×１６画素）、スケール２（８×８画素）とする。また、ＭＩＰＷｏｒｄ数は、スケール１及びスケール２は共に７５０とする。 <Experimental conditions>
As an experimental condition, 254 nature-related broadcast program videos are used as an example of a use video, and 100 preparation videos for MIPWord generation are used. Also, 254 videos to be searched are used. The scene delimiter was fixed at the number of shots in one scene, and one scene for every five shots. The total number of scenes is about 7300, the frame image normalization size is 320 × 180, the block division scale is Nd = 2, scale 1 (16 × 16 pixels), and scale 2 (8 × 8 pixels). The number of MIPWords is 750 for both scale 1 and scale 2.

ここで、図１０は、クエリーとする１２種類のシーンとそれぞれについて設定した正解映像内容を示す図である。各画像は、シーンを構成するショットの先頭画像である。図１０の例に示すクエリーシーン（Ｑ１〜Ｑ１２）は、検索対象映像の中から正解設定の容易さを考慮した上で無作為に選択されたものである。また、正解映像内容は「各ショットの何れかに映り、かつ内容の面である程度重要であると思われる被写体」という観点で設定しているが、これに限定されるものではない。例えば、図１０のクエリーシーンＱ１の正解映像内容は、｛山、空と山（空＋山），花、枝、鳥｝等である。また、クエリーシーンＱ２の正解映像内容は、｛建造物遠景，建造物近景色，街の遠景｝等である。 Here, FIG. 10 is a diagram showing twelve types of scenes to be queried and the correct video content set for each. Each image is a head image of a shot constituting the scene. The query scenes (Q1 to Q12) shown in the example of FIG. 10 are randomly selected from the search target videos in consideration of easy setting of correct answers. The correct video content is set from the viewpoint of “a subject that appears in any one of the shots and is considered to be important to some extent in terms of content”, but is not limited to this. For example, the correct video content of the query scene Q1 in FIG. 10 is {mountain, sky and mountain (sky + mountain), flower, branch, bird}. The correct video content of the query scene Q2 includes {building distant view, building close view, city distant view} and the like.

本実施形態では、上述したクエリーシーンのＨ−ＭＩＰＷと検索対象全シーンのＨ−ＭＩＰＷとのヒストグラムインターセクションによる類似度を算出し、類似度の高い順に検索対象シーンを並べ替えることで、検索結果を得ることができる。 In the present embodiment, the similarity by the histogram intersection between the H-MIPW of the query scene and the H-MIPW of all search target scenes is calculated, and the search target scenes are rearranged in descending order of the similarity, thereby obtaining a search result. Can be obtained.

＜精度評価のための尺度＞
ここで、検索結果の精度を評価するための尺度（関連度）について説明する。関連度は、シーンの各ショットが正解映像内容と関連しているか、及びシーンがどれだけ正解映像内容を網羅しているか等の両面を考慮して定めることができるが、これに限定されるものではなく、例えば上述の何れかであってもよい。ここで、あるシーンと正解映像内容との関連度Ｒを、「Ｒ＝（２ＲｓＲｃ／（Ｒｓ＋Ｒｃ））・・・（１）」のように設定することができる。ここで、Ｒｓは、「シーンにおける正解映像内容の何れかを含むショット比率」である。また、Ｒｃは、「何れかのショットに含まれる正解映像内容の比率」を表す。 <Measure for accuracy evaluation>
Here, a scale (relevance) for evaluating the accuracy of the search result will be described. The degree of relevance can be determined in consideration of both aspects such as whether each shot of the scene is related to the correct video content and how much the correct video content is covered by the scene, but is limited to this. Instead, for example, any of the above may be used. Here, the degree of association R between a certain scene and the correct video content can be set as “R = (2RsRc / (Rs + Rc)) (1)”. Here, Rs is “a shot ratio including any correct video content in a scene”. Rc represents “the ratio of correct video content included in any shot”.

図１１は、関連度の概略的な算出例を示す図である。図１１に示すあるシーンを構成する５つのショットの中で、このシーンの正解映像内容である｛月，山，海，魚｝の何れかが映るショットは、図１１の□印を付けた３つ（フレーム画像１，２，４）である。したがって、上述したショット比率Ｒｓは、３／５＝６０％となる。 FIG. 11 is a diagram illustrating a schematic calculation example of the degree of association. Of the five shots that make up a scene shown in FIG. 11, the shot that shows one of the correct video contents of this scene, {moon, mountain, sea, fish}, is marked 3 in FIG. (Frame images 1, 2, 4). Therefore, the above-described shot ratio Rs is 3/5 = 60%.

一方、４つの正解映像内容｛月，山，海，魚｝の中で、シーンの何れかのショットに映っているものは、○印を付けた３つ（月，山，海）である。したがって、上述した正解映像内容の比率Ｒｃは、３／４＝７５％となる。そして、上述した（１）式により、このシーンと正解映像内容との関連度Ｒは６６．７％となる。 On the other hand, among the four correct video contents {moon, mountain, sea, fish}, three (moon, mountain, sea) marked with ○ are shown in any shot of the scene. Therefore, the ratio Rc of the correct video content described above is 3/4 = 75%. Then, according to the above-described equation (1), the degree of association R between this scene and the correct video content is 66.7%.

＜シーン検索結果例＞
次に、Ｈ−ＭＩＰＷを用いたシーン検索結果例について説明する。ここで、図１２は、本実施形態における検索結果の一例を示す図である。図１２（ａ）〜図１２（ｃ）は、検索結果例１〜３を示すものである。具体的には、図１２（ａ）は、上述した図１０に示すクエリーシーンＱ１に対する検索結果上位２０シーンを示している。また同様に、図１２（ｂ）がクエリーシーンＱ３、図１２（ｃ）がクエリーシーンＱ９に対するそれぞれの検索結果上位２０シーンを示している。 <Scene search result example>
Next, an example of a scene search result using H-MIPW will be described. Here, FIG. 12 is a diagram illustrating an example of a search result in the present embodiment. 12 (a) to 12 (c) show search result examples 1 to 3. FIG. Specifically, FIG. 12A shows the top 20 search results for the query scene Q1 shown in FIG. Similarly, FIG. 12B shows the top 20 search results for the query scene Q3, and FIG. 12C shows the top 20 search results for the query scene Q9.

各画像は、シーンを構成する各ショットの先頭画像である。画像の左上に付けられた○印は、関連度算出における「正解映像内容の何れかを含むショット」を示している。また、検索結果のシーンの右側の表中○印は「何れかのショットに含まれる正解映像内容」を示している。図１２（ａ）〜（ｃ）のそれぞれについて関連度を算出すると、検索結果の上位２０シーンの正解映像内容との関連度の平均値は、図１２（ａ）が５２．１％、図１２（ｂ）が６８．９％、図１２（ｃ）が６８．９％であった。なお、上述の例では、番組映像のジャンルを自然番組に絞っているが、多種多様なある意味「雑多」なシーン集合を検索対象としていることを考慮すると高精度な結果であるといえる。 Each image is a head image of each shot constituting the scene. The ◯ mark attached to the upper left of the image indicates “a shot including any of the correct video contents” in the relevance calculation. In the table on the right side of the scene of the search result, the ◯ mark indicates “correct video content included in any shot”. When the relevance is calculated for each of FIGS. 12A to 12C, the average value of the relevance with the correct video content of the top 20 scenes of the search results is 52.1% in FIG. (B) was 68.9% and FIG. 12 (c) was 68.9%. In the above example, the genre of the program video is narrowed down to the natural program. However, it can be said that the result is highly accurate considering that a variety of meaning “miscellaneous” scene sets are targeted for retrieval.

＜関連手法との精度比較結果＞
次に、本実施形態の有効性を客観的に実証するため、関連手法との精度比較結果について説明する。まず、ブロック分割を多重スケールにすることの効果を検証するために、２つの手法を比較対象とする。図１３は、比較手法の一例を示す図である。図１３（ａ）は、比較手法１として、画像片（ブロック画像）でワードを生成することによる優位性を実証するために、従来の局所特徴を用いたＢａｇｏｆＶｉｓｕａｌＷｏｒｄｓによる手法を示している。代表的な局所特徴としては、輝度勾配ベースのＳＩＦＴ（ＳｃａｌｅＩｎｖａｒｉａｎｔＦｅａｔｕｒｅＴｒａｎｓｆｏｒｍ）特徴、及びＳＵＲＦ（ＳｐｅｅｄｅｄＵｐＲｏｂｕｓｔＦｅａｔｕｒｅｓ）特徴、ＳＩＦＴ特徴をカラー画像用に拡張したＣｏｌｏｒ−ＳＩＦＴ特徴等を用いることができる。この度の比較では、Ｃｏｌｏｒ−ＳＩＦＴ特徴を比較対象とし、ＶｉｓｕａｌＷｏｒｄｓのワード数は１０００とする。 <Accuracy comparison results with related methods>
Next, in order to objectively demonstrate the effectiveness of the present embodiment, the accuracy comparison result with the related method will be described. First, in order to verify the effect of block division using multiple scales, two methods are compared. FIG. 13 is a diagram illustrating an example of the comparison method. FIG. 13A shows a conventional method of Bag of Visual Words using local features in order to demonstrate the superiority of generating a word with an image fragment (block image) as a comparison method 1. . As representative local features, a luminance gradient-based SIFT (Scale Invariant Feature Transform) feature, a SURF (Speeded Up Robust Features) feature, a Color-SIFT feature obtained by extending the SIFT feature for color images, and the like can be used. . In this comparison, the Color-SIFT feature is a comparison target, and the number of words of Visual Words is 1000.

また、図１３（ｂ）は、比較手法２として、本実施形態における分割するスケールの種類が１種類（スケール１（１６×１６画素））のみのブロック分割での画像片ワードヒストグラムによる手法を示している。 FIG. 13B shows a method based on the image fragment word histogram in the block division with only one type of scale to be divided (scale 1 (16 × 16 pixels)) as the comparison method 2 in the present embodiment. ing.

比較手法１である局所特徴のＢａｇｏｆＷｏｒｄｓ手法は、一枚の画像の複写体識別については、ある程度良好な性能を示している。しかしながら、比較手法２に示すブロック画像とは異なり、各ワードと被写体との関連性が弱い。したがって、例えばシーン検索のように、複数フレームを統合して処理する場合には、異なる映像内容である２つのシーンと、ＢａｇｏｆＶｉｓｕａｌＷｏｒｄｓが類似してしまうといったケースが生じ易くなる。 The local feature Bag of Words method, which is the comparison method 1, shows a somewhat good performance for identifying a copy of an image. However, unlike the block image shown in the comparison method 2, the relevance between each word and the subject is weak. Therefore, for example, when a plurality of frames are integrated and processed as in a scene search, a case where two scenes having different video contents and Bag of Visual Words are similar is likely to occur.

また、図１４は、実験結果の比較例を示す図である。図１４では、クエリーシーンＱ１２に対する各手法による検索結果の上位２０シーンを示している。なお、図１４（ａ）は、上述した比較手法１による検索結果を示し、図１４（ｂ）は、上述した比較手法２（分割スケールが１種類）による検索結果を示している。また、図１４（ｃ）は、比較手法３として、本実施形態における分割するスケールの種類が複数種類のブロック分割での画像片ワードヒストグラムによる手法を示している。また、図１４の各画像及び図中の○印は、何れかのショットに含まれる正解映像内容を示している。 Moreover, FIG. 14 is a figure which shows the comparative example of an experimental result. FIG. 14 shows the top 20 scenes of the search results obtained by the respective methods for the query scene Q12. 14A shows a search result by the above-described comparison method 1, and FIG. 14B shows a search result by the above-described comparison method 2 (one kind of division scale). FIG. 14C shows a method using an image fragment word histogram as a comparison method 3 in block division with multiple types of scales in this embodiment. Further, each image in FIG. 14 and a circle in the figure indicate the content of the correct video included in any shot.

図１４（ａ）〜図１４（ｃ）をそれぞれ比較すると、図１４（ｃ）が○印が最も多い。ここで、本実施形態におけるシーン検索では、１ページ目に表示する検索結果数は最大２０個程度である。したがって、検索結果上位２０シーンの正解映像内容との関連度Ｒの平均値で精度を評価することができる。 Comparing FIG. 14A to FIG. 14C, FIG. 14C has the largest number of circles. Here, in the scene search in the present embodiment, the maximum number of search results displayed on the first page is about 20. Therefore, the accuracy can be evaluated by the average value of the degree of association R with the correct video content of the top 20 search results.

図１４（ａ）〜図１４（ｃ）のそれぞれについて算出した精度は、図１４（ａ）が４３．２％、図１４（ｂ）が６３．５％、図１４（ｃ）が７３．２％であり、本実施形態の一例である比較手法３が、検索精度が最も高いことがわかる。 The accuracy calculated for each of FIGS. 14A to 14C is 43.2% in FIG. 14A, 63.5% in FIG. 14B, and 73.2 in FIG. 14C. It can be seen that the comparison method 3 which is an example of this embodiment has the highest search accuracy.

ここで、図１５は、精度比較の一例を示す図である。図１５（ａ）では、各クエリーに対して評価結果として、上述した図１４（ａ）〜図１４（ｃ）のそれぞれの手法による精度を示している。また、図１５（ｂ）では、図１５（ａ）で示した全クエリーでの精度を平均した全体（Ｔｏｔａｌ）での精度を示している。 Here, FIG. 15 is a diagram illustrating an example of accuracy comparison. FIG. 15A shows the accuracy by the respective methods of FIGS. 14A to 14C described above as evaluation results for each query. FIG. 15B shows the accuracy of the total (Total) obtained by averaging the accuracy of all the queries shown in FIG.

図１５（ａ）を参照すると、比較手法３と比較手法１との比較については、８つのクエリーシーン（Ｑ１〜Ｑ４，Ｑ７，Ｑ８，Ｑ１０，Ｑ１２）において精度が高い。また、図１５（ｂ）に示すように、全体で１３％の精度向上を得た。また、比較手法３と比較手法２との比較については、９つのクエリーシーン（Ｑ２〜Ｑ５，Ｑ７，Ｑ９〜Ｑ１２）において本実施形態の精度が高く、全体でも４％の精度向上となった。 Referring to FIG. 15A, the comparison between the comparison method 3 and the comparison method 1 is highly accurate in the eight query scenes (Q1 to Q4, Q7, Q8, Q10, and Q12). Further, as shown in FIG. 15B, the accuracy was improved by 13% as a whole. As for comparison between the comparison method 3 and the comparison method 2, the accuracy of the present embodiment is high in nine query scenes (Q2 to Q5, Q7, Q9 to Q12), and the accuracy is improved by 4% as a whole.

これにより、局所特徴（Ｃｏｌｏｒ−ＳＩＦＴ）でなく、画像片（ブロック画像）によるワードを用いた本実施形態の優位性を示すことができる。更に、画像片は、複数のスケールを生成するのが好ましいことが分かる。本実施形態により、Ｈ−ＭＩＰＷを用いることで、映像内容の類似性に基づいた高精度なシーン検索を実現することができる。 Thereby, it is possible to show the superiority of the present embodiment using words based on image pieces (block images) instead of local features (Color-SIFT). Further, it can be seen that the image pieces preferably generate multiple scales. According to the present embodiment, by using H-MIPW, it is possible to realize a highly accurate scene search based on the similarity of video contents.

＜実行プログラム＞
ここで、上述した画像処理装置１０は、例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の揮発性の記憶装置、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等の不揮発性の記憶装置、マウスやキーボード、ポインティングデバイス等の入力装置、画像やデータ等を表示する表示装置、並びに外部と通信するためのインタフェース装置を備えたコンピュータによって構成することができる。 <Execution program>
Here, the above-described image processing apparatus 10 includes, for example, a CPU (Central Processing Unit), a volatile storage device such as a RAM (Random Access Memory), a nonvolatile storage device such as a ROM (Read Only Memory), a mouse and a keyboard. It can be configured by a computer having an input device such as a pointing device, a display device for displaying images and data, and an interface device for communicating with the outside.

したがって、画像処理装置１０が有する上述した各機能は、これらの機能を記述したプログラムをＣＰＵに実行させることによりそれぞれ実現可能となる。また、これらのプログラムは、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記録媒体に格納して頒布することもできる。 Therefore, the above-described functions of the image processing apparatus 10 can be realized by causing the CPU to execute a program describing these functions. These programs can also be stored and distributed in a recording medium such as a magnetic disk (floppy (registered trademark) disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), or a semiconductor memory.

つまり、上述した各構成における処理をコンピュータに実行させるための実行プログラム（画像処理プログラム）を生成し、例えば汎用のパーソナルコンピュータやサーバ等にそのプログラムをインストールすることにより、画像処理を実現することができる。なお、本実施形態における実行プログラムによる処理については、例えば上述した各処理を実現することができる。 That is, an image processing can be realized by generating an execution program (image processing program) for causing a computer to execute the processing in each configuration described above and installing the program in, for example, a general-purpose personal computer or server. it can. In addition, about the process by the execution program in this embodiment, each process mentioned above is realizable, for example.

上述したように本実施形態によれば、映像中に含まれるシーン毎の特徴情報を適切に取得することができる。また、特徴情報に基づく適切なシーンの分類により類似度を迅速に取得することができる。したがって、シーン全体の画像特徴の類似性に基づいた高精度なシーン検索を実現することができる。 As described above, according to the present embodiment, the feature information for each scene included in the video can be appropriately acquired. Further, the similarity can be quickly acquired by appropriate scene classification based on the feature information. Therefore, it is possible to realize a highly accurate scene search based on the similarity of the image features of the entire scene.

例えば、複数カットからなるシーン検索のための動画特徴として、多重スケールのブロック画像の種類と出現比率による多重スケール画像片ワードヒストグラムを用いることで、高精度な画像分類を行うことができる。また、本実施形態を適用することで、例えば放送番組映像に対するシーン検索等や映像内容の類似性に基づいたシーン検索が可能となる。したがって、例えば従来のようにシーンの代表サムネイル画像ではなく、シーン全体の画像特徴の類似性に基づいて高精度なシーン検索を実現することができる。 For example, high-precision image classification can be performed by using a multi-scale image fragment word histogram based on types and appearance ratios of multi-scale block images as moving image features for scene search including a plurality of cuts. Further, by applying this embodiment, for example, a scene search for a broadcast program video or a scene search based on the similarity of video content can be performed. Therefore, for example, it is possible to realize a highly accurate scene search based on the similarity of the image features of the entire scene, not the representative thumbnail image of the scene as in the prior art.

以上、好ましい実施形態について詳述したが、開示の技術は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された開示の技術の要旨の範囲内において、種々の変形、変更が可能である。 The preferred embodiment has been described in detail above, but the disclosed technique is not limited to the specific embodiment, and various modifications, within the scope of the disclosed technique described in the claims, It can be changed.

１０画像処理装置
２０特徴抽出装置
２１サンプリング取得手段
２２分割ブロック設定手段
２３画像片ワード生成手段（ブロック特徴情報生成手段）
２４シーン分割手段
２５，３１ヒストグラム生成手段
３０シーン検索装置
３２検索手段
４１準備用映像集合
４２準備用フレーム画像集合
４３画像片ワード
４４検索対象映像
４５検索対象シーン
４６，５２画像片ワードヒストグラム
５１要求シーン
５３検索結果 DESCRIPTION OF SYMBOLS 10 Image processing apparatus 20 Feature extraction apparatus 21 Sampling acquisition means 22 Division | segmentation block setting means 23 Image piece word generation means (block characteristic information generation means)
24 Scene division means 25, 31 Histogram generation means 30 Scene search device 32 Search means 41 Preparation video set 42 Preparation frame image set 43 Image fragment word 44 Search object video 45 Search object scene 46, 52 Image fragment word histogram 51 Required scene 53 Results

Claims

In an image processing apparatus that extracts feature information of each scene included in a video,
Sampling acquisition means for sampling a predetermined frame image from a sample video;
Block feature information generating means for dividing each frame image obtained by the sampling acquisition means for each of one or a plurality of scales and generating feature information for each divided block;
Scene dividing means for dividing a scene from the target video for generating the feature information;
Histogram generation means for generating a histogram based on the appearance ratio for each block using the block obtained by the block feature information generation means for each scene divided by the scene division means. apparatus.

The histogram generation means generates a block unit histogram for a search request scene from a user,
The image processing apparatus according to claim 1, further comprising: search means for searching for a corresponding scene by referring to the histogram generated by the histogram generation means using the generated search request scene.

3. The divided block setting unit that sets at least one of the size, type, and number of one or a plurality of blocks generated by the block feature information generation unit. Image processing apparatus.

The image processing apparatus according to claim 1, wherein the feature information is a color feature or a texture feature.

In an image processing program for causing a computer to execute image processing for extracting feature information of each scene included in a video,
The computer,
Sampling acquisition means for sampling a predetermined frame image from a sample video;
Block feature information generating means for dividing each frame image obtained by the sampling acquisition means into one or a plurality of scales and generating feature information for each divided block;
Scene dividing means for dividing a scene from the target video for generating the feature information; and
An image processing program for functioning as a histogram generation unit that generates a histogram based on an appearance ratio for each block using a block obtained by the block feature information generation unit for each scene divided by the scene division unit.