JP3573493B2

JP3573493B2 - Video search system and video search data extraction method

Info

Publication number: JP3573493B2
Application number: JP14490794A
Authority: JP
Inventors: 健治川崎; 義章森本; 哲雄田中; 晶田中
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1994-06-27
Filing date: 1994-06-27
Publication date: 2004-10-06
Anticipated expiration: 2019-10-06
Also published as: JPH0816610A

Description

【０００１】
【産業上の利用分野】
本発明は、映像を記憶している媒体から利用者が所定の場面についての映像を検索する、動画検索システムおよびその方法に関するものである。
【０００２】
【従来の技術】
近年、マルチメディアシステムの普及に伴い、ＡＶデータを扱うシステムが次々と開発されはじめている。ここでＡＶデータとは、動画像を表わす動画データと音声データを合わせ持ったデータを指す。ＡＶデータの形式としては、動画データと音声データを時間を基準として並列管理している形式が主流をなしている。例えばＡｐｐｌｅＣｏｍｐｕｔｅｒ社のＱｕｉｃｋＴｉｍｅでは、動画データや音声データを別々のトラックに入れ、それらのトラックを動じにアクセスしながら、音声を伴った動画を、再生している。
【０００３】
まず、動画データの符号化について説明する。動画データは、音声データと比較すると、単位時間当りの情報量がきわめて多い。しかし、一般に連続した動画では、前後の画像と注目している画像とはよく似ているので、データとして見ると冗長性が非常に多い。そこで、ディジタル的に高能率符号化を行い、必要な情報量を大きく削減することができる。現在、高能率符号化の手法としては、ＩＳＯ（国際標準化機構）／ＩＥＣ（国際電気標準会議）のＭＰＥＧ（ＭｏｖｉｎｇＰｉｃｔｕｒｅＣｏｄｉｎｇＥｘｐｅｒｔｓＧｒｏｕｐ）によって定められた国際標準草案ＭＰＥＧ１が主流である。本特許はＡＶデータの検索を行うためのものなので、膨大な量のＡＶデータを扱うことが考えられる。必然的にそのＡＶデータに高能率符号化が施されている可能性も十分有り得る。そこで、まずＭＰＥＧ１の符号化方式について、図１０を用いて説明する。
【０００４】
先に述べたように、一般的に連続した動画では、前後の画像と注目している画像とはよく似ている。そこでＭＰＥＧ１では、例えば画像１００４０を符号化しようとする時には、時間的に前方の画像１００１０との差分を取り、その差分値を符号化してＰピクチャ（前方予測符号化画像）１００８０とする。また、画像１００２０を符号化しようとする時には、時間的に前方の画像１００１０か、後方の画像１００４０、もしくは前方と後方から作られた補間画像との差分を取り、その差分値を符号化してＢピクチャ（両方向予測符号化画像）１００６０とする。同様に画像１００３０を符号化しようとする時にも、画像１００１０と１００４０からＢピクチャ１００７０をつくる。このように符号化することにより、時間軸方向の冗長度を減らして、情報量を少なくしている。また、常に差分を符号化するのではなく、例えば入力画像１００１０をそのまま符号化して、Ｉピクチャ（フレーム内符号化画像）１００５０とすることもある。
【０００５】
また、ＭＰＥＧ１では単に同じ位置の差をとるだけでなく、動き補償を使用する。これは、マクロブロック単位で、符号化しようとする画像の前画像の中で、符号化しようとするブロックと一番差分が少ないブロックを探索し、それとの差分をとることにより、さらに送らなければならないデータを削減する手法をいう。ここで、マクロブロックとは１６×１６画素のブロックである。実際には、Ｐピクチャでは動き補償後の予測画との差分をとったものと、差分をとらないものの二者のうちデータ量の少ないものをマクロブロック単位に選択して符号化する。しかしこれでもまだ、物体の動いたうしろから出てきた部分に関しては、多くのデータを送らなければならない。そこでＢピクチャでは、すでに復号化された動き補償後の時間的に前方だけでなく、後方またはその両者から作った補間画像との差分をとったものと、差分をとらないものの四者のうち一番データ量の少ないものを同じくマクロブロック単位に選択して符号化する。このようにすれば、ほとんどのデータは送らなくても済む。
【０００６】
ＭＰＥＧ１も、動画データや音声データを別々のトラックに入れ、それらのトラックを同時にアクセスして、音声を伴った動画を再生している。従って、例えば、ある時間の音声データに対応する動画データを抽出することも可能である。
【０００７】
ところで、このようなＡＶデータを用いたシステム、特にテキスト、グラフィック、サウンド等、各種データを利用してプレゼンテーション用などのアプリケーションを作成するオーサリングシステムにおいては、ＡＶデータの検索とその表示が重要な問題となる。なぜなら、ＡＶデータは他のデータに比べ、再生するためのデータ読み込み処理に必要とされる時間が長い上に、ＡＶデータそのものの再生時間も長い可能性があるからである。
【０００８】
従来、ＡＶデータの検索を行うシステムにおいて、希望するデータを取り出す機構の一つとしては、予め登録しておいたブラウジング（拾い読み）用のデータを、使用者の指示により個々にまたは複数個一覧表示するブラウジング検索があった。この検索方式では、ブラウジング用のデータを見ることにより短時間に動画の内容が把握できるため、ＡＶデータの検索方式としては優れた方式である。
【０００９】
ところで、ブラウジング検索方式の問題点としては、予め検索用のデータを設定しておく必要があるので、その設定作業に非常に手間がかかるという点が挙げられる。この問題点を解決するために、場面切替の発生部分を検出し、その前後いずれかのフレームを検索用の画像として検索リストに抽出、検索時にはそれらを一覧表示する、特開平３−２９２５７２号公報に記載の技術がある。ここでフレームとは、動画データを構成する最小単位の要素の静止画像である。場面（シーン）とは、被写体が連続的に撮影されているフレームの集合体であると定義する。場面切替の発生部分とは、異なる２場面の接続する部分を指すものとする。例えば一般的なテレビ動画では、１秒間は３０フレームの静止画像で構成されている。この方式では、次に検出される場面切替部分までの時間が一定時間以下の場面切替部分を除いて抽出することも可能である。この方式を用いれば、１場面につき１フレームの検索画像を抽出し、検索用のデータとして利用することができる。
【００１０】
【発明が解決しようとする課題】
しかしながら、上記公報に記載の技術においては、ＡＶデータが沢山の場面数を持つ場合には、検索対象となる画像数もそれだけ多くなるので、検索に時間を要するという問題がある。つまり、単に場面切替の生じたフレームを検索用の画像とするだけでは、動画検索用システムおよびその方法としては不十分である。
【００１１】
この発明は、重要な場面については、一場面につき、複数フレームの検索画像を抽出し、きめ細かな検索データの抽出を行うことを目的とする。
【００１２】
また、この発明は、重要でない場面については、場面から検索画像を抽出せずに、検索用データ数の削減を行うことを目的とする。
【００１３】
【課題を解決するための手段】
上記課題を解決するために、本発明によれば、
複数のシーケンシャルな画像フレームより構成される動画像を表わす動画データと、前記各フレームのそれぞれに対応づけられるシーケンシャルな音声データとを入力し、入力した動画データの中から、前記動画像に含まれる場面毎に検索用データを抽出する動画検索システムにおいて、
入力する前記動画データの内容に基づいて、前記各場面毎に、当該場面を構成するフレームを検出する場面切替検出手段と、
入力する前記音声データの音量を、対応するフレーム毎に検出する音量検出手段と、
前記場面切替検出手段により検出された各場面について、前記音量検出手段により予め定められた値以上の音量が検出されたフレームを含む動画データを、当該各場面についての前記検索用データとして当該場面から抽出する検索用データ抽出手段と、
を備え、
前記検索用データ抽出手段は、
前記検索用データを抽出する場面の最大数として指示された数以下に、前記検索用データを抽出する場面の数を制限することを特徴とする動画検索システムを提供する。
【００１４】
また、
複数のシーケンシャルな画像フレームより構成される動画像を表わす動画データと、前記各フレームのそれぞれに対応づけられるシーケンシャルな音声データとを入力し、入力した動画データの中から、前記動画像に含まれる場面毎に検索用データを抽出する動画検索システムにおいて、
入力する前記動画データの内容に基づいて、前記各場面毎に、当該場面を構成するフレームを検出する場面切替検出手段と、
入力する前記音声データの音量を、対応するフレーム毎に検出する音量検出手段と、
前記場面切替検出手段により検出された各場面について、前記音量検出手段が検出した各フレーム毎の音量に基づいて、予め定められた間隔離れたフレームに対する音量変化が予め定められた値以上であるフレームを含む動画データを、当該各場面についての前記検索用データとして当該場面から抽出する検索用データ抽出手段と、
を備え、
前記検索用データ抽出手段は、
前記検索用データを抽出する場面の最大数として指示された数以下に、前記検索用データを抽出する場面の数を制限することを特徴とする動画検索システムを提供する。
【００１５】
【作用】
まず、ＡＶデータについて説明する。ＡＶデータとは、動画像を表わす動画データと、対応する音声を表わす音声データとを合わせ持ったデータである。
【００１６】
ＡＶデータに対し、場面切替の発生部分と、各フレームの音量とをそれぞれ検出する。これらの検出した情報を元に、ＡＶデータから必要なフレームを抽出し、これから検索用データを作成して表示する。
【００１７】
即ち、動画検索用データの生成の為の基準として、場面切替だけでなく音声データについても着目する。なぜなら、音量の大きい部分や音量変化の激しい部分は、ＡＶデータにとって、より特徴的な部分であると考えられるからである。動画検索用データは、フレーム中から検索の為に必要な部分を選択し、それを結合することにより生成する。
【００１８】
ＡＶデータは、通常複数のシーンから構成されており、それぞれのシーンの再生時間に応じて、その内容の重要度も異なる。そこで、各シーンの再生時間に応じて、以下のように動画検索用データを抽出する。
【００１９】
再生時間の短いシーンからは、基本的には意味を持たないと判断して検索用データを抽出しない。ただし、該シーンの平均音量が一定値以上のものについては、何らかの意味を持つものとして、該シーン全てについて検索用データを抽出する。
【００２０】
再生時間の長いシーンは、何らかの意味を持つものと判断して、特徴部分を１つ抽出する。ただし、シーンが非常に長い場合にも特徴部分が１つしか抽出されないといった事態を避けたい場合には、検索者の指示により、同一シーン内から複数の特徴部分を抽出できるようにする。さらにこの場合、以下の２つの機能の内のいづれかを選択できるようにする。
【００２１】
（１）複数の特徴部分を抽出したい場合には、一定音量以上の全てのフレームを抽出する。
【００２２】
（２）特徴部分を１つだけ抽出したい場合には、該シーン中の最大音量を示すフレームを抽出する。
【００２３】
再生時間が中間的な長さのシーンは、平均音量が一定値以上であれば、何らかの意味を持つものとして、該シーン中の最大音量を示す部分を抽出する。平均音量が一定値未満であっても、該シーン中に一定音量差以上の音量変化部分が存在すれば、何らかの意味を持つものと判断して、該シーン中の最大の音量変化を示す部分を抽出する。それ以外のシーンは、特に意味の無いものと判断して、抽出しない。
【００２４】
ＡＶデータは、複数のフレームによって構成されており、それぞれのフレームごとにある大きさの最大，最小，平均等の音量的特徴を持っている。シーン中の音量変化を調べる場合は、あるフレームの音量を、そのフレームとは異なるフレームの音量と比較しなければならない。そこで、シーン中の音量変化を調べる場合に、以下のように行なう。
【００２５】
まず、音量変化を調べる場合に、すぐ前のフレームを差分検出の対象としたのでは、ノイズの影響により、音量差分が最大の部分を正確に判定できない場合がある。そこで、何フレーム前の音量を差分検出時の比較対象とするかを検索者が設定できるようにする。
【００２６】
さらに、音量変化の最大部分の検出は、音量の増加が最大の部分を検出したい場合と、音量の変化が最も大きい部分、すなわち音量差分の絶対値が最大の部分を検出したい場合の２通りがある。これら２通りの内のどちらを選択するかは、検索者が設定できるようにする。
【００２７】
また、検索用データは複数の特徴場面から抽出されたフレームを結合することにより構成される。しかし、検索用データ生成時に検索用データを大きくとり過ぎると、検索者が参照しなければならない画像数が増えるので、検索のために必要な時間も多くなる。そこで、検索用データを必要以上に抽出し過ぎないようにするために、以下のようにする。
【００２８】
まず、検索用データを抽出する各場面から抽出されるを構成するフレームの数は、検索者が決められるように、抽出時に特徴部分の前後何フレームを抽出するかを設定できるようにする。
【００２９】
また、検索用データが大き過ぎる場合、検索者が参照しなければならない画像数が増えるので、検索のために必要な時間も多くなる。そこで、大き過ぎる検索用データを減らすために、以下のようにする。
【００３０】
まず、検索用データの抽出を許容しうる最大の場面数を、検索者が設定できるようにしておく。設定場面数よりも検索用データを抽出する場面数が大きい場合には、検索用データを抽出する場面数を減らして、設定場面数以下にする。この場合に、最大音量が最小の場面と音量差分が最小の場面のどちらを先に削除するかを、検索者が設定できるようにする。
【００３１】
また、検索用データとして許容しうる最大の総フレーム数を、検索者が設定できるようにする。設定フレーム数よりも検索用データの総フレーム数が大きい場合には、検索用データのフレーム数を減らして、設定フレーム数以下にする。
【００３２】
上記の、検索用データのフレーム数を設定フレーム数以下にする機能において、逆に検索用データが小さくなり過ぎた場合、検索時にその内容が理解できなくなる恐れがある。そこで、これを回避するため、１シーン当りのフレーム数を小さくし過ぎてその内容が理解できなくなるようなことがないように、１シーンに最低限必要なフレーム数も、検索者が設定できるようにする。
【００３３】
【実施例】
図１は本発明の一実施例における動画検索用データ生成方式のブロック図である。図１中の１０１０は本実施例の動画検索用データ生成方式全体を制御する主制御部、１０２０は使用者がキーボードなどから動画ファイルの検索指示や、検索用データ抽出時に必要とされるパラメータの入力を行う入力部である。１０３０は動画ファイルの記録、読み出しを行うデータ記憶管理部、１０４０は動画ファイル中の特徴部分を抽出する検索用データ生成部、１０５０は検索用データ生成部１０４０で管理される検索データを表示する表示部、１０６０は上記各部間のデータ交換を行う通信部である。
【００３４】
上記構成は、例えば、主制御部１０１０をＣＰＵで、入力部１０２０をキーボードで、データ記憶管理部１０３０をハードディスクで、表示部１０５０をＣＲＴディスプレイで、検索用データ生成部１０４０をＲＯＭ、ＲＡＭ等に格納されＣＰＵで実行されるプログラムで、通信部１０６０をバスを用いて実現することにより、従来からよく知られた装置で構成可能である。
【００３５】
図１を実現可能な装置構成図を、図１１に示す。図１１において、１１０１０はＣＰＵ（中央処理装置）、１１０２０はキーボード、１１０３０はハードディスク、１１０４０はＲＯＭ（リードオンリメモリ）、１１０５０はディスプレイ、１１０６０はバスを示す。
【００３６】
図１に戻り１０３１、１０３２はデータ記憶管理部１０３０を構成する。１０３１はＡＶデータを記憶するデータ記憶部、１０３２は主制御部１０１０または検索用データ生成部１０４０からの制御でデータ記憶部１０３１のデータを読み出すデータ読み出し部である。１０４１、１０４２、１０４３は検索用データ生成部１０４０を構成する。１０４１はデータ読み出し部１０３２から読み出される動画ファイルの場面切替を検出する場面切替検出部である。１０４２はデータ読み出し部１０３２から読み出される動画ファイルのフレームごとの音量を検出する音量検出部である。１０４３は、場面切替検出部１０４１と音量検出部１０４２の情報を元に所定の処理を行い、その結果を管理する検索用データ選択管理部で、当該フレーム番号または当該フレームデータを保持するものとする。
【００３７】
まず、場面切替の検出処理について説明する。場面切替の検出処理は、例えば、時間的に連続する２フレーム間の相関および画像全体の並行移動を示す動きベクトルの情報を用いて行える。詳しくは、まず１つのフレーム内で２次元的な小ブロックを定め、時間的に連続する２フレーム間において、各ブロックの相関をとり（例えば誤差の２乗）、相関があるか否かを判断する（誤差の２乗の大きさ）。相関がある場合（誤差の２乗が小さい）は、連続する場面（シーン）と考えられ、場面切替は発生していない。しかし、このフレーム相関だけではパン（左右移動）やチルト（上下移動）等のカメラの並行移動に起因するシーンを場面切替と判断してしまうことになるため、更に画面の並行移動量、すなわち動きベクトルを検出し、この動きベクトルに基づいて場面切替か否かを判別する（動きベクトルが検出された場合は連続シーン）。この動きベクトルの検出は、例えばフレーム画像の空間的勾配と、画像間差の関係から求まる。動きベクトルの検出の方法に関しては、文献「画像のデジタル信号処理：吹抜敬彦著、日刊工業新聞社」に詳しい。
【００３８】
ここで、動きベクトルの検出方法について説明する。代表的な動きベクトルの検出方法としては、ブロックマッチング法がある。以下、ブロックマッチング法について、図１２を用いて説明する。ブロックマッチング法は、符号化対象画像の１マクロブロック（１６×１６画素のブロック）と、前画像の全てのマクロブロックを比較し、画像の差分が最も小さいマクロブロックへのベクトルを動きベクトルとして採用する方法である。例えばマクロブロック１２０１０の動きベクトルを求めたい場合、前画像から読み出すマクロブロック１２０２０〜１２０７０の位置を少しずつずらしていく（この図ではｘ軸方向に２ずつずらしている）。読み出したそれぞれのブロックをマクロブロック１２０１０と比較し、最も画像の差分の小さいブロック（この図ではブロック１２０４０）を求める。ブロックマッチングは、候補となりうるブロックの数だけ行う。この図ではベクトル（６，０）が動きベクトルとして採用される。
【００３９】
したがって、例えばＭＰＥＧ１において場面切替を検出する場合には、Ｐピクチャにおいて、動き補償後の予測画との差分をとったブロックと、差分をとらないブロックの数を調べ、差分をとったブロックの数が一定値以上であれば、動きベクトルは検出されたと考えて、連続する場面と判断しても差し支えない。
【００４０】
従って、場面切替が発生したと判断されるのは、フレーム間相関が無く、しかも動きベクトルが認められなかった場合となる。
【００４１】
ここで音量検出処理について説明する。音量検出処理において、各フレームごとの音量値をそのままの値で用いると、各フレームのノイズの影響を大きく受けてしまう。そこで、フレームごとのノイズを除去するために、各フレームごとの音量値は、例えば、音量を求めたいフレームと時間的に連続する前後３フレーム、つまり音量を求めたいフレームを含めて合計７フレームの平均値をとり、それを該フレームの音量値として定義する。もちろん、この方法以外のノイズ除去方法を用いることも可能である。
【００４２】
以下、各フレームの音量を検出する方法を説明する。通常、音響波形をディジタル化するためには、図１３に示すように、標本化、量子化、符号化を順番に行う必要がある。標本化では、時間を細かく区切って、その単位時間での波形の高さを見る。量子化では、その波形の高さを、ある桁数の２進数の細かさに区切って読む。符号化では、音響波形を、量子化によって得られた値のディジタル信号に変換する。一般の音楽用コンパクトディスクでは、標本化周波数（標本化する際、１秒間に刻む時間の数）が４４．１ｋＨｚ、量子化ビット数（量子化する際、音の強弱を区切る細かさ）が１６ビットである。
【００４３】
したがって各フレームの音量は、例えば図１４に示すように、１フレーム時間内の符号化数値群の中で、例えば最大値１４０１０をそのフレームの音量として定義することにより求めることができる。もちろん、１フレームに対応する時間内で、符号化された値の平均値や最小値１４０２０をそのフレームの音量として定義しても構わない。
【００４４】
本実施例では場面切替検出方法、音量検出方法として、上記したフレーム間相関と動きベクトル、フレーム音量定義を用いる。また、図８は検索用データの生成過程で必要となる各シーンごとの情報を格納するシーン特徴テーブルである。テーブルの各行にはそれぞれの場面のシーン終了フレーム番号、シーン長、平均音量、最大音量フレームアドレス、最大音量、最大音量差分フレームアドレス、最大音量差分を格納する。テーブルをリストによって実現することにより、テーブルのサイズは任意に変化させることが可能である。
【００４５】
以下、図１のブロック図の動作例を図２のフローチャートを使って説明する。図２は動画検索用データ生成方式の動作を示すフローチャートである。まず、動画ファイルの検索処理が入力部１０２０の指示によって開始されると、検索者は検索したい動画ファイルを入力部１０２０で選択する（ステップ２０１０）。主制御部１０１０は、指示された動画ファイルを検索するために、検索用データ生成部１０４０に対し当該動画ファイルの検索用データの作成指示を行う。検索用データ生成部１０４０では、検索用データ選択管理部１０４３がデータ読み出し部１０３２に当該動画ファイルの読み出し指示を行う（ステップ２０２０）。検索用データ選択管理部１０４３は、当該動画ファイルの特徴部分を明らかにするために、場面切替検出部１０４１と音量検出部１０４２に対し場面切替発生情報と各フレームの音量情報を要求し、図８に示すシーン特徴テーブルを作成する（ステップ２０３０）。検索用データ選択管理部１０４３は、シーン特徴テーブルから得られる情報を元に、当該動画ファイル中の特徴部分を図９の抽出テーブルに抽出する（ステップ２０４０）。抽出テーブルの各行には検索データを構成するそれぞれの場面の最大音量、最大音量差分、場面の開始フレーム番号、終了フレーム番号を格納する。テーブルをリストによって実現することにより、テーブルのサイズは任意に変化させることが可能である。検索用データ選択管理部１０４３は、入力部１０２０によって与えられる条件に基づいて、抽出テーブルから検索用のデータを選択する（ステップ２０５０）。最後に、主制御部１０１０は表示部１０５０に検索用データの一覧表示を指示し、表示部１０５０は抽出テーブルに保存される各シーンの開始フレーム番号から終了フレーム番号までのフレームを一覧表示する（ステップ２０６０）。なお、これらの各部間のデータ交換は、通信部１０６０を介して行われる。
【００４６】
図２のフローチャート中のシーン特徴テーブル作成処理、特徴部分抽出処理、検索用データ選択処理については、図３，図５，図７にそれぞれの処理内容例を表すフローチャートを示す。さらに図３のフローチャート中の場面切替検出処理については図４に、図５のフローチャート中の同一場面内複数特徴部分抽出処理については図６に、それぞれの処理内容例を表すフローチャートを示す。以下、図３から図７までのフローチャートについて、その処理内容例を順番に説明する。
【００４７】
図３に示すシーン特徴テーブル作成処理（ステップ２０３０）の動作例を説明する。まず、２フレーム間の音量差分を検出する場合に、任意の相対フレームとの音量差分が検出できるように、何フレーム前の音量との差分をとるかを指定する音量差分検出対象フレーム数αを、入力部１０２０により指定する。また、αフレーム前のフレームとの音量差分が最大となるフレームを検出する場合に、音量の増加が最大となるフレームをとるか、それとも音量の変化、すなわち音量差分の絶対値が最大となるフレームをとるかを、入力部１０２０により指定する。（ステップ３０１０）。動画ファイルの最初のフレームからのフレーム数をカウントする現フレーム番号に０をセットする（ステップ３０１５）。指定ファイルの１フレーム目を読み込み（ステップ３０２０）、初期設定値として、１シーンのフレーム数をカウントするフレームカウンタに１を、１シーン中の各フレームの音量の合計を表す変数総音量に０を、１シーン中で音量が最大となるフレームがどこにあるかを表す変数最大音量フレーム番号に現フレーム番号を、そのフレームの音量を表す変数最大音量に０を、１シーン中でαフレーム前との音量差が最大となるフレームがどこにあるかを表す変数最大音量差分フレーム番号に現フレーム番号を、そのフレームの音量とαフレーム前のフレームの音量の差を表す変数最大音量差分に０をセットする（ステップ３０３０）。現フレーム番号に１を加える（ステップ３０３５）。現在読み込んでいるフレームの音量が最大音量よりも大きいかどうかを調べ（ステップ３０４０）、大きい場合には、最大音量フレーム番号を現フレーム番号に、最大音量を現フレームの音量値に置き換える（ステップ３０５０）。
【００４８】
次に、αフレーム前のフレームとの音量差分をとる前に、αフレーム前のフレームが同一シーン内に存在するかどうかを見る必要があるので、フレームカウンタの値がα＋１以上あるかどうかを調べる（ステップ３０６０）。α＋１以上ある場合には、αフレーム前の音量と比較することができるので、αフレーム前の音量を検出した後（ステップ３０７０）、音量の増加が最大となるフレームを最大音量差分フレームとしてとる場合には、最大音量差分が”（現フレーム音量）−（αフレーム前の音量）”よりも小さいかどうかを調べる（ステップ３０８０，３０９０）。音量差分の絶対値が最大となるフレームを最大音量差分フレームとしてとる場合には、”（現フレーム音量）−（αフレーム前の音量）”の絶対値よりも小さいかどうかを調べる（ステップ３０８０，３１００）。小さい場合には、現音量差分が最大音量差分となるので、最大音量差分フレーム番号に現フレーム番号を、最大音量差分に”（現フレーム音量）−（αフレーム前の音量）”の絶対値を、それぞれ代入する（ステップ３１１０）。さらに、最大音量差分の値が０よりも小さければ（ステップ３１２０）、最大音量差分フレーム番号からαを引く（ステップ３１３０）。これは、音量減少方向の差分をとる場合に、その下がり始めのフレームを音量減少のフレームとして抽出したいためである。
【００４９】
次に、総音量に現フレーム音量を加え（ステップ３１４０）、次フレームが存在するか否かを調べる。次フレームが存在する場合には、次フレームを読み込み（ステップ３１６０）、場面切替検出処理（ステップ３１７０）を経て、場面切替フラグが立っているかどうかを見る（ステップ３１８０）。フラグが立っていなければ、フレームカウンタに１を加えた後（ステップ３２１０）、再び現フレーム番号に１を加える（ステップ３０３５）。フラグが立っている場合は、図８に示すシーン特徴テーブル８０１０の新しい行に、シーン終了フレーム番号８０１５として現フレーム番号の値を、シーン長８０２０としてフレームカウンタの値を、平均音量８０３０として”（総音量）÷（フレームカウンタ）”の値を、最大音量フレーム番号８０４０として変数最大音量フレーム番号の値を、最大音量８０５０として変数最大音量の値を、最大音量差分フレーム番号８０６０として変数最大音量差分フレーム番号の値を、最大音量差分８０７０として変数最大音量差分の値をそれぞれ登録し（ステップ３１９０）、終了フラグが立っていればシーン特徴テーブル作成処理を終了、立っていなければ、再び初期設定（ステップ３０３０）に戻り、処理を継続する。以上、図３のシーン特徴テーブル作成処理（ステップ２０３０）の動作例を説明した。
【００５０】
続いて、図３に示す場面切替検出処理（ステップ３１７０）の動作例を、図４を用いて説明する。まず、前フレームと現フレームの間に相関があるか（ステップ４０１０）、動きベクトルがあるかどうかを調べ（ステップ４０２０）、前フレームと現フレームの間に相関も動きベクトルもない場合には、場面切替が発生したものと判断して場面切替フラグを立てて（ステップ４０３０）、場面切替検出処理を終了する。相関あるいは動きベクトルの内、少なくとも一つ以上が検出された場合には、前フレームと現フレームは同一シーン内にあると判断して、場面切替フラグを降ろして（ステップ４０４０）、場面切替検出処理を終了する。以上、図４の場面切替検出処理（ステップ３１７０）の動作例を説明した。
【００５１】
続いて、図２に示す特徴部分抽出処理（ステップ２０４０）の動作例を図５を用いて説明する。まず、特徴部分を選別するためのパラメータとして、シーン長ａ，ｂ、音量ｃ，ｄ，ｅ、最大音量差分ｆを、入力部１０２０により設定する（ステップ５０１０）。加えて、入力部１０２０により、特徴フレームの前後何フレームを特徴部分として抽出するかを変数ｘで指定する。また、シーン長がｂ以上の場合に複数の特徴部分を抽出するか否かを入力部１０２０により指定する（ステップ５０２０）。シーン特徴テーブル８０１０の１番目の行を読み込み（ステップ５０３０）、シーンカウンタの値を０に設定する（ステップ５０４０）。
【００５２】
この後、シーンの長さによる場合分けを行う。まず、対象となるシーンがある程度以下の短いシーンであるかどうかを見るため、シーン長８０２０がａ以下であるかどうかを調べ（ステップ５０５０）、シーン長８０２０がａ以下でなければ、今度は対象となるシーンがある程度以上の長いシーンであるかどうかを見るため、シーン長８０２０がｂ以上であるかどうかを調べる（ステップ５０８０）。シーン長８０２０がａ以下である場合には、さらに平均音量８０３０がｄ以上であるかどうかを調べる（ステップ５０６０）。平均音量８０３０がｄ以上である場合には、シーン長８０２０が短くても特徴的な部分であると判断して、そのシーンの全フレームを抽出テーブルに抽出するために、最大音量９０２０に最大音量８０５０の値を、最大音量差分９０３０に最大音量差分８０７０の値を、開始フレーム番号９０４０に”（シーン終了フレーム番号）−（シーン長）＋１”を、終了フレーム番号９０５０にシーン終了フレーム番号８０１５を、それぞれ登録（ステップ５０７０）した後、シーンカウンタに１を加える（ステップ５１７０）。平均音量８０３０がｄよりも小さい場合には、その部分は特徴的ではないと判断して、何も行わない。
【００５３】
シーン長８０２０がｂ以上でない場合には、中間的なシーン長のシーンであると判断して、平均音量８０３０がｅ以上であるかどうかを調べる（ステップ５１３０）。平均音量８０３０がｅ以上である場合には、音量が最大となる部分がシーン中最も特徴的な部分であると判断して、最大音量フレーム番号のフレームとその前後ｘフレームを抽出テーブルに抽出するために、最大音量９０２０に最大音量８０５０の値を、最大音量差分９０３０に最大音量差分８０７０の値を、開始フレーム番号９０４０に”（最大音量フレーム番号）−ｘ”を、終了フレーム番号９０５０に”（最大音量フレーム番号）＋ｘ”を、それぞれ登録し（ステップ５１４０）、シーンカウンタに１を加える（ステップ５１７０）。平均音量８０３０がｅよりも小さい場合には、今度は最大音量差分８０７０がｆ以上であるかどうかを調べる（ステップ５１５０）。最大音量差分８０７０がｆ以上である場合には、音量の差分が最大となる部分がシーン中最も特徴的な部分であると判断して、最大音量差分フレーム番号のフレームとその前後ｘフレームを抽出テーブルに抽出するために、最大音量９０２０に最大音量８０５０の値を、最大音量差分９０３０に最大音量差分８０７０の値を、開始フレーム番号９０４０に”（最大音量差分フレーム番号）−ｘ”を、終了フレーム番号９０５０に”（最大音量差分フレーム番号）＋ｘ”を、それぞれ登録し（ステップ５１６０）、シーンカウンタに１を加える（ステップ５１７０）。最大音量差分がｆよりも小さい場合には、そのシーンは特徴的ではないと判断して、何も行わない。
【００５４】
一方、シーン長がｂ以上である場合には、ある程度以上長いシーンであると判断して、まず、そのシーンから複数の特徴部分を抽出するかどうかを調べる（ステップ５０９０）。複数の特徴部分を抽出する場合には、それらの特徴部分の数をカウントするシーンサブカウンタに０をセットし（ステップ５１００）、同一場面内複数特徴部分抽出処理（ステップ５１１０）によりそのシーンの特徴部分を抽出し、シーンカウンタに”（シーンサブカウンタ）−１”を加えた（ステップ５１２０）後、シーンカウンタに１を加える（ステップ５１７０）。つまり、ステップ５１２０とステップ５１７０を合わせて考えると、シーンカウンタにシーンサブカウンタの値を加えることに等しい。同一シーン内から複数の特徴部分を抽出しない場合には、音量が最大となる部分がシーン中最も特徴的な部分であると判断して、最大音量フレーム番号のフレームとその前後ｘフレームを抽出テーブルに抽出するために、最大音量９０２０に最大音量８０５０の値を、最大音量差分９０３０に最大音量差分８０７０の値を、開始フレーム番号９０４０に”（最大音量フレーム番号）−ｘ”を、終了フレーム番号９０５０に”（最大音量フレーム番号）＋ｘ”を、それぞれ登録（ステップ５１４０）、シーンカウンタに１を加える（ステップ５１７０）。
【００５５】
これらのステップのいずれかを経た後、シーン特徴テーブル８０１０に次の行が存在するかどうかを調べ（ステップ５１８０）、存在する場合はその行を読んで、再びシーン長がａ以下であるかどうか調べ（ステップ５０５０）、処理を継続する。存在しない場合には、特徴部分抽出処理を終了する。以上、図５の特徴部分抽出処理（ステップ２０４０）の動作例を説明した。
【００５６】
続いて、図５に示す同一場面内複数特徴部分抽出処理（ステップ５１１０）の動作例を、図６を用いて説明する。まず、複数の特徴部分を抽出するシーンのフレーム数と同数のフラグを用意して、”（シーン終了フレーム番号）−（シーン長）＋１”からシーン終了フレーム番号までの番号を割り当て、各フレームと１対１に対応付ける（ステップ６０１０）。抽出するシーンの１番目のフレームを読み込み（ステップ６０２０）、現フレームの音量が音量ｃ以上であるかどうかを調べる（ステップ６０３０）。音量ｃ以上であれば、現フレームとその前後のｘフレームに対応するフラグを立てる。その後、次フレームが存在するかどうかを調べ、存在する場合にはそのフレームを読み込んで（ステップ６０６０）、フレーム音量が音量ｃ以上であるかどうかを再び調べる（ステップ６０３０）。次フレームが存在しない場合には、時間的に連続するフラグの立っている部分が合計幾つあるかを調べてシーンサブカウンタにその値をセットし（ステップ６０７０）、最後にフラグに対応するフレームを抽出テーブルに抽出するために、最大音量９０２０に最大音量８０５０の値を、最大音量差分９０３０に最大音量差分８０７０の値を、開始フレーム番号９０４０に連続部分の開始フラグ番号を、終了フレーム番号９０５０に連続部分の終了番号を、全ての連続部分について順次登録し（ステップ６０８０）、同一場面内複数特徴部分抽出処理を終了する。以上、図６の同一場面内複数特徴部分抽出処理（ステップ５１１０）の動作例を説明した。
【００５７】
続いて、図２に示す検索用データ選択処理（ステップ２０５０）の動作例を、図７を用いて説明する。まず、検索用データの最大場面数を指定する最大許容限度場面数と、最大フレーム数を指定する最大許容限度フレーム数を指定する。また、１つの特徴部分を構成するフレームの数が小さくなり過ぎて内容が理解できなくなることを防止するために、１シーンを構成するフレーム数の下限を指定する最小必要限度フレーム数を指定する。さらに、最大音量の大きな特徴部分と最大音量差分の大きな特徴部分のどちらを検索用のデータとして優先するかを指定する（ステップ７０１０）。次に、シーンカウンタの値が最大許容限度場面数よりも多いかどうかを調べる（ステップ７０２０）。シーンカウンタの値が最大許容限度場面数よりも多い場合には、抽出場面数を減らす必要があるので、最大音量の大きな特徴シーンを最大音量差分の大きな特徴シーンよりも検索用のデータとして優先するかどうかを調べ（ステップ７０３０）、優先する場合には最大音量差分が一番小さいシーンを、優先しない場合には最大音量が一番小さいシーンを抽出テーブルからそれぞれ削除（ステップ７０４０）（ステップ７０５０）した後、シーンカウンタから１を引いて（ステップ７０５５）、再びステップ７０２０に戻り、シーンカウンタの値が最大許容限度場面数以下になるまでこれを繰り返す。この際、最大音量または最大音量差分が一番小さいシーンが複数ある場合には、終了フレーム番号が一番小さいシーンを削除する。
【００５８】
抽出場面数が最大許容限度場面数以下の場合には、まず、抽出テーブルの各行の”（終了フレーム番号）−（開始フレーム番号）＋１”の総和である抽出フレーム数を計算して（ステップ７０５８）、削除禁止シーンカウンタに０をセットする（ステップ７０６０）。削除禁止シーンカウンタは、１シーンを構成するフレームの数が、最低必要限度フレーム数よりも小さくなったシーンの数をカウントするカウンタである。次に、削除禁止シーンカウンタの値がシーンカウンタの値に等しいかどうかを調べる（ステップ７０７０）。等しい場合には、削除できるシーンが存在しないことを意味するので、検索用データ選択処理を終了する。等しくない場合には、抽出フレーム数が最大許容限度フレーム数よりも大きいかどうかを調べる（ステップ７０８０）。抽出フレーム数が最大許容限度フレーム数以下の場合には、フレーム削除の必要はないので、検索用データ選択処理を終了する。
【００５９】
抽出フレーム数が最大許容限度フレーム数よりも大きい場合には、抽出フレーム数を削減しなければならないので、まず、１つ目のシーンのシーン長を”（終了フレーム番号）−（開始フレーム番号）＋１”より計算して（ステップ７０９０）、シーン長が最低必要限度フレーム数よりも大きいかどうかを調べる（ステップ７１００）。シーン長が最低必要限度フレーム数よりも大きい場合には、そのシーンからのフレーム削減が可能なので、シーンの最前後１フレームずつを削除するために、シーンの開始フレーム番号に１を加え、終了フレーム番号から１を引き（ステップ７１１０）、抽出フレーム数から２を引く（ステップ７１１５）。シーン長が最低必要限度フレームよりも大きくない場合には、それ以上フレーム数を削減するとそのシーンの内容が理解できなくなるので、そのシーンからのフレーム削除は不可能と見なして、削除禁止カウンタに１を加える（ステップ７１２０）。次に、抽出テーブルに次シーンが存在するかどうかを調べる（ステップ７１３０）。存在する場合には、次シーンのシーン長を計算して（ステップ７１４０）、ステップ７１００から再び、次シーンが存在しなくなるまでこれを繰り返す。次シーンが存在しなくなった場合には、ステップ７０７０に戻り、削除禁止シーンカウンタがシーンカウンタに等しく、かつ抽出フレーム数が最大許容限度フレーム数以下になるまで、この処理を繰り返す。以上、図７の検索用データ選択処理（ステップ２０５０）の動作例を説明した。
【００６０】
上記実施例では、検索用データ生成処理を検索時に毎回行うものとしているが、図１の検索用データ選択管理部１０４３内に検索後も抽出テーブルを保持する機能を持たせれば、毎回行う必要はない。この際には、主制御部１０１０がブラウズ指示を受けた動画ファイルに関し検索用データ選択管理部１０４３をチェックし、既に検索用データが生成されている場合には、その検索用データに基づいて直接表示する。生成されていない場合にのみ検索用データ生成部１０４０で、検索用データの作成を行う。
【００６１】
また、上記実施例では、検索用データ生成の為の各種パラメータ設定は、使用者が全て図１の入力部１０２０により行うものとしているが、検索用データ選択管理部１０４３内で保持されるシーン特徴テーブルの情報をパラメータとして利用することも可能である。例えば、図５のｃに、そのシーンの平均音量の１．５倍の値を使用することもできる。これは、ｃをステップ５０１０で設定せず、同一場面内複数特徴部分抽出処理５１１０の開始時に、ｃ＝（平均音量）×１．５を設定すればよい。また、ステップ５０１０で、シーン長ｂとして動画ファイル全体のシーン長の１０％の値を用いたり、音量ｅとして動画ファイル全体の平均音量の２倍の値を用いることも可能である。これらの場合は、ｂ＝（シーン特徴テーブルのシーン長総和）×０．１、およびｅ＝〔｛（シーン長×平均音量）の総和｝÷（シーン長の総和）〕×２を、図５のステップ５０１０で設定すれば良い。
【００６２】
また、現在ＡＶデータを扱う場合、映像を蓄積する媒体としてはＶＴＲ等のテープが主流である。記憶媒体としてのテープは、シーケンシャルアクセスで、低速なマスストレージといえる。このような媒体を利用したシステムでは、検索の効果を高めるために、検索用データ選択管理部１０４３において、検索用データを高速ランダムアクセスが可能な別の記憶媒体（ハードディスク等）に蓄積することが考えられる。この方法によれば、動画ファイルの高速検索が可能になり、使用者は必要シーンのみをテープから再生することができる。
【００６３】
また、デジタル映像編集装置等では、アナログテープから装置内のデジタルメディア（ハードディスク等）にＡ／Ｄ変換して格納させる（ダウンロード）作業があり、本実施例での検索用データ生成処理をこの作業時に行うものとしてもよい。そうすればダウンロード終了後、直ちに動画ファイルの高速検索が可能となる。
【００６４】
以上説明したように、本実施例によれば、場面切り替えの生じたフレームと、フレームの音声情報とに基づいて、検索用の画像（動画）を抽出し、任意のＡＶデータ（ファイル）の検索が実現可能になる。
【００６５】
これにより、従来のように、検索を行う前にあらかじめ使用者が検索したい特徴部分を指定して検索用データを作成しておく必要がなくなり、使用者の作業量を低減できるという効果が得られる。
【００６６】
【発明の効果】
上記のように、本発明によれば、重要な（特徴的な）場面については、一場面につき複数フレームを検出用データとして抽出し、検出用データをきめ細かく抽出することができる。
【００６７】
また、本発明によれば、重要でない場面については、該場面から検索用データを抽出せずに、検索用データ数の削減を行うことができる。
【図面の簡単な説明】
【図１】図１は、本発明の実施例に係る動画検索用データ生成方式の全体的な構成を示すブロック図である。
【図２】図２は、図１の実施例の動作を示すフローチャートである。
【図３】図３は、図２のフローチャート中のシーン特徴テーブル作成処理の動作を示すフローチャートである。
【図４】図４は、図３のフローチャート中の場面切替検出処理の動作を示すフローチャートである。
【図５】図５は、図２のフローチャート中の特徴シーン抽出処理の動作を示すフローチャートである。
【図６】図６は、図５のフローチャート中の同一場面内複数特徴部分抽出処理の動作を示すフローチャートである。
【図７】図７は、図２のフローチャート中の検索用データ選択処理の動作を示すフローチャートである。
【図８】図８は、検索用データ生成の過程で作成されるシーン特徴テーブルである。
【図９】図９は、検索用データ生成の過程で作成される抽出テーブルである。
【図１０】図１０は、ＭＰＧ１の符号化方式に関する説明図である。
【図１１】図１１は、本発明の実現可能な装置構成図である。
【図１２】図１２は、ブロックマッチング法に関する説明図である。
【図１３】図１３は、各フレームの音量を検出する方法に関する説明図である。
【図１４】図１４は、各フレームの音量を検出する方法に関する説明図である。
【符号の説明】
１０１０…主制御部
１０２０…入力部
１０３０…データ記憶管理部
１０３１…データ記憶部
１０３２…データ読み出し部
１０４０…検索用データ生成部
１０４１…場面切替検出部
１０４２…音量検出部
１０４３…検索用データ選択管理部
１０５０…表示部
１０６０…通信部[0001]
[Industrial applications]
The present invention relates to a moving image search system and a moving image search method in which a user searches for a video of a predetermined scene from a medium storing the video.
[0002]
[Prior art]
In recent years, with the spread of multimedia systems, systems for handling AV data have begun to be developed one after another. Here, the AV data refers to data having both moving image data representing a moving image and audio data. As a format of AV data, a format in which moving image data and audio data are managed in parallel on a time basis has become mainstream. For example, in QuickTime of Apple Computer, moving image data and audio data are put on separate tracks, and a moving image accompanied by audio is reproduced while accessing these tracks.
[0003]
First, encoding of moving image data will be described. Moving image data has an extremely large amount of information per unit time as compared with audio data. However, in general, in a continuous moving image, the preceding and succeeding images are very similar to the image of interest, and therefore, when viewed as data, there is much redundancy. Therefore, high-efficiency coding can be performed digitally to greatly reduce the required information amount. At present, as a method of high-efficiency coding, an international standard draft MPEG1 defined by MPEG (Moving Picture Coding Experts Group) of ISO (International Organization for Standardization) / IEC (International Electrotechnical Commission) is mainstream. Since this patent is for searching for AV data, it is possible to handle a huge amount of AV data. Inevitably, there is a possibility that the AV data is subjected to high-efficiency encoding. Therefore, the encoding method of MPEG1 will be described first with reference to FIG.
[0004]
As described above, in a generally continuous moving image, the preceding and following images are very similar to the image of interest. Therefore, in MPEG1, for example, when an image 10040 is to be encoded, a difference from the temporally preceding image 10010 is obtained, and the difference value is encoded to obtain a P picture (forward predicted encoded image) 10080. When the image 10020 is to be encoded, the difference between the temporally forward image 10010, the backward image 10040, or the interpolated image created from the forward and backward is obtained, and the difference value is encoded to obtain the B value. A picture (bidirectionally predicted coded image) 10060 is assumed. Similarly, when trying to encode the image 10030, a B picture 10070 is created from the images 10010 and 10040. By encoding in this way, the redundancy in the time axis direction is reduced, and the amount of information is reduced. Also, instead of always encoding the difference, for example, the input image 10010 may be encoded as it is to obtain an I picture (intra-frame encoded image) 10050.
[0005]
In MPEG1, motion compensation is used in addition to simply taking the same position difference. This is done by searching for the block with the smallest difference from the block to be coded in the previous image of the image to be coded in macroblock units, and taking the difference from it to send further. This is a method to reduce unnecessary data. Here, the macro block is a block of 16 × 16 pixels. Actually, for a P picture, a difference between the predicted picture after motion compensation and a difference between the two pictures, which does not take a difference, is selected and coded in macroblock units. But still, a lot of data has to be sent for the part that comes out of the movement of the object. Therefore, in the B picture, one of four types, one obtained by taking a difference with an interpolated image created not only from the front but also from the back or both of them after the motion compensation already decoded, and the one not taking the difference. Those with a small number data amount are similarly selected and encoded in macroblock units. In this way, most data need not be sent.
[0006]
MPEG1 also stores moving image data and audio data in separate tracks, accesses those tracks simultaneously, and reproduces moving images with sound. Therefore, for example, it is also possible to extract moving image data corresponding to audio data at a certain time.
[0007]
By the way, in a system using such AV data, particularly in an authoring system for creating an application for presentation using various data such as text, graphics, sound, etc., retrieval and display of AV data are important problems. It becomes. This is because AV data takes a longer time to read data for reproduction than other data, and the reproduction time of the AV data itself may be longer.
[0008]
2. Description of the Related Art Conventionally, in a system for retrieving AV data, as one of mechanisms for extracting desired data, browsing (browsing) data registered in advance is individually or plurally displayed in accordance with a user's instruction. There was a browsing search to do. This search method is an excellent search method for AV data because the contents of a moving image can be grasped in a short time by looking at browsing data.
[0009]
By the way, as a problem of the browsing search method, since it is necessary to set search data in advance, the setting work is very time-consuming. To solve this problem, Japanese Patent Laid-Open Publication No. 3-292572 discloses a method of detecting a portion where a scene change has occurred, extracting one of the frames before and after the portion as a search image in a search list, and displaying the list at the time of search. There is a technique described in. Here, a frame is a still image of a minimum unit element that constitutes moving image data. A scene is defined as an aggregate of frames in which a subject is continuously photographed. The portion where the scene switching occurs indicates a portion where two different scenes are connected. For example, in a general television moving image, one second is composed of 30 frames of still images. In this method, it is also possible to extract a scene change portion excluding a scene change portion in which the time until the next detected scene change portion is equal to or less than a predetermined time. If this method is used, one frame of a search image can be extracted per scene and used as search data.
[0010]
[Problems to be solved by the invention]
However, the technique described in the above publication has a problem in that when the AV data has a large number of scenes, the number of images to be searched also increases accordingly, so that it takes a long time to search. In other words, simply using a frame in which scene switching has occurred as a search image is not sufficient as a moving image search system and method.
[0011]
SUMMARY OF THE INVENTION It is an object of the present invention to extract a plurality of frames of search images for important scenes and extract fine search data.
[0012]
It is another object of the present invention to reduce the number of search data for an unimportant scene without extracting a search image from the scene.
[0013]
[Means for Solving the Problems]
In order to solve the above problems, according to the present invention,
Moving image data representing a moving image composed of a plurality of sequential image frames, and sequential audio data associated with each of the frames are input, and are included in the moving image from the input moving image data. For each sceneInspectionSearch dataToIn the video search system to extract,
Based on the content of the moving image data to be input, for each of the scenes, a scene switching detection unit for detecting a frame constituting the scene,
Volume detection means for detecting the volume of the input audio data for each corresponding frame,
For each scene detected by the scene switching detection means, a volume equal to or greater than a predetermined value is detected by the volume detection means.WasflameincludingVideo data as the search data for each sceneFrom the sceneSearch data extraction means to be extracted;
Equipped,
The search data extraction means,
A moving image search system is provided, wherein the number of scenes from which the search data is extracted is limited to not more than the number specified as the maximum number of scenes from which the search data is extracted..
[0014]
Also,
Moving image data representing a moving image composed of a plurality of sequential image frames, and sequential audio data associated with each of the frames are input, and are included in the moving image from the input moving image data. For each sceneToSearch dataToIn the video search system to extract,
Based on the content of the moving image data to be input, for each of the scenes, a scene switching detection unit for detecting a frame constituting the scene,
Volume detection means for detecting the volume of the input audio data for each corresponding frame,
For each scene detected by the scene switching detection unit, a change in volume for frames separated by a predetermined interval based on the volume of each frame detected by the volume detection unit is equal to or greater than a predetermined value.HLaemincludingVideo data as the search data for each sceneFrom the sceneSearch data extraction means to be extracted;
Equipped,
The search data extraction means,
A moving image search system is provided, wherein the number of scenes from which the search data is extracted is limited to not more than the number specified as the maximum number of scenes from which the search data is extracted..
[0015]
[Action]
First, the AV data will be described. The AV data is data having both moving image data representing a moving image and audio data representing a corresponding audio.
[0016]
With respect to the AV data, a portion where a scene change occurs and a volume of each frame are detected. Based on the detected information, a necessary frame is extracted from the AV data, and search data is created and displayed based on the extracted frame.
[0017]
That is, attention is paid not only to scene switching but also to audio data as a criterion for generating moving image search data. This is because a part with a large volume or a part with a large volume change is considered to be a more characteristic part for the AV data. The moving image search data is generated by selecting a portion necessary for a search from the frames and combining them.
[0018]
The AV data is usually composed of a plurality of scenes, and the importance of the contents differs depending on the reproduction time of each scene. Therefore, moving image search data is extracted as follows according to the reproduction time of each scene.
[0019]
From a scene with a short playback time, the search data is not extracted because it is basically determined to have no meaning. However, if the average volume of the scene is equal to or higher than a certain value, the search data is extracted for all the scenes as having some meaning.
[0020]
A scene with a long reproduction time is determined to have some meaning, and one characteristic portion is extracted. However, if it is desired to avoid a situation in which only one characteristic portion is extracted even when the scene is very long, a plurality of characteristic portions can be extracted from the same scene in accordance with a searcher's instruction. Further, in this case, one of the following two functions can be selected.
[0021]
(1) When it is desired to extract a plurality of characteristic portions, all frames having a certain volume or higher are extracted.
[0022]
(2) When it is desired to extract only one characteristic portion, a frame indicating the maximum volume in the scene is extracted.
[0023]
If the average volume is equal to or higher than a certain value, a scene having an intermediate reproduction time is extracted as a part having the maximum volume in the scene as having some meaning. Even if the average volume is less than a certain value, if there is a volume change portion that is equal to or more than a certain volume difference in the scene, it is determined to have some meaning, and the portion showing the maximum volume change in the scene is determined. Extract. Other scenes are not extracted because they are determined to be meaningless.
[0024]
The AV data is composed of a plurality of frames, and has a volume characteristic such as maximum, minimum, and average of a certain size for each frame. When examining volume changes in a scene, the volume of one frame must be compared to the volume of a different frame. Therefore, when examining a volume change in a scene, the following is performed.
[0025]
First, when examining a change in volume, if the immediately preceding frame is used as a difference detection target, there may be a case where it is not possible to accurately determine a portion having the largest volume difference due to the influence of noise. Thus, the searcher can set how many frames before the volume to be compared when the difference is detected.
[0026]
Further, the detection of the maximum part of the volume change is performed in two ways: one is to detect the part where the volume increase is the largest, and the other is to detect the part where the volume change is the largest, that is, the part where the absolute value of the volume difference is the largest. is there. Which of these two types to select is set by the searcher.
[0027]
The search data is configured by combining frames extracted from a plurality of characteristic scenes. However, if the search data is too large when the search data is generated, the number of images that the searcher must refer to increases, and the time required for the search also increases. Therefore, in order to prevent search data from being extracted more than necessary, the following is performed.
[0028]
First, the number of frames constituting the scene extracted from each scene from which the search data is extracted is set so as to determine how many frames before and after the characteristic portion are to be extracted at the time of extraction so that the searcher can determine the number of frames.
[0029]
Further, if the search data is too large, the number of images that the searcher must refer to increases, so that the time required for the search also increases. Therefore, in order to reduce search data that is too large, the following is performed.
[0030]
First, the searcher can set the maximum number of scenes at which extraction of search data is allowable. When the number of scenes for extracting the search data is larger than the set number of scenes, the number of scenes for extracting the search data is reduced to be equal to or less than the set number of scenes. In this case, the searcher can set which of the scene with the smallest maximum volume and the scene with the smallest volume difference is deleted first.
[0031]
In addition, the maximum total number of frames allowable as search data is set by the searcher. When the total number of frames of the search data is larger than the set number of frames, the number of frames of the search data is reduced to be equal to or less than the set number of frames.
[0032]
In the function of setting the number of frames of the search data to be equal to or less than the set number of frames, if the size of the search data is too small, the contents may not be understood during the search. Therefore, in order to avoid this, the searcher can set the minimum number of frames necessary for one scene so that the number of frames per scene is not made too small and the contents cannot be understood. To
[0033]
【Example】
FIG. 1 is a block diagram of a moving image search data generation method according to an embodiment of the present invention. In FIG. 1, reference numeral 1010 denotes a main control unit that controls the entire video search data generation method according to the present embodiment. Reference numeral 1020 denotes a user's instruction to search for a video file from a keyboard or the like, and parameters required when extracting the search data. This is an input unit for inputting. 1030, a data storage management unit that records and reads a moving image file; 1040, a search data generation unit that extracts a characteristic part in the moving image file; 1050, a display that displays search data managed by the search data generation unit 1040 A communication unit 1060 performs data exchange between the above units.
[0034]
In the above configuration, for example, the main control unit 1010 is a CPU, the input unit 1020 is a keyboard, the data storage management unit 1030 is a hard disk, the display unit 1050 is a CRT display, and the search data generation unit 1040 is a ROM, RAM, or the like. By realizing the communication unit 1060 using a bus with a program stored and executed by the CPU, the communication unit 1060 can be configured with a conventionally well-known device.
[0035]
FIG. 11 shows a device configuration diagram capable of realizing FIG. In FIG. 11, reference numeral 11010 denotes a CPU (central processing unit), 11020 denotes a keyboard, 11030 denotes a hard disk, 11040 denotes a ROM (read only memory), 11050 denotes a display, and 11060 denotes a bus.
[0036]
Returning to FIG. 1, reference numerals 1031 and 1032 constitute a data storage management unit 1030. Reference numeral 1031 denotes a data storage unit that stores AV data, and 1032 denotes a data read unit that reads data from the data storage unit 1031 under the control of the main control unit 1010 or the search data generation unit 1040. 1041, 1042, and 1043 constitute a search data generation unit 1040. Reference numeral 1041 denotes a scene switching detection unit that detects scene switching of a moving image file read from the data reading unit 1032. Reference numeral 1042 denotes a volume detection unit that detects the volume of each frame of the moving image file read from the data reading unit 1032. A search data selection management unit 1043 performs a predetermined process based on the information of the scene switching detection unit 1041 and the volume detection unit 1042, and manages the result. The search data selection management unit 1043 holds the frame number or the frame data. .
[0037]
First, the scene switching detection process will be described. The process of detecting a scene change can be performed using, for example, a correlation between two temporally consecutive frames and information of a motion vector indicating a parallel movement of the entire image. Specifically, first, a two-dimensional small block is determined in one frame, and a correlation between the blocks (for example, the square of an error) is calculated between two temporally consecutive frames to determine whether there is a correlation. (The magnitude of the square of the error). If there is a correlation (the square of the error is small), it is considered to be a continuous scene (scene), and no scene switching has occurred. However, this frame correlation alone determines that a scene caused by a parallel movement of the camera such as pan (lateral movement) or tilt (vertical movement) is a scene switch. A vector is detected, and it is determined whether or not scene switching is performed based on the motion vector (when a motion vector is detected, a continuous scene). The detection of the motion vector is obtained, for example, from the relationship between the spatial gradient of the frame image and the difference between the images. The method of detecting a motion vector is described in detail in the document "Digital signal processing of images: written by Takahiko Fuukiki, Nikkan Kogyo Shimbun".
[0038]
Here, a method of detecting a motion vector will be described. As a typical method of detecting a motion vector, there is a block matching method. Hereinafter, the block matching method will be described with reference to FIG. The block matching method compares one macroblock (a block of 16 × 16 pixels) of an encoding target image with all macroblocks of a previous image, and adopts a vector to a macroblock having the smallest image difference as a motion vector. How to For example, when it is desired to obtain a motion vector of the macro block 12010, the positions of the macro blocks 12020 to 1270 read from the previous image are slightly shifted (in this figure, shifted by 2 in the x-axis direction). Each of the read blocks is compared with a macro block 12010, and a block (block 12040 in this figure) having the smallest image difference is obtained. Block matching is performed for the number of blocks that can be candidates. In this figure, the vector (6,0) is adopted as a motion vector.
[0039]
Therefore, for example, when a scene change is detected in MPEG1, the number of blocks having a difference from the motion-compensated predicted image and the number of blocks having no difference are checked in the P picture, and the number of blocks having the difference is determined. Is greater than or equal to a certain value, it may be considered that a motion vector has been detected and may be determined to be a continuous scene.
[0040]
Therefore, it is determined that scene switching has occurred when there is no inter-frame correlation and no motion vector is recognized.
[0041]
Here, the volume detection processing will be described. In the sound volume detection processing, if the sound volume value of each frame is used as it is, the influence of noise of each frame is greatly affected. Therefore, in order to remove noise for each frame, the sound volume value for each frame is, for example, three frames before and after which are temporally continuous with the frame for which the sound volume is to be obtained, that is, a total of seven frames including the frame for which the sound volume is to be obtained. Take the average value and define it as the volume value for the frame. Of course, it is also possible to use a noise removal method other than this method.
[0042]
Hereinafter, a method of detecting the volume of each frame will be described. Normally, in order to digitize an acoustic waveform, sampling, quantization, and encoding need to be performed in order as shown in FIG. In sampling, the time is divided finely, and the height of the waveform in that unit time is observed. In the quantization, the height of the waveform is read while being divided into a binary number of a certain number of digits. In the encoding, the acoustic waveform is converted into a digital signal having a value obtained by quantization. In a general music compact disc, the sampling frequency (the number of time steps per second when sampling) is 44.1 kHz, and the number of quantization bits (the fineness that separates the strength of sound during quantization) is 16 Is a bit.
[0043]
Therefore, the volume of each frame can be obtained by defining, for example, the maximum value 14010 as the volume of the frame in the group of encoded numerical values within one frame time as shown in FIG. Of course, within the time corresponding to one frame, the average value or the minimum value 14020 of the encoded values may be defined as the volume of the frame.
[0044]
In the present embodiment, the above-described inter-frame correlation, motion vector, and frame volume definition are used as a scene switching detection method and a volume detection method. FIG. 8 is a scene feature table for storing information for each scene required in the process of generating search data. Each row of the table stores the scene end frame number, scene length, average volume, maximum volume frame address, maximum volume, maximum volume difference frame address, and maximum volume difference of each scene. By realizing a table by a list, the size of the table can be changed arbitrarily.
[0045]
Hereinafter, an operation example of the block diagram of FIG. 1 will be described with reference to the flowchart of FIG. FIG. 2 is a flowchart showing the operation of the moving image search data generation method. First, when a search process of a moving image file is started by an instruction of the input unit 1020, a searcher selects a moving image file to be searched by using the input unit 1020 (step 2010). The main control unit 1010 instructs the search data generation unit 1040 to create search data for the moving image file in order to search for the specified moving image file. In the search data generation unit 1040, the search data selection management unit 1043 instructs the data reading unit 1032 to read the moving image file (step 2020). The search data selection management unit 1043 requests the scene switching detection unit 1041 and the volume detection unit 1042 for scene switching occurrence information and volume information of each frame in order to clarify the characteristic portion of the moving image file. (Step 2030). The search data selection management unit 1043 extracts the characteristic part in the moving image file into the extraction table of FIG. 9 based on the information obtained from the scene characteristic table (step 2040). Each row of the extraction table stores the maximum volume, the maximum volume difference, the start frame number, and the end frame number of each scene constituting the search data. By realizing a table by a list, the size of the table can be changed arbitrarily. The search data selection management unit 1043 selects search data from the extraction table based on the condition given by the input unit 1020 (Step 2050). Finally, the main control unit 1010 instructs the display unit 1050 to display a list of search data, and the display unit 1050 displays a list of frames from the start frame number to the end frame number of each scene stored in the extraction table ( Step 2060). The data exchange between these units is performed via the communication unit 1060.
[0046]
FIG. 3, FIG. 5, and FIG. 7 are flowcharts showing examples of the processing contents of the scene characteristic table creation processing, the characteristic part extraction processing, and the search data selection processing in the flowchart of FIG. Further, FIG. 4 shows a scene switching detection process in the flowchart of FIG. 3, and FIG. 6 shows a flowchart of an example of each process content extraction process in the flowchart of FIG. Hereinafter, the processing examples of the flowcharts in FIGS. 3 to 7 will be sequentially described.
[0047]
An operation example of the scene feature table creation processing (step 2030) shown in FIG. 3 will be described. First, when detecting the volume difference between two frames, the number of frames α for volume difference detection that specifies the number of frames before the volume difference is set so that the volume difference with any relative frame can be detected. , Input unit 1020. Further, when detecting a frame in which the volume difference from the frame before the α frame is the largest, the frame in which the increase in the volume is the largest is taken, or the change in the volume, that is, the frame in which the absolute value of the volume difference is the largest, is taken. Is specified by the input unit 1020. (Step 3010). The current frame number for counting the number of frames from the first frame of the moving image file is set to 0 (step 3015). The first frame of the specified file is read (step 3020), and 1 is set to a frame counter for counting the number of frames of one scene, and 0 is set to a variable total volume representing the total volume of each frame in one scene as an initial setting value. The current frame number is set as the variable maximum volume frame number indicating where the frame with the highest volume is located in one scene, 0 is set as the variable maximum volume indicating the volume of the frame, and 0 is set as the α frame before in the one scene. The current frame number is set to a variable maximum volume difference frame number indicating where the frame with the maximum volume difference is located, and 0 is set to a variable maximum volume difference indicating the difference between the volume of that frame and the volume of the frame preceding the α frame. (Step 3030). One is added to the current frame number (step 3035). It is determined whether or not the volume of the currently read frame is higher than the maximum volume (step 3040). If the volume is higher, the maximum volume frame number is replaced with the current frame number and the maximum volume is replaced with the volume value of the current frame (step 3050). ).
[0048]
Next, before taking the volume difference from the frame before the α-frame, it is necessary to check whether the frame before the α-frame exists in the same scene. Therefore, it is checked whether or not the value of the frame counter is α + 1 or more. (Step 3060). If the volume is greater than or equal to α + 1, the volume can be compared with the volume before the α-frame. Therefore, after detecting the volume before the α-frame (step 3070), the frame in which the increase in the volume is maximum is taken as the maximum volume difference frame It is checked whether or not the maximum volume difference is smaller than "(current frame volume)-(volume before α frame)" (steps 3080, 3090). When the frame having the maximum absolute value of the volume difference is taken as the maximum volume difference frame, it is checked whether the frame is smaller than the absolute value of “(current frame volume) − (volume before α frame)” (step 3080, 3100). When the volume is small, the current volume difference becomes the maximum volume difference. Therefore, the current frame number is used as the maximum volume difference frame number, and the absolute value of “(current frame volume) − (volume before α frame)” is used as the maximum volume difference. , Respectively (step 3110). Further, if the value of the maximum volume difference is smaller than 0 (step 3120), α is subtracted from the maximum volume difference frame number (step 3130). This is because, when taking a difference in the sound volume decreasing direction, it is desirable to extract a frame at the beginning of the decrease as a frame of the sound volume decreasing.
[0049]
Next, the current frame volume is added to the total volume (step 3140), and it is checked whether or not the next frame exists. If the next frame exists, the next frame is read (step 3160), and through the scene switching detection processing (step 3170), it is checked whether or not the scene switching flag is set (step 3180). If the flag is not set, 1 is added to the frame counter (step 3210), and 1 is added again to the current frame number (step 3035). If the flag is set, the value of the current frame number as the scene end frame number 8015, the value of the frame counter as the scene length 8020, and the value of the frame counter as the average volume 8030 in a new line of the scene feature table 8010 shown in FIG. The value of the variable maximum volume frame number as the maximum volume frame number 8040, the variable maximum volume value as the maximum volume 8050, and the variable maximum volume difference as the maximum volume difference frame number 8060. The value of the variable maximum volume difference is registered as the value of the frame number as the maximum volume difference 8070 (step 3190). If the end flag is set, the scene feature table creation processing is ended. Returning to step 3030), the processing is continued. The operation example of the scene feature table creation processing (step 2030) in FIG. 3 has been described above.
[0050]
Next, an operation example of the scene switching detection process (step 3170) shown in FIG. 3 will be described with reference to FIG. First, it is checked whether there is a correlation between the previous frame and the current frame (step 4010) and whether there is a motion vector (step 4020). If there is no correlation or motion vector between the previous frame and the current frame, It is determined that a scene change has occurred, a scene change flag is set (step 4030), and the scene change detection process ends. When at least one of the correlation or the motion vector is detected, it is determined that the previous frame and the current frame are in the same scene, the scene switching flag is lowered (step 4040), and the scene switching detection process is performed. To end. The operation example of the scene switching detection processing (step 3170) in FIG. 4 has been described above.
[0051]
Next, an operation example of the characteristic portion extraction process (step 2040) shown in FIG. 2 will be described with reference to FIG. First, scene lengths a and b, sound volumes c, d, and e, and a maximum sound volume difference f are set by the input unit 1020 as parameters for selecting characteristic portions (step 5010). In addition, the input unit 1020 specifies the number of frames before and after the feature frame to be extracted as a feature portion by using a variable x. In addition, when the scene length is b or longer, the input unit 1020 specifies whether to extract a plurality of characteristic portions (step 5020). The first line of the scene feature table 8010 is read (step 5030), and the value of the scene counter is set to 0 (step 5040).
[0052]
After that, the cases are classified according to the length of the scene. First, in order to see whether or not the target scene is a short scene of a certain length or less, it is checked whether or not the scene length 8020 is a or less (step 5050). In order to see whether or not the scene is longer than a certain length, it is checked whether or not the scene length 8020 is longer than b (step 5080). If the scene length 8020 is equal to or less than a, it is further checked whether or not the average volume 8030 is equal to or more than d (step 5060). If the average volume 8030 is equal to or greater than d, it is determined that the scene length 8020 is a characteristic part even if it is short, and the maximum volume 9020 is set to the maximum volume 9020 in order to extract all frames of the scene into the extraction table. 8050, the maximum volume difference 9030, the maximum volume difference 8070, the start frame number 9040, "(scene end frame number)-(scene length) +1", and the end frame number 9050, the scene end frame number 8015. After each registration (step 5070), 1 is added to the scene counter (step 5170). If the average volume 8030 is lower than d, it is determined that the part is not characteristic, and nothing is performed.
[0053]
If the scene length 8020 is not equal to or greater than b, it is determined that the scene is an intermediate scene length, and it is checked whether the average volume 8030 is equal to or greater than e (step 5130). When the average sound volume 8030 is equal to or higher than e, it is determined that the portion having the highest sound volume is the most characteristic portion in the scene, and the frame of the maximum sound volume frame number and the x frames before and after it are extracted in the extraction table. Therefore, the value of the maximum volume 8050 is set to the maximum volume 9020, the value of the maximum volume difference 8070 is set to the maximum volume difference 9030, "(maximum volume frame number) -x" is set to the start frame number 9040, and " (Maximum volume frame number) + x "is registered (step 5140), and 1 is added to the scene counter (step 5170). If the average volume 8030 is smaller than e, it is checked whether the maximum volume difference 8070 is equal to or greater than f (step 5150). If the maximum volume difference 8070 is equal to or greater than f, it is determined that the portion having the maximum volume difference is the most characteristic portion in the scene, and the frame of the maximum volume difference frame number and the x frames before and after the frame are extracted. In order to extract the values in the table, the value of the maximum volume 8050 is set to the maximum volume 9020, the value of the maximum volume difference 8070 is set to the maximum volume difference 9030, "(maximum volume difference frame number) -x" is set to the start frame number 9040, and the end is set. “(Maximum volume difference frame number) + x” is registered as the frame number 9050 (step 5160), and 1 is added to the scene counter (step 5170). If the maximum volume difference is smaller than f, it is determined that the scene is not characteristic, and nothing is performed.
[0054]
On the other hand, if the scene length is equal to or longer than b, it is determined that the scene is longer than a certain length, and it is first checked whether or not a plurality of characteristic portions are extracted from the scene (step 5090). When extracting a plurality of characteristic portions, 0 is set in a scene sub-counter for counting the number of the characteristic portions (step 5100), and the feature of the scene is obtained by a plurality of characteristic portion extraction processing in the same scene (step 5110). The part is extracted, "(scene sub counter) -1" is added to the scene counter (step 5120), and then 1 is added to the scene counter (step 5170). That is, considering step 5120 and step 5170 together, it is equivalent to adding the value of the scene sub-counter to the scene counter. If a plurality of characteristic parts are not extracted from the same scene, the part having the highest volume is determined to be the most characteristic part in the scene, and the frame with the maximum volume frame number and the x frames before and after it are extracted from the extraction table. , The value of the maximum volume 8050 in the maximum volume 9020, the value of the maximum volume difference 8070 in the maximum volume difference 9030, “(maximum volume frame number) −x” in the start frame number 9040, and the end frame number “(Maximum volume frame number) + x” is registered in 9050 (step 5140), and 1 is added to the scene counter (step 5170).
[0055]
After passing through any of these steps, it is checked whether or not the next line exists in the scene feature table 8010 (step 5180). If so, the line is read and if the scene length is again less than or equal to a. A check (step 5050) continues the processing. If not, the characteristic part extraction processing ends. The operation example of the characteristic portion extraction processing (step 2040) in FIG. 5 has been described above.
[0056]
Next, an operation example of the multiple feature portion extraction process in the same scene (step 5110) shown in FIG. 5 will be described with reference to FIG. First, the same number of flags as the number of frames of a scene from which a plurality of characteristic portions are extracted are prepared, and numbers from “(scene end frame number) − (scene length) +1” to scene end frame numbers are assigned. One-to-one correspondence is made (step 6010). The first frame of the scene to be extracted is read (step 6020), and it is checked whether or not the volume of the current frame is higher than the volume c (step 6030). If the volume is equal to or higher than c, a flag corresponding to the current frame and x frames before and after the current frame is set. Thereafter, it is checked whether or not the next frame exists. If so, the frame is read (step 6060), and it is again checked whether or not the frame volume is equal to or higher than the volume c (step 6030). If the next frame does not exist, it is checked how many times the flag is temporally continuous and the value is set in the scene sub-counter (step 6070). Finally, the frame corresponding to the flag is set. To extract the maximum volume 9020, the maximum volume difference 9030, the maximum volume difference 8070, the start frame number 9040, the start flag number of the continuous portion, and the end frame number 9050 The end numbers of the continuous portions are sequentially registered for all the continuous portions (step 6080), and the multiple feature portion extraction process in the same scene ends. The operation example of the multiple feature portion extraction process in the same scene (step 5110) in FIG. 6 has been described above.
[0057]
Next, an operation example of the search data selection process (step 2050) shown in FIG. 2 will be described with reference to FIG. First, a maximum allowable limit number of scenes specifying the maximum number of scenes of the search data and a maximum allowable limit number of frames specifying the maximum frame number are specified. In addition, in order to prevent the number of frames constituting one characteristic portion from becoming too small to make the contents incomprehensible, a minimum necessary limit number of frames for designating the lower limit of the number of frames constituting one scene is specified. Furthermore, it is specified which of the characteristic part with the largest volume and the characteristic part with the largest difference in volume has priority as the search data (step 7010). Next, it is checked whether or not the value of the scene counter is larger than the maximum allowable limit number of scenes (step 7020). When the value of the scene counter is larger than the maximum allowable limit number of scenes, it is necessary to reduce the number of extracted scenes. Therefore, a feature scene with a large maximum volume is prioritized as search data over a feature scene with a large maximum volume difference. It is checked whether or not the priority is higher (step 7030). If priority is given, the scene with the smallest maximum volume difference is deleted from the extraction table, if not, the scene with the smallest maximum volume is deleted (step 7040) (step 7050). After that, 1 is subtracted from the scene counter (step 7055), and the process returns to step 7020 again until the value of the scene counter becomes equal to or less than the maximum allowable limit number of scenes. At this time, if there are a plurality of scenes with the smallest maximum volume or maximum volume difference, the scene with the smallest end frame number is deleted.
[0058]
If the number of extracted scenes is equal to or less than the maximum allowable limit number of scenes, first, the number of extracted frames which is the sum of “(end frame number) − (start frame number) +1” of each row of the extraction table is calculated (step 7058). ), 0 is set to a deletion prohibition scene counter (step 7060). The deletion prohibition scene counter is a counter that counts the number of scenes in which the number of frames constituting one scene is smaller than the minimum necessary frame number. Next, it is checked whether the value of the deletion prohibition scene counter is equal to the value of the scene counter (step 7070). If they are equal, it means that there is no scene that can be deleted, and the search data selection process ends. If not, it is checked whether the number of extracted frames is larger than the maximum allowable frame number (step 7080). If the number of extracted frames is equal to or less than the maximum allowable frame number, there is no need to delete frames, and the search data selection process ends.
[0059]
If the number of extracted frames is larger than the maximum allowable limit number of frames, the number of extracted frames must be reduced. First, the scene length of the first scene is set to “(end frame number) − (start frame number). +1 "is calculated (step 7090), and it is checked whether or not the scene length is larger than the minimum necessary frame number (step 7100). If the scene length is larger than the minimum required number of frames, it is possible to reduce the number of frames from that scene, so add 1 to the start frame number of the scene and delete 1 Subtract 1 from the number (step 7110) and subtract 2 from the number of extracted frames (step 7115). If the scene length is not larger than the minimum necessary frame, if the number of frames is further reduced, the contents of the scene cannot be understood, so that it is considered impossible to delete a frame from the scene, and 1 is set in the deletion prohibition counter. Is added (step 7120). Next, it is checked whether or not the next scene exists in the extraction table (step 7130). If there is, the scene length of the next scene is calculated (step 7140), and this is repeated from step 7100 until the next scene no longer exists. If the next scene no longer exists, the process returns to step 7070, and this process is repeated until the deletion-prohibited scene counter is equal to the scene counter and the number of extracted frames is equal to or less than the maximum allowable frame number. The operation example of the search data selection process (step 2050) in FIG. 7 has been described above.
[0060]
In the above embodiment, the search data generation process is performed every time a search is performed. However, if the search data selection management unit 1043 in FIG. 1 has a function of retaining an extraction table after a search, it is not necessary to perform the process each time. Absent. At this time, the main control unit 1010 checks the search data selection management unit 1043 for the moving image file for which the browse instruction has been received, and if the search data has already been generated, directly based on the search data. indicate. Only when it is not generated, the search data generation unit 1040 creates search data.
[0061]
Further, in the above embodiment, various parameters for generating search data are set by the user using the input unit 1020 of FIG. 1. However, scene features held in the search data selection management unit 1043 are stored. It is also possible to use table information as a parameter. For example, a value 1.5 times the average volume of the scene can be used in FIG. This can be achieved by setting c = (average volume) × 1.5 at the start of the multiple feature portion extraction process 5110 in the same scene without setting c in step 5010. In step 5010, it is also possible to use a value of 10% of the scene length of the entire moving image file as the scene length b, or to use a value twice as large as the average sound volume of the entire moving image file as the sound volume e. In these cases, b = (sum of scene lengths in the scene feature table) × 0.1 and e = [{sum of (scene length × average volume)｝ ÷ (sum of scene lengths)] × 2 are shown in FIG. May be set in step 5010.
[0062]
When AV data is currently handled, a tape such as a VTR is mainly used as a medium for storing video. A tape as a storage medium can be said to be a sequential access and a low-speed mass storage. In a system using such a medium, in order to enhance the effect of the search, the search data selection management unit 1043 may store the search data in another storage medium (such as a hard disk) capable of high-speed random access. Conceivable. According to this method, a high-speed search of a moving image file becomes possible, and a user can reproduce only a necessary scene from a tape.
[0063]
Also, in a digital video editing device or the like, there is an operation of A / D conversion from an analog tape to digital media (a hard disk or the like) in the device and storing (downloading) the data. It may be performed sometimes. Then, after the download is completed, a high-speed search for the moving image file can be performed immediately.
[0064]
As described above, according to the present embodiment, a search image (moving image) is extracted based on a frame in which scene switching has occurred and audio information of the frame, and a search for arbitrary AV data (file) is performed. Becomes feasible.
[0065]
As a result, unlike the related art, it is not necessary for the user to specify the characteristic portion to be searched in advance and create the search data before performing the search, and the effect of reducing the user's workload can be obtained. .
[0066]
【The invention's effect】
As described above, according to the present invention, for an important (characteristic) scene, a plurality of frames can be extracted as detection data for each scene, and the detection data can be finely extracted.
[0067]
Further, according to the present invention, for an unimportant scene, the number of search data can be reduced without extracting search data from the scene.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an overall configuration of a moving image search data generation method according to an embodiment of the present invention.
FIG. 2 is a flowchart showing the operation of the embodiment of FIG. 1;
FIG. 3 is a flowchart illustrating an operation of a scene feature table creation process in the flowchart of FIG. 2;
FIG. 4 is a flowchart illustrating an operation of a scene switching detection process in the flowchart of FIG. 3;
FIG. 5 is a flowchart showing an operation of a feature scene extraction process in the flowchart of FIG. 2;
FIG. 6 is a flowchart showing an operation of a process of extracting a plurality of characteristic portions in the same scene in the flowchart of FIG. 5;
FIG. 7 is a flowchart illustrating an operation of a search data selection process in the flowchart of FIG. 2;
FIG. 8 is a scene feature table created in the process of generating search data.
FIG. 9 is an extraction table created in the process of generating search data.
FIG. 10 is an explanatory diagram related to an MPG1 encoding method.
FIG. 11 is a diagram showing a device configuration that can realize the present invention.
FIG. 12 is an explanatory diagram relating to a block matching method.
FIG. 13 is an explanatory diagram relating to a method of detecting a volume of each frame.
FIG. 14 is an explanatory diagram relating to a method of detecting a volume of each frame.
[Explanation of symbols]
1010: Main control unit
1020 ... input unit
1030 Data storage management unit
1031: Data storage unit
1032: Data reading unit
1040 ... search data generation unit
1041 scene change detection unit
1042 ... Volume detector
1043... Search data selection management unit
1050 ... Display unit
1060 ... Communication unit

Claims

Moving image data representing a moving image composed of a plurality of sequential image frames, and sequential audio data associated with each of the frames are input, and are included in the moving image from the input moving image data. In a video search system that extracts search data for each scene,
Based on the content of the moving image data to be input, for each of the scenes, a scene switching detection unit for detecting a frame constituting the scene,
Volume detection means for detecting the volume of the input audio data for each corresponding frame,
For each scene detected by the scene switch detection unit, moving image data including a frame in which a volume equal to or higher than a predetermined value is detected by the volume detection unit is used as the search data for each scene from the scene. Search data extraction means to be extracted;
With
The search data extraction means,
A moving image search system , wherein the number of scenes from which the search data is extracted is limited to the number specified as the maximum number of scenes from which the search data is extracted .

Moving image data representing a moving image composed of a plurality of sequential image frames, and sequential audio data associated with each of the frames are input, and are included in the moving image from the input moving image data. In a video search system that extracts search data for each scene,
Based on the content of the moving image data to be input, for each of the scenes, a scene switching detection unit for detecting a frame constituting the scene,
Volume detection means for detecting the volume of the input audio data for each corresponding frame,
For each scene detected by the scene switching detection unit, a frame whose volume change with respect to a frame separated by a predetermined interval is equal to or more than a predetermined value based on the volume of each frame detected by the volume detection unit. A search data extraction unit that extracts moving image data from the scene as the search data for each scene,
With
The search data extraction means,
A moving image search system , wherein the number of scenes from which the search data is extracted is limited to the number specified as the maximum number of scenes from which the search data is extracted .

The moving image search system according to claim 1 or 2 ,
The search data extraction means,
Among the scenes detected by the scene switching detection unit, comprising a calculation unit that calculates an average volume in the scene based on the volume detected by the volume detection unit,
A moving image search system, wherein search data is extracted only for scenes whose average sound volume calculated by the calculation means is higher than or equal to a predetermined sound volume.

The moving image search system according to claim 1 ,
The search data extraction means,
For each scene detected by the scene switching detection means, comprising a detection means for detecting a frame having a maximum volume,
A moving image search system, wherein moving image data including a frame detected by the detection unit is extracted as the search data for each scene.

The moving image search system according to claim 2 ,
The search data extraction means,
For each scene detected by the scene switching detection unit, extracting moving image data including a frame having a maximum volume change with respect to frames separated by a predetermined interval as the search data for each scene. Characteristic video search system.

The moving image search system according to claim 1 or 2 ,
The search data extraction means,
A moving image search system , wherein the number of frames extracted from each scene detected by the scene switching detection means is limited to a number specified as a maximum number of frames extracted from each scene .

Moving image data representing a moving image composed of a plurality of sequential image frames, and sequential audio data associated with each of the frames are input, and are included in the moving image from the input moving image data. In a video search system that extracts search data for each scene,
Based on the content of the moving image data to be input, for each of the scenes, a scene switching detection unit for detecting a frame constituting the scene,
Volume detection means for detecting the volume of the input audio data for each corresponding frame,
For each scene detected by the scene switching detection unit, moving image data including a frame in which a volume equal to or higher than a predetermined value is detected by the volume detection unit is used as the search data for the scene from the scene. First search data extraction means to be extracted;
For each scene detected by the scene switching detection unit, a frame whose volume change with respect to a frame separated by a predetermined interval is equal to or more than a predetermined value based on the volume of each frame detected by the volume detection unit. Second search data extraction means for extracting moving image data from the scene as the search data for the scene,
Means for switching and enabling one of the first search data extraction means and the second search data extraction means in accordance with an external instruction;
With
The first and second search data extracting means include:
A moving image search system , wherein the number of scenes from which the search data is extracted is limited to the number specified as the maximum number of scenes from which the search data is extracted .