JP3642019B2

JP3642019B2 - AV content automatic summarization system and AV content automatic summarization method

Info

Publication number: JP3642019B2
Application number: JP2000339805A
Authority: JP
Inventors: 実黒岩
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2000-11-08
Filing date: 2000-11-08
Publication date: 2005-04-27
Anticipated expiration: 2020-11-08
Also published as: JP2002149672A

Description

【０００１】
【発明の属する技術分野】
本発明はＡＶコンテンツ自動要約システム及びＡＶコンテンツ自動要約方法に関し、特にＡＶ（ＡｕｄｉｏＶｉｓｕａｌ）コンテンツの要約を生成する方法に関する。
【０００２】
【従来の技術】
従来、ＡＶコンテンツの自動要約システムとしては、映像フレームの中から複数の代表画像を選択し、それらを順次表示したり、縮小画像の一覧で表示するものがある。
【０００３】
この場合、上記の自動要約システムでは映像フレームから一定周期で取出した映像や、映像の特徴量の変化点を自動検出してその変化点直後の映像を代表画像として選択している。
【０００４】
また、ＡＶコンテンツの自動要約の別の方式として、映像や音声の特徴量の変化点付近の映像と音声とを同時に再生するシステムがある。このシステムについては、特開平１１−８８８０７号公報に開示されている。
【０００５】
【発明が解決しようとする課題】
しかしながら、上述した従来のＡＶコンテンツの自動要約システムでは、映像のみを利用しているため、音声による情報が欠落し、また代表映像が必ずしもＡＶコンテンツの概要を的確に表しているものではないことが多いので、ＡＶコンテンツの概要をうまく把握することが困難であるという問題がある。
【０００６】
上記の公報記載のシステムでは、ＡＶコンテンツに含まれるひとつの話題に、現場の様子や解説者の話、テロップによる説明等の数多くのシーンが含まれているため、それらを音声付きの映像で再生する場合に、音声が自然に聞けるようにひとつのシーン毎の再生時間を数秒以上再生する必要があり、かつそれら多くのシーンの全てが対応する話題の概要を的確に表現するものでない。
【０００７】
また、ＡＶコンテンツの内容を端的に表現する映像と、ＡＶコンテンツの内容を端的に表現する音声とが別のシーンに存在することが多いため、ＡＶコンテンツの一部分を再生する方式で、それらの映像と音声との両方を再生しようとすると必然的に時間が長くなる。したがって、上記の公報記載のシステムには、ＡＶコンテンツの概要をうまく把握するのに、ある程度長いＡＶ要約を生成する必要があるという問題がある。
【０００８】
そこで、本発明の目的は上記の問題点を解消し、より内容を把握しやすいＡＶ要約を生成することができるＡＶコンテンツ自動要約システム及びＡＶコンテンツ自動要約方法を提供することにある。
【０００９】
【課題を解決するための手段】
本発明によるＡＶコンテンツ自動要約システムは、少なくとも映像及び音声を含むＡＶ（ＡｕｄｉｏＶｉｓｕａｌ）コンテンツからそれらの映像及び音声の中から部分的に選択して編集するＡＶコンテンツ自動要約システムであって、前記ＡＶコンテンツの中から前記音声とは独立して前記映像を部分的に取出す手段と、前記ＡＶコンテンツの中から前記映像とは独立して前記音声を部分的に取出す手段と、それら個別に取り出した映像及び音声を合成して出力する手段とを備えている。
【００１０】
本発明による他のＡＶコンテンツ自動要約システムは、少なくとも報道番組でアナウンサが次のニュースの概要を説明するシーンを示す概要説明シーンを検出する検出手段と、前記検出手段で検出された概要説明シーンに続く詳細シーンの要約映像を生成する生成手段と、前記検出手段で検出された概要説明シーンの音声のみを抽出する抽出手段と、前記生成手段で要約映像と前記抽出手段で抽出された概要説明音声とを合成して出力する出力手段とを備えている。
【００１１】
本発明による別のＡＶコンテンツ自動要約システムは、少なくとも報道番組でアナウンサが次のニュースの概要を説明するシーンを示す概要説明シーンを含むコンテンツからＡＶ（ＡｕｄｉｏＶｉｓｕａｌ）要約を生成するＡＶコンテンツ自動要約システムであって、前記コンテンツから前記概要説明シーンを検出しかつその概要説明シーンの開始フレーム番号及び終了フレーム番号の集合を前記概要説明シーンとともに記録する概要説明シーン検出手段と、前記概要説明シーンに続く詳細シーンの要約映像を生成する映像要約手段と、前記概要説明シーンの音声を概要説明音声として切出す音声抽出手段と、前記音声抽出手段が生成した概要説明音声とその概要説明音声に対応する前記映像要約手段が生成した詳細シーンの要約映像との同期をとって前記ＡＶ要約として再生出力するＡＶ要約出力手段とを備えている。
【００１２】
本発明によるＡＶコンテンツ自動要約方法は、少なくとも映像及び音声を含むＡＶ（ＡｕｄｉｏＶｉｓｕａｌ）コンテンツからそれらの映像及び音声の中から部分的に選択して編集するＡＶコンテンツ自動要約方法であって、前記ＡＶコンテンツの中から前記音声とは独立して前記映像を部分的に取出すステップと、前記ＡＶコンテンツの中から前記映像とは独立して前記音声を部分的に取出すステップと、それら個別に取り出した映像及び音声を合成して出力するステップとを備え、これら各ステップをコンピュータが実行している。
【００１３】
本発明による他のＡＶコンテンツ自動要約方法は、少なくとも報道番組でアナウンサが次のニュースの概要を説明するシーンを示す概要説明シーンを検出するステップと、検出された概要説明シーンに続く詳細シーンの要約映像を生成するステップと、検出された概要説明シーンの音声のみを抽出するステップと、前記要約映像と前記概要説明音声とを合成して出力するステップとを備え、これら各ステップをコンピュータが実行している。
【００１４】
本発明による別のＡＶコンテンツ自動要約方法は、少なくとも報道番組でアナウンサが次のニュースの概要を説明するシーンを示す概要説明シーンを含むコンテンツからＡＶ（ＡｕｄｉｏＶｉｓｕａｌ）要約を生成するＡＶコンテンツ自動要約方法であって、前記コンテンツから前記概要説明シーンを検出しかつその概要説明シーンの開始フレーム番号及び終了フレーム番号の集合を前記概要説明シーンとともに記録するステップと、前記概要説明シーンに続く詳細シーンの要約映像を生成するステップと、前記概要説明シーンの音声を概要説明音声として切出すステップと、前記概要説明音声とその概要説明音声に対応する前記詳細シーンの要約映像との同期をとって前記ＡＶ要約として再生出力するステップとを備え、これら各ステップをコンピュータが実行している。
【００１５】
すなわち、本発明のＡＶコンテンツ自動要約方式は、映像と音声とが多重化されたＡＶコンテンツの内容を短時間で把握するためのＡＶ要約を自動生成する方式において、報道番組でアナウンサが次のニュースの概要を説明するシーン等の概要説明シーンを自動検出し、概要説明シーンに続く詳細シーンの要約映像と、概要説明シーンの音声のみを取出した概要説明音声とを合成することで、ＡＶ要約を生成する方式である。
【００１６】
より具体的に、本発明のＡＶコンテンツ自動要約システムは、既存の人物検出、テロップ検出、人声検出、類似画像検出等の技術を利用して概要説明シーンを検出し、概要説明シーンの開始フレーム番号と終了フレーム番号の集合とを記録する概要説明シーン検出手段と、既存の映像要約技術を利用して概要説明シーンに続く詳細シーンの要約映像を生成する映像要約手段と、概要説明シーンの音声を概要説明音声として切り出す音声抽出手段と、音声抽出手段が生成した概要説明音声とその概要説明音声に対応する映像要約手段が生成した詳細シーンの要約映像との同期をとってＡＶ要約として再生もしくは記録媒体に出力するＡＶ要約出力手段とを有している。
【００１７】
上記のような構成とすることで、要約映像と概要説明音声とを個別に生成してから合成するため、ＡＶコンテンツの一部を切り出してＡＶ要約とする方法に比べて、より内容を把握しやすいＡＶ要約の生成を可能にする。また、アナウンサ等が概要を説明する部分の音声をそのまま利用するので、音声認識やテキスト要約を利用する方法に比べて音声が自然で、要約処理時間も少ないという効果がある。
【００１８】
【発明の実施の形態】
次に、本発明の実施例について図面を参照して説明する。図１は本発明の一実施例によるＡＶコンテンツ自動要約システムの構成を示すブロック図である。図１において、本発明の一実施例によるＡＶコンテンツ自動要約システムはＡＶデータ入力手段１と、概要説明シーン検出手段２と、映像要約手段３と、音声抽出手段４と、ＡＶ要約出力手段５とから構成されている。
【００１９】
ＡＶデータ入力手段１は放送電波を受信し、その信号に含まれる映像情報と音声情報とを抽出する。この場合、映像情報は輝度情報と色情報とからなるＹＵＶ［Ｙ（輝度信号）、Ｕ，Ｖ（色差信号成分）］データに変換され、音声情報はＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）データに変換されてメモリ（図示せず）上に記録される。
【００２０】
ＹＵＶデータは映像のフレーム単位で取出すことができる。また、ＰＣＭデータはサンプル単位で取出すことができる。ＡＶデータ入力手段１は市販のＰＣ（パーソナルコンピュータ）用ＴＶチューナボードと付属プログラム、及びＰＣ用のオペレーティングシステムが提供する機能を用いる等によって容易に実現することができる。
【００２１】
概要説明シーン検出手段２はＡＶデータ入力手段１からＹＵＶデータとＰＣＭデータとを受取り、それらのデータを解析することによって、アナウンサが次のニュースの概要を説明するシーン等の概要説明シーンを検出し、概要説明シーンの開始フレーム番号と終了フレーム番号とを概要説明シーンの通し番号に関連付けて記録する。
【００２２】
概要説明シーンの通し番号は、後述する要約映像と概要説明音声との対応付けを行うことが目的であり、ある番組の要約を生成する場合には対象番組先頭からの通し番号を付加すればよく、ある開始時刻からある終了時刻までの要約を生成する場合にはその開始時刻からの通し番号を付加すればよい。
【００２３】
映像要約手段３はＡＶデータ入力手段１からＹＵＶデータを受取り、概要説明シーン検出手段２が記録した概要説明シーンのフレーム区間を参照して、概要説明シーンに続く現場シーンや解説シーン等の詳細シーンの要約映像を生成し、対応する概要説明シーンの通し番号に関連付けてその要約映像を記録する。
【００２４】
ここで、要約映像とは受信したＡＶコンテンツの内容をおおまかに把握可能な元映像よりも短い映像のことである。例えば、元映像から３０秒周期で２秒間の映像を抜き出し、それら２秒間の映像を連結して得られる元の映像の１５分の１の長さの映像は要約映像といえる。
【００２５】
音声抽出手段４はＡＶデータ入力手段１からＰＣＭデータを受取り、概要説明シーン検出手段２が記録した概要説明シーンのフレーム区間のＰＣＭデータを抜き出し、対応する概要説明シーンの通し番号に関連付けて概要説明音声として記録する。
【００２６】
ＡＶ要約出力手段５は映像要約手段３が記録した要約映像と、音声抽出手段４が記録した概要説明音声とを受取り、同じ通し番号が割り当てられている要約映像と概要説明音声とを同期させて合成し、ＡＶ要約としてメモリや磁気記録装置等に出力する。
【００２７】
図２は図１の概要説明シーン検出手段２の詳細な構成を示すブロック図である。図２において、概要説明シーン検出手段２は人物検出手段２１と、テロップ検出手段２２と、人声検出手段２３と、概要説明シーン判定手段２４とから構成されている。
【００２８】
人物検出手段２１はＡＶデータ入力手段１からＹＵＶデータを受取り、映像の各フレーム毎に画像中央部分に人の顔が存在しているかどうかを判断して記録する。
【００２９】
テロップ検出手段２２はＡＶデータ入力手段１からＹＵＶデータを受取り、映像の各フレーム毎に画像下部にテロップ文字が存在しているかどうかを判断して記録する。
【００３０】
人声検出手段２３はＡＶデータ入力手段１からＰＣＭデータを受取り、映像の各フレームに対応する音声データに、人の声が存在しているかどうかを判断して記録する。
【００３１】
概要説明シーン判定手段２４は人物検出手段２１の検出結果と、テロップ検出手段２２の検出結果と、人声検出手段２３の検出結果とを参照して、概要説明シーンのフレーム区間を判定し、その開始フレーム番号と終了フレーム番号とを概要説明シーンの通し番号に関連付けて記録する。
【００３２】
図３は本発明の一実施例によるＡＶコンテンツ自動要約システムの動作を示すフロートャートである。これら図１及び図３を参照して本発明の一実施例によるＡＶコンテンツ自動要約システムの全体の動作について説明する。
【００３３】
概要説明シーン検出手段２はＡＶデータ入力手段１からＹＵＶデータとＰＣＭデータとを受取り、そのデータを解析して概要説明シーンを特定し、概要説明シーンの通し番号を要素番号とし、開始フレーム番号と終了フレーム番号との組を要素とする配列として記録する（図３ステップＳ１）。
【００３４】
映像要約手段３はＡＶデータ入力手段１からＹＵＶデータを受取り、概要説明シーン検出手段２が記録した概要説明シーンのフレーム区間を参照し、概要説明シーンの終了フレーム直後から次の概要説明シーンの開始フレーム直前まで、あるいは次の概要説明シーンが存在しない場合に概要説明シーンの終了フレーム直後から最終フレームまでの詳細シーンに対して、予め定められた周期で、予め定められた時間分のＹＵＶデータを切り出し、それらの周期的な部分映像を連結したものを要約映像として記録する（図３ステップＳ２）。
【００３５】
要約映像の記録方法においては要約映像のＹＵＶデータを記録する必要はなく、各概要説明シーンの通し番号毎に、概要説明シーンに対応する要約映像に含まれるフレーム区間のリストを記録すればよい。
【００３６】
音声抽出手段４はＡＶデータ入力手段１からＰＣＭデータを受取り、概要説明シーン検出手段２が記録した概要説明シーンのフレーム区間に対応するＰＣＭデータを切り出し、概要説明音声として記録する（図３ステップＳ３）。
【００３７】
その際、概要説明シーンの区間は映像のフレーム番号で記録されているので、

の算出式に基づいてＰＣＭデータのサンプル番号に変換する。
【００３８】
また、概要説明音声の記録方法においては、概要説明音声のＰＣＭデータそのものを記録する必要はなく、概要説明シーンの通し番号を要素番号とし、概要説明音声の開始サンプル番号と終了サンプル番号との組を要素とする配列として記録すればよい。
【００３９】
ＡＶ要約出力手段５は概要説明シーンの通し番号毎に、映像要約手段３が記録した詳細シーンの要約映像と、音声抽出手段４が記録した概要説明音声の長さとを合わせて合成し、概要説明シーンの通し番号の順に連結して、ＡＶ要約として記録媒体に出力する（図３ステップＳ４）。
【００４０】
各通し番号毎の合成処理において、要約映像が概要説明音声よりも長い場合には、概要説明音声の後ろに無音信号を付加することで長さを合わせればよい。要約映像が概要説明音声よりも短い場合には、概要説明音声と同じ長さになるまで、要約映像を繰り返せばよい。尚、出力するＡＶ要約の形式はＹＵＶデータとＰＣＭデータとを多重化した形式、ＹＵＶデータをＲＧＢ［Ｒ（赤），Ｇ（緑），Ｂ（青）］データに変換してＰＣＭデータと多重化した形式、ＹＵＶデータ、ＲＧＢデータ、ＰＣＭデータを圧縮して多重化したＭＰＥＧ（ＭｏｖｉｎｇＰｉｃｔｕｒｅＥｘｐｅｒｔｓＧｒｏｕｐ）等の圧縮形式等の様々な形式が利用可能である。
【００４１】
図４は図２に示す概要説明シーン検出手段２の動作を示すフローチャートである。これら図２及び図４を参照して、概要説明シーン検出手段２の動作について説明する。
【００４２】
人物検出手段２１はＡＶデータ入力手段１からＹＵＶデータを受取ると、各フレーム画像を３×３の小画像にほぼ等分に９分割し、それぞれの小画像毎に各ピクセルの輝度値のヒストグラムを生成する。
【００４３】
次に、人物検出手段２１はフレーム中央部の小画像の輝度ヒストグラムの各レベルの値を８倍したヒストグラムと、フレーム周辺部の８個の小画像のヒストグラムの各レベルの値をそれぞれ加算したヒストグラムとの差分値を計算し、その差分値が予め定められた閾値よりも大きい場合に対象フレーム画像の中央部に人の顔が検出されたことを記録する（図４ステップＳ１１）。ここで、ヒストグラムの差分値とは２つのヒストグラムの各レベル毎の値の差分の絶対値を、全てのレベルについて合計した値のことである。
【００４４】
テロップ検出手段２２はＡＶデータ入力手段１からＹＵＶデータを受取ると、各フレーム画像の下３分の１の領域について、予め定められた閾値Ａと閾値Ｂ（Ａ＞Ｂ）とを用いて、輝度値が閾値Ａ以上、もしくは輝度値が閾値Ｂ以下であるピクセルの個数をカウントし、そのピクセル個数が別の閾値Ｃ以上である場合に対象フレーム画像の下部にテロップが検出されたことを記録する（図４ステップＳ１２）。
【００４５】
人声検出手段２３はＡＶデータ入力手段１からＰＣＭデータを受取ると、映像の各フレームに対応する区間毎に、人声に対応する予め定められた周波数帯域の平均パワーを求め、それが予め定められた閾値以上である場合、対応するフレームに人声が検出されたことを記録する（図４ステップＳ１３）。ここで、特定の周波数帯域の信号を抽出するバンドパスフィルタ（図示せず）には既存の音声信号処理手法を適用すればよい。
【００４６】
概要説明シーン判定手段２４は、まず人物、テロップ、人声の全てが検出されているフレームを概要説明シーンの検出フレーム候補として記録する（図４ステップＳ１４）。続いて、概要説明シーン判定手段２４は概要説明シーンの検出フレーム候補に対して、非検出フレームの連続数が予め定められた閾値よりも短い場合に、その非検出フレームを検出フレームへと変更する（図４ステップＳ１５）。これはフラッシュ等によって瞬間的に人物が検出されなかった場合や、人声が息継ぎなどによって瞬間的に検出されなかった場合に、概要説明シーンが分断されないようにするためである。
【００４７】
最後に、概要説明シーン判定手段２４は概要説明シーンの検出フレーム候補に対して、予め定められた時間以下の連続した検出フレームを非検出フレームへと変更し、残った連続する検出フレームを概要説明シーンとして記録する（図４ステップＳ１６）。この処理は概要説明シーンが一般的に数秒間連続するものであるから、それ以下の短い検出フレーム区間は誤検出として排除するためである。
【００４８】
図５〜図９は本発明の一実施例によるＡＶコンテンツ自動要約システムの具体的な動作例を示す図である。これら図１と図５〜図９とを参照して本発明の一実施例によるＡＶコンテンツ自動要約システムの具体的な動作について説明する。
【００４９】
要約対象となる放送番組は、図５に示すように、１０分、１０分、５分、５分の長さの四つの個別ニュースから構成される３０分の報道番組であるとし、それぞれの個別ニュースの冒頭の１０秒でアナウンサによる概要説明がなされるとともに、個別ニュースのタイトルがテロップ文字として画面下部に表示されるものとする。
【００５０】
ＡＶデータ入力手段１は受信した信号を、映像を毎秒１０フレームのＹＵＶデータ、音声を毎秒１００００サンプルのＰＣＭデータにそれぞれ変換して記録する。
【００５１】
概要説明シーン検出手段２は、図６に示すように、第０フレームから第９９フレーム、第６０００フレームから第６０９９フレーム、第１２０００フレームから第１２０９９フレーム、第１５０００フレームから第１５０９９フレームの４区間を概要説明シーンのフレーム区間であると判断し、４要素の配列として記録する。
【００５２】
映像要約手段３は概要説明シーンに続く詳細シーンから２分周期で３秒間の映像を切り出して要約映像を生成するものとすると、図７に示すように、最初のニュースに対しては第１００フレームから第１２９フレーム、第１３００フレームから第１３２９フレーム、第２５００フレームから第２５２９フレーム、第３７００フレームから第３７２９フレーム、第４９００フレームから第４９２９フレームが要約映像に使われる区間として記録される。
【００５３】
２番目、３番目、４番目のニュースに対しても、上記と同様にして、要約映像に使われる区間が記録される。つまり、２番目のニュースに対しては第６１００フレームから第６１２９フレーム、第７３００フレームから第７３２９フレーム、第８５００フレームから第８５２９フレーム、第９７００フレームから第９７２９フレーム、第１０９００フレームから第１０９２９フレームが要約映像に使われる区間として記録される。
【００５４】
３番目のニュースに対しては第１２１００フレームから第１２１２９フレーム、第１３３００フレームから第１３３２９フレーム、第１４５００フレームから第１４５２９フレームが要約映像に使われる区間として記録される。
【００５５】
４番目のニュースに対しては第１５１００フレームから第１５１２９フレーム、第１６３００フレームから第１６３２９フレーム、第１７５００フレームから第１７５２９フレームが要約映像に使われる区間として記録される。
【００５６】
音声抽出手段４は概要説明シーン検出手段２が記録した概要説明シーンのフレーム区間に相当するＰＣＭデータのサンプル番号を、上述した式、
Ｐ＝Ｆ÷Ｒｆ×Ｒｐ
の式から算出する。
【００５７】
この場合、Ｒｆ＝１０、Ｒｐ＝１００００なので、概要説明音声のサンプル区間は、図８に示すように、第０サンプルから第９９９９９サンプル、第６００００００サンプルから第６０９９９９９サンプル、第１２００００００サンプルから第１２０９９９９９サンプル、第１５００００００サンプルから１５０９９９９９サンプルの４区間となり、それらが配列として記録される。
【００５８】
ＡＶ要約出力手段５は四つの個別ニュース毎に、映像要約手段３が生成した映像要約と音声抽出手段４が生成した概要説明音声とをその長さを合わせて合成し、それを通し番号順に連結する。図９に示すように、最初のニュースと２番目のニュースとでは要約映像が１５秒なのに対して概要説明音声が１０秒であるから、概要説明音声の終了後に５秒間の無音データを付加してから合成する。
【００５９】
それに対して３番目のニュースと４番目のニュースとでは、要約映像が９秒なのに対して概要説明音声が１０秒であるから、９秒の要約映像の後に再び先頭から１秒後までの映像を付加してから合成する。それらを通し番号順に連結すると、最終的に５０秒のＡＶ要約が生成される。
【００６０】
このように、要約映像と概要説明音声とを別々に生成した後にそれらを合成することによって、映像と音声とのそれぞれがニュース概要を把握するのに適した内容になっているので、視聴者がＡＶ要約を視聴した時によりニュースの概要を把握することが容易となる。
【００６１】
また、高速なＣＰＵ（中央処理装置）や大量のメモリを必要とする音声認識処理や自然言語理解等の高度な技術を使用せずに概要説明音声を生成することによって、概要説明音声の抽出処理の実現コストが小さくかつ高速なので、メモリ容量が小さいＰＣ（パーソナルコンピュータ）やＣＰＵ性能が高くないＰＣでも実現することができる。
【００６２】
さらに、概要説明音声としてアナウンサが実際に喋っている言葉をそのまま利用することによって、概要説明音声を自然で理解しやすい音声にすることができる。
【００６３】
図１０は本発明の他の実施例による概要説明シーン検出手段の詳細な構成を示すブロック図である。図１０において、概要説明シーン検出手段６は類似画像検索手段６１と、概要説明シーンデータベース（ＤＢ）６２と、概要説明シーン判定手段６３とから構成されている。
【００６４】
概要説明シーンデータベース６２は放送番組で用いられる概要説明シーンの映像のフレームサンプルを複数記録しており、サンプル毎にＹＵＶデータとして取出すことができる。
【００６５】
類似画像検索手段６１は複数のＡＶコンテンツ入力手段１から渡されるＹＵＶデータと、概要説明シーンデータベース６２が記録している概要説明シーンのサンプルとを比較し、概要説明シーンデータベース６２が記録する概要説明シーンのサンプルのどれかと類似性が高い場合に、そのフレームを概要説明シーンの候補として記録する。
【００６６】
上記の類似画像検索手段６１における類似画像検索手法としては、公知の様々な方法を適用することができる。例えば、フレームを構成するピクセル毎の色情報の差分をとり、その総和が閾値を超えるかどうかで判断する方法がある。また、フレームの輝度データ、色データ、それらを周波数変換した後の周波数成分等から生成されかつ元映像データよりサイズの小さい検索キー同士を比較する方法もあり、その場合にはデータベースの容量と処理時間とを短縮することができる。
【００６７】
概要説明シーン判定手段６３は、図４に示す本発明の一実施例の動作と比べて、概要説明シーンの候補フレームを類似画像検索手段６１によって検出することが異なる。候補フレームを検出した後、短い非検出区間を検出区間への変更し（図４ステップＳ１５）、短い検出区間を非検出区間に変更して概要説明シーンを決定する（図４ステップＳ１６）。
【００６８】
本実施例は要約対象となるＡＶコンテンツにおける概要説明シーンがある程度固定されており、かつ概要説明シーンのサンプルが予め入手可能な場合に、より高い精度で概要説明シーンを検出することができる。よって、最終的に出力されるＡＶ要約も、より内容を把握しやすいものになる。
【００６９】
例えば、報道番組におけるアナウンサによる概要説明シーンの構図は、数ヶ月以上にわって固定である場合が多いため、本実施例によって高精度のＡＶ要約を生成することができる。
【００７０】
尚、上述した実施例では、ＡＶコンテンツ入力手段１として放送を受信する例について述べたが、放送以外の記録メディアに蓄積されたＡＶコンテンツ、あるいはインタネット等を介して送られてくるＡＶコンテンツでも、上記の実施例と同様に、ＡＶ要約を生成することができる。
【００７１】
また、ＡＶコンテンツ入力手段１が記録するフォーマットとしてＹＵＶデータとＰＣＭデータとを例示したが、もちろん、他の様々なフォーマットでも、上記の実施例と同様に、ＡＶ要約を生成することができる。
【００７２】
一方、上述した実施例では概要説明シーン検出手段２，６として、人物検出とテロップ検出と人声検出とを組合わせる方法と、類似画像検索による方法とを例示したが、その他の方法を用いてもかまわない。例えば、放送電波に現在のシーンを特定する信号が重畳されており、概要説明シーンであることをその信号から判定することができる場合にはその信号を利用すればよい。
【００７３】
また、人物検出、テロップ検出、人声検出、類似画像検索の各手法の任意の組合わせでも実現することができる。さらに、話者識別技術によって概要説明を行う話者を検出する方法、「次のニュースです」等の話題区切りを音声認識によって認識し、それに続くシーンを概要説明シーンだと判断する方法等が考えられる。
【００７４】
上述した実施例では、人物検出手段２１として、画面中央部及び周辺部の輝度ヒストグラムを比較する方法を例示しているが、もちろん、その他の人物検出手法を適用することができる。例えば、その方法としては画面中央の９等分割画像に限らないことはもちろん、色情報の分布を調べる方法、目、鼻、口といった顔を構成する要素候補を検出してその位置関係及びその時間方向での動き量から人の顔を検出する方法等が考えられる。
【００７５】
また、テロップ検出手段２２として、輝度の高いピクセルと低いピクセルとの数をカウントする方法を例示しているが、もちろん、その他のテロップ検出手法を適用することができる。例えば、その方法としてはエッジの個数で判断する方法、エッジ点での輝度変化量が連続するエッジで対称になっているかどうかで判断する方法、エッジ分布密度が高い領域の形状で判断する方法等が考えられる。
【００７６】
さらに、人声検出手段２３として、バンドパスフィルタで特定周波数領域を取出す方法を例示しているが、もちろん、その他の人声検出方法を用いても構わない。例えば、その方法としては人声の各種特徴量の時間方向の変化パターンが予め登録しておいたパターンと類似しているかどうかで判断する方法、周波数スペクトルの分布形状が予め登録しておいたパターンと類似しているかどうかで判断する方法等が考えられる。
【００７７】
また、概要説明シーン判定手段２４で、概要説明シーン間の時間条件を設けて概要説明シーン間が閾値よりも短い場合には、どちらかの候補をキャンセルする方法や、番組中に比較的均等に分布するように選択する方法も考えられる。
【００７８】
上述した実施例では、映像要約手段３が概要説明シーンの後に続く映像を要約する例を示しているが、概要説明シーンのテロップ文字を映像として表示することはひとつの有効な要約手段であり、もちろん要約映像に概要説明シーンが含まれても構わない。
【００７９】
また、映像要約手段３として、一定周期毎に一定時間の映像を抜き出す方法を例示しているが、その他の映像要約手法を適用することができることはいうまでもない。例えば、その方法としては一定周期毎にフレームを抜き出してそのフレームを静止画として一定時間表示する方法、抜き出すフレーム周期や表示時間を内容に応じて変化させる方法、抜き出したフレームを縮小画像の一覧で表示する方法、映像の特徴量の変化点をシーンチェンジとして検出してその直後の映像を抜き出す方法、映像の時間方向での変化量に応じて映像の重要度を計算して重要度の高い映像を抜き出す方法等が考えられる。
【００８０】
要約ＡＶ出力手段５としては要約映像と概要説明音声とを多重化して記録媒体に記録する方法を例示しているが、その他にも、要約映像をディスプレイ上に表示すると同時に概要説明音声をスピーカ等の音声出力装置から再生する方法、要約映像と概要説明音声とを多重化して伝送路上に送信する方法等もある。
【００８１】
上述した実施例の動作では、概要説明シーン検出手段２、映像要約手段３、音声抽出手段４、ＡＶ要約出力手段５が逐次的に動作する場合を例示しているが、それらの手段の全てが、あるいは一部が平行して動作する場合も当然含まれる。
【００８２】
【発明の効果】
以上説明したように本発明によれば、少なくとも映像及び音声を含むＡＶコンテンツからそれらの映像及び音声の中の代表的な部分を選択して表示するＡＶコンテンツ自動要約システムにおいて、ＡＶコンテンツの中から代表的な部分の映像及び音声を別々に取出し、それらの映像及び音声を合成して出力することによって、より内容を把握しやすいＡＶ要約を生成することができるという効果がある。
【図面の簡単な説明】
【図１】本発明の一実施例によるＡＶコンテンツ自動要約システムの構成を示すブロック図である。
【図２】図１の概要説明シーン検出手段の詳細な構成を示すブロック図である。
【図３】本発明の一実施例によるＡＶコンテンツ自動要約システムの動作を示すフロートャートである。
【図４】図２に示す概要説明シーン検出手段の動作を示すフローチャートである。
【図５】本発明の一実施例によるＡＶコンテンツ自動要約システムの具体的な動作例を示す図である。
【図６】本発明の一実施例によるＡＶコンテンツ自動要約システムの具体的な動作例を示す図である。
【図７】本発明の一実施例によるＡＶコンテンツ自動要約システムの具体的な動作例を示す図である。
【図８】本発明の一実施例によるＡＶコンテンツ自動要約システムの具体的な動作例を示す図である。
【図９】本発明の一実施例によるＡＶコンテンツ自動要約システムの具体的な動作例を示す図である。
【図１０】本発明の他の実施例による概要説明シーン検出手段の詳細な構成を示すブロック図である。
【符号の説明】
１ＡＶデータ入力手段
２，６概要説明シーン検出手段
３映像要約手段
４音声抽出手段
５ＡＶ要約出力手段
２１人物検出手段
２２テロップ検出手段
２３人声検出手段
２４，６３概要説明シーン判定手段
６１類似画像検索手段
６２概要説明シーンデータベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an AV content automatic summarization system and AV content automatic summarization. Method In particular, the present invention relates to a method for generating a summary of AV (Audio Visual) content.
[0002]
[Prior art]
2. Description of the Related Art Conventional AV content automatic summarization systems include one that selects a plurality of representative images from a video frame and displays them sequentially or displays a list of reduced images.
[0003]
In this case, the above-described automatic summarization system automatically detects a video taken out from a video frame at a fixed period or a video feature change point and selects a video immediately after the change point as a representative image.
[0004]
As another method for automatic summarization of AV content, there is a system that simultaneously reproduces video and audio near the change point of the feature amount of video and audio. This system is disclosed in Japanese Patent Application Laid-Open No. 11-88807.
[0005]
[Problems to be solved by the invention]
However, since the above-described conventional automatic summarization system for AV content uses only video, information by audio is lost, and the representative video does not necessarily accurately represent the outline of AV content. Since there are many, there exists a problem that it is difficult to grasp | ascertain the outline | summary of AV content well.
[0006]
In the system described in the above publication, a single topic included in AV content includes a number of scenes such as scenes from the field, commentary stories, and explanations using telops. In this case, it is necessary to reproduce the playback time of each scene for several seconds or more so that the sound can be heard naturally, and the outline of the topics to which all of these scenes correspond is not accurately expressed.
[0007]
In addition, since the video that directly expresses the content of the AV content and the audio that directly expresses the content of the AV content are often present in different scenes, those videos are reproduced in a method of reproducing a part of the AV content. If you try to play both audio and audio, the time will inevitably increase. Therefore, the system described in the above publication has a problem that it is necessary to generate an AV summary that is somewhat long in order to grasp the outline of the AV content.
[0008]
Accordingly, an object of the present invention is to solve the above-described problems and to generate an AV content automatic summarization system and AV content automatic summarization that can generate an AV summary that can be easily understood. Method Is to provide.
[0009]
[Means for Solving the Problems]
The AV content automatic summarization system according to the present invention includes an AV (Audio Visual) content including at least video and audio, and includes the video and audio. Partially from Select Edit An AV content automatic summarization system for Part of the video independently of the audio Means to take out , Means for partially extracting the audio from the AV content independently from the video, and individually extracting them Means for synthesizing and outputting video and audio.
[0010]
Another AV content automatic summarization system according to the present invention includes a detecting means for detecting a summary explanation scene showing a scene explaining an outline of the next news at least in a news program, and a summary explanation scene detected by the detection means. A generating means for generating a summary video of the following detailed scene, an extracting means for extracting only the audio of the outline explanation scene detected by the detecting means, a summary video and the outline explanation voice extracted by the extracting means by the generating means Output means for combining and outputting.
[0011]
Another AV content automatic summarization system according to the present invention is an AV content automatic summarization system that generates an AV (Audio Visual) summary from content including a summary explanation scene showing a scene where an announcer explains a summary of the next news at least in a news program. The outline explanation scene detecting means for detecting the outline explanation scene from the contents and recording a set of start frame numbers and end frame numbers of the outline explanation scene together with the outline explanation scene, and the outline explanation scene. Video summary means for generating a summary video of a detailed scene, voice extraction means for cutting out the voice of the summary explanation scene as outline explanation voice, the summary explanation voice generated by the voice extraction means and the summary explanation voice The detailed scene generated by the video summarization means is the same as the summary video. AV summary output means for reproducing and outputting the AV summary as a period.
[0012]
Automatic AV content summarization according to the present invention Method Is an AV (Audio Visual) content including at least video and audio. Partially from Select Edit A method for automatically summarizing AV content, comprising: Part of the video independently of the audio Steps to take and , A step of partially extracting the audio from the AV content independently of the video, and taking out each of them separately Synthesizing and outputting video and audio The computer performs each of these steps ing.
[0013]
Other AV content automatic summarization according to the present invention Method Detecting a summary explanation scene showing a scene where an announcer outlines the next news at least in a news program, and generating a summary video of a detailed scene following the detected summary explanation scene, Extracting only the audio of the outline explanation scene, and synthesizing and outputting the summary video and the outline explanation audio. The computer performs each of these steps ing.
[0014]
Another AV content automatic summarization according to the present invention Method Is an AV content automatic summarization method for generating an AV (Audio Visual) summary from content including a summary explanation scene at which an announcer explains a summary of the next news at least in a news program, wherein the summary explanation is based on the content. Detecting a scene and recording a set of start frame numbers and end frame numbers of the outline explanation scene together with the outline explanation scene; generating a summary video of a detailed scene following the outline explanation scene; and the outline explanation Cutting out the audio of the scene as an outline explanation voice, and synchronizing the outline explanation voice and the summary video of the detailed scene corresponding to the outline explanation voice to reproduce and output as the AV summary. The computer performs each of these steps ing.
[0015]
That is, the AV content automatic summarization method of the present invention is a method for automatically generating AV summaries for quickly understanding the contents of AV content in which video and audio are multiplexed. By automatically detecting an outline explanation scene such as a scene explaining the outline of the scene, and synthesizing a summary video of the detailed scene following the outline explanation scene and an outline explanation voice obtained by extracting only the voice of the outline explanation scene, AV summarization is performed. It is a method to generate.
[0016]
More specifically, the AV content automatic summarization system of the present invention detects an outline explanation scene using existing techniques such as human detection, telop detection, voice detection, and similar image detection, and starts an outline explanation scene start frame. Outline explanation scene detecting means for recording a set of numbers and end frame numbers, video summarization means for generating a summary video of a detailed scene following the outline explanation scene using existing video summarization technology, and audio of the outline explanation scene Is extracted as an AV summary in synchronism with a summary extraction voice generated by the voice extraction means and a summary video of the detailed scene generated by the video summary means corresponding to the summary voice. AV summary output means for outputting to a recording medium.
[0017]
With the configuration described above, the summary video and the summary explanation audio are generated separately and then combined. Therefore, compared to the method of extracting a part of AV content and making it an AV summary, the contents can be grasped more. Allows easy AV summary generation. In addition, since the announcer or the like uses the speech of the outline as it is, the speech is more natural and the summarization processing time is shorter than the method using speech recognition or text summarization.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an AV content automatic summarization system according to an embodiment of the present invention. 1, an AV content automatic summarization system according to an embodiment of the present invention includes an AV data input means 1, an outline explanation scene detection means 2, a video summarization means 3, an audio extraction means 4, an AV summary output means 5, It is composed of
[0019]
The AV data input means 1 receives broadcast radio waves and extracts video information and audio information included in the signal. In this case, the video information is converted into YUV [Y (luminance signal), U, V (color difference signal component)] data including luminance information and color information, and the audio information is converted into PCM (Pulse Code Modulation) data. It is recorded on a memory (not shown).
[0020]
YUV data can be extracted in units of video frames. PCM data can be taken out in units of samples. The AV data input means 1 can be easily realized by using a function provided by a commercially available TV tuner board for PC (personal computer), an attached program, and an operating system for PC.
[0021]
The outline explanation scene detection means 2 receives YUV data and PCM data from the AV data input means 1 and analyzes the data to detect an outline explanation scene such as a scene where the announcer outlines the next news. The start frame number and end frame number of the outline explanation scene are recorded in association with the serial number of the outline explanation scene.
[0022]
The serial number of the outline explanation scene is for the purpose of associating the summary video and the outline explanation audio, which will be described later, and when generating a summary of a certain program, the serial number from the beginning of the target program may be added. When a summary from the start time to a certain end time is generated, a serial number from the start time may be added.
[0023]
The video summarizing means 3 receives the YUV data from the AV data input means 1 and refers to the frame section of the outline explanation scene recorded by the outline explanation scene detection means 2, and the detailed scene such as a scene in the scene or explanation scene following the outline explanation scene The summary video is generated, and the summary video is recorded in association with the serial number of the corresponding summary explanation scene.
[0024]
Here, the summary video is a video shorter than the original video that can roughly grasp the content of the received AV content. For example, a video of 1/15 length of the original video obtained by extracting a video of 2 seconds in a cycle of 30 seconds from the original video and connecting the videos of 2 seconds can be said to be a summary video.
[0025]
The voice extraction means 4 receives the PCM data from the AV data input means 1, extracts the PCM data of the frame section of the summary explanation scene recorded by the summary explanation scene detection means 2, and associates the summary explanation voice with the serial number of the corresponding summary explanation scene. Record as.
[0026]
The AV summary output means 5 receives the summary video recorded by the video summary means 3 and the summary explanation voice recorded by the voice extraction means 4, and synchronizes the summary video assigned with the same serial number with the summary explanation voice. Then, it is output to a memory, a magnetic recording device or the like as an AV summary.
[0027]
FIG. 2 is a block diagram showing a detailed configuration of the outline explanation scene detection means 2 of FIG. In FIG. 2, the outline explanation scene detection means 2 includes a person detection means 21, a telop detection means 22, a human voice detection means 23, and an outline explanation scene determination means 24.
[0028]
The person detection means 21 receives the YUV data from the AV data input means 1 and determines whether or not a human face exists in the center of the image for each frame of the video.
[0029]
The telop detection means 22 receives the YUV data from the AV data input means 1 and determines whether or not a telop character exists at the bottom of the image for each frame of the video and records it.
[0030]
The human voice detection means 23 receives the PCM data from the AV data input means 1, and determines whether or not a human voice exists in the audio data corresponding to each frame of the video and records it.
[0031]
The outline explanation scene determination means 24 determines the frame section of the outline explanation scene by referring to the detection result of the person detection means 21, the detection result of the telop detection means 22, and the detection result of the human voice detection means 23. The start frame number and the end frame number are recorded in association with the serial number of the overview explanation scene.
[0032]
FIG. 3 is a flowchart showing the operation of the AV content automatic summarizing system according to the embodiment of the present invention. The overall operation of the AV content automatic summarizing system according to an embodiment of the present invention will be described with reference to FIGS.
[0033]
The outline explanation scene detection means 2 receives YUV data and PCM data from the AV data input means 1 and analyzes the data to identify the outline explanation scene. The serial number of the outline explanation scene is used as the element number, the start frame number and the end. The data is recorded as an array having a pair with a frame number as an element (step S1 in FIG. 3).
[0034]
The video summarizing means 3 receives the YUV data from the AV data input means 1, refers to the frame section of the outline explanation scene recorded by the outline explanation scene detection means 2, and starts the next outline explanation scene immediately after the end frame of the outline explanation scene. YUV data for a predetermined period of time in a predetermined cycle for a detailed scene from immediately before the end frame of the overview description scene to the last frame until the immediately preceding frame or when the next overview description scene does not exist Cut out and concatenated those periodic partial videos are recorded as a summary video (step S2 in FIG. 3).
[0035]
In the summary video recording method, it is not necessary to record YUV data of the summary video, and a list of frame sections included in the summary video corresponding to the summary explanation scene may be recorded for each serial number of the summary explanation scene.
[0036]
The voice extraction means 4 receives the PCM data from the AV data input means 1, cuts out PCM data corresponding to the frame section of the outline explanation scene recorded by the outline explanation scene detection means 2, and records it as outline explanation voice (step S3 in FIG. 3). ).
[0037]
At that time, the section of the overview explanation scene is recorded with the frame number of the video,

Is converted into a PCM data sample number.
[0038]
In addition, in the recording method of the outline explanation voice, it is not necessary to record the PCM data itself of the outline explanation voice. It can be recorded as an array of elements.
[0039]
The AV summary output means 5 combines the summary video of the detailed scene recorded by the video summary means 3 and the length of the summary explanation voice recorded by the audio extraction means 4 for each serial number of the summary explanation scene, and synthesizes the summary explanation scene. Are connected in the order of serial numbers and output to the recording medium as an AV summary (step S4 in FIG. 3).
[0040]
In the synthesis process for each serial number, if the summary video is longer than the outline explanation voice, the length may be adjusted by adding a silence signal after the outline explanation voice. If the summary video is shorter than the summary explanation audio, the summary video may be repeated until the summary video has the same length. The output AV summary format is a format in which YUV data and PCM data are multiplexed. YUV data is converted into RGB [R (red), G (green), B (blue)] data and multiplexed with PCM data. Various formats such as a compressed format such as MPEG (Moving Picture Experts Group), which is obtained by compressing and multiplexing YUV data, RGB data, and PCM data, can be used.
[0041]
FIG. 4 is a flowchart showing the operation of the outline explanation scene detection means 2 shown in FIG. With reference to FIG. 2 and FIG. 4, the operation of the overview explanation scene detection means 2 will be described.
[0042]
When the person detection means 21 receives the YUV data from the AV data input means 1, the person detection means 21 divides each frame image into nine equal parts of 3 × 3, and a histogram of the luminance value of each pixel for each small image. Generate.
[0043]
Next, the person detection means 21 adds a histogram obtained by multiplying the value of each level of the luminance histogram of the small image at the center of the frame by 8 and the value of each level of the histogram of the eight small images at the periphery of the frame. When the difference value is larger than a predetermined threshold value, the fact that a human face has been detected at the center of the target frame image is recorded (step S11 in FIG. 4). Here, the difference value of the histogram is a value obtained by summing up the absolute value of the difference between the values of each level of the two histograms for all levels.
[0044]
When the telop detection unit 22 receives the YUV data from the AV data input unit 1, the luminance is determined using a predetermined threshold A and threshold B (A> B) for the lower third region of each frame image. The number of pixels whose value is greater than or equal to threshold A or the luminance value is less than or equal to threshold B is counted, and when the number of pixels is greater than or equal to another threshold C, it is recorded that a telop has been detected at the bottom of the target frame image. (FIG. 4, step S12).
[0045]
When the human voice detecting means 23 receives the PCM data from the AV data input means 1, the human voice detecting means 23 obtains an average power in a predetermined frequency band corresponding to the human voice for each section corresponding to each frame of the video, which is predetermined. If it is equal to or greater than the threshold value, it is recorded that a human voice is detected in the corresponding frame (step S13 in FIG. 4). Here, an existing audio signal processing method may be applied to a bandpass filter (not shown) that extracts a signal in a specific frequency band.
[0046]
First, the outline explanation scene determination means 24 records a frame in which all of a person, a telop, and a human voice are detected as a detection frame candidate of the outline explanation scene (step S14 in FIG. 4). Subsequently, the outline explanation scene determination means 24 changes the non-detection frame to a detection frame when the number of consecutive non-detection frames is shorter than a predetermined threshold for the detection frame candidate of the outline explanation scene. (FIG. 4, step S15). This is to prevent the outline explanation scene from being divided when a person is not detected instantaneously by flash or the like, or when a human voice is not detected instantaneously by breathing or the like.
[0047]
Finally, the outline explanation scene determination means 24 changes the consecutive detection frames below the predetermined time to non-detection frames for the detection frame candidates of the outline explanation scene, and outlines the remaining detection frames. A scene is recorded (step S16 in FIG. 4). This process is for the purpose of eliminating the short detection frame section shorter than that because the outline explanation scene is generally continuous for several seconds.
[0048]
5 to 9 are diagrams showing specific operation examples of the AV content automatic summarizing system according to the embodiment of the present invention. A specific operation of the AV content automatic summarizing system according to the embodiment of the present invention will be described with reference to FIG. 1 and FIGS.
[0049]
As shown in FIG. 5, the broadcast program to be summarized is a 30-minute news program composed of four individual news pieces each having a length of 10 minutes, 10 minutes, 5 minutes, and 5 minutes. It is assumed that the announcer gives an outline explanation in the first 10 seconds of the news, and the title of the individual news is displayed at the bottom of the screen as telop characters.
[0050]
The AV data input means 1 records the received signal after converting the video into YUV data of 10 frames per second and the audio into PCM data of 10,000 samples per second.
[0051]
Outline Description As shown in FIG. 6, the scene detection means 2 includes four sections from frame 0 to frame 99, frame 6000 to frame 6099, frame 12000 to frame 12099, frame 15000 to frame 15099. It is determined that it is a frame section of the outline explanation scene, and is recorded as an array of four elements.
[0052]
Assuming that the video summarizing means 3 generates a summary video by cutting out a video for 3 seconds at a cycle of 2 minutes from a detailed scene following the outline explanation scene, as shown in FIG. To 129 frames, 1300 to 1329 frames, 2500 to 2529 frames, 3700 to 3729 frames, and 4900 to 4929 frames are recorded as sections used for the summary video.
[0053]
The sections used for the summary video are recorded in the same manner as described above for the second, third, and fourth news. That is, for the second news, the 6100th to 6129th frames, the 7300th to 7329th frames, the 8500th to 8529th frames, the 9700th to 9729th frames, the 10900th to 10929th frames It is recorded as the section used for the summary video.
[0054]
For the third news, the 12100 to 12129 frames, the 13300 to 13329 frames, and the 14500 to 14529 frames are recorded as sections used for the summary video.
[0055]
For the fourth news, frames 15100 to 15129, frames 16300 to 16329, and frames 17500 to 17529 are recorded as sections used for the summary video.
[0056]
The voice extraction unit 4 uses the above-described equation to calculate the sample number of the PCM data corresponding to the frame section of the summary explanation scene recorded by the summary explanation scene detection unit 2.
P = F ÷ Rf × Rp
It is calculated from the formula of
[0057]
In this case, since Rf = 10 and Rp = 10000, as shown in FIG. 8, the sample section of the outline explanation voice is from the 0th sample to the 99999th sample, from the 6000000th sample to the 6099999th sample, and from the 12,000,000th sample to the 120999999th sample. The four sections from the 15000000 sample to the 15099999 sample are recorded as an array.
[0058]
The AV summary output means 5 synthesizes the video summary generated by the video summary means 3 and the summary explanation voice generated by the voice extraction means 4 for each of the four individual news in the same length, and connects them in the order of serial numbers. . As shown in FIG. 9, in the first news and the second news, the summary video is 15 seconds while the summary explanation voice is 10 seconds. Therefore, after the summary explanation voice ends, 5 seconds of silence data is added. Synthesize from
[0059]
On the other hand, in the 3rd news and the 4th news, the summary video is 9 seconds while the summary explanation voice is 10 seconds. Therefore, after the 9-second summary video, the video from the beginning to 1 second later is displayed again. Add and synthesize. By concatenating them in serial number order, a 50-second AV summary is finally generated.
[0060]
In this way, by generating the summary video and the summary explanation audio separately and then synthesizing them, the video and voice are each suitable for grasping the news summary. It becomes easier to grasp the outline of the news when viewing the AV summary.
[0061]
Also, the outline explanation voice is extracted by generating the outline explanation voice without using a high-speed CPU (central processing unit), a voice recognition process that requires a large amount of memory, or an advanced technology such as natural language understanding. Since the realization cost is small and high speed, it can be realized even with a PC (personal computer) with a small memory capacity or a PC with low CPU performance.
[0062]
Furthermore, by using the words actually spoken by the announcer as they are as the outline explanation voice, the outline explanation voice can be made natural and easy to understand.
[0063]
FIG. 10 is a block diagram showing the detailed configuration of the outline explanation scene detecting means according to another embodiment of the present invention. In FIG. 10, the outline explanation scene detection means 6 includes a similar image search means 61, an outline explanation scene database (DB) 62, and an outline explanation scene determination means 63.
[0064]
The outline explanation scene database 62 records a plurality of frame samples of the video of the outline explanation scene used in the broadcast program, and each sample can be taken out as YUV data.
[0065]
The similar image search means 61 compares the YUV data passed from the plurality of AV content input means 1 with the outline explanation scene sample recorded in the outline explanation scene database 62, and the outline explanation recorded in the outline explanation scene database 62. If the similarity is high with any one of the scene samples, the frame is recorded as a summary explanation scene candidate.
[0066]
Various known methods can be applied as the similar image search method in the similar image search means 61 described above. For example, there is a method of taking a difference in color information for each pixel constituting a frame and determining whether the sum exceeds a threshold value. There is also a method of comparing search keys generated from frame luminance data, color data, frequency components after frequency conversion of them, and smaller in size than the original video data, in which case the database capacity and processing are compared. Time can be reduced.
[0067]
Compared to the operation of the embodiment of the present invention shown in FIG. 4, the outline explanation scene determination means 63 is different in that the outline image scene candidate frame is detected by the similar image search means 61. After the candidate frame is detected, the short non-detection section is changed to the detection section (step S15 in FIG. 4), and the short detection section is changed to the non-detection section to determine the outline explanation scene (step S16 in FIG. 4).
[0068]
In this embodiment, when the outline explanation scene in the AV content to be summarized is fixed to some extent and a sample of the outline explanation scene is available in advance, the outline explanation scene can be detected with higher accuracy. Therefore, the AV summary that is finally output also becomes easier to grasp the contents.
[0069]
For example, since the composition of an outline explanation scene by an announcer in a news program is often fixed for several months or more, a highly accurate AV summary can be generated according to this embodiment.
[0070]
In the above-described embodiment, an example in which broadcast is received as the AV content input unit 1 has been described. However, AV content stored in a recording medium other than broadcast, or AV content sent via the Internet or the like, Similar to the above example, an AV summary can be generated.
[0071]
In addition, although YUV data and PCM data are exemplified as formats to be recorded by the AV content input unit 1, it is needless to say that AV summaries can be generated in various other formats as in the above-described embodiment.
[0072]
On the other hand, in the above-mentioned embodiment, the outline explanation scene detection means 2 and 6 exemplify a method of combining person detection, telop detection and human voice detection and a method by similar image search, but other methods are used. It doesn't matter. For example, when a signal specifying a current scene is superimposed on a broadcast radio wave and it can be determined from the signal that the scene is an outline explanation scene, the signal may be used.
[0073]
It can also be realized by any combination of the methods of person detection, telop detection, human voice detection, and similar image search. In addition, there is a method to detect the speaker who gives an outline explanation by speaker identification technology, a method to recognize a topic break such as "Next News" by voice recognition, and to judge the following scene as an outline explanation scene. It is done.
[0074]
In the above-described embodiment, the person detection unit 21 is exemplified by a method of comparing the luminance histograms at the center and the periphery of the screen, but other person detection methods can be applied. For example, the method is not limited to the nine-divided image at the center of the screen, but the method of examining the distribution of color information, the candidate elements constituting the face such as eyes, nose, mouth, etc. A method of detecting a human face from the amount of movement in the direction can be considered.
[0075]
Further, as the telop detection means 22, a method of counting the number of pixels with high luminance and low pixel is illustrated, but other telop detection methods can of course be applied. For example, as the method, a method of judging by the number of edges, a method of judging whether or not the luminance change amount at the edge point is symmetric with respect to successive edges, a method of judging by the shape of a region having a high edge distribution density, etc. Can be considered.
[0076]
Furthermore, as the human voice detection means 23, a method of extracting a specific frequency region with a bandpass filter is illustrated, but other human voice detection methods may be used as a matter of course. For example, as the method, a method for judging whether or not the time direction change pattern of various features of human voice is similar to a previously registered pattern, a pattern whose frequency spectrum distribution shape is registered in advance. The method of judging by whether it is similar or not can be considered.
[0077]
In addition, when the summary explanation scene determination unit 24 sets a time condition between the summary explanation scenes and the interval between the summary explanation scenes is shorter than the threshold value, a method of canceling one of the candidates, A method of selecting the distribution is also conceivable.
[0078]
In the embodiment described above, the video summarizing means 3 shows an example of summarizing the video following the outline explanation scene, but displaying the telop characters of the outline explanation scene as a video is one effective summarization means, Of course, a summary explanation scene may be included in the summary video.
[0079]
Further, as the video summarizing means 3, a method of extracting video for a fixed time every fixed period is illustrated, but it goes without saying that other video summarization techniques can be applied. For example, as a method, a frame is extracted every fixed period and the frame is displayed as a still image for a certain period of time, a method of changing the extracted frame period or display time according to the contents, and the extracted frame is a list of reduced images. Display method, Video feature change point is detected as a scene change and the video immediately after it is extracted, Video importance is calculated according to the amount of video change in time direction, and video with high importance A method of extracting the color can be considered.
[0080]
The summary AV output means 5 exemplifies a method of multiplexing the summary video and the summary explanation voice and recording them on the recording medium. In addition, the summary explanation voice is displayed on the display and the summary explanation voice is displayed on the speaker. There are also a method of reproducing from the audio output device, a method of multiplexing the summary video and the outline explanation audio and transmitting them on the transmission line.
[0081]
In the operation of the above-described embodiment, the case where the outline explanation scene detection means 2, the video summarization means 3, the audio extraction means 4, and the AV summary output means 5 operate sequentially is illustrated. Of course, the case where a part of them operates in parallel is also included.
[0082]
【The invention's effect】
As described above, according to the present invention, in an AV content automatic summarizing system that selects and displays a representative portion of video and audio from AV content including at least video and audio, the AV content is selected from the AV content. By taking out the video and audio of a representative portion separately, and synthesizing and outputting the video and audio, there is an effect that an AV summary that makes it easier to grasp the contents can be generated.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an AV content automatic summarization system according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a detailed configuration of an outline explanation scene detection unit in FIG. 1;
FIG. 3 is a flowchart showing the operation of an AV content automatic summarization system according to an embodiment of the present invention.
4 is a flowchart showing the operation of the outline explanation scene detection means shown in FIG. 2;
FIG. 5 is a diagram illustrating a specific operation example of the AV content automatic summarizing system according to the embodiment of the present invention.
FIG. 6 is a diagram illustrating a specific operation example of the AV content automatic summarizing system according to the embodiment of the present invention.
FIG. 7 is a diagram illustrating a specific operation example of the AV content automatic summarizing system according to the embodiment of the present invention.
FIG. 8 is a diagram illustrating a specific operation example of the AV content automatic summarizing system according to the embodiment of the present invention.
FIG. 9 is a diagram illustrating a specific operation example of the AV content automatic summarizing system according to the embodiment of the present invention.
FIG. 10 is a block diagram showing a detailed configuration of an outline explanation scene detection unit according to another embodiment of the present invention.
[Explanation of symbols]
1 AV data input means
2,6 Outline explanation scene detection means
3 Video summarization means
4 voice extraction means
5 AV summary output means
21 Person detection means
22 Ticker detection means
23 Human voice detection means
24, 63 Outline explanation scene determination means
61 Similar image search means
62 Outline scene database

Claims

A detecting means for detecting a summary explanation scene at least showing a scene where an announcer explains a summary of the next news in a news program;
Generating means for generating a summary video of a detailed scene following the outline explanation scene detected by the detecting means;
Extraction means for extracting only the audio of the overview explanation scene detected by the detection means;
An AV content automatic summarization system comprising: output means for synthesizing and outputting the summary video by the generation means and the summary explanation voice extracted by the extraction means.

2. The AV content automatic summarization system according to claim 1 , wherein the extraction means extracts the voice of the outline explanation scene at the beginning of each topic and uses it as it is.

2. The AV content automatic summarization system according to claim 1 , wherein the extracting means extracts the voice of the outline explanation scene by the announcer at the beginning of each individual news of the news program and uses it as it is.

The detection means detects the outline explanation scene by combining detection of a person in video information, detection of a telop in the video information, and detection of a human voice in audio information accompanying the video information. The AV content automatic summarization system according to any one of claims 1 to 3 , wherein the system is configured as described above.

Said detecting means, claims 1 to 3, characterized in that so as to search for the overview scene using similar image search which detects the similarity of the sample briefing scene previously recorded The AV content automatic summarization system described in any of the above.

An AV content automatic summarization system for generating an AV (Audio Visual) summary from a content including a summary explanation scene, in which an announcer explains a summary of the next news at least in a news program,
A summary explanation scene detecting means for detecting the summary explanation scene from the content and recording a set of a start frame number and an end frame number of the summary explanation scene together with the summary explanation scene;
Video summarizing means for generating a summary video of a detailed scene following the outline explanation scene;
Voice extraction means for cutting out the voice of the outline explanation scene as outline explanation voice;
AV summary output means for reproducing and outputting as the AV summary in synchronism with the summary explanation voice generated by the voice extraction means and the summary video of the detailed scene generated by the video summary means corresponding to the summary explanation voice. This is an AV content automatic summarization system.

7. The AV content automatic summary according to claim 6, wherein the outline explanation scene detecting means is configured to detect the outline explanation scene by performing person detection, telop detection and human voice detection on the content. system.

The outline explanation scene detecting means is configured to detect the outline explanation scene by performing a similar image search for detecting similarity to a sample of the outline explanation scene recorded in advance for the content. The AV content automatic summarization system according to claim 6 .

The AV content automatic summarization system according to any one of claims 6 to 8 , wherein the voice extraction means extracts the voice of the outline explanation scene at the beginning of each topic and uses it as it is.

9. The voice extraction unit according to any one of claims 6 to 8 , wherein the voice extraction means extracts the voice of the outline explanation scene by the announcer at the beginning of each individual news of the news program and uses it as it is. AV content automatic summarization system.