JP2004015523A

JP2004015523A - Video-related content generation apparatus, video-related content generation method, and video-related content generation program

Info

Publication number: JP2004015523A
Application number: JP2002167419A
Authority: JP
Inventors: Toshihiko Misu; 三須　俊彦; Masahide Naemura; 苗村　昌秀; Bunto Tei; 鄭　文濤
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2002-06-07
Filing date: 2002-06-07
Publication date: 2004-01-15
Anticipated expiration: 2022-06-07
Also published as: JP4003940B2

Abstract

【課題】映像を、その映像の内容に関連する座標情報、音声情報、画像情報等の異種情報に変換した映像関連コンテンツを生成することを可能にする映像関連コンテンツ生成装置、映像関連コンテンツ生成方法及び映像関連コンテンツ生成プログラムを提供する。
【解決手段】映像関連コンテンツ生成装置１は、入力された映像信号の映像シーンを解析することで、その映像シーンの映像特徴量を抽出する映像シーン解析手段２と、その映像シーン解析手段２で抽出した映像特徴量に基づいて、映像内容に関連する情報をコンテンツ記述言語で記述したコンテンツ、画像ファイル又は映像ストリーム、音声ファイル又は音声ストリームを映像関連コンテンツとして生成するコンテンツ生成手段３を備えたことを特徴とする。
【選択図】　　　図１A video-related content generation apparatus and a video-related content generation method capable of generating a video-related content obtained by converting a video into heterogeneous information such as coordinate information, audio information, and image information related to the content of the video. And a video-related content generation program.
A video-related content generation device (1) analyzes a video scene of an input video signal to extract a video feature amount of the video scene, and a video scene analysis unit (2). A content generating means for generating a content, an image file or a video stream, an audio file or an audio stream as video-related content in which information related to the video content is described in a content description language based on the extracted video feature amount; It is characterized by.
[Selection diagram] Fig. 1

Description

【０００１】
【発明の属する技術分野】
本発明は、映像を映像メディア以外のメディアに供給することができる映像関連コンテンツを生成する映像関連コンテンツ生成装置、映像関連コンテンツ生成方法及び映像関連コンテンツ生成プログラムに関する。
【０００２】
【従来の技術】
従来、映像（映像コンテンツ）から、映像コンテンツをそのまま提示することができる映像メディア（テレビ放送等）以外の、例えばデータ放送、ＷＷＷ（Ｗｏｒｌｄ　Ｗｉｄｅ　Ｗｅｂ）、携帯端末等の他のメディアで提示することができるコンテンツを制作し、配信する場合、元となる映像から特定の大きさを切り出したり、伝送フレームレート等を変換することで、他のメディア用の映像コンテンツを制作し、配信を行っている。この映像コンテンツを、映像メディア以外のメディア用コンテンツへ変換する方法は、解像度を除いては基本的に同種（映像）のコンテンツに変換することしか行われていない。
【０００３】
また、従来、音声（音声コンテンツ）から、文字放送等で提示するコンテンツを制作する場合、元となる音声を音声認識によって文字情報に変換して、文字コンテンツとする方法がある。このように、音声コンテンツでは音声コンテンツから文字情報という異種のコンテンツに変換することが行われている。
【０００４】
【発明が解決しようとする課題】
しかし、前記従来の技術では、異なるメディア用のコンテンツに変換する場合、映像から映像への解像度変換が主流であり、その変換の前後では解像度の違いを除いては基本的には同一の映像コンテンツである。また、異種コンテンツへの変換は、音声認識に基づく音声コンテンツから文字コンテンツへの変換が主流である。すなわち、映像から映像コンテンツ以外のコンテンツに変換する手法は考えられていない。
【０００５】
このため、映像コンテンツに関連した、ＷＷＷ等で使用されるコンテンツ記述言語で記述されたテキストベースのコンテンツ、音声コンテンツ、あるいは、映像コンテンツと関連しているが異なる画像を有する画像コンテンツ等を制作する場合、映像コンテンツを利用することができず、最初から制作を行わなければならないという問題があった。
【０００６】
本発明は、以上のような問題点に鑑みてなされたものであり、映像（映像コンテンツ）を、その映像の内容に関連する座標情報、音声情報、画像情報等の異種情報に変換した映像関連コンテンツを生成することを可能にする映像関連コンテンツ生成装置、映像関連コンテンツ生成方法及び映像関連コンテンツ生成プログラムを提供することを目的とする。
【０００７】
【課題を解決するための手段】
本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の映像関連コンテンツ生成装置は、映像信号から、その映像信号の映像内容に関連する情報を映像関連コンテンツとして生成する映像関連コンテンツ生成装置であって、映像信号を解析して、映像特徴量を抽出する映像シーン解析手段と、この映像シーン解析手段で抽出された映像特徴量を、テキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツを生成するコンテンツ生成手段と、を備える構成とした。
【０００８】
かかる構成によれば、映像関連コンテンツ生成装置は、映像シーン解析手段によって、映像信号から映像特徴量を抽出する。そして、コンテンツ生成手段によって、映像特徴量をテキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツとして生成する。
【０００９】
ここで、映像特徴量とは、映像シーンを構成するフレームを特徴付ける数量のことで、例えば、明るさ（輝度値）、色味（色特徴量）、動き（動きベクトル量）、テクスチャ、映像オブジェクトの位置座標、映像オブジェクト数等を数値化したもの、あるいはその統計量である。
【００１０】
また、映像特徴量をテキストデータに変換する場合、テキストベースのコンテンツ記述言語に変換すると、そのコンテンツ記述言語の再生装置によって、コンテンツを再生することが可能になり都合がよい。このコンテンツ記述言語には、例えば、ＨＴＭＬ（ＨｙｐｅｒＴｅｓｔ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ）、ＶＲＭＬ（Ｖｉｒｔｕａｌ　ＲｅａｌｉｔｙＭｏｄｅｌｉｎｇ　Ｌａｎｇｕａｇｅ）、ＢＭＬ（Ｂｒｏａｄｃａｓｔ　Ｍａｒｋｕｐ　Ｌａｎｇｕａｇｅ）、ＲｅａｌＡｕｄｉｏメタファイル等がある。
【００１１】
また、請求項２に記載の映像関連コンテンツ生成装置は、請求項１に記載の映像関連コンテンツ生成装置において、映像シーン解析手段が、映像シーンに含まれる映像オブジェクトの位置座標を、映像特徴量として検出する映像オブジェクト位置検出手段を備える構成とした。
【００１２】
かかる構成によれば、映像関連コンテンツ生成装置は、映像オブジェクト位置検出手段によって、映像シーンに含まれる映像オブジェクトの位置座標を、映像特徴量として検出する。この位置座標は、映像オブジェクトの特定の位置（例えば、左上座標、中心座標等）でもよいし、映像オブジェクトの重心座標としてもよい。
【００１３】
さらに、請求項３に記載の映像関連コンテンツ生成装置は、請求項１又は請求項２に記載の映像関連コンテンツ生成装置において、映像シーン解析手段が、映像シーンに含まれる映像オブジェクトを特徴付ける特徴量を、映像特徴量として抽出する映像オブジェクト特徴量抽出手段を備える構成とした。
【００１４】
かかる構成によれば、映像関連コンテンツ生成装置は、映像オブジェクト特徴量抽出手段によって、映像シーンに含まれる映像オブジェクトを特徴付ける映像特徴量を抽出する。この映像特徴量（映像オブジェクト特徴量）は、明るさ（輝度値）、色味（色特徴量）、動き（動きベクトル量）、テクスチャ等の映像オブジェクト毎の特徴量である。
【００１５】
また、請求項４に記載の映像関連コンテンツ生成装置は、請求項１乃至請求項３のいずれか１項に記載の映像関連コンテンツ生成装置において、コンテンツ生成手段が、映像特徴量と、その映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースと、映像特徴量に基づいて、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語のテキスト領域に、特徴量文字列を埋め込む文字列埋め込み手段と、を備える構成とした。
【００１６】
かかる構成によれば、映像関連コンテンツ生成装置は、文字列埋め込み手段によって、文字列変換データベースを参照して、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語のテキスト領域に、映像特徴量に対応する特徴量文字列を埋め込む。
このコンテンツ記述言語のテキスト領域には、映像特徴量に対応する予め定められた置換対象文字列を記述しておき、文字列変換データベースには、映像特徴量の種類とその映像特徴量の値毎に、置換対象文字列と特徴量文字列（置換文字列）とを対応付けておくことで、コンテンツ記述言語のテキスト領域である置換対象文字列を容易に特徴量文字列に置き換えることができる。
【００１７】
さらに、請求項５に記載の映像関連コンテンツ生成装置は、請求項２に記載の映像関連コンテンツ生成装置において、コンテンツ生成手段が、映像オブジェクト位置検出手段で検出された映像オブジェクトの位置座標に、映像オブジェクトに関連する画像データを合成する画像合成手段を備える構成とした。
【００１８】
かかる構成によれば、映像関連コンテンツ生成装置は、画像合成手段によって、映像オブジェクト位置検出手段で検出された映像オブジェクトの位置座標に、画像データを合成することで、映像オブジェクトの位置座標を可視化したコンテンツを生成する。
【００１９】
さらにまた、請求項６に記載の映像関連コンテンツ生成装置は、請求項１乃至請求項５のいずれか１項に記載の映像関連コンテンツ生成装置において、コンテンツ生成手段が、映像特徴量に対応付けて、複数の音声データを蓄積した音声データ蓄積手段と、映像特徴量に基づいて、音声データ蓄積手段に蓄積されている音声データを選択する音声選択手段と、この音声選択手段で選択された音声データを出力する音声出力手段と、を備える構成とした。
【００２０】
かかる構成によれば、映像関連コンテンツ生成装置は、映像特徴量に対応付けて、複数の音声データを蓄積した音声データ蓄積手段から、音声選択手段が、映像特徴量に基づいて音声データを選択する。
ここで、音声データ蓄積手段に蓄積されている音声データは、映像特徴量の値に対応付けて、例えば、輝度値等による映像の明るさを映像特徴量とする場合は、明るい映像に対して、楽しい音楽を対応付ける。あるいは、映像オブジェクトの移動量による動きの激しさを映像特徴量とする場合は、映像オブジェクトの動きの激しい映像に対しては、テンポの速い音楽を対応付けることも可能である。
【００２１】
また、請求項７に記載の映像関連コンテンツ生成方法は、映像信号から、その映像信号の映像内容に関連する情報を映像関連コンテンツとして生成する映像関連コンテンツ生成方法であって、映像信号の映像シーンを解析して、映像特徴量を抽出する映像シーン解析ステップと、映像特徴量とその映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースから、映像シーン解析ステップで抽出した映像特徴量に対応する特徴量文字列を検索する文字列検索ステップと、特徴量文字列を埋め込むテキスト領域をテンプレート化した、コンテンツ記述言語を入力するコンテンツ記述言語入力ステップと、映像特徴量に基づいて、コンテンツ記述言語のテキスト領域に文字列検索ステップで検索した特徴量文字列を埋め込む文字列埋め込みステップと、を含んでいることを特徴とする。
【００２２】
かかる方法によれば、映像関連コンテンツ生成方法は、映像シーン解析ステップによって、映像シーンを構成するフレームを特徴付ける数量である映像特徴量を抽出し、文字列検索ステップによって、映像特徴量と映像特徴量を文字列として表現した特徴量文字列とを対応付けて蓄積した文字列変換データベースから、映像シーン解析ステップで抽出した映像特徴量に対応する特徴量文字列を検索する。
そして、コンテンツ記述言語入力ステップによって、特徴量文字列を埋め込むテキスト領域をテンプレート化したコンテンツ記述言語を入力し、文字列埋め込みステップによって、コンテンツ記述言語のテキスト領域に特徴量文字列を埋め込んで映像関連コンテンツを生成する。
【００２３】
さらに、請求項８に記載の映像関連コンテンツ生成プログラムは、映像信号から、その映像信号の映像内容に関連する情報を映像関連コンテンツとして生成するために、コンピュータを、映像信号の映像シーンを解析して、映像特徴量を抽出する映像シーン解析手段、この映像シーン解析手段で抽出された映像特徴量を、テキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツを生成するコンテンツ生成手段、として機能させる構成とした。
【００２４】
かかる構成によれば、映像関連コンテンツ生成プログラムは、映像シーン解析手段によって、映像シーンを構成するフレームを特徴付ける数量である映像特徴量を抽出し、コンテンツ生成手段によって、映像特徴量をテキストデータ、画像データ及び音声データの少なくとも１つに変換して映像関連コンテンツとして生成する。
【００２５】
【発明の実施の形態】
以下、本発明の実施の形態について図面を参照して説明する。
（映像関連コンテンツ生成装置の構成）
図１は、本発明における映像関連コンテンツ生成装置の構成を示したブロック図である。図１に示すように映像関連コンテンツ生成装置１は、入力された映像（映像信号）の映像シーンを解析することで、その映像シーンの特徴量（映像特徴量）を抽出し、その抽出した映像特徴量に基づいて、映像内容に関連する情報をコンテンツ記述言語で記述したコンテンツ、画像ファイル又は映像ストリーム、音声ファイル又は音声ストリームを映像関連コンテンツとして生成するものである。
【００２６】
この映像関連コンテンツ生成装置１は、映像シーン解析手段２と、コンテンツ記述言語生成手段４、画像合成手段５及び音声合成手段６を含んだコンテンツ生成手段３と、を備える構成とした。
【００２７】
映像シーン解析手段２は、映像オブジェクト位置検出部２１と、映像オブジェクト特徴量抽出部２２と、映像シーン特徴量抽出部２３とを備え、入力された映像信号から、映像信号の解析を行い、映像特徴量を抽出するものである。
【００２８】
映像オブジェクト位置検出部（映像オブジェクト位置検出手段）２１は、映像シーンから、その映像シーンに含まれる映像オブジェクトを検出するものである。ここでは、この映像オブジェクト位置検出部２１は、映像オブジェクトのフレーム内における重心の位置座標を検出すると同時に、個々の映像オブジェクトに固有の識別子を割り当て、この位置座標、識別子、並びに映像オブジェクトを検出した時刻を映像特徴量としてコンテンツ生成手段３へ出力するものとする。
【００２９】
なお、映像オブジェクトに識別子を割り当てるのは、映像オブジェクトの位置座標に基づいて、例えば、画面上で左から表示される順番に連番を付けることも可能である。あるいは、映像オブジェクトが人物の場合、一般的な顔認識の技術によって、連番の代わりに人物名を識別子として用いることも可能である。
【００３０】
映像オブジェクト特徴量抽出部（映像オブジェクト特徴量抽出手段）２２は、映像シーンから、映像オブジェクト位置検出部２１で検出された映像オブジェクトの映像オブジェクト特徴量を抽出するものである。この映像オブジェクト特徴量は、明るさ（輝度値）、色味（色特徴量）、動き（動きベクトル量）、テクスチャ、形状パラメータ等の映像オブジェクト毎の特徴量である。この映像オブジェクト特徴量抽出部２２は、この映像オブジェクト特徴量とその映像オブジェクト固有の識別子を映像特徴量としてコンテンツ生成手段３へ出力するものである。
なお、映像オブジェクトの検出や特徴量の抽出は、本願出願人において「動画像のオブジェクト抽出装置（特開２００１−３０７１０４）」又は「映像オブジェクト検出・追跡装置（特願２００１−１６６５２５）」として開示されている技術を用いて実現することができる。
【００３１】
映像シーン特徴量抽出部２３は、映像シーンから、フレーム毎の映像特徴量を抽出するものである。この映像シーン特徴量抽出部２３は、フレーム全体の特徴や、映像オブジェクト特徴量抽出部２２で抽出した個々の映像オブジェクト特徴量を統計した情報をフレームの映像特徴量としてコンテンツ生成手段３へ出力するものである。
例えば、映像シーン特徴量抽出部２３は、フレームの各画素の輝度値を、フレーム全体に渡って平均をとったフレームの平均輝度値や、フレーム内の映像オブジェクトの数等を映像特徴量として出力する。
【００３２】
コンテンツ生成手段３は、コンテンツ記述言語生成手段４、画像合成手段５及び音声合成手段６を備え、映像シーン解析手段２から入力される映像特徴量から、映像シーンに関連する情報を映像関連コンテンツとして出力するものである。
【００３３】
コンテンツ記述言語生成手段４は、文字列変換データベース４１と、特徴量文字列変換部４２と、テンプレート文字列置換部４３とを備え、外部から入力されるコンテンツ記述言語のテンプレート（コンテンツ記述言語テンプレート４４ａ）の置換対象文字列を、映像シーン解析手段２で解析され、抽出された映像特徴量に対応する置換文字列に置換して、映像シーンに関連するコンテンツ記述言語で書かれたコンテンツを生成するものである。
【００３４】
このコンテンツ記述言語には、例えば、ＨＴＭＬ、ＶＲＭＬ、ＢＭＬ、ＲｅａｌＡｕｄｉｏメタファイル等がある。ここでは、ＨＴＭＬを代表して説明を行うが、他のコンテンツ記述言語においても同様の構成で実現することが可能である。
【００３５】
文字列変換データベース４１は、映像特徴量を文字列として表現するための変換ルール（文字列変換ルール４１ａ）を蓄積したデータベースで、映像特徴量の種類及びその数値と、コンテンツ記述言語テンプレート４４ａに記述された置換対象文字列と、その置換対象文字列を置換する置換文字列とを対応付けて蓄積したものである。
【００３６】
ここで、図２及び図３を参照して、コンテンツ記述言語テンプレート４４ａ及び文字列変換ルール４１ａについて説明する。図２は、コンテンツ記述言語テンプレート４４ａの一例を示すＨＴＭＬで記述したテンプレート（雛型）であり、図３は、文字列変換ルール４１ａの内容の一例を示す図である。
【００３７】
図２に示すように、コンテンツ記述言語テンプレート４４ａは、コンテンツ記述言語（ここではＨＴＭＬ）で記述したテキストファイルであり、入力される映像シーンの内容に関連する部分を置換対象文字列として記述しておき、あとからその置換対象文字列を置換することができるテンプレートである。ここでは、「部屋の情景」を説明するＨＴＭＬのテンプレートを例としており、置換対象文字列４４ｂとして、「＜！−−ｂｒｉｇｈｔｎｅｓｓ−−＞」を用い、映像シーンの明るさに関する映像特徴量に基づいて文字列を置換する領域を示している。また、置換対象文字列４４ｃとして、「＜！−−ｎｕｍｂｅｒ−−＞」を用い、映像シーン内の数に関する映像特徴量に基づいて文字列を置換する領域を示している。
【００３８】
図３の文字列変換ルール４１ａでは、映像特徴量の種類として、フレームの各画素の輝度値を、フレーム全体に渡って平均をとったフレームの平均輝度値（輝度値の平均値）と、フレーム内の映像オブジェクトの数（オブジェクトの個数）を用い、その映像特徴量の値に置換対象文字列と置換文字列（特徴量文字列）とを対応付けている。
【００３９】
図３（ａ）では、例えば、映像を構成する画素の輝度値を０から２５５の２５６値で表したとき、映像特徴量の値である輝度値の平均値が、「９０未満」の場合は、置換対象文字列が「＜！−−ｂｒｉｇｈｔｎｅｓｓ−−＞」、置換文字列が「暗い」であることを示している。これによって、輝度値の平均値が、「９０未満」の場合は、コンテンツ記述言語テンプレート４４ａ（図２）の置換対象文字列４４ｂが「暗い」に置換される。また、オブジェクトの個数が「２以上」の場合は、コンテンツ記述言語テンプレート４４ａの置換対象文字列４４ｃは、「たくさんあります」に置換される。
【００４０】
図３（ｂ）では、図３（ａ）の置換文字列を「暗い」、「たくさんあります」等の日本語文字列で表すのではなく、「＜ｉｍｇ　ｓｒｃ＝”１．ｐｎｇ”＞」等のＨＴＭＬの埋め込み画像として指定する場合の例を示している。このように、置換文字列は、日本語文字列だけではなく画像ファイル、音声ファイル、スクリプト等のファイル名をコンテンツ記述言語に埋め込む置換文字列として記述することとしてもよい。なお、このスクリプトには、ＪａｖａＳｃｒｉｐｔ（登録商標）、ＥＣＭＡＳｃｒｉｐｔ等がある。
図１に戻って説明を続ける。
【００４１】
特徴量文字列変換部（文字列埋め込み手段）４２は、映像シーン解析手段２から入力された映像特徴量に基づいて、文字列変換データベース４１の文字列変換ルール４１ａを参照し、その映像特徴量に対応する置換対象文字列と、置換文字列とをテンプレート文字列置換部４３へ通知するものである。
【００４２】
テンプレート文字列置換部（文字列埋め込み手段）４３は、外部から入力されるコンテンツ記述言語テンプレート４４ａと、特徴量文字列変換部４２から通知される置換対象文字列及び置換文字列とに基づいて、コンテンツ記述言語テンプレート４４ａに記述されている置換対象文字列を置換文字列に置換することで、ＨＴＭＬファイル等のコンテンツ記述言語を生成するものである。
【００４３】
なお、置換文字列で映像オブジェクトの位置を表す場合には、その位置座標の時刻毎の位置座標リストを置換対象文字列としたＶＲＭＬのＰｏｓｉｔｉｏｎＩｎｔｅｒｐｏｌａｔｏｒノードを用いて記述することも可能である。
【００４４】
画像合成手段５は、位置提示画像合成部５１と、画像出力部５２とを備え、映像シーン解析手段２で解析され、抽出された映像特徴量に関連する画像を合成して画像ファイル又は映像ストリームとして出力するものである。
【００４５】
位置提示画像合成部５１は、映像シーン解析手段２の映像オブジェクト位置検出部２１から、映像オブジェクトの位置座標、識別子並びに検出時刻を映像特徴量として入力し、その識別子で区別された映像オブジェクトがある時刻においてどの位置に存在していたかを提示する位置提示画像を合成するものである。ここで合成された画像は画像出力部５２へ出力される。
【００４６】
例えば、予めアイコン画像を蓄積した画像蓄積手段（図示せず）から、アイコン画像を読み込んで、無地の画像上の位置座標で示される位置にアイコン画像を合成する。また、例えば、映像シーン解析手段２の映像オブジェクト特徴量抽出部２２で抽出される映像オブジェクトの画像をそのままアイコン画像として合成することとしてもよい。
【００４７】
画像出力部５２は、位置提示画像合成部５１で合成された画像を画像ファイルとして出力するものである。なお、画像出力部５２は、位置提示画像合成部５１から画像が時系列に入力される場合は、その時系列画像を映像オブジェクトが時刻によって変化する映像ストリームとして出力する。
【００４８】
音声合成手段６は、音声データ蓄積部６１と、音声選択部６２と、音声出力部６３とを備え、映像シーン解析手段２で解析され、抽出された映像特徴量に関連する音声を音声ファイル又は音声ストリームとして出力するものである。
【００４９】
音声データ蓄積部（音声データ蓄積手段）６１は、予め映像シーンに関連する音声データ６１ａを識別番号に対応付けて蓄積しておくものであり、ハードディスク等で構成されるものである。この音声データ６１ａは、映像シーンに関連して映像シーンを表現するための音声データであり、例えば、ＢＧＭ（Ｂａｃｋ　Ｇｒｏｕｎｄ　Ｍｕｓｉｃ）、効果音、人の声等である。
また、この音声データ蓄積部６１は、映像シーンの映像特徴量に基づいた音声データ６１ａを複数保持している。例えば、輝度値に対応付けて「明るさ」のレベルを表現する音声データ６１ａを音声ファイルとして保持している。
【００５０】
音声選択部（音声選択手段）６２は、映像シーン解析手段２から入力される映像シーンの映像特徴量に基づいて、音声データ蓄積部６１に蓄積されている音声データ６１ａを選択して、音声出力部６３へ音声データ６１ａの識別番号を通知するものである。
この音声選択部６２は、映像シーンの映像特徴量（例えば輝度値の平均値）から、映像シーンを表現する音声データ蓄積部６１に蓄積されている音声データ６１ａの識別番号を音声出力部６３へ通知する。例えば、音声選択部６２は、輝度値の平均値に基づいて、「明るさ」のレベルを判定して、その「明るさ」に対応する音声データ６１ａを選択する。あるいは、映像オブジェクトの位置座標に基づいて、予め設定された領域に映像オブジェクトが入ったとときに、特定の音声データ６１ａを選択することとしてもよい。
【００５１】
音声出力部（音声出力手段）６３は、音声選択部６２で選択され、識別番号で通知された音声データ蓄積部６１内の音声データ６１ａを読み込んで、音声ファイル又は音声ストリームとして出力するものである。
このように、コンテンツ記述言語生成手段４から出力されるコンテンツ記述言語（ＨＴＭＬファイル等）、画像合成手段５から出力される画像ファイル（又は映像ストリーム）、音声合成手段６から出力される音声ファイル（又は音声ストリーム）は、個々に出力する形態であっても構わないし、複数の出力を映像関連コンテンツとして出力する形態であっても構わない。
【００５２】
以上、一実施形態に基づいて、映像関連コンテンツ生成装置１の構成について説明したが、映像関連コンテンツ生成装置１は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合して映像関連コンテンツ生成プログラムとして動作させることも可能である。
【００５３】
（映像関連コンテンツ生成装置の動作：コンテンツ記述言語生成例）
次に、映像関連コンテンツ生成装置１の動作について説明する。
まず、図１及び図７を参照して、映像関連コンテンツ生成装置１がコンテンツ記述言語を生成する動作例について説明する。図７は、映像関連コンテンツ生成装置１がコンテンツ記述言語を生成する動作を示すフローチャートである。
【００５４】
映像関連コンテンツ生成装置１は、入力された映像信号から、映像シーン解析手段２が、映像信号を解析し、映像特徴量を抽出する（ステップＳ１１）。
【００５５】
そして、コンテンツ記述言語生成手段４において、特徴量文字列変換部４２が文字列変換データベース４１内の文字列変換ルール４１ａに基づいて、ステップＳ１１で抽出した映像特徴量に対応する置換対象文字列及び置換文字列を検索して、テンプレート文字列置換部４３に通知する（ステップＳ１２）。
【００５６】
置換対象文字列及び置換文字列を通知されたテンプレート文字列置換部４３は、外部からコンテンツ記述言語テンプレート４４ａを読み込み（ステップＳ１３）、コンテンツ記述言語テンプレート４４ａ内の置換対象文字列を検索する（ステップＳ１４）。
そして、置換対象文字列が存在するかどうかを判定し（ステップＳ１５）、存在する場合（Ｙｅｓ）は、置換対象文字列を置換文字列に置き換えて（ステップＳ１６）、ステップＳ１４に戻ってさらに置換対象文字列を検索する。
【００５７】
一方、置換対象文字列が存在しない場合（ステップＳ１５でＮｏ）は、置換対象文字列をすべて置換文字列に置き換えたものとして、その置換文字列に置き換えたコンテンツ記述言語（ＨＴＭＬファイル等）を出力して（ステップＳ１７）、動作を終了する。
以上のステップによって、映像信号からその映像内容に関連する情報を、コンテンツ記述言語で記述したテキストベースのコンテンツを生成することができる。
【００５８】
次に、図４を参照して、コンテンツ記述言語生成手段４を中心にして、コンテンツ記述言語生成の具体的な動作について説明する。図４は、映像特徴量からＨＴＭＬファイルを生成する例を示す概念図である。
【００５９】
コンテンツ記述言語生成手段４は、まず、映像特徴量２ａを入力する。ここで、映像特徴量２ａとして、輝度値の平均値が「１００」、オブジェクトの個数が「１」であったとすると、コンテンツ記述言語生成手段４は、文字列変換ルール４１ａ（図３（ａ））を参照して、輝度値の平均値「１００」及びオブジェクトの個数「１」に対応する置換対象文字列及び置換文字列を検索し、輝度値の平均値「１００」に対応する置換対象文字列「＜！−−ｂｒｉｇｈｔｎｅｓｓ−−＞」並びに置換文字列「薄暗い」と、オブジェクトの個数「１」に対応する置換対象文字列「＜！？ｎｕｍｂｅｒ−−＞」並びに置換文字列「一つあります」とを得る。
【００６０】
そして、コンテンツ記述言語生成手段４は、外部から入力されるコンテンツ記述言語テンプレート４４ａ（図２）の置換対象文字列「＜！−−ｂｒｉｇｈｔｎｅｓｓ−−＞」並びに「＜！−−ｎｕｍｂｅｒ−−＞」をそれぞれ「薄暗い」並びに「一つあります」に変換することにより、「部屋の情景」を説明するＨＴＭＬファイル４ａを生成する。
【００６１】
（映像関連コンテンツ生成装置の動作：画像合成例）
次に、図１及び図８を参照して、映像関連コンテンツ生成装置１が映像信号に関連する画像を合成して出力する動作例について説明する。図８は、映像関連コンテンツ生成装置１が合成画像を時系列化した映像ストリームを生成する動作を示すフローチャートである。
【００６２】
まず、映像関連コンテンツ生成装置１は、入力された映像信号に基づいて、映像シーン解析手段２が、映像信号を解析し、映像特徴量である映像シーンの時刻、映像オブジェクトの位置座標を抽出する（ステップＳ２１）。
そして、画像合成手段５において、位置提示画像合成部５１が映像シーンのある時刻における映像オブジェクトの位置座標にアイコン画像を合成した合成画像を生成する（ステップＳ２２）。
【００６３】
映像シーンの全ての時刻における合成画像の生成を完了したかどうかを判定し（ステップＳ２３）、まだ、完了していない場合（Ｎｏ）は、ステップＳ２２へ戻って、次の時刻の合成画像を生成する。一方、すべての時刻における合成画像の生成を完了した場合（Ｙｅｓ）は、映像シーンの時刻毎に生成した合成画像を映像ストリームとして出力して（ステップＳ２４）、動作を終了する。
以上のステップによって、映像信号から、映像オブジェクトの位置のみを視覚化した映像ストリームとして生成することができる。
【００６４】
次に、図５及び図６を参照して、画像合成手段５を中心にして、映像信号から画像ファイル又は映像ストリームを生成する具体的な動作について説明する。図５は、映像特徴量から同一画像上に時系列に変化する映像オブジェクトの位置を視覚化した、画像ファイルを生成する例を示す概念図である。図６は、映像特徴量から時系列に変化する映像オブジェクトの位置を別々の画像ファイルとして生成する、又は、映像ストリームとして生成する例を示す概念図である。
【００６５】
図５に示すように、画像合成手段５は、まず、映像特徴量２ａを入力する。ここで、映像特徴量２ａとして、時刻が「１」、「２」、「３」及び「４」、その時刻に対応する映像オブジェクトの位置座標が「（１，１）」、「（１，２）」、「（２，３）」及び「（３，２）」であったとする。
【００６６】
画像合成手段５は、まず、無地画像の位置座標「（１，１）」、「（１，２）」、「（２，３）」及び「（３，２）」にアイコン画像Ｃ１、Ｃ２、Ｃ３及びＣ４（Ｃ４のみ異なるアイコン画像を使用）を合成し、各アイコン画像（Ｃ１〜Ｃ４）間を直線で結んだ画像ファイル５ａを生成する。これによって、映像オブジェクトがＣ１の位置座標からＣ４の位置座標へ移動したことを表現することができる。
【００６７】
また、図６に示した例では、図５と同様の映像特徴量２ａを入力しているが、画像合成手段５の出力が、一枚の画像ファイルではなく、複数の画像ファイルあるいは映像ストリームとしているところが異なっている。
図６において、画像合成手段５は、時刻「１」から「４」に対応する無地画像の位置座標「（１，１）」、「（１，２）」、「（２，３）」及び「（３，２）」にアイコン画像Ｃ４を合成し、４枚の画像を生成する。なお、画像合成手段５は、この４枚の画像を個々の画像ファイル５ａとして出力することも可能であるし、個々の画像を連続したストリームデータとした映像ストリーム５ｂとして出力することも可能である。
【００６８】
（映像関連コンテンツ生成装置の動作：音声合成例）
次に、図１及び図９を参照して、映像関連コンテンツ生成装置１が映像信号に関連する音声を合成して出力する動作例について説明する。図９は、映像関連コンテンツ生成装置１が音声データを出力する動作を示すフローチャートである。
【００６９】
映像関連コンテンツ生成装置１は、一定の時間間隔又はカット点検出技術により得られるカット点のタイミングに基づいて、映像シーン解析手段２が、映像信号を解析し、映像特徴量を抽出する（ステップＳ３１）。
【００７０】
そして、音声合成手段６において、音声選択部６２が映像特徴量の値に応じて、映像シーンを表現する音声データ蓄積部６１に蓄積されている音声データ６１ａを選択し、その識別番号を音声出力部６３へ通知する（ステップＳ３２）。
その識別番号を通知された音声出力部６３は、識別番号に基づいて選択された音声データ６１ａを音声データ蓄積部６１から読み込んで、音声ファイル又は音声ストリームとして出力する（ステップＳ３３）。
以上のステップによって、映像信号からその映像内容に関連する音声データを、音声ファイル又は音声ストリームとして出力することができる。
【００７１】
【発明の効果】
以上説明したとおり、本発明に係る映像関連コンテンツ生成装置、映像関連コンテンツ生成方法及び映像関連コンテンツ生成プログラムでは、以下に示す優れた効果を奏する。
【００７２】
請求項１、請求項７又は請求項８に記載の発明によれば、入力された映像信号から、映像シーンの解析を行い、その映像の内容に関連する座標情報、音声情報、画像情報等の異種情報に変換した映像関連コンテンツを生成することができる。これによって、今まで音声認識による字幕作成データに限られていた自然入力データに基づく自動コンテンツ制作を、映像入力においても適用することが可能になる。
【００７３】
請求項２又は請求項３に記載の発明によれば、映像内の映像オブジェクトを検出し、映像特徴量として抽出することができる。そして、映像オブジェクト毎の位置情報や特徴量を、視覚化したテキスト、音声、画像等によって表現することができるため、映像のデータ量を削減したコンテンツを生成することが可能になる。また、ＷＷＷや携帯端末で使用可能なコンテンツを生成することができ、データのアクセシビリティを向上させることができる。
【００７４】
請求項４に記載の発明によれば、テンプレート化したＨＴＭＬ等のコンテンツ記述言語から、映像シーンの内容に関連した情報をコンテンツ記述言語として生成することができ、定型化したコンテンツをテンプレート化して準備しておくことで、コンテンツ制作の制作時間の短縮を行うことが可能になる。
【００７５】
請求項５に記載の発明によれば、映像オブジェクト毎の位置を、他の画像等によって表現することができるため、映像のデータ量を削減したコンテンツを生成することが可能になる。また、ＷＷＷや携帯端末で使用可能なコンテンツを生成することができ、データのアクセシビリティを向上させることができる。
【００７６】
請求項６に記載の発明によれば、映像シーンに適する音声を適宜出力することができるため、映像だけでは表現できない効果を演出することが可能になる。これによって、コンテンツ制作にかける労力を低減させることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る映像関連コンテンツ生成装置の全体構成を示すブロック図である。
【図２】コンテンツ記述言語のテンプレートの例を説明する説明図である。
【図３】文字列変換ルールの例を説明するための説明図である。
【図４】本発明の実施の形態に係るコンテンツ記述言語生成手段の動作例を説明するあめの説明図である。
【図５】本発明の実施の形態に係る画像合成手段の動作を模式的に示した模式図（その１）である。
【図６】本発明の実施の形態に係る画像合成手段の動作を模式的に示した模式図（その２）である。
【図７】本発明の実施の形態に係る映像からコンテンツ記述言語を生成する動作を示すフローチャートである。
【図８】本発明の実施の形態に係る映像から合成画像を生成する動作を示すフローチャートである。
【図９】本発明の実施の形態に係る映像から合成音声を生成する動作を示すフローチャートである。
【符号の説明】
１……映像関連コンテンツ生成装置
２……映像シーン解析手段
２１……映像オブジェクト位置検出部（映像オブジェクト位置検出手段）
２２……映像オブジェクト特徴量抽出部（映像オブジェクト特徴量抽出手段）
２３……映像シーン特徴量抽出部
３……コンテンツ生成手段
４……コンテンツ記述言語生成手段
４１……文字列変換データベース
４２……特徴量文字列変換部（文字列埋め込み手段）
４３……テンプレート文字列置換部（文字列埋め込み手段）
５……画像合成手段
５１……位置提示画像合成部
５２……画像出力部
６……音声合成手段
６１……音声データ蓄積部（音声データ蓄積手段）
６２……音声選択部（音声選択手段）
６３……音声出力部（音声出力手段）[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a video-related content generation device, a video-related content generation method, and a video-related content generation program that generate video-related content that can supply video to media other than video media.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, from video (video content), other than video media (television broadcasting, etc.) that can directly present video content, other media such as data broadcasting, WWW (World Wide Web), and portable terminals are used. When producing and distributing content that can be used, video content for other media is produced and distributed by cutting out a specific size from the original video or converting the transmission frame rate etc. . The method of converting the video content into media content other than video media is basically only conversion to the same type (video) content except for the resolution.
[0003]
Conventionally, when producing content to be presented by text broadcasting or the like from voice (voice content), there is a method of converting the original voice into text information by voice recognition to produce text content. As described above, in audio content, conversion from audio content to heterogeneous content such as character information is performed.
[0004]
[Problems to be solved by the invention]
However, in the conventional technique, when converting to content for different media, resolution conversion from video to video is mainstream, and before and after the conversion, basically the same video content except for the difference in resolution is used. It is. In addition, the conversion to the heterogeneous content is mainly the conversion from the voice content to the text content based on the voice recognition. That is, a method of converting video into content other than video content has not been considered.
[0005]
For this reason, text-based content, audio content, image content related to the video content but having a different image, etc., related to the video content and described in a content description language used in WWW or the like are produced. In such a case, there was a problem that the video content could not be used and the production had to be performed from the beginning.
[0006]
SUMMARY OF THE INVENTION The present invention has been made in view of the above-described problems, and has been made in consideration of an image-related image obtained by converting an image (video content) into heterogeneous information such as coordinate information, audio information, and image information related to the content of the image. It is an object of the present invention to provide a video-related content generation device, a video-related content generation method, and a video-related content generation program that can generate content.
[0007]
[Means for Solving the Problems]
The present invention has been devised to achieve the above object. First, the video-related content generating apparatus according to claim 1 converts, from a video signal, information related to the video content of the video signal into a video-related information. A video-related content generation device that generates video content as content, comprising: a video scene analysis unit that analyzes a video signal to extract a video feature amount; Content generation means for converting the data into at least one of data and audio data to generate video-related content.
[0008]
According to this configuration, the video-related content generation device extracts the video feature amount from the video signal by the video scene analysis unit. Then, the content generation unit converts the video feature amount into at least one of text data, image data, and audio data to generate video-related content.
[0009]
Here, the video feature amount refers to a quantity characterizing a frame constituting a video scene. For example, brightness (brightness value), color (color feature amount), motion (motion vector amount), texture, video object Is a numerical value of the position coordinates, the number of video objects, and the like, or a statistic thereof.
[0010]
Further, when converting the video feature amount into text data, if the video feature amount is converted into a text-based content description language, the content can be played back by a playback device of the content description language, which is convenient. The content description language includes, for example, HTML (HyperTest Markup Language), VRML (Virtual Reality Modeling Language), BML (Broadcast Markup Language), RealAudit, etc.
[0011]
Also, in the video-related content generation device according to claim 2, in the video-related content generation device according to claim 1, the video scene analysis means uses the position coordinates of the video object included in the video scene as a video feature amount. It is configured to include a video object position detecting means for detecting.
[0012]
According to this configuration, the video-related content generation device detects the position coordinates of the video object included in the video scene as the video feature amount by the video object position detection unit. The position coordinates may be a specific position (for example, upper left coordinates, center coordinates, and the like) of the video object, or may be barycenter coordinates of the video object.
[0013]
Further, in the video-related content generation device according to claim 3, in the video-related content generation device according to claim 1 or 2, the video scene analysis unit determines a feature amount characterizing a video object included in the video scene. And a video object feature extracting means for extracting as a video feature.
[0014]
According to this configuration, the video-related content generation device extracts the video feature amount characterizing the video object included in the video scene by the video object feature amount extraction unit. The video feature amount (video object feature amount) is a feature amount for each video object such as brightness (brightness value), color (color feature amount), motion (motion vector amount), texture, and the like.
[0015]
According to a fourth aspect of the present invention, in the video-related content generating apparatus according to any one of the first to third aspects, the content generating means includes: a video feature amount; A character string conversion database in which the amount of character strings expressed in character strings are stored in association with each other, and a text area for embedding the character string based on the video characteristic amount in a text area of a content description language in which a template is formed. And a character string embedding means for embedding a feature amount character string.
[0016]
According to this configuration, the video-related content generation device refers to the character string conversion database by the character string embedding unit, and stores the video feature in the text area of the content description language in which the text area in which the feature amount character string is embedded is templated. The feature amount character string corresponding to the amount is embedded.
In the text area of the content description language, a predetermined replacement target character string corresponding to the video feature amount is described, and the character string conversion database stores the type of the video feature amount and the value of the video feature amount. By associating the replacement target character string with the feature amount character string (replacement character string), the replacement target character string, which is a text area of the content description language, can be easily replaced with the feature amount character string.
[0017]
Further, in the video-related content generation device according to the fifth aspect, in the video-related content generation device according to the second aspect, the content generation unit may add the video to the position coordinates of the video object detected by the video object position detection unit. An image synthesizing means for synthesizing image data related to the object is provided.
[0018]
According to this configuration, the video-related content generation device visualizes the position coordinates of the video object by synthesizing the image data with the position coordinates of the video object detected by the video object position detection unit by the image synthesis unit. Generate content.
[0019]
Furthermore, a video-related content generation device according to claim 6 is the video-related content generation device according to any one of claims 1 to 5, wherein the content generation unit associates the content with the video feature amount. Audio data storage means for storing a plurality of audio data, audio selection means for selecting audio data stored in the audio data storage means based on the video feature amount, and audio data selected by the audio selection means And an audio output unit that outputs
[0020]
According to this configuration, in the video-related content generation device, the audio selection unit selects the audio data from the audio data storage unit that stores the plurality of audio data in association with the video feature amount based on the video feature amount. .
Here, the audio data stored in the audio data storage means is associated with the value of the video feature amount. Make fun music correspond. Alternatively, when the degree of movement due to the moving amount of the video object is used as the video feature amount, it is possible to associate fast-moving music with a video in which the movement of the video object is rapid.
[0021]
A video-related content generation method according to claim 7 is a video-related content generation method for generating, from a video signal, information related to the video content of the video signal as the video-related content. A video scene analysis step of analyzing the video feature amount and extracting a video feature amount; and a video string conversion database storing the video feature amount and the feature amount character string expressing the video feature amount as a character string. A character string search step for searching for a characteristic amount character string corresponding to the video characteristic amount extracted in the analysis step; a text description area for embedding the characteristic amount character string as a template; a content description language input step for inputting a content description language; A feature amount sentence searched in the character string search step in the text area of the content description language based on the video feature amount Characterized in that it contains a character string embedding step of embedding the column, the.
[0022]
According to such a method, the video-related content generating method extracts a video feature amount, which is a quantity characterizing a frame constituting a video scene, in a video scene analysis step, and performs a video feature amount and a video feature amount by a character string search step. A character string corresponding to the video feature extracted in the video scene analysis step is searched for from a character string conversion database stored in association with a character string representing the character string as a character string.
Then, in the content description language input step, a content description language in which a text region in which the feature amount character string is embedded is input as a template, and in the character string embedding step, the feature amount character string is embedded in the text region of the content description language to obtain a video-related text. Generate content.
[0023]
Further, the video-related content generation program according to claim 8 analyzes a video scene of the video signal by a computer to generate, from the video signal, information related to the video content of the video signal as the video-related content. Video scene analysis means for extracting a video feature quantity, and content for converting the video feature quantity extracted by the video scene analysis means into at least one of text data, image data, and audio data to generate video-related content It was configured to function as generating means.
[0024]
According to this configuration, the video-related content generation program extracts the video feature amount, which is a quantity characterizing the frames constituting the video scene, by the video scene analysis unit, and converts the video feature amount into text data, image data, and the like by the content generation unit. The data is converted into at least one of data and audio data to generate video-related content.
[0025]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(Configuration of video-related content generation device)
FIG. 1 is a block diagram showing a configuration of a video-related content generation device according to the present invention. As shown in FIG. 1, the video-related content generation device 1 analyzes a video scene of an input video (video signal) to extract a feature amount (video feature amount) of the video scene, and extracts the extracted video. Based on the feature amount, a content, an image file or a video stream, an audio file or an audio stream, in which information related to video content is described in a content description language, is generated as video-related content.
[0026]
The video-related content generation device 1 is configured to include a video scene analysis unit 2 and a content generation unit 3 including a content description language generation unit 4, an image synthesis unit 5, and a voice synthesis unit 6.
[0027]
The video scene analysis means 2 includes a video object position detection unit 21, a video object feature value extraction unit 22, and a video scene feature value extraction unit 23, analyzes a video signal from an input video signal, and The feature amount is extracted.
[0028]
The video object position detection unit (video object position detection means) 21 detects a video object included in the video scene from the video scene. Here, the video object position detection unit 21 detects the position coordinates of the center of gravity in the frame of the video object, assigns a unique identifier to each video object, and detects the position coordinates, the identifier, and the video object. It is assumed that the time is output to the content generation means 3 as a video feature amount.
[0029]
It should be noted that the identifier may be assigned to the video object based on the position coordinates of the video object, for example, by assigning a serial number in the order displayed from the left on the screen. Alternatively, when the video object is a person, a person's name can be used as an identifier instead of a serial number by a general face recognition technique.
[0030]
The video object feature amount extraction unit (video object feature amount extraction means) 22 extracts a video object feature amount of the video object detected by the video object position detection unit 21 from the video scene. The video object feature amount is a feature amount for each video object such as brightness (brightness value), color (color feature amount), motion (motion vector amount), texture, shape parameter, and the like. The video object feature extraction unit 22 outputs the video object feature and an identifier unique to the video object to the content generation unit 3 as a video feature.
It should be noted that detection of a video object and extraction of a feature amount are disclosed by the applicant of the present invention as a “moving image object extraction device (JP-A-2001-307104)” or a “video object detection / tracking device (Japanese Patent Application No. 2001-166525)”. It can be realized using the technology that has been used.
[0031]
The video scene feature extraction unit 23 extracts a video feature for each frame from the video scene. The video scene feature extraction unit 23 outputs, to the content generation unit 3, information obtained by statistically analyzing the features of the entire frame and the individual video object features extracted by the video object feature extraction unit 22 as the video feature of the frame. Things.
For example, the video scene feature amount extraction unit 23 outputs the brightness value of each pixel of the frame as an average brightness value of a frame obtained by averaging over the entire frame, the number of video objects in the frame, and the like as a video feature amount. I do.
[0032]
The content generating means 3 includes a content description language generating means 4, an image synthesizing means 5, and an audio synthesizing means 6, and uses information relating to the video scene as video-related content from the video feature amount input from the video scene analyzing means 2. Output.
[0033]
The content description language generation unit 4 includes a character string conversion database 41, a feature amount character string conversion unit 42, and a template character string replacement unit 43, and includes a content description language template (content description language template 44a) input from the outside. ) Is replaced with a replacement character string corresponding to the extracted video feature amount analyzed and extracted by the video scene analysis means 2 to generate content written in a content description language related to the video scene. Things.
[0034]
This content description language includes, for example, HTML, VRML, BML, RealAudio metafile, and the like. Here, the description will be made on behalf of HTML, but other content description languages can be realized with a similar configuration.
[0035]
The character string conversion database 41 is a database that stores conversion rules (character string conversion rules 41a) for expressing video feature amounts as character strings, and describes the types and numerical values of the video feature amounts and the content description language template 44a. The replacement target character string and the replacement character string that replaces the replacement target character string are stored in association with each other.
[0036]
Here, the content description language template 44a and the character string conversion rule 41a will be described with reference to FIGS. FIG. 2 is a template (model) described in HTML that shows an example of the content description language template 44a, and FIG. 3 is a diagram showing an example of the contents of the character string conversion rule 41a.
[0037]
As shown in FIG. 2, the content description language template 44a is a text file described in a content description language (here, HTML), and describes a portion related to the contents of an input video scene as a character string to be replaced. This is a template from which the replacement target character string can be replaced later. Here, an HTML template for explaining “room scene” is used as an example, and “<! −− brightness −−>” is used as the replacement target character string 44b based on the video feature amount related to the brightness of the video scene. Indicates an area for replacing a character string. Also, “<! − Number −−>” is used as the replacement target character string 44c, and an area where the character string is replaced based on the video feature amount related to the number in the video scene is shown.
[0038]
According to the character string conversion rule 41a in FIG. 3, as the types of the video feature amount, the average luminance value (average luminance value) of the frame obtained by averaging the luminance value of each pixel of the frame over the entire frame, The number of video objects (the number of objects) in the table is used, and the replacement target character string and the replacement character string (feature amount character string) are associated with the value of the video feature amount.
[0039]
In FIG. 3A, for example, when the luminance values of the pixels constituting the video are represented by 256 values from 0 to 255, when the average value of the luminance values, which are the values of the video feature amounts, is “less than 90”, , The replacement character string is “<! −− brightness −−>”, and the replacement character string is “dark”. Thereby, when the average value of the luminance values is “less than 90”, the replacement target character string 44b of the content description language template 44a (FIG. 2) is replaced with “dark”. When the number of objects is "2 or more", the replacement target character string 44c of the content description language template 44a is replaced with "many".
[0040]
In FIG. 3B, the replacement character string of FIG. 3A is not represented by a Japanese character string such as "dark" or "many", but "<img src =" 1. An example of a case where the image is designated as an HTML embedded image such as png ">" is shown. In this manner, the replacement character string may be described as a replacement character string that embeds a file name of an image file, an audio file, a script, or the like in the content description language, in addition to the Japanese character string. This script includes JavaScript (registered trademark), ECMAScript, and the like.
Returning to FIG. 1, the description will be continued.
[0041]
The feature amount character string conversion unit (character string embedding unit) 42 refers to the character string conversion rule 41 a of the character string conversion database 41 based on the video feature amount input from the video scene analysis unit 2, and Is notified to the template character string replacement unit 43 of the replacement target character string corresponding to the character string (a) and the replacement character string.
[0042]
The template character string replacement unit (character string embedding means) 43 is based on the content description language template 44a input from the outside and the replacement target character string and the replacement character string notified from the feature amount character string conversion unit 42. By replacing the replacement target character string described in the content description language template 44a with a replacement character string, a content description language such as an HTML file is generated.
[0043]
When the position of the video object is represented by the replacement character string, it can also be described using a VRML PositionInterpolator node in which a position coordinate list for each time of the position coordinate is set as the replacement target character string.
[0044]
The image synthesizing unit 5 includes a position presentation image synthesizing unit 51 and an image output unit 52. The image synthesizing unit 5 synthesizes an image related to the extracted video feature amount analyzed by the video scene analyzing unit 2 and extracts an image file or a video stream. Is output as
[0045]
The position presentation image synthesizing unit 51 inputs the position coordinates, the identifier, and the detection time of the video object from the video object position detection unit 21 of the video scene analysis unit 2 as the video feature amount, and there is a video object identified by the identifier. This is to synthesize a position presentation image that indicates which position was present at the time. The image synthesized here is output to the image output unit 52.
[0046]
For example, an icon image is read from an image storage unit (not shown) in which the icon image is stored in advance, and the icon image is synthesized at a position indicated by position coordinates on a plain image. Further, for example, the image of the video object extracted by the video object feature amount extraction unit 22 of the video scene analysis means 2 may be directly synthesized as an icon image.
[0047]
The image output unit 52 outputs the image synthesized by the position presentation image synthesizing unit 51 as an image file. When the images are input in time series from the position presentation image synthesizing section 51, the image output section 52 outputs the time series images as a video stream in which the video object changes with time.
[0048]
The audio synthesizing unit 6 includes an audio data storage unit 61, an audio selection unit 62, and an audio output unit 63. The audio synthesizing unit 6 analyzes audio extracted by the video scene analysis unit 2 and extracts the audio related to the video feature amount into an audio file or It is output as an audio stream.
[0049]
The audio data storage unit (audio data storage unit) 61 stores audio data 61a related to a video scene in advance in association with an identification number, and is configured by a hard disk or the like. The audio data 61a is audio data for expressing the video scene in relation to the video scene, and includes, for example, BGM (Back Ground Music), sound effects, and human voice.
Further, the audio data storage unit 61 holds a plurality of audio data 61a based on the video feature amount of the video scene. For example, audio data 61a expressing the level of “brightness” in association with a luminance value is stored as an audio file.
[0050]
The audio selection unit (audio selection unit) 62 selects the audio data 61a stored in the audio data storage unit 61 based on the video feature amount of the video scene input from the video scene analysis unit 2, and outputs the audio data. The unit 63 is notified of the identification number of the audio data 61a.
The audio selection unit 62 outputs the identification number of the audio data 61a stored in the audio data storage unit 61 representing the video scene to the audio output unit 63 from the video feature amount (for example, the average luminance value) of the video scene. Notice. For example, the audio selection unit 62 determines the level of “brightness” based on the average value of the luminance values, and selects the audio data 61a corresponding to the “brightness”. Alternatively, based on the position coordinates of the video object, when the video object enters a preset area, the specific audio data 61a may be selected.
[0051]
The audio output unit (audio output unit) 63 reads the audio data 61a in the audio data storage unit 61 selected by the audio selection unit 62 and notified by the identification number, and outputs the read audio data as an audio file or an audio stream. .
As described above, the content description language (HTML file or the like) output from the content description language generation unit 4, the image file (or video stream) output from the image synthesis unit 5, and the audio file (output from the audio synthesis unit 6) Or an audio stream) may be output individually or may be output as a plurality of outputs as video-related contents.
[0052]
As described above, the configuration of the video-related content generation device 1 has been described based on one embodiment. However, in the video-related content generation device 1, each means can be realized as a function program in a computer. Can be combined to operate as a video-related content generation program.
[0053]
(Operation of the video-related content generation device: content description language generation example)
Next, the operation of the video-related content generation device 1 will be described.
First, an operation example in which the video-related content generation device 1 generates a content description language will be described with reference to FIGS. FIG. 7 is a flowchart showing an operation in which the video-related content generation device 1 generates a content description language.
[0054]
In the video-related content generation device 1, the video scene analysis unit 2 analyzes the video signal from the input video signal and extracts a video feature amount (Step S11).
[0055]
Then, in the content description language generating means 4, based on the character string conversion rule 41 a in the character string conversion database 41, the feature amount character string conversion unit 42 replaces the replacement target character string corresponding to the video feature amount extracted in step S 11. The replacement character string is searched and notified to the template character string replacement unit 43 (step S12).
[0056]
The template character string replacement unit 43 notified of the replacement target character string and the replacement character string reads the content description language template 44a from the outside (step S13), and searches for the replacement target character string in the content description language template 44a (step S13). S14).
Then, it is determined whether or not the replacement target character string exists (step S15). If the replacement target character string exists (Yes), the replacement target character string is replaced with the replacement character string (step S16), and the process returns to step S14 to perform further replacement. Search for the target string.
[0057]
On the other hand, if the replacement target character string does not exist (No in step S15), it is assumed that all the replacement target character strings have been replaced with replacement character strings, and a content description language (HTML file or the like) replaced with the replacement character string is output. Then (step S17), the operation ends.
Through the above steps, it is possible to generate a text-based content in which information related to the video content is described in a content description language from the video signal.
[0058]
Next, with reference to FIG. 4, a specific operation of generating the content description language will be described, centering on the content description language generating means 4. FIG. 4 is a conceptual diagram showing an example of generating an HTML file from the video feature amount.
[0059]
First, the content description language generating means 4 inputs the video feature quantity 2a. Here, assuming that the average value of the luminance value is “100” and the number of objects is “1” as the video feature amount 2a, the content description language generating unit 4 determines the character string conversion rule 41a (FIG. 3A). ), The replacement target character string and the replacement character string corresponding to the average luminance value “100” and the number of objects “1” are searched, and the replacement target character corresponding to the average luminance value “100” is searched. There is a string “<! −− brightness −−>” and a replacement character string “dim”, and a replacement string “<!? number −−>” and a replacement character string “one” corresponding to the number of objects “1”. And get.
[0060]
Then, the content description language generating unit 4 replaces the replacement target character strings “<! − Brightness −−>” and “<! −− number −−>” of the content description language template 44a (FIG. 2) input from the outside. Are converted into “dim” and “there is one”, respectively, thereby generating an HTML file 4a that describes the “scene of the room”.
[0061]
(Operation of video-related content generation device: example of image composition)
Next, an operation example in which the video-related content generation device 1 combines and outputs an image related to a video signal will be described with reference to FIGS. 1 and 8. FIG. 8 is a flowchart showing an operation in which the video-related content generation device 1 generates a video stream in which a synthesized image is time-series.
[0062]
First, in the video-related content generation device 1, the video scene analysis unit 2 analyzes the video signal based on the input video signal, and extracts the time of the video scene and the position coordinates of the video object, which are video feature amounts. (Step S21).
Then, in the image synthesizing unit 5, the position presentation image synthesizing unit 51 generates a synthesized image obtained by synthesizing the icon image with the position coordinates of the video object at a certain time in the video scene (step S22).
[0063]
It is determined whether the generation of the composite image at all times of the video scene has been completed (step S23). If the generation has not been completed yet (No), the process returns to step S22 to generate the composite image at the next time. I do. On the other hand, when the generation of the composite image at all times is completed (Yes), the composite image generated at each time of the video scene is output as a video stream (step S24), and the operation ends.
Through the above steps, a video stream in which only the position of the video object is visualized can be generated from the video signal.
[0064]
Next, a specific operation of generating an image file or a video stream from a video signal will be described with reference to FIGS. FIG. 5 is a conceptual diagram showing an example of generating an image file by visualizing the position of a video object that changes in time series on the same image from the video feature amount. FIG. 6 is a conceptual diagram illustrating an example in which the position of a video object that changes in a time series from the video feature amount is generated as a separate image file or as a video stream.
[0065]
As shown in FIG. 5, the image synthesizing unit 5 first inputs the video feature amount 2a. Here, the time is “1”, “2”, “3” and “4”, and the position coordinates of the video object corresponding to the time are “(1, 1)”, “(1, 2) "," (2, 3) "and" (3, 2) ".
[0066]
First, the image synthesizing unit 5 adds the icon images C1 and C2 to the position coordinates “(1, 1)”, “(1, 2)”, “(2, 3)” and “(3, 2)” of the plain image. , C3 and C4 (only different icon images are used for C4) to generate an image file 5a in which each icon image (C1 to C4) is connected by a straight line. Thereby, it can be expressed that the video object has moved from the position coordinates of C1 to the position coordinates of C4.
[0067]
In the example shown in FIG. 6, the same video feature 2a as in FIG. 5 is input. However, the output of the image synthesizing means 5 is not a single image file but a plurality of image files or video streams. Where they are is different.
In FIG. 6, the image synthesizing unit 5 determines the position coordinates “(1, 1)”, “(1, 2)”, “(2, 3)”, The icon image C4 is combined with “(3, 2)” to generate four images. Note that the image synthesizing means 5 can output these four images as individual image files 5a, or can output the individual images as a video stream 5b as continuous stream data. .
[0068]
(Operation of the video-related content generation device: voice synthesis example)
Next, an operation example in which the video-related content generation device 1 synthesizes and outputs audio related to a video signal will be described with reference to FIGS. 1 and 9. FIG. 9 is a flowchart showing an operation in which the video-related content generation device 1 outputs audio data.
[0069]
In the video-related content generation device 1, the video scene analysis means 2 analyzes a video signal and extracts a video feature based on a fixed time interval or a timing of a cut point obtained by a cut point detection technique (step S31). ).
[0070]
Then, in the voice synthesizing unit 6, the voice selecting unit 62 selects the voice data 61a stored in the voice data storage unit 61 representing the video scene according to the value of the video feature amount, and outputs the identification number as the voice output. The notification is made to the unit 63 (step S32).
The audio output unit 63 notified of the identification number reads the audio data 61a selected based on the identification number from the audio data storage unit 61 and outputs it as an audio file or an audio stream (step S33).
Through the above steps, audio data relating to the video content can be output from the video signal as an audio file or an audio stream.
[0071]
【The invention's effect】
As described above, the video-related content generation device, the video-related content generation method, and the video-related content generation program according to the present invention have the following excellent effects.
[0072]
According to the first, seventh, or eighth aspect of the present invention, a video scene is analyzed from an input video signal, and coordinate information, audio information, image information, and the like related to the content of the video are analyzed. Video-related content converted into heterogeneous information can be generated. As a result, automatic content creation based on natural input data, which has been limited to caption creation data by voice recognition, can be applied to video input.
[0073]
According to the second or third aspect, a video object in a video can be detected and extracted as a video feature amount. Then, since the position information and the feature amount of each video object can be expressed by visualized text, sound, image, and the like, it is possible to generate a content with a reduced video data amount. Further, it is possible to generate contents usable on the WWW and the portable terminal, and it is possible to improve data accessibility.
[0074]
According to the fourth aspect of the present invention, information related to the contents of a video scene can be generated as a content description language from a templated content description language such as HTML, and templated contents are prepared. By doing so, it becomes possible to shorten the production time of content production.
[0075]
According to the fifth aspect of the present invention, since the position of each video object can be represented by another image or the like, it is possible to generate content with a reduced video data amount. Further, it is possible to generate contents usable on the WWW and the portable terminal, and it is possible to improve data accessibility.
[0076]
According to the sixth aspect of the present invention, it is possible to appropriately output a sound suitable for a video scene, so that it is possible to produce an effect that cannot be expressed only with a video. This can reduce the labor required for content production.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an overall configuration of a video-related content generation device according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram illustrating an example of a template of a content description language.
FIG. 3 is an explanatory diagram for describing an example of a character string conversion rule.
FIG. 4 is an explanatory diagram illustrating an operation example of a content description language generating unit according to the embodiment of the present invention.
FIG. 5 is a schematic diagram (part 1) schematically showing an operation of the image combining means according to the embodiment of the present invention.
FIG. 6 is a schematic diagram (part 2) schematically showing the operation of the image combining means according to the embodiment of the present invention.
FIG. 7 is a flowchart showing an operation of generating a content description language from a video according to the embodiment of the present invention.
FIG. 8 is a flowchart showing an operation of generating a composite image from a video according to the embodiment of the present invention.
FIG. 9 is a flowchart illustrating an operation of generating a synthesized voice from a video according to the embodiment of the present invention.
[Explanation of symbols]
1 ... Video related content generation device
2 ... Video scene analysis means
21 ... Video object position detection unit (video object position detection means)
22... Video object feature amount extraction unit (video object feature amount extraction means)
23 ... Video scene feature extraction unit
3. Content generation means
4. Content description language generating means
41 ... Character string conversion database
42 ... characteristic character string conversion unit (character string embedding means)
43 template character string replacement unit (character string embedding means)
5 ... Image combining means
51: Position presenting image synthesizing unit
52 image output unit
6. Voice synthesis means
61 voice data storage unit (voice data storage means)
62 voice selection section (voice selection means)
63 audio output unit (audio output means)

Claims

A video-related content generation device that generates, from a video signal, information related to the video content of the video signal as video-related content,
A video scene analysis unit that analyzes the video signal and extracts a video feature amount;
Content generation means for converting the video feature amount extracted by the video scene analysis means into at least one of text data, image data, and audio data to generate the video-related content;
A video-related content generation device, comprising:

The video scene analysis means,
Video object position detection means for detecting the position coordinates of the video object included in the video scene as the video feature value;
The video-related content generation device according to claim 1, further comprising:

The video scene analysis means,
A video object feature amount extraction unit for extracting a feature amount characterizing a video object included in the video scene as the video feature amount;
The video-related content generation device according to claim 1 or 2, further comprising:

The content generation means,
A character string conversion database that stores the image feature amount and a feature amount character string expressing the image feature amount as a character string,
Character string embedding means for embedding the feature amount character string in the text area of the content description language in which the text area in which the feature amount character string is embedded is templated based on the video feature amount;
The video-related content generation device according to any one of claims 1 to 3, further comprising:

The content generation means,
Image synthesizing means for synthesizing image data related to the video object with position coordinates of the video object detected by the video object position detection means,
The video-related content generation apparatus according to claim 2, comprising:

The content generation means,
Audio data storage means for storing a plurality of audio data in association with the video feature amount;
Audio selection means for selecting the audio data stored in the audio data storage means, based on the video feature quantity;
Audio output means for outputting the audio data selected by the audio selection means,
The video-related content generation device according to any one of claims 1 to 5, further comprising:

A video-related content generation method for generating, from a video signal, information related to the video content of the video signal as video-related content,
Analyzing a video scene of the video signal, a video scene analysis step of extracting a video feature amount,
The feature amount corresponding to the video feature amount extracted in the video scene analysis step from a character string conversion database stored in association with the video feature amount and a feature amount character string expressing the video feature amount as a character string A character string search step for searching for a character string;
A content description language input step of inputting a content description language, in which a text area in which the feature amount character string is embedded is templated,
A character string embedding step of embedding the feature amount character string searched in the character string search step in the text area of the content description language based on the video feature amount;
A video-related content generation method, comprising:

In order to generate information related to the video content of the video signal as video-related content from the video signal,
A video scene analyzing unit that analyzes a video scene of the video signal and extracts a video feature amount;
Content generation means for converting the video feature amount extracted by the video scene analysis means into at least one of text data, image data, and audio data to generate the video-related content;
A video-related content generation program characterized by functioning as: