JP2004152063A

JP2004152063A - Structuring method, structuring device and structuring program of multimedia contents, and providing method thereof

Info

Publication number: JP2004152063A
Application number: JP2002317244A
Authority: JP
Inventors: Makoto Iwata; 真琴岩田; Naohiro Takeda; 直博竹田; Satoshi Nakazawa; 聡中澤; Riyouma Ooami; 亮磨大網
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-10-31
Filing date: 2002-10-31
Publication date: 2004-05-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means enabling the easy retrieval and display of a related image media by automatically structuring the image media in conformation with text information. <P>SOLUTION: A voice recognition means 8 divides voice information in an image media every sentence followed by voice recognition processing to generate voice recognition text information. A time code generation means 9 generates a time code that is the time information of start and end of each sentence. A mapping means 10 divides the text information to each sentence based on the dividing positions of the voice recognition text information, and conforms the voice recognition text information to the text information for every sentence. A structuring means 11 conforms the time code to the text information for every sentence. A data storage means 12 stores the image media, the time code and the text information. A retrieval control means 17 fetches the image media and text information matched to a retrieval condition inputted by a user 4, and outputs them to an input and output means 16 so that a related part can be displayed in conjugation by use of the time code. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、関連する映像情報とテキスト情報とを対応付けるマルチメディアコンテンツ構造化方法、マルチメディアコンテンツ構造化方法およびマルチメディアコンテンツ構造化プログラム、ならびに、構造化されて記憶されている内容を、ユーザの要求に応じて検索して出力するマルチメディアコンテンツ提供方法に関する。
【０００２】
【従来の技術】
近年、情報のマルチメディア化が進み、映像情報、音声情報およびテキスト情報等とを含むマルチメディアコンテンツの情報量は、急激に増大している。これらの情報を記憶し、後に必要に応じて再び呼び出すことにより、より有効に利用することができる。記憶されている複数種類の情報を有効に活用するには、それらの情報が対応付けられている必要がある。
【０００３】
複数種類の情報を対応付けるシステムの一例として、原稿上の文字を文字認識して電子化した第１テキストと、音声情報を音声認識して電子化した第２テキストとから最適な第３テキストを生成して、デジタルデータのテキスト情報を得る電子化テキスト作成システムがある（例えば特許文献１。）。
【０００４】
また、映像情報に付加されている音声情報にもとづいて音声認識を行い、音声認識結果であるテキスト情報と映像情報とを構造化するシステムがある（例えば特許文献２。）。
【０００５】
さらに、映像情報、音声情報、テキスト情報等をデジタル化し、それらの情報と、それらの情報の時間的関連を示す時間情報とを保存するマルチメディア情報処理装置がある（例えば特許文献３。）。
【０００６】
【特許文献１】
特開２００１−２８２７７９号公報（第３−８頁、第１図）
【特許文献２】
特開２００２−１８９７２８号公報（第３頁）
【特許文献３】
特開平８−２５３２０９号公報（第３−８頁、第５図）
【０００７】
【発明が解決しようとする課題】
特許文献１に記載されているシステムでは、原稿と音声認識処理結果のテキストとのマッチング処理を行うので、より正確なテキスト情報を自動的に作成することができる。しかし、特許文献１には、映像情報等の他の情報との対応付けに関する開示がないので、特許文献１に記載された技術を映像情報、音声情報およびテキスト情報の構造化処理に適用する場合には、人為的に構造化処理を行わなくてはならない。
【０００８】
また、特許文献２に記載されているシステムでは、映像情報と映像情報に付加されている音声情報とを構造化することができるが、特許文献２には、さらにテキスト情報を構造化することに関して何ら開示されていない。
【０００９】
さらに、特許文献３に記載されているマルチメディア情報処理装置では、映像情報、音声情報およびテキスト情報等の各情報間の構造化処理を、入力された時間における時間情報を用いて行っているが、時間情報は、映像情報、音声情報およびテキスト情報等を作成するときに付加されている必要がある。従って、時間情報が付加されていない場合には、テキスト情報、映像情報および音声情報を自動的に対応付けることはできず、対応付けのために人手を要することになる。
【００１０】
そこで、本発明は、関連する映像情報とテキスト情報とを自動的に対応付けることができるマルチメディアコンテンツ構造化方法、マルチメディアコンテンツ構造化装置およびマルチメディアコンテンツ構造化プログラム、ならびに、ユーザの要求に応じて容易に所望の情報を提供することができるマルチメディアコンテンツ提供方法を提供することを目的とする。
【００１１】
【課題を解決するための手段】
本発明によるマルチメディアコンテンツ構造化方法は、テキスト情報と、そのテキスト情報に対応する音声情報および映像情報を入力し、テキスト情報と映像情報とを対応付けるためのマルチメディアコンテンツ構造化方法であって、テキスト情報に対応する音声情報にもとづいて音声認識処理を行って音声認識テキスト情報を生成し、音声認識テキスト情報とテキスト情報とを比較し、テキスト情報の段落を示す情報等による区切り位置にもとづいて音声認識テキスト情報を各段落等の分割部分に分割し、それぞれの分割部分の開始時刻と終了時刻とを示すタイムコードを生成し、生成されたタイムコードとテキスト情報とを所定の分割部分毎に対応付けて構造化することを特徴とする。なお、音声情報と映像情報とは関連しているので、音声情報（具体的には音声認識テキスト情報）にもとづいて生成されたタイムコードとテキスト情報とを対応付けることは、実質的にテキスト情報と映像情報とを対応付けることになる。
【００１２】
マルチメディアコンテンツ構造化方法は、テキスト情報と映像情報とを格納手段に格納し、各分割部分のタイムコードをタイムコード格納手段に格納するように構成されていてもよい。そのような構成によれば、テキスト情報とタイムコード格納手段に格納されたタイムコードとによって、所望の映像情報を検索することができる。
【００１３】
マルチメディアコンテンツ構造化方法は、映像情報を格納手段に格納し、タイムコード、テキスト情報、および映像情報の格納位置を示す情報をＸＭＬ言語で記述するように構成されていてもよい。そのような構成によれば、ＸＭＬ言語により映像情報とテキスト情報とを構造化することができる。
【００１４】
本発明によるマルチメディアコンテンツ構造化装置は、テキスト情報と、そのテキスト情報に対応する音声情報および映像情報を入力し、テキスト情報と映像情報とを対応付けるためのマルチメディアコンテンツ構造化装置であって、テキスト情報に対応する音声情報にもとづいて音声認識処理を行って音声認識テキスト情報を生成する音声認識手段と、音声認識テキスト情報を所定の分割部分に分割した場合の各分割部分の開始時刻と終了時刻とを示すタイムコードを生成するための情報を生成するタイムコード生成手段と、音声認識テキスト情報とテキスト情報とを比較し、テキスト情報の区切り位置にもとづいて音声認識テキスト情報を分割部分に分割し、テキスト情報の区切り位置で区切られる分割部分と音声認識テキスト情報における分割部分とを対応付けるマッピング手段と、音声認識テキスト情報における分割部分に対応するタイムコードとテキスト情報とを所定の分割部分毎に対応付ける構造化手段とを備えたことを特徴とする。
【００１５】
マルチメディアコンテンツ構造化装置は、テキスト情報と映像情報とを格納する格納手段と、各分割部分のタイムコードを格納するタイムコード格納手段とを備えていてもよい。そのような構成によれば、テキスト情報とタイムコード格納手段に格納されたタイムコードとによって、所望の映像情報を検索することができる。
【００１６】
マルチメディアコンテンツ構造化装置は、映像情報を格納する格納手段を備え、構造化手段が、タイムコード、テキスト情報、および映像情報の格納位置を示す情報をＸＭＬ言語で記述してＸＭＬファイル格納手段に格納するように構成されていてもよい。そのような構成によれば、ＸＭＬ言語により映像情報とテキスト情報とを構造化することができる。
【００１７】
マルチメディアコンテンツ構造化装置は、ユーザが入力した検索条件に合致するテキスト情報における分割部分を特定し、特定した分割部分に対応するタイムコードを抽出する検索制御手段と、抽出されたタイムコードに対応する映像情報の部分を特定し、特定した映像情報を、検索条件に合致するテキスト情報とともにユーザに提供する同期手段とを備えていてもよい。そのような構成によれば、ユーザの要求に応じて、ユーザが所望する映像情報を提供することができる。
【００１８】
同期手段が、映像情報に対応するテキスト情報を、映像情報に連動させてユーザに提供するように構成されていてもよい。そのような構成によれば、ユーザの要求に応じて、ユーザが所望するテキスト情報および映像情報を見やすい形式で提供することができる。
【００１９】
本発明によるマルチメディアコンテンツ構造化プログラムは、コンピュータに、テキスト情報に対応する音声情報にもとづいて音声認識処理を行って音声認識テキスト情報を生成する処理と、音声認識テキスト情報とテキスト情報とを比較し、テキスト情報の区切り位置にもとづいて音声認識テキスト情報を分割部分に分割する処理と、それぞれの分割部分の開始時刻と終了時刻とを示すタイムコードを生成する処理と、生成されたタイムコードとテキスト情報とを所定の分割部分毎に対応付けて構造化する処理とを実行させることを特徴とする。
【００２０】
本発明によるマルチメディアコンテンツ提供方法は、ユーザが要求するテキスト情報、およびテキスト情報に対応する映像情報をユーザに提供するマルチメディアコンテンツ提供方法であって、テキスト情報に対応する音声情報にもとづいて音声認識処理を行って音声認識テキスト情報を生成し、音声認識テキスト情報とテキスト情報とを比較し、テキスト情報の区切り位置にもとづいて音声認識テキスト情報を分割部分に分割し、それぞれの分割部分の開始時刻と終了時刻とを示すタイムコードを生成し、生成されたタイムコードとテキスト情報とを所定の分割部分毎に対応付けて構造化し、テキスト情報と、映像情報と、タイムコードとを格納し、ユーザが入力した検索条件に合致するテキスト情報における分割部分を特定し、特定された分割部分に対応するタイムコードを抽出し、抽出されたタイムコードに対応する映像情報の部分を特定し、特定した映像情報を、検索条件に合致するテキスト情報とともにユーザに提供することを特徴とする。
【００２１】
本発明によるマルチメディアコンテンツ提供方法は、映像情報をユーザに提供する際に、映像情報に対応するテキスト情報を映像情報に連動させてユーザに提供するように構成されていてもよい。そのような構成によれば、ユーザの要求に応じて、ユーザが所望するテキスト情報および映像情報を見やすい形式で提供することができる。
【００２２】
本発明によるマルチメディアコンテンツ提供プログラムは、コンピュータに、テキスト情報に対応する音声情報にもとづいて音声認識処理を行って音声認識テキスト情報を生成する処理と、音声認識テキスト情報とテキスト情報とを比較し、テキスト情報の区切り位置にもとづいて音声認識テキスト情報を分割部分に分割する処理と、それぞれの分割部分の開始時刻と終了時刻とを示すタイムコードを生成する処理と、生成されたタイムコードとテキスト情報とを所定の分割部分毎に対応付けて構造化する処理と、テキスト情報と、映像情報と、タイムコードとを格納する処理と、ユーザが入力した検索条件に合致するテキスト情報における分割部分を特定する処理と、特定された分割部分に対応するタイムコードを抽出する処理と、抽出されたタイムコードに対応する映像情報の部分を特定し、特定した映像情報を、検索条件に合致するテキスト情報とともにユーザに提供するする処理とを実行させることを特徴とする。
【００２３】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。図１は本発明の第１の実施の形態のマルチメディアコンテンツ構造化システムを示すブロック図である。図１に示すシステムにおいて、コンテンツホルダ１は、映像情報を記録した映像メディアと、映像情報に関連する文書が記載されているテキストメディアとを所有する。映像メディアには、映像情報と音声情報とが記録されている。
【００２４】
テキスト情報、音声情報および映像情報を含むマルチメディアコンテンツの例として、テレビ放送におけるニュース番組がある。ニュース番組の場合には、ニュース原稿がテキストメディアであり、アナウンサが読み上げた音声が音声情報であり、映像情報は音声に関連した映像情報である。オペレータ２は、コンテンツホルダ１から映像メディアとテキストメディアとを受け取り、映像メディアとテキストメディアとをマルチメディアコンテンツ装置に入力する。
【００２５】
マルチメディアコンテンツ装置は、同期データ生成手段３と同期データ利用手段５とを含む。同期データ生成手段３は、映像入力手段６と、テキスト入力手段７と、音声認識手段８と、タイムコード生成手段９と、マッピング手段１０と、構造化手段１１と、データ格納手段１２とで構成される。またデータ格納手段１２は映像メディア格納手段１３と、テキストメディア格納手段１４と、タイムコード格納手段１５とで構成される。同期データ利用手段５は、入出力手段１６と、検索制御手段１７と、同期手段１８とで構成される。ただし、図１に示す入出力手段１６は、マルチメディアコンテンツ装置に含まれるものでなくてもよく、例えば、ユーザ４が有するパーソナルコンピュータにおけるキーボードなどの入出力手段である。
【００２６】
映像入力手段６は、オペレータ２から映像メディアが入力されると、映像メディアがデジタル媒体であった場合には、映像メディアに記録されている映像情報および音声情報を音声認識手段８に出力する。映像メディアがアナログ媒体であった場合には、映像入力手段６として、例えばビデオキャプチャ等を含み、アナログ映像情報およびアナログ音声情報をＡＶＩ形式やＭＰＥＧ形式のデジタル情報に変換する機能を有するものが用いられる。そして、映像入力手段６は、映像メディアに記録されている映像情報および音声情報に対してデジタル変換を行ってデジタル化して、音声認識手段８に出力する。音声認識手段８は、入力された音声情報について音声認識処理を行い、音声認識処理の結果のテキスト情報である音声認識テキスト情報を生成する。また、音声認識手段８は、音声情報を各単語に分割する。
【００２７】
タイムコード生成手段９は、音声情報における各単語の開始時刻と終了時刻との情報である単語タイムコードを音声認識手段８に出力する。音声認識手段８は、音声認識テキスト情報に単語タイムコードを付加して、映像情報とともにマッピング手段１０に出力する。従って、この実施の形態では、タイムコード生成手段９は、音声認識テキスト情報を所定の複数の分割部分（例えば複数の段落）に分割した場合の各分割部分の開始時刻と終了時刻とを示すタイムコードを生成するための情報として、単語タイムコードを生成する。
【００２８】
テキスト入力手段７は、オペレータ２からテキストメディアを受け取る。テキスト入力手段７は、テキストメディアがデジタル媒体であった場合には、テキストメディアに記録されているテキスト情報をマッピング手段１０に出力する。テキストメディアがアナログ媒体であった場合には、テキスト入力手段７は、例えばＯＣＲ（光学式文字読み取り）装置を含むものとして構成される。そして、テキストメディアに記録されているテキスト情報に対してデジタル変換を行って、デジタル化したテキスト情報をマッピング手段１０に出力する。
【００２９】
マッピング手段１０は、テキスト情報を適当な間隔で分割部分に区切る。ここでは、例えば、改行やインデント等を検出することによって文の固まりである段落を検出し、改行箇所等を区切り位置としてテキスト情報を区切る。さらに、マッピング手段１０は、テキスト情報と音声認識テキスト情報とを比較して、テキスト情報における区切り位置にもとづいて音声認識テキスト情報を区切り、段落音声認識テキスト情報を生成する。さらに、テキスト情報における各段落（区切られたテキスト情報）と各段落音声認識テキスト情報との１対１の対応を示す情報である対応情報を生成する。そして、段落音声認識テキスト情報およびテキスト情報とともに対応情報を構造化手段１１に出力する。
【００３０】
構造化手段１１は、段落音声認識テキスト情報における各単語に付加されている単語タイムコードから、各段落の開始時刻と終了時刻である各タイムコードを算出する。さらに構造化手段１１は、タイムコードと対応情報とにもとづいて、テキスト情報の各段落と各タイムコードとの１対１の対応を示す情報である構造化情報を生成する。
【００３１】
また、構造化手段１１は、構造化情報にもとづいて、テキスト情報を保存するテキストメディアファイルおよびタイムコードを保存するタイムコードファイルを生成するとともに、映像情報を保存する映像メディアファイルとを生成する。
【００３２】
例えば、テキストメディアファイルには各段落内のテキスト情報が段落順に格納され、タイムコードファイルには各段落に対応するタイムコードがテキスト情報における段落順と同順で格納される。なお、それぞれが各段落に対応した複数のテキストメディアファイルと、タイムコードファイルとを生成してもよい。
【００３３】
次に、構造化手段１１は、映像メディアファイルと、テキストメディアファイルと、タイムコードファイルとをデータ格納手段１２に出力する。データ格納手段１２内の映像メディア格納手段１３は映像メディアファイルを記憶し、データ格納手段１２内のテキストメディア格納手段１４はテキストメディアファイルを記憶し、データ格納手段１２内のタイムコード格納手段１５はタイムコードファイルを記憶する。この例において、テキスト情報と映像情報とを格納する格納手段は、映像メディア格納手段１３およびテキストメディア格納手段１４に相当し、各分割部分のタイムコードを格納するタイムコード格納手段は、タイムコード格納手段１５に相当する。
【００３４】
なお、映像情報は、映像入力手段６から、音声認識手段８、マッピング手段１０および構造化手段１０を介してデータ格納手段１２に供給されるようにしてもよいが、映像入力手段６から直接データ格納手段１２に供給されるようにしてもよい。
【００３５】
また、構造化手段１１が、テキストメディアファイルにおけるテキスト情報の各段落の開始アドレスと終了アドレスと、タイムコードファイルにおける各段落のタイムコードの開始アドレスと終了アドレスとを、管理情報として生成してもよい。このとき、マルチメディアコンテンツ装置は、図２に示すように、データ格納手段１２において管理情報を格納する管理ファイル格納手段２３が含まれる構成になる。そして、構造化手段１１は、管理情報を管理ファイル格納手段２３に出力し、管理ファイル格納手段２３は管理情報を記憶する。
【００３６】
また、構造化手段１１が、図５に示すような、テキスト情報とタイムコードとを結合した、タイムコードを含む構造化されたテキストメディアを生成してもよい。このとき、マルチメディアコンテンツ装置は、タイムコード格納手段１５を持たず、データ変換手段１２は図３に示す構成になる。
【００３７】
さらに、構造化手段１１が、図６に示すような、ＸＭＬ（エクステンシブルマークアップランゲージ）言語による、ＭＰＥＧ７（ムービングピクチャーエキスパートグループ７）形式の構造的記述によるＸＭＬファイルを生成する方法もある。ＸＭＬファイルを生成する場合には、マルチメディアコンテンツ装置は、データ格納手段１２にテキストメディア格納手段１４とタイムコード格納手段１５とが含まれず、ＸＭＬファイル格納手段２４が含まれた図４に示す構成になる。そして、構造化手段１１は、ＸＭＬファイルをＸＭＬファイル格納手段２４に出力し、ＸＭＬファイル格納手段２４は、ＸＭＬファイルを記憶する。この例では、映像情報が格納される格納手段は映像メディア格納手段１２に相当し、タイムコード、テキスト情報、および映像情報の格納位置を示す情報を記述したＸＭＬ言語が格納されるＸＭＬファイル格納手段は、ＸＭＬファイル格納手段２４に相当する。
【００３８】
ユーザ４は、所望の映像情報およびテキスト情報を要求するときに、同期データ利用手段５にキーワードとなる語句を入力する。すると、入出力手段１６は、ユーザ４が入力した語句を検索制御手段１７に出力する。検索制御手段１７は、その語句を含むテキスト情報の段落をテキストメディア格納手段１４（図２，３に示す構成の場合）、またはＸＭＬファイル格納手段２４（図４に示す構成の場合）から検索し、該当するテキスト情報の段落を入出力手段１６に出力する。さらに、ユーザ４が、あるテキスト情報の段落を選択した場合、入出力手段１６は、ユーザ４が選択したテキスト情報と同期する映像情報の出力を検索制御手段１７に要求する。
【００３９】
検索制御手段１７は、映像情報を映像メディア格納手段１３から取り出し、ユーザ４が選択したテキスト情報の段落のタイムコードを、タイムコード格納手段１５（図２に示す構成の場合）、テキストメディア格納手段１４（図３に示す構成の場合）、またはＸＭＬファイル格納手段２４（図４に示す構成の場合）から抽出し、同期手段１８に出力する。同期手段１８は、タイムコードが示す開始時刻を映像情報の出力の先頭時間とし、タイムコードの示す終了時刻を映像情報の出力の最終時間として、入出力手段１６に映像情報の出力を行う。また、同期手段１８は、テキスト情報を、タイムコードにもとづいて加工して入出力手段１６に出力する。このときの加工として、例えば、テキスト情報をスクロールさせるなどの方法がある。
【００４０】
なお、同期データ生成手段３および同期データ利用手段５は、コンピュータシステムで実現できる。ただし、入出力手段１６は、ユーザ側のマイクロコンピュータ等のキーボードや表示部などの入出力手段に相当する。同期データ生成手段３および同期データ利用手段５（入出力手段を除く。）がコンピュータシステムで実現される場合には、音声認識手段８、タイムコード生成手段９、マッピング手段１０、構造化手段１１、検索制御手段１７および同期手段１８は、ソフトウェアによって実現される。また、データ格納手段１２は、コンピュータシステムにおける磁気ディスク等の記憶媒体によって実現される。
【００４１】
具体的には、コンピュータシステムに実装されるソフトウェアは、テキスト情報に対応する音声情報にもとづいて音声認識処理を行って音声認識テキスト情報を生成する処理と、音声認識テキスト情報とテキスト情報とを比較し、テキスト情報の区切り位置にもとづいて音声認識テキスト情報を分割部分に分割する処理と、それぞれの分割部分の開始時刻と終了時刻とを示すタイムコードを生成する処理と、生成されたタイムコードによってテキスト情報と映像情報とを所定の分割部分毎に対応付けて構造化する処理と、テキスト情報と、映像情報と、タイムコードとを格納する処理とを実行し、また、ユーザが入力した検索条件に合致するテキスト情報における分割部分を特定する処理と、特定された分割部分に対応するタイムコードを抽出する処理と、抽出されたタイムコードに対応する映像情報を特定し、特定した映像情報を、検索条件に合致するテキスト情報とともにユーザに提供する処理とを実行するプログラムを含む。
【００４２】
次に、この実施の形態の動作について、図面を参照して説明する。図７、図９および図１０は、この実施の形態の動作を説明するフローチャートである。図１１および図１２は入出力手段１６が情報を入出力するときの表示例を示す説明図である。
【００４３】
テレビ放送におけるニュース番組を例にすると、コンテンツホルダ１はニュース番組の制作会社に相当し、オペレータ２は制作会社内の管理部門やニュース配信の委託を受けた管理会社に相当する。また、ユーザ４は、例えばパーソナルコンピュータに搭載されるＷｅｂブラウザとインターネットとを介して同期データ利用手段５に情報を要求し、インターネットとＷｅｂブラウザとを介して所望の同期データ利用手段５から情報を受信する。
【００４４】
まず、映像メディアファイル、タイムコードファイル、テキストメディアファイル、管理ファイル、あるいはＸＭＬファイルを、データ格納手段１２に格納するまでについて、図７を参照して説明する。コンテンツホルダ１からニュース番組の映像情報（音声情報を含む）を記録した映像メディアと、ニュース番組においてアナウンサが読んだ原稿を記録したテキストメディアとが、オペレータ２に送られたとする。映像メディアに記録されている音声情報は、原稿を読んだアナウンサの音声による音声情報である。すなわち、テキストメディアに記録されているテキスト情報と映像メディアに記録されている音声情報とは、同じ内容の情報である。オペレータ２は、映像メディアとテキストメディアとを同期データ生成手段３に入力する。
【００４５】
同期データ生成手段３は、オペレータ２から映像メディアとテキストメディアとを受け取る（ステップＳ１０１）。テキストメディアは、テキスト入力手段７が受け取り、テキストメディアがデジタル媒体であった場合には、テキストメディアに記録されているテキスト情報をマッピング手段１０に出力する。テキストメディアがデジタル媒体でなかった場合には、ＯＣＲ等を用いてテキストメディアに記録されているテキスト情報をデジタル化して（ステップＳ１０２，Ｓ１０３）、マッピング手段１０に出力する。マッピング手段１０は、テキスト情報を一時格納する。具体的には、テキスト情報は、コンピュータシステムにおけるＲＡＭ等に格納される。
【００４６】
映像メディアは、映像入力手段６に入力される。映像入力手段６は、映像メディアがデジタル媒体であった場合には、映像メディアに記録されている映像情報（音声情報を含む）を音声認識手段８に出力し、デジタル媒体でなかった場合には、映像メディアに記録されている映像情報（音声情報を含む）デジタル化して（ステップＳ１０４，Ｓ１０５）、音声認識手段８に出力する。
【００４７】
音声認識手段８は、映像メディアの音声トラックに記録されている音声情報について音声認識処理を行い、音声認識テキスト情報を生成する（ステップＳ１０６）。さらに、音声認識手段８は、音声認識テキスト情報に対して単語切り出し処理を行い、音声情報を各単語に分割する。音声認識手段８は、映像情報をタイムコード生成手段９に出力するとともに、音声情報の分割位置を示す情報をタイムコード生成手段９に出力する。タイムコード生成手段９は、映像情報および音声情報の再生開始タイミング、再生開始タイミングからの経過時間、および音声認識手段８からの音声情報の分割位置を示す情報にもとづいて、各単語の音声情報の開始時刻と終了時刻である単語タイムコードを各単語ごとに生成し、音声認識手段８に出力する。音声認識手段８は、各単語の単語タイムコードを各単語に割り付ける。そして、単語タイムコードを音声認識テキスト情報に付加して、映像情報とともにマッピング手段１０に出力する（ステップＳ１０７、Ｓ１０８）。
【００４８】
マッピング手段１０は、テキスト情報から、例えば、改行やインデント等によって文の固まりである段落を検出し、テキスト情報を段落に区切る。また、テキスト情報と音声認識テキスト情報とを比較する。そして、テキスト情報の段落区切り位置にもとづいて音声認識テキスト情報を段落に区切り、段落音声認識テキスト情報を生成する。さらに、テキスト情報の各段落と、段落音声認識テキスト情報との１対１の対応を示す情報である対応情報を生成し（ステップＳ１０９）、段落音声認識テキスト情報とテキスト情報とともに対応情報を構造化手段１１に出力する。
【００４９】
マッピング手段１０による対応付けの一例を図８に示す。図８において、テキスト情報であるニュース原稿５００における「こんばんは」、「７時」、「の」、「ニュース」および「です」と、音声認識結果５１０内の音声認識テキスト情報における「こんばんは」、「一次」、「の」、「ニュース」および「で」とを比較する。このように単語同士を順次比較していく。比較の手法としてＤＰマッチングの手法を用いることにより、単語の一致具合で、ニュース原稿５００における区切り位置に対応する音声認識結果５１０における区切り位置を検出することができる。
【００５０】
構造化手段１１は、段落音声認識テキスト情報に付加された単語タイムコードから、段落音声認識テキスト情報における各段落の開始時刻と終了時刻であるタイムコードを算出する。各段落の最初の単語の開始時刻と最後の単語の終了時刻とから、各段落の開始時刻と終了時刻を算出することができる。さらに、構造化手段１１は、タイムコードと対応情報とから、テキスト情報の各段落とタイムコードとの１対１の対応を示す情報である構造化情報を生成する（ステップＳ１１０）。
【００５１】
構造化手段１１は、構造化情報にもとづいて、テキスト情報を記憶するテキストメディアファイルとタイムコードを記憶するタイムコードファイルとを、各段落毎に対応付けて生成する。構造化手段１１は、生成した各ファイルをデータ格納手段１２に出力する。テキストメディア格納手段１４は、テキストメディアファイルを格納し、タイムコード格納手段１５は、タイムコードファイルを格納する、また、映像メディア格納手段１３は、映像メディアファイルを格納する（ステップＳ１１１）。
【００５２】
ここで、構造化手段１１が、テキスト情報とタイムコードとを対応付けて、テキストメディアファイルと、タイムコードファイルとを生成する、構造化の方法について説明する。まず、各段落毎にテキストメディアファイルとタイムコードファイルとを生成する方法を説明する。例えば、第１段落のテキストメディアファイルのファイル名を「原稿１．ｔｘｔ」、第１段落のタイムコードファイルのファイル名「時間１．ｔｘｔ」とする。すると、第１段落のファイルは、拡張子を除いたファイル名の末尾が「１」という対応付けができる。同様に第２段落同士もそれぞれ「原稿２．ｔｘｔ」と「時間２．ｔｘｔ」とすることで、第２段落のファイルを対応付けることができる。同様に、第３、第４段落以降も対応付けることができる。
【００５３】
また、テキストメディアファイルにおける各段落のテキスト情報の開始アドレスと終了アドレスと、タイムコードファイルにおける各段落のタイムコードの開始アドレスと終了アドレスとを、管理情報として生成する方法もある。そのような管理情報を参照すれば、テキスト情報とタイムコードの対応がわかる。このとき、マルチメディアコンテンツ装置の構成は、図２に示すように、データ格納手段１２内に管理ファイル格納手段２３が設けられた構成である。そして、構造化手段１１は、テキストメディアファイル、タイムコードファイルおよび映像メディアファイルに加えて、管理情報をデータ格納手段１２に出力する。テキストメディア格納手段１４はテキストメディアファイルを格納し、タイムコード格納手段１５はタイムコードファイルを格納し、映像メディア格納手段１３は映像メディアファイルを格納し、管理ファイル格納手段２３は管理情報を格納する。
【００５４】
構造化手段１１は、図５に示すように、テキスト情報とタイムコードとを結合して、タイムコードを含む構造化されたテキスト情報を生成してもよい。このとき、図３に示すように、マルチメディアコンテンツ装置において、タイムコード格納手段１５は設けられない。構造化手段１１は、テキストメディアファイルと映像メディアファイルとをデータ格納手段１２に出力する。テキストメディア格納手段１４はテキストメディアファイルを格納し、映像メディア格納手段１３は映像メディアファイルを格納する。
【００５５】
さらに、構造化手段１１は、図６に示すように、ＸＭＬ言語による構造的な記述によるＸＭＬファイルを生成してもよい。このとき、マルチメディアコンテンツ装置において、テキストメディア格納手段１４とタイムコード格納手段１５とは設けられず、ＸＭＬファイル格納手段２４が設けられる。ここで、ＸＭＬ言語による構造的な記述によるＸＭＬファイルを生成する方法について説明する。
【００５６】
この方法では、構造化手段１１が、ＸＭＬ言語によるＭＰＥＧ７形式の記述による構造的記述を行う。構造化手段１１の動作について図６、図８（Ａ）および図９を参照して説明する。図６は構造的記述例を示す説明図である。図８（Ａ）は、テキストメディアの一例のニュース原稿５００の内容を示す説明図である。図９はＸＭＬ言語を用いた構造化手段の動作を説明するためのフローチャートである。
【００５７】
構造化手段１１は、まず、＜？ｘｍｌ＞タグや＜Ｍｐｅｇ７＞といった、あらかじめＭＰＥＧ７規格として定められているＸＭＬテンプレートに対し、映像情報のコンピュータシステム内での格納位置（フォルダ位置）を＜ＭｅｄｉａＬｏｃａｔｏｒ＞タグ、および＜ＭｅｄｉａＵｒｌ＞タグを用いて挿入する（ステップＳ２０１）。ここでは、例えば「Ｃ：￥メタ情報￥映像データ￥ニュース映像０２０９１３．ｍｐｇ」であるとする。なお、「Ｃ：￥メタ情報￥映像データ￥ニュース映像０２０９１３．ｍｐｇ」は、構造化手段１１が、映像情報を格納しようとする映像メディア格納手段１３における格納位置に相当する。
【００５８】
次に、テキストメディアに記載されている番組タイトル情報を、＜ＣｒｅａｔｉｏｎＩｎｆｏｒｍａｔｉｏｎ＞タグ内の＜Ｔｉｔｌｅ＞タグを用いて挿入する（ステップＳ２０２）。図８（Ａ）を例にすると、「ニュース番組０２０９１３」が挿入される。そして、段落毎の開始時間と終了時間の情報を＜ＭｅｄｉａＴｉｍｅ＞タグ、および＜ＭｅｄｉａＲｅｌＴｉｍｅＰｏｉｎｔ＞タグ、および＜ＭｅｄｉａＤｕｒａｔｉｏｎ＞タグを用いて挿入する（ステップＳ２０３）。
【００５９】
さらに、発言者情報を、＜Ｎａｍｅ＞タグ、および＜ＧｉｖｅｎＮａｍｅ＞タグ、および＜ＦａｍｉｌｙＮａｍｅ＞タグを用いて挿入する（ステップＳ２０４）。図８（Ａ）を例にすると「アナウンサ」が挿入される。次に、テキスト情報を、＜ＴｅｘｔＡｎｎｏｔａｔｉｏｎ＞タグ、および＜ＦｒｅｅＴｅｘｔＡｎｎｏｔａｔｉｏｎ＞タグを用いて挿入する（ステップＳ２０５）。ここで、各段落毎にテキスト情報を挿入する。図８（Ａ）の例では「こんばんは、７時のニュースです。」を含む段落が挿入される。
【００６０】
ここで、ステップＳ２０３で挿入した、段落の終了時間が、タイムコードの最終時間に到達していなければ、ステップＳ２０３に戻り、次の段落の各情報の挿入を行う（ステップＳ２０６）。段落の終了時間が、タイムコードの最終時間に到達していたら、ＸＭＬファイルをデータ格納手段１２に出力する（ステップＳ２０７）。データ格納手段１２内のＸＭＬファイル格納手段２３は、ＸＭＬファイルを格納する。
【００６１】
次に、格納したデータを、ユーザ４が検索して抽出するときの動作について、図１０に示すフローチャートを参照して説明する。ユーザ４は、図１２に示すような、コンピュータ画面に表示された入出力手段１６の検索条件入力画面を用いて所望の検索条件をキーボード等により入力する（ステップＳ３０１）。入出力手段１６は、例えばユーザ４が有するパーソナルコンピュータのキーボード、表示部および通信手段等に相当する。入出力手段１６は、インターネット等を介して検索条件を検索制御手段１７に送信する。検索制御手段１７は、ユーザ４が入力した検索条件にもとづいて、データ格納手段１２から検索条件に合致するデータを有するテキスト情報の段落を特定する（ステップＳ３０２）。検索する際に、全文一致検索や、各検索条件によるアンド検索といった一般的な検索処理を用いることができる。
【００６２】
検索の結果、該当データがなかった場合は、入出力手段１６に、検索結果がなかったことを通知し、該当データが存在した場合には、データ格納手段１２から、ユーザ４が入力した検索条件に合致するテキスト情報の段落をすべて抽出して、入出力手段１６に出力する（ステップＳ３０３，Ｓ３０４）。ここで、テキスト情報の段落と対応する映像情報も抽出することが好ましい。
【００６３】
映像情報も抽出する場合には、例えば図１１に示すように、映像情報とテキスト情報とを表示することができる。図１１には、テキスト情報とともに、代表的な画像がサムネイル画像として出力された場合の表示例が示されている。ここで、段落の開始時刻の部分の静止画を映像情報から抽出してサムネイル画像として出力してもよいし、段落の開始から終了までの間の最も代表的な静止画を抽出して出力してもよい。また、静止画によらず、段落の開始時刻から終了時刻までの動画を出力してもよい。
【００６４】
図１１に示された例では、ユーザ４からキーワードの語句として、「タマちゃん」が入力され、１２件の検索結果が出力されたことを示している。次に、ユーザ４が、例えば「目撃者の遊覧船」を含む段落を、入出力手段１６を用いて、例えばマウスでクリックすることによって選択すると、入出力手段１６は、その旨を検索制御手段１７に通知する。選択制御手段１７は、ユーザ４が選択した段落のテキスト情報と、映像情報とをデータ格納手段１２から抽出して同期手段１８に出力する。同期手段１８は、映像情報のうち、選択した段落のタイムコードの開始時間を映像情報の再生の先頭とすることを入出力手段１６に通知し、また、選択された段落のテキスト情報を入出力手段１６に出力する（ステップＳ３０５）。
【００６５】
入出力手段１６は、例えば図１２に示す画面（巻戻し等のスイッチ部の表示を有する画面）を表示するとともに、映像情報を再生し、映像情報の再生のタイミングに合わせて、選択された段落のテキスト情報をスクロール表示する（ステップＳ３０６）。なお、巻戻し等のスイッチ部の表示やテキスト情報をスクロールする処理は、同期手段１８によって実行されている。図１２に示す例では、映像メディアの先頭から９秒９７の位置から再生され、「目撃者の遊覧船」を含むテキスト情報の部分が、再生箇所であることを示すために斜体文字になっている。また、映像情報の再生状況に連動して、段落のテキスト情報のスクロールを行ってもよい。
【００６６】
以上のように、抽出されたタイムコードに対応する映像情報の部分が特定され、特定された映像情報が入出力手段１６を介してユーザ４に提供される。また、同期手段１８は、映像情報に対応するテキスト情報を、映像情報に連動させてユーザに提供する。
【００６７】
次に、本発明による第２の実施の形態について説明する。第１の実施の形態との構成の違いは、タイムコード生成手段１９が構造化手段２１に接続されていることであり、その他の構成は第１の実施例と同様である。また、第１の実施の形態との動作の違いは、音声認識テキスト情報における区切り位置の検出の方法である。音声認識テキスト情報における区切り位置の検出は、同期データ生成手段３の動作に関連する。具体的には、音声認識手段２８、タイムコード生成手段１９、マッピング手段２０および構造化手段２１の動作に関連する。そのため、第１の実施の形態と同様の動作の手段等については図１と同じ符号を付して説明を省略する。
【００６８】
第２の実施の形態の構成について説明する。図１３はこの実施の形態のブロック図である。この実施の形態では、タイムコード生成手段１９は、音声認識手段２８と、構造化手段２１とに接続される。音声認識手段２８は、音声情報において所定の無音区間が存在する箇所を文の終了位置すなわち文の区切りと判定する。また、音声認識手段２８は、受け取った映像情報をタイムコード生成手段１９に出力する。そして、音声認識手段２８は、文の区切りを示す情報をタイムコード生成手段１９に出力する。
【００６９】
タイムコード生成手段１９は、映像情報および音声情報の再生開始タイミング、再生開始タイミングからの経過時間、および音声認識手段２８からの文区切りを示す情報にもとづいて、音声情報における各文の開始時刻と終了時刻の情報である文タイムコードを生成し構造化手段２１に出力する。従って、この実施の形態では、タイムコード生成手段１９は、音声認識テキスト情報を所定の分割部分に分割した場合の各分割部分の開始時刻と終了時刻とを示すタイムコードを生成するための情報として、文タイムコードを生成する。
【００７０】
次に第２の実施の形態の動作について説明する。図１４はこの実施の形態を説明するフローチャートである。同期データ生成手段３は、オペレータ２から映像メディアとテキストメディアとを受け取る（ステップＳ４０１）。テキストメディアは、テキスト入力手段７に入力される。テキスト入力手段７は、テキストメディアがデジタル媒体であった場合には、テキストメディアに記録されているテキスト情報をマッピング手段２０に出力し、テキストメディアがデジタル媒体でなかった場合には、ＯＣＲ等を用いてテキストメディアに記録されているテキスト情報をデジタル化して（ステップＳ４０２，Ｓ４０３）、マッピング手段２０に出力する。
【００７１】
映像情報（音声情報を含む）は、映像入力手段６に入力される。映像入力手段６は、映像メディアがデジタル媒体であった場合には、映像メディアに記録されている映像情報を音声認識手段２８に出力し、デジタル媒体でなかった場合には、映像メディアに記録されている映像情報をデジタル化して（ステップＳ４０４，Ｓ４０５）、音声認識手段２８に出力する。音声認識手段２８は、映像メディアの音声トラックに記録されている音声情報について音声認識処理を行い（ステップＳ４０６）、音声認識テキスト情報を生成して、マッピング手段２０に出力する。また、音声情報において所定の時間以上の無音期間があった場合には、その無音期間がテキスト情報の文の区切りである判定して、音声情報を区切り、文の区切りを示す情報をタイムコード生成手段１９に出力する。
【００７２】
タイムコード生成手段１９は、音声情報における各文の開始時刻と終了時刻の情報である文タイムコードを生成して（ステップＳ４０７）、構造化手段２１に出力する。
【００７３】
マッピング手段２０は、第１の実施の形態と同様にして、テキスト情報における区切り位置にもとづいて、音声認識テキスト情報における区切り位置を検出し、段落音声認識テキスト情報を生成する。このときの音声認識テキスト情報とテキスト情報との比較の方法は、ＤＰマッチングの手法を用いる第１の実施の形態における方法と同様である。さらに、テキスト情報における各段落（区切られたテキスト情報）と各段落音声認識テキスト情報との１対１の対応を示す情報である対応情報を生成する（ステップＳ４０８）。そして、段落音声認識テキスト情報とテキスト情報とともに対応情報を構造化手段２１に出力する。なお、１つの段落音声認識テキスト情報には、複数の文すなわち音声認識手段２８が検出した文の区切りが存在する場合もある。その場合には、マッピング手段２０は、その旨を示す情報も構造化手段２１に出力する。
【００７４】
構造化手段２１は、文タイムコードおよび対応情報から、テキスト情報とタイムコードとの段落ごとの１対１の対応を示す情報である構造化情報を生成する（ステップＳ４０９）。構造化手段２１は、第１の実施の形態の場合と同様に、構造化情報にもとづいて、テキスト情報を記憶するテキストメディアファイルとタイムコードを記憶するタイムコードファイルとを、各段落毎に対応付けて生成する。
【００７５】
テキスト情報とタイムコードとを対応付けてテキストメディアファイルとタイムコードファイルとを生成する構造化の方法は、第１の実施の形態と同様である。ただし、この実施の形態では、段落音声認識テキスト情報における最初の文の開始時刻と最後の文の終了時刻とから段落の開始時刻と終了時刻を算出する。構造化手段２１は、生成した各ファイルをデータ格納手段１２に出力する。映像メディア格納手段１３は映像メディアファイルを格納し、テキストメディア格納手段１４はテキストメディアファイルを格納し、タイムコード格納手段１５はタイムコードファイルを格納する（ステップＳ４１０）。
【００７６】
なお、この実施の形態では、タイムコード生成手段１９は、構造化手段２１に文タイムコードを出力したが、第１の実施の形態の場合と同様に、音声認識手段２８に文タイムコードを出力するようにしてもよい。その場合には、段落音声認識テキスト情報における各文に、各文の開始時刻と終了時刻とを示す情報が付加される。
【００７７】
【発明の効果】
以上のように、本発明によれば、テキスト情報と、そのテキスト情報に対応する音声情報および映像情報のうちの音声情報とを比較することによって、タイムコードとテキスト情報を自動的に対応付けるので、タイムコードを介してテキスト情報と映像情報とを低コストで構造化することができる。
【００７８】
また、ユーザが入力した検索条件に合致するテキスト情報における分割部分を特定し、特定された分割部分に対応するタイムコードを抽出し、抽出されたタイムコードに対応する映像情報を特定して特定した映像情報をユーザに提供することによって、ユーザに、所望の映像情報を提供することができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態を示すブロック図である。
【図２】本発明のデータ格納手段において、管理ファイル格納手段を含む場合の構成例を示すブロック図である。
【図３】本発明のデータ格納手段においてタイムコード格納手段を含まない場合の構成例を示すブロック図である。
【図４】本発明のデータ格納手段において、ＸＭＬファイル格納手段を含む場合の構成例を示すブロック図である。
【図５】構造化手段の出力例を示す説明図である。
【図６】ＸＭＬ言語による、ＭＰＥＧ７形式の構造的記述の例を示す説明図である。
【図７】本発明の第１の実施の形態における同期データ生成手段の動作を示すフローチャートである。
【図８】マッピング手段の動作を説明するための説明図である。
【図９】ＸＭＬ言語を用いた構造化手段の動作を説明するためのフローチャートである。
【図１０】本発明の第１の実施の形態における同期データ利用手段の動作を示すフローチャートである。
【図１１】入出力手段の検索条件入力画面の出力例を示す説明図である。
【図１２】検索結果の出力例を示す説明図である。
【図１３】本発明の第２の実施の形態を示すブロック図である。
【図１４】本発明の第２の実施の動作における同期データ生成手段の動作を示すフローチャートである。
【符号の説明】
１コンテンツホルダ
２オペレータ
３同期データ生成手段
４ユーザ
５同期データ利用手段
６映像入力手段
７テキスト入力手段
８音声認識手段
９タイムコード生成手段
１０マッピング手段
１１構造化手段
１２データ格納手段
１３映像メディア格納手段
１４テキストメディア格納手段
１５タイムコード格納手段
１６入出力手段
１７検索制御手段
１８同期手段
１９タイムコード生成手段
２０マッピング手段
２１構造化手段
２３管理ファイル格納手段
２４ＸＭＬファイル格納手段
２８音声認識手段
５００ニュース原稿
５１０音声認識結果[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention provides a multimedia content structuring method for associating related video information and text information, a multimedia content structuring method and a multimedia content structuring program, and a structured content stored by a user. The present invention relates to a multimedia content providing method for searching and outputting according to a request.
[0002]
[Prior art]
2. Description of the Related Art In recent years, information has become more multimedia, and the amount of multimedia content including video information, audio information, text information, and the like has increased rapidly. By storing such information and recalling it later as needed, it can be used more effectively. In order to effectively utilize a plurality of types of stored information, the pieces of information need to be associated with each other.
[0003]
As an example of a system for associating a plurality of types of information, an optimal third text is generated from a first text digitized by character recognition of characters on a document and a second text digitized by speech recognition of voice information. Then, there is an electronic text creation system for obtaining text information of digital data (for example, Patent Document 1).
[0004]
There is also a system that performs voice recognition based on voice information added to video information and structures text information and video information that are voice recognition results (for example, Patent Document 2).
[0005]
Further, there is a multimedia information processing apparatus that digitizes video information, audio information, text information, and the like, and stores the information and time information indicating a temporal relationship between the information (for example, Patent Document 3).
[0006]
[Patent Document 1]
JP 2001-282779 A (Page 3-8, FIG. 1)
[Patent Document 2]
JP-A-2002-189728 (page 3)
[Patent Document 3]
JP-A-8-253209 (Page 3-8, FIG. 5)
[0007]
[Problems to be solved by the invention]
In the system described in Patent Literature 1, a matching process is performed between a manuscript and the text of the speech recognition processing result, so that more accurate text information can be automatically created. However, since Patent Document 1 does not disclose association with other information such as video information, the technique described in Patent Document 1 is applied to the structuring processing of video information, audio information, and text information. Must artificially perform a structuring process.
[0008]
Further, in the system described in Patent Literature 2, video information and audio information added to the video information can be structured. However, Patent Literature 2 discloses that text information is further structured. Nothing is disclosed.
[0009]
Further, in the multimedia information processing apparatus described in Patent Literature 3, structuring processing between pieces of information such as video information, audio information, and text information is performed using time information at an input time. The time information needs to be added when creating video information, audio information, text information, and the like. Therefore, when the time information is not added, the text information, the video information, and the audio information cannot be automatically associated with each other, and the association requires human resources.
[0010]
Therefore, the present invention provides a multimedia content structuring method, a multimedia content structuring apparatus and a multimedia content structuring program capable of automatically associating related video information and text information, and responding to a user request. It is an object of the present invention to provide a multimedia content providing method capable of providing desired information easily and easily.
[0011]
[Means for Solving the Problems]
A multimedia content structuring method according to the present invention is a multimedia content structuring method for inputting text information, audio information and video information corresponding to the text information, and associating the text information with the video information, Speech recognition processing is performed based on the speech information corresponding to the text information to generate speech recognition text information, the speech recognition text information is compared with the text information, and the speech recognition text information is compared with the text information. The speech recognition text information is divided into divided parts such as paragraphs, and a time code indicating a start time and an end time of each divided part is generated. The generated time code and text information are divided into predetermined divided parts. It is characterized by being associated with and structured. Since the audio information and the video information are related, associating the text information with the time code generated based on the audio information (specifically, the speech recognition text information) is substantially equivalent to the text information. It will be associated with video information.
[0012]
The multimedia content structuring method may be configured such that text information and video information are stored in a storage unit, and a time code of each divided part is stored in a time code storage unit. According to such a configuration, desired video information can be searched for based on the text information and the time code stored in the time code storage unit.
[0013]
The multimedia content structuring method may be configured such that video information is stored in a storage unit, and time code, text information, and information indicating a storage position of the video information are described in an XML language. According to such a configuration, video information and text information can be structured by the XML language.
[0014]
The multimedia content structuring device according to the present invention is a multimedia content structuring device for inputting text information, audio information and video information corresponding to the text information, and associating the text information with the video information, Speech recognition means for performing speech recognition processing based on speech information corresponding to text information to generate speech recognition text information, and start time and end of each divided part when the speech recognition text information is divided into predetermined divided parts Time code generating means for generating information for generating a time code indicating the time, and comparing the speech recognition text information with the text information, and dividing the speech recognition text information into divided parts based on the delimiter positions of the text information In addition, the speech recognition text information and the divided part separated by the text information That split portion and a mapping means for associating, characterized in that a structured means for associating the time code and text information for each predetermined divided portion corresponding to the divided portions of the speech recognition text information.
[0015]
The multimedia content structuring apparatus may include a storage unit that stores text information and video information, and a time code storage unit that stores a time code of each divided part. According to such a configuration, desired video information can be searched for based on the text information and the time code stored in the time code storage unit.
[0016]
The multimedia content structuring device includes storage means for storing video information, and the structuring means describes time code, text information, and information indicating a storage position of the video information in an XML language, and stores the information in an XML file storage means. It may be configured to store. According to such a configuration, video information and text information can be structured by the XML language.
[0017]
The multimedia content structuring apparatus includes: a search control unit that specifies a divided part in text information that matches a search condition input by a user and extracts a time code corresponding to the specified divided part; And a synchronizing means for specifying a portion of the video information to be provided and providing the specified video information to the user together with text information matching the search condition. According to such a configuration, video information desired by the user can be provided according to the user's request.
[0018]
The synchronization means may be configured to provide text information corresponding to the video information to the user in conjunction with the video information. According to such a configuration, it is possible to provide text information and video information desired by the user in a format that is easy to view according to the user's request.
[0019]
The multimedia content structuring program according to the present invention provides a computer that performs speech recognition processing based on speech information corresponding to text information to generate speech recognition text information, and compares the speech recognition text information with text information. Then, a process of dividing the speech recognition text information into divided portions based on a delimiter position of the text information, a process of generating a time code indicating a start time and an end time of each of the divided portions, And structuring the text information in association with each of the predetermined divided parts.
[0020]
A multimedia content providing method according to the present invention is a multimedia content providing method for providing text information requested by a user and video information corresponding to the text information to the user, wherein the audio is provided based on audio information corresponding to the text information. Performs recognition processing to generate speech recognition text information, compares the speech recognition text information with the text information, divides the speech recognition text information into divided parts based on the delimiter positions of the text information, and starts each divided part. A time code indicating a time and an end time is generated, and the generated time code and text information are structured in association with each predetermined divided portion, and the text information, the video information, and the time code are stored. Identify the segment in the text information that matches the search condition entered by the user, Extracting a time code corresponding to the divided portion, specifying a portion of video information corresponding to the extracted time code, and providing the specified video information to a user together with text information matching a search condition. .
[0021]
The multimedia content providing method according to the present invention may be configured such that, when providing video information to a user, text information corresponding to the video information is provided to the user in conjunction with the video information. According to such a configuration, it is possible to provide text information and video information desired by the user in a format that is easy to view according to the user's request.
[0022]
A multimedia content providing program according to the present invention provides a computer with a process of generating speech recognition text information by performing speech recognition processing based on speech information corresponding to text information, and comparing the speech recognition text information with the text information. Processing to divide the speech recognition text information into divided parts based on the delimiter positions of the text information, processing to generate time codes indicating the start time and end time of each divided part, and the generated time code and text Processing for associating information with each predetermined divided part, structuring text information, video information, and time code; and dividing divided parts of text information that matches a search condition input by a user. A process for specifying, a process for extracting a time code corresponding to the specified divided portion, and a process for extracting the time code. Identifying portions of the image information corresponding to the time code, the specified image information, characterized in that to execute a process of providing the user with the text data matching the search keyword.
[0023]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a multimedia content structuring system according to a first embodiment of the present invention. In the system shown in FIG. 1, the content holder 1 owns a video medium on which video information is recorded and a text media on which a document related to the video information is described. Video information and audio information are recorded on the video media.
[0024]
An example of a multimedia content including text information, audio information, and video information is a news program on a television broadcast. In the case of a news program, the news manuscript is text media, the audio read out by the announcer is audio information, and the video information is video information related to the audio. The operator 2 receives the video media and the text media from the content holder 1, and inputs the video media and the text media to the multimedia content device.
[0025]
The multimedia content device includes a synchronous data generating unit 3 and a synchronous data using unit 5. The synchronous data generating means 3 includes a video input means 6, a text input means 7, a voice recognition means 8, a time code generating means 9, a mapping means 10, a structuring means 11, and a data storing means 12. Is done. The data storage unit 12 includes a video media storage unit 13, a text media storage unit 14, and a time code storage unit 15. The synchronous data using unit 5 includes an input / output unit 16, a search control unit 17, and a synchronizing unit 18. However, the input / output means 16 shown in FIG. 1 does not have to be included in the multimedia content device, and is, for example, an input / output means such as a keyboard of a personal computer of the user 4.
[0026]
When a video medium is input from the operator 2, the video input means 6 outputs video information and audio information recorded on the video medium to the voice recognition means 8 when the video medium is a digital medium. When the video medium is an analog medium, the video input means 6 includes, for example, a video capture and has a function of converting analog video information and analog audio information into AVI format or MPEG format digital information. Can be Then, the video input unit 6 performs digital conversion on the video information and the audio information recorded on the video media, digitizes the video information and the audio information, and outputs the digitized data to the voice recognition unit 8. The voice recognition unit 8 performs voice recognition processing on the input voice information, and generates voice recognition text information that is text information as a result of the voice recognition processing. Further, the voice recognition means 8 divides the voice information into each word.
[0027]
The time code generation means 9 outputs to the voice recognition means 8 a word time code which is information on the start time and end time of each word in the voice information. The voice recognition unit 8 adds a word time code to the voice recognition text information and outputs the text information together with the video information to the mapping unit 10. Therefore, in this embodiment, the time code generation means 9 sets the time indicating the start time and end time of each divided part when the speech recognition text information is divided into a plurality of divided parts (for example, a plurality of paragraphs). A word time code is generated as information for generating a code.
[0028]
The text input means 7 receives text media from the operator 2. If the text medium is a digital medium, the text input means 7 outputs text information recorded on the text medium to the mapping means 10. If the text medium is an analog medium, the text input means 7 is configured to include, for example, an OCR (optical character reading) device. Then, digital conversion is performed on the text information recorded on the text medium, and the digitized text information is output to the mapping unit 10.
[0029]
The mapping means 10 divides the text information into divided parts at appropriate intervals. Here, for example, paragraphs, which are a group of sentences, are detected by detecting line breaks, indents, and the like, and text information is delimited using a line feed point or the like as a delimiter. Further, the mapping means 10 compares the text information with the speech recognition text information, and separates the speech recognition text information based on a break position in the text information to generate paragraph speech recognition text information. Further, correspondence information that is information indicating one-to-one correspondence between each paragraph (delimited text information) and each paragraph speech recognition text information in the text information is generated. Then, the corresponding information is output to the structuring unit 11 together with the paragraph speech recognition text information and the text information.
[0030]
The structuring unit 11 calculates each time code that is a start time and an end time of each paragraph from the word time code added to each word in the paragraph speech recognition text information. Further, the structuring unit 11 generates structured information, which is information indicating a one-to-one correspondence between each paragraph of the text information and each time code, based on the time code and the correspondence information.
[0031]
The structuring unit 11 generates a text media file for storing text information and a time code file for storing time code based on the structured information, and also generates a video media file for storing video information.
[0032]
For example, a text media file stores text information in each paragraph in the order of paragraphs, and a time code file stores time codes corresponding to each paragraph in the same order as the paragraphs in the text information. Note that a plurality of text media files, each corresponding to each paragraph, and a time code file may be generated.
[0033]
Next, the structuring unit 11 outputs the video media file, the text media file, and the time code file to the data storage unit 12. The video media storage means 13 in the data storage means 12 stores a video media file, the text media storage means 14 in the data storage means 12 stores a text media file, and the time code storage means 15 in the data storage means 12 Store the time code file. In this example, the storage means for storing the text information and the video information correspond to the video media storage means 13 and the text media storage means 14, and the time code storage means for storing the time code of each divided portion is the time code storage means. It corresponds to the means 15.
[0034]
The video information may be supplied from the video input unit 6 to the data storage unit 12 via the voice recognition unit 8, the mapping unit 10, and the structuring unit 10. The information may be supplied to the storage unit 12.
[0035]
Further, even if the structuring unit 11 generates, as management information, a start address and an end address of each paragraph of the text information in the text media file and a start address and an end address of the time code of each paragraph in the time code file. Good. At this time, the multimedia content device has a configuration including a management file storage unit 23 for storing management information in the data storage unit 12, as shown in FIG. Then, the structuring unit 11 outputs the management information to the management file storage unit 23, and the management file storage unit 23 stores the management information.
[0036]
Alternatively, the structuring unit 11 may generate a structured text medium including a time code, which is a combination of text information and a time code, as shown in FIG. At this time, the multimedia content device does not have the time code storage means 15, and the data conversion means 12 has a configuration shown in FIG.
[0037]
Further, there is a method in which the structuring unit 11 generates an XML file having a structural description in an MPEG7 (moving picture expert group 7) format in an XML (extensible markup language) language as shown in FIG. In the case of generating an XML file, the multimedia content apparatus has a configuration shown in FIG. 4 in which the data storage unit 12 does not include the text media storage unit 14 and the time code storage unit 15, but includes the XML file storage unit 24. become. Then, the structuring unit 11 outputs the XML file to the XML file storage unit 24, and the XML file storage unit 24 stores the XML file. In this example, the storage means for storing the video information corresponds to the video media storage means 12, and the XML file storage means for storing the time code, the text information, and the XML language describing the information indicating the storage position of the video information. Corresponds to the XML file storage means 24.
[0038]
When requesting desired video information and text information, the user 4 inputs a word serving as a keyword to the synchronous data using means 5. Then, the input / output unit 16 outputs the phrase input by the user 4 to the search control unit 17. The search control unit 17 searches for the paragraph of the text information including the phrase from the text media storage unit 14 (in the configuration shown in FIGS. 2 and 3) or the XML file storage unit 24 (in the configuration shown in FIG. 4). The corresponding paragraph of text information is output to the input / output means 16. Further, when the user 4 selects a paragraph of certain text information, the input / output unit 16 requests the search control unit 17 to output video information synchronized with the text information selected by the user 4.
[0039]
The retrieval control unit 17 extracts the video information from the video media storage unit 13 and stores the time code of the paragraph of the text information selected by the user 4 in the time code storage unit 15 (in the case of the configuration shown in FIG. 2), the text media storage unit 14 (in the case of the configuration shown in FIG. 3) or from the XML file storage means 24 (in the case of the configuration shown in FIG. 4), and outputs it to the synchronization means 18. The synchronization unit 18 outputs the video information to the input / output unit 16 with the start time indicated by the time code as the start time of the output of the video information and the end time indicated by the time code as the end time of the output of the video information. Further, the synchronizing means 18 processes the text information based on the time code and outputs the processed text information to the input / output means 16. As a processing at this time, for example, there is a method of scrolling text information.
[0040]
Note that the synchronous data generating means 3 and the synchronous data using means 5 can be realized by a computer system. However, the input / output unit 16 corresponds to an input / output unit such as a keyboard or a display unit of a microcomputer on the user side. When the synchronous data generating means 3 and the synchronous data using means 5 (excluding the input / output means) are realized by a computer system, the voice recognition means 8, the time code generating means 9, the mapping means 10, the structuring means 11, The search control unit 17 and the synchronization unit 18 are realized by software. The data storage unit 12 is realized by a storage medium such as a magnetic disk in a computer system.
[0041]
Specifically, software implemented in the computer system performs a speech recognition process based on speech information corresponding to the text information to generate speech recognition text information, and compares the speech recognition text information with the text information. Processing for dividing the speech recognition text information into divided parts based on the delimiter positions of the text information, processing for generating time codes indicating the start time and end time of each of the divided parts, A process for structuring the text information and the video information in association with each predetermined divided portion, a process for storing the text information, the video information and the time code, and a search condition input by the user. Processing to specify the divided part in the text information that matches with, and extracting the time code corresponding to the specified divided part. Includes a process of, identifies the image information corresponding to the extracted time code, the identified image information, a program for executing the process of providing the user with the text data matching the search keyword.
[0042]
Next, the operation of this embodiment will be described with reference to the drawings. FIGS. 7, 9 and 10 are flowcharts for explaining the operation of this embodiment. 11 and 12 are explanatory diagrams showing display examples when the input / output means 16 inputs / outputs information.
[0043]
Taking a news program in a television broadcast as an example, the content holder 1 corresponds to a news program production company, and the operator 2 corresponds to a management department in the production company or a management company entrusted with news distribution. Also, the user 4 requests information from the synchronous data using means 5 via a Web browser mounted on a personal computer and the Internet, for example, and transmits information from a desired synchronous data using means 5 via the Internet and the Web browser. Receive.
[0044]
First, the process of storing a video media file, a time code file, a text media file, a management file, or an XML file in the data storage unit 12 will be described with reference to FIG. It is assumed that a video medium in which video information (including audio information) of a news program is recorded and a text medium in which a manuscript read by an announcer in a news program is transmitted from the content holder 1 to the operator 2. The audio information recorded on the video media is audio information based on the sound of the announcer who has read the original. That is, the text information recorded on the text media and the audio information recorded on the video media have the same contents. The operator 2 inputs the video media and the text media to the synchronous data generator 3.
[0045]
The synchronous data generation means 3 receives the video media and the text media from the operator 2 (Step S101). The text medium is received by the text input means 7, and when the text medium is a digital medium, outputs text information recorded on the text medium to the mapping means 10. If the text medium is not a digital medium, the text information recorded on the text medium is digitized using OCR or the like (steps S102, S103) and output to the mapping means 10. The mapping means 10 temporarily stores text information. Specifically, the text information is stored in a RAM or the like in the computer system.
[0046]
The video media is input to the video input means 6. The video input means 6 outputs video information (including audio information) recorded on the video medium to the voice recognition means 8 when the video medium is a digital medium, and outputs the video information when the video medium is not a digital medium. Then, the video information (including audio information) recorded on the video media is digitized (steps S104 and S105) and output to the audio recognition means 8.
[0047]
The voice recognition unit 8 performs voice recognition processing on voice information recorded on the voice track of the video media, and generates voice recognition text information (step S106). Further, the speech recognition unit 8 performs a word cutout process on the speech recognition text information, and divides the speech information into each word. The audio recognition unit 8 outputs the video information to the time code generation unit 9 and outputs information indicating the division position of the audio information to the time code generation unit 9. The time code generation unit 9 generates the audio information of each word based on the reproduction start timing of the video information and the audio information, the elapsed time from the reproduction start timing, and the information indicating the division position of the audio information from the audio recognition unit 8. A word time code, which is a start time and an end time, is generated for each word and output to the voice recognition means 8. The voice recognition means 8 assigns a word time code of each word to each word. Then, the word time code is added to the speech recognition text information and output to the mapping means 10 together with the video information (steps S107 and S108).
[0048]
The mapping unit 10 detects a paragraph, which is a group of sentences, from text information, for example, by line feed or indentation, and divides the text information into paragraphs. Further, the text information is compared with the speech recognition text information. Then, the speech recognition text information is divided into paragraphs based on the paragraph break positions of the text information, and paragraph speech recognition text information is generated. Further, correspondence information, which is information indicating a one-to-one correspondence between each paragraph of the text information and the paragraph speech recognition text information, is generated (step S109), and the correspondence information is structured together with the paragraph speech recognition text information and the text information. Output to means 11.
[0049]
FIG. 8 shows an example of the association by the mapping means 10. In FIG. 8, “Good evening”, “7 o'clock”, “No”, “News”, and “Is” in a news manuscript 500 that is text information, and “Good evening” and “Good evening” in speech recognition text information in a speech recognition result 510. Compare "primary", "of", "news" and "at". In this way, words are sequentially compared. By using the DP matching technique as the comparison technique, it is possible to detect the break position in the speech recognition result 510 corresponding to the break position in the news manuscript 500 according to the degree of word matching.
[0050]
The structuring unit 11 calculates a time code that is a start time and an end time of each paragraph in the paragraph speech recognition text information from the word time code added to the paragraph speech recognition text information. The start time and end time of each paragraph can be calculated from the start time of the first word and the end time of the last word of each paragraph. Further, the structuring unit 11 generates, from the time code and the correspondence information, structured information that is information indicating a one-to-one correspondence between each paragraph of the text information and the time code (step S110).
[0051]
The structuring unit 11 generates a text media file for storing text information and a time code file for storing a time code in association with each paragraph based on the structured information. The structuring unit 11 outputs each generated file to the data storage unit 12. The text media storage means 14 stores a text media file, the time code storage means 15 stores a time code file, and the video media storage means 13 stores a video media file (step S111).
[0052]
Here, a structuring method in which the structuring unit 11 associates text information with a time code to generate a text media file and a time code file will be described. First, a method of generating a text media file and a time code file for each paragraph will be described. For example, assume that the file name of the text media file in the first paragraph is “document 1.txt” and the file name of the time code file in the first paragraph is “time 1.txt”. Then, the file of the first paragraph can be associated with the file name excluding the extension with the end of “1”. Similarly, the files of the second paragraph can be associated with each other by setting “document 2.txt” and “time 2.txt” to each other. Similarly, the third and fourth paragraphs can be associated with each other.
[0053]
There is also a method of generating, as management information, a start address and an end address of text information of each paragraph in a text media file and a start address and an end address of time code of each paragraph in a time code file. The correspondence between the text information and the time code can be understood by referring to such management information. At this time, the configuration of the multimedia content device is a configuration in which a management file storage unit 23 is provided in the data storage unit 12, as shown in FIG. Then, the structuring unit 11 outputs management information to the data storage unit 12 in addition to the text media file, the time code file, and the video media file. The text media storage means 14 stores a text media file, the time code storage means 15 stores a time code file, the video media storage means 13 stores a video media file, and the management file storage means 23 stores management information. .
[0054]
The structuring unit 11 may combine the text information and the time code to generate structured text information including the time code, as shown in FIG. At this time, as shown in FIG. 3, the multimedia content device does not include the time code storage unit 15. The structuring unit 11 outputs the text media file and the video media file to the data storage unit 12. The text media storage means 14 stores a text media file, and the video media storage means 13 stores a video media file.
[0055]
Further, as shown in FIG. 6, the structuring unit 11 may generate an XML file with a structural description in the XML language. At this time, in the multimedia content device, the text media storage unit 14 and the time code storage unit 15 are not provided, but the XML file storage unit 24 is provided. Here, a method for generating an XML file with a structural description in the XML language will be described.
[0056]
In this method, the structuring means 11 performs a structural description in a description in the MPEG7 format in the XML language. The operation of the structuring unit 11 will be described with reference to FIGS. 6, 8A and 9. FIG. 6 is an explanatory diagram showing a structural description example. FIG. 8A is an explanatory diagram showing the contents of a news manuscript 500 as an example of a text medium. FIG. 9 is a flowchart for explaining the operation of the structuring means using the XML language.
[0057]
First, the structuring unit 11 sets the <? For XML templates such as the <xml> tag and the <Mpeg7>, which are defined in advance as the MPEG7 standard, the storage location (folder position) of video information in the computer system is determined using the <Media Locator> tag and the <MediaUrl> tag. (Step S201). Here, for example, it is assumed that “C: \ Meta information \ Video data \ News video 020913.mpg”. Note that "C: \ meta information \ video data \ news video 020913.mpg" corresponds to a storage position in the video media storage unit 13 where the structuring unit 11 intends to store video information.
[0058]
Next, the program title information described in the text media is inserted using a <Title> tag within a <CreationInformation> tag (step S202). In the example of FIG. 8A, “news program 020913” is inserted. Then, information on the start time and end time for each paragraph is inserted using the <MediaTime> tag, <MediaRelTimePoint> tag, and <MediaDuration> tag (step S203).
[0059]
Furthermore, the speaker information is inserted using a <Name> tag, a <GivenName> tag, and a <FamilyName> tag (step S204). In the example of FIG. 8A, “announcer” is inserted. Next, text information is inserted using a <TextAnnotation> tag and a <FreeTextAnnotation> tag (step S205). Here, text information is inserted for each paragraph. In the example of FIG. 8A, a paragraph including "Good evening is news at 7 o'clock" is inserted.
[0060]
If the end time of the paragraph inserted in step S203 has not reached the end time of the time code, the process returns to step S203 and inserts information of the next paragraph (step S206). If the end time of the paragraph has reached the last time of the time code, the XML file is output to the data storage unit 12 (step S207). The XML file storage means 23 in the data storage means 12 stores the XML file.
[0061]
Next, the operation when the user 4 searches and extracts the stored data will be described with reference to the flowchart shown in FIG. The user 4 uses a keyboard or the like to input desired search conditions using a search condition input screen of the input / output unit 16 displayed on the computer screen as shown in FIG. 12 (step S301). The input / output unit 16 corresponds to, for example, a keyboard, a display unit, and a communication unit of a personal computer of the user 4. The input / output unit 16 transmits the search condition to the search control unit 17 via the Internet or the like. The search control unit 17 specifies a paragraph of text information having data that matches the search condition from the data storage unit 12 based on the search condition input by the user 4 (step S302). When performing a search, a general search process such as a full-text match search or an AND search based on each search condition can be used.
[0062]
As a result of the search, if there is no corresponding data, the input / output unit 16 is notified that there is no search result. If there is the corresponding data, the search condition input by the user 4 is input from the data storage unit 12. Then, all the paragraphs of the text information that match with are extracted and output to the input / output means 16 (steps S303, S304). Here, it is preferable to extract video information corresponding to the paragraph of the text information.
[0063]
When video information is also extracted, video information and text information can be displayed, for example, as shown in FIG. FIG. 11 shows a display example when a representative image is output as a thumbnail image along with text information. Here, the still image at the start time of the paragraph may be extracted from the video information and output as a thumbnail image, or the most representative still image from the start to the end of the paragraph may be extracted and output. May be. In addition, a moving image from the start time to the end time of a paragraph may be output instead of a still image.
[0064]
The example illustrated in FIG. 11 indicates that “Tama-chan” has been input as the keyword phrase from the user 4, and 12 search results have been output. Next, when the user 4 selects a paragraph including, for example, “the sighting boat of the witness” by clicking the mouse with the input / output unit 16, the input / output unit 16 uses the search control unit. Notify 17. The selection control unit 17 extracts the text information of the paragraph selected by the user 4 and the video information from the data storage unit 12 and outputs the extracted information to the synchronization unit 18. The synchronization means 18 notifies the input / output means 16 that the start time of the time code of the selected paragraph in the video information is set as the head of the reproduction of the video information, and also inputs / outputs the text information of the selected paragraph. Output to the means 16 (step S305).
[0065]
The input / output unit 16 displays, for example, the screen shown in FIG. 12 (a screen having a display of a switch unit such as rewind), reproduces the video information, and selects the selected paragraph in accordance with the reproduction timing of the video information. Is scroll-displayed (step S306). The display of the switch unit such as rewinding and the process of scrolling the text information are executed by the synchronization unit 18. In the example shown in FIG. 12, the text information is reproduced from the position of 9 seconds 97 from the beginning of the video media, and the text information including “the sighting boat of the witness” is in italic characters to indicate that it is the reproduction position. I have. Further, the text information of the paragraph may be scrolled in conjunction with the reproduction status of the video information.
[0066]
As described above, the portion of the video information corresponding to the extracted time code is specified, and the specified video information is provided to the user 4 via the input / output unit 16. Further, the synchronization unit 18 provides the text information corresponding to the video information to the user in association with the video information.
[0067]
Next, a second embodiment according to the present invention will be described. The difference from the first embodiment is that the time code generation unit 19 is connected to the structuring unit 21, and the other configurations are the same as those of the first embodiment. The difference from the first embodiment is the method of detecting the break position in the speech recognition text information. The detection of the break position in the speech recognition text information is related to the operation of the synchronous data generating means 3. Specifically, it relates to the operation of the voice recognition unit 28, the time code generation unit 19, the mapping unit 20, and the structuring unit 21. Therefore, the same operation means as in the first embodiment are denoted by the same reference numerals as those in FIG. 1 and the description is omitted.
[0068]
The configuration of the second embodiment will be described. FIG. 13 is a block diagram of this embodiment. In this embodiment, the time code generation unit 19 is connected to the voice recognition unit 28 and the structuring unit 21. The voice recognition unit 28 determines that a location where a predetermined silent section exists in the voice information is a sentence end position, that is, a sentence break. Further, the voice recognition unit 28 outputs the received video information to the time code generation unit 19. Then, the voice recognition unit 28 outputs information indicating a sentence break to the time code generation unit 19.
[0069]
The time code generation unit 19 determines the start time of each sentence in the audio information based on the reproduction start timing of the video information and audio information, the elapsed time from the reproduction start timing, and the information indicating the sentence break from the audio recognition unit 28. A sentence time code, which is information on the end time, is generated and output to the structuring unit 21. Therefore, in this embodiment, the time code generating means 19 generates the time code indicating the start time and end time of each divided part when the speech recognition text information is divided into predetermined divided parts. Generate sentence timecode.
[0070]
Next, the operation of the second embodiment will be described. FIG. 14 is a flowchart for explaining this embodiment. The synchronous data generator 3 receives the video media and the text media from the operator 2 (Step S401). The text media is input to the text input unit 7. The text input means 7 outputs text information recorded on the text medium to the mapping means 20 when the text medium is a digital medium, and outputs OCR or the like when the text medium is not a digital medium. Then, the text information recorded on the text medium is digitized (steps S402 and S403) and output to the mapping means 20.
[0071]
Video information (including audio information) is input to the video input means 6. The video input means 6 outputs video information recorded on the video medium to the audio recognition means 28 when the video medium is a digital medium, and outputs the video information when the video medium is not a digital medium. The video information is digitized (steps S404 and S405) and output to the voice recognition means 28. The voice recognition unit 28 performs a voice recognition process on the voice information recorded on the voice track of the video media (step S406), generates voice recognition text information, and outputs it to the mapping unit 20. Also, if there is a silence period longer than a predetermined time in the audio information, it is determined that the silence period is a delimiter of a sentence of the text information, the audio information is delimited, and information indicating the delimiter of the sentence is generated by time code generation. Output to means 19.
[0072]
The time code generation means 19 generates a sentence time code which is information on the start time and end time of each sentence in the voice information (step S407), and outputs it to the structuring means 21.
[0073]
As in the first embodiment, the mapping means 20 detects the break position in the speech recognition text information based on the break position in the text information, and generates paragraph speech recognition text information. The method of comparing the speech recognition text information and the text information at this time is the same as the method in the first embodiment using the DP matching technique. Further, correspondence information which is information indicating a one-to-one correspondence between each paragraph (delimited text information) and each paragraph speech recognition text information in the text information is generated (step S408). Then, the corresponding information is output to the structuring unit 21 together with the paragraph speech recognition text information and the text information. It should be noted that a single paragraph speech recognition text information may include a plurality of sentences, that is, a sentence segment detected by the speech recognition unit 28. In that case, the mapping means 20 also outputs information indicating that to the structuring means 21.
[0074]
The structuring unit 21 generates, from the sentence time code and the correspondence information, structured information that is information indicating a one-to-one correspondence between the text information and the time code for each paragraph (step S409). As in the case of the first embodiment, the structuring unit 21 associates a text media file storing text information and a time code file storing time code with each paragraph based on the structured information. Generate it.
[0075]
A structuring method for generating a text media file and a time code file by associating text information with a time code is the same as in the first embodiment. However, in this embodiment, the paragraph start time and the end time are calculated from the start time of the first sentence and the end time of the last sentence in the paragraph speech recognition text information. The structuring unit 21 outputs the generated files to the data storage unit 12. The video media storage means 13 stores a video media file, the text media storage means 14 stores a text media file, and the time code storage means 15 stores a time code file (step S410).
[0076]
In this embodiment, the time code generation means 19 outputs the sentence time code to the structuring means 21, but outputs the sentence time code to the speech recognition means 28 as in the first embodiment. You may make it. In that case, information indicating the start time and end time of each sentence is added to each sentence in the paragraph speech recognition text information.
[0077]
【The invention's effect】
As described above, according to the present invention, the text information is compared with the audio information of the audio information and the video information corresponding to the text information, so that the time code is automatically associated with the text information. Text information and video information can be structured at low cost via the time code.
[0078]
In addition, a divided part in the text information that matches the search condition input by the user is specified, a time code corresponding to the specified divided part is extracted, and video information corresponding to the extracted time code is specified and specified. By providing the video information to the user, desired video information can be provided to the user.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration example in a case where the data storage means of the present invention includes a management file storage means.
FIG. 3 is a block diagram showing a configuration example in a case where the data storage means of the present invention does not include a time code storage means.
FIG. 4 is a block diagram showing a configuration example in a case where an XML file storage unit is included in the data storage unit of the present invention.
FIG. 5 is an explanatory diagram showing an output example of a structuring unit.
FIG. 6 is an explanatory diagram showing an example of a structural description in an MPEG7 format in an XML language.
FIG. 7 is a flowchart illustrating an operation of a synchronous data generation unit according to the first embodiment of the present invention.
FIG. 8 is an explanatory diagram for explaining an operation of a mapping unit.
FIG. 9 is a flowchart for explaining the operation of the structuring unit using the XML language.
FIG. 10 is a flowchart illustrating an operation of a synchronous data using unit according to the first embodiment of the present invention.
FIG. 11 is an explanatory diagram showing an output example of a search condition input screen of the input / output means.
FIG. 12 is an explanatory diagram showing an output example of a search result.
FIG. 13 is a block diagram showing a second embodiment of the present invention.
FIG. 14 is a flowchart showing the operation of the synchronous data generation means in the operation of the second embodiment of the present invention.
[Explanation of symbols]
1 Content holder
2 Operator
3 Synchronous data generation means
4 users
5 Synchronous data utilization means
6 Image input means
7 Text input means
8 Voice recognition means
9 Time code generation means
10 Mapping means
11 structuring means
12 Data storage means
13 Video media storage means
14 Text media storage means
15 Time code storage means
16 Input / output means
17 Search control means
18 Synchronization means
19 Time code generation means
20 Mapping means
21 Structuring means
23 Management file storage means
24 XML file storage means
28 Voice Recognition Means
500 news manuscript
510 speech recognition result

Claims

Text information, audio information and video information corresponding to the text information is input, and in the multimedia content structuring method for associating the text information with the video information,
Performing voice recognition processing based on voice information corresponding to the text information to generate voice recognition text information,
Comparing the speech recognition text information and the text information, dividing the speech recognition text information into divided parts based on a delimiter position of the text information,
Generating a time code indicating a start time and an end time of each of the divided parts;
A multimedia content structuring method, wherein the generated time code and the text information are structured in association with each other for each predetermined divided portion.

2. The multimedia content structuring method according to claim 1, wherein the text information and the video information are stored in the storage means, and the time code of each divided part is stored in the time code storage means.

2. The multimedia content structuring method according to claim 1, wherein the video information is stored in a storage unit, and time code, text information, and information indicating a storage position of the video information are described in an XML language.

Text information, audio information and video information corresponding to the text information is input, and in the multimedia content structuring apparatus for associating the text information with the video information,
Voice recognition means for performing voice recognition processing based on voice information corresponding to the text information to generate voice recognition text information,
Time code generation means for generating information for generating a time code indicating a start time and an end time of each divided part when the speech recognition text information is divided into predetermined divided parts,
The speech recognition text information is compared with the text information, and the speech recognition text information is divided into the divided portions based on a delimiter position of the text information. Mapping means for associating the divided part with the recognized text information,
A multimedia content structuring apparatus, comprising: structuring means for associating the time code with the text information for each predetermined divided portion.

5. The multimedia content structuring apparatus according to claim 4, further comprising storage means for storing text information and video information, and time code storage means for storing a time code of each divided portion.

Storage means for storing video information,
5. The multimedia content structuring apparatus according to claim 4, wherein the structuring means describes information indicating a storage position of the time code, the text information, and the video information in an XML language and stores the information in the XML file storage means.

Search control means for specifying a divided part in the text information that matches the search condition input by the user, and extracting a time code corresponding to the specified divided part;
7. Synchronizing means for identifying a portion of the video information corresponding to the extracted time code, and providing the specified video information to a user together with text information matching a search condition. The multimedia content structuring device according to any one of claims 1 to 4.

The multimedia content structuring apparatus according to claim 7, wherein the synchronization means provides text information corresponding to the video information to the user in association with the video information.

A multimedia content structuring program for causing a computer to execute a process for associating the video information with the text information in audio information and video information corresponding to text information,
On the computer,
A process of performing voice recognition processing based on voice information corresponding to the text information to generate voice recognition text information;
Comparing the speech recognition text information and the text information, and dividing the speech recognition text information into divided parts based on a delimiter position of the text information;
A process of generating a time code indicating a start time and an end time of each of the divided parts;
A multimedia content structuring program for executing a process of structuring the generated time code and the text information in association with each other for each predetermined divided portion.

In a multimedia content providing method for providing a user with text information requested by a user and video information corresponding to the text information,
Performing voice recognition processing based on voice information corresponding to the text information to generate voice recognition text information,
Comparing the speech recognition text information and the text information, dividing the speech recognition text information into divided parts based on a delimiter position of the text information,
Generating a time code indicating a start time and an end time of each of the divided parts;
The generated time code and the text information are structured in association with each predetermined divided portion,
Storing the text information, the video information, and the time code,
Identify the segment in the text information that matches the search condition entered by the user,
Extracting a time code corresponding to the specified divided portion;
A multimedia content providing method comprising: specifying a portion of video information corresponding to the extracted time code; and providing the specified video information to a user together with text information matching search conditions.

The multimedia content providing method according to claim 10, wherein when providing the video information to the user, text information corresponding to the video information is provided to the user in conjunction with the video information.

A multimedia content providing program for causing a computer to execute a process of providing video information and text information according to a user request,
On the computer,
A process of performing voice recognition processing based on voice information corresponding to the text information to generate voice recognition text information;
Comparing the speech recognition text information and the text information, and dividing the speech recognition text information into divided parts based on a delimiter position of the text information;
A process of generating a time code indicating a start time and an end time of each of the divided parts;
A process of structuring the generated time code and the text information in association with each predetermined divided portion,
A process of storing the text information, the video information, and the time code;
A process of specifying a divided portion in text information that matches a search condition input by a user;
A process of extracting a time code corresponding to the specified divided portion;
A multimedia content structuring program for executing a process of specifying video information corresponding to the extracted time code, and providing the specified video information to a user together with text information matching search conditions.