JP4172904B2

JP4172904B2 - Video / voice search device

Info

Publication number: JP4172904B2
Application number: JP2000242371A
Authority: JP
Inventors: 祐一望月; 誠喜井上; 英樹住吉; 雅規佐野; 香子有安
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2000-08-10
Filing date: 2000-08-10
Publication date: 2008-10-29
Anticipated expiration: 2020-08-10
Also published as: JP2002056006A

Description

【０００１】
【発明の属する技術分野】
本発明は、放送や通信により伝送され、またはＶＴＲやビデオディスクなどの記録装置に記録されたコンテンツのなかから、所望の映像や音声を検索する映像・音声検索装置に関するものである。
【０００２】
【従来の技術】
従来、コンテンツ内の所望の映像や音声を検索するには、映像や音声に対して何らかの信号処理を施して当該映像や音声を検索する方法や、映像や音声に関するキーワード等の内容を記述した文字情報を新たに人手により作成し、これを用いて検索するなどの方法があった。
【０００３】
【発明が解決しようとする課題】
しかし、従来の方法では、映像や音声に対して何らかの信号処理を施したり、映像や音声に関するキーワード等の内容を記述した文字情報を新たに人手により作成したりする必要があり、手間がかかるだけでなく、信号処理や文字情報の作成を誤ると正しい映像や音声の検索ができなくなるという解決すべき課題があった。
【０００４】
本発明の目的は、映像や音声に対して何らかの信号処理を施したり、映像や音声に関するキーワード等の内容を記述した文字情報を新たに人手により作成したりする必要のない新規な映像・音声検索装置を提供することにある。
【０００５】
【課題を解決するための手段】
上記目的を達成するために、本発明映像・音声検索装置は、字幕データと関連付けられた映像及び音声からなるコンテンツから、特定の映像及び／又は音声を検索する装置であって、前記字幕データは、予め映像内の話者の役柄を区別するために色分けされた字幕文に対応する色情報を有し、前記コンテンツを入力し、映像及び音声と字幕データとを分離して、当該映像及び音声と同期をとるための情報である時刻情報とともに当該字幕データを抽出する手段と、前記抽出した時刻情報から、当該字幕文データが映像に同期して提示されるタイミングを表す時刻の情報を提示タイミング情報として検出する手段と、前記抽出した字幕データから、字幕文の色情報を抽出し、抽出した色情報に関連付けられた映像内の話者の役柄を特定する話者の役柄情報を生成する手段と、前記字幕データを抽出する際に、前記映像における画面上の字幕文の提示位置を抽出し、抽出した提示位置に最も近い画面内の話者の画面上における登場位置を検出し、検出した登場位置の情報を該話者の登場位置情報として生成する手段と、字幕文に用いられている表音文字以外の記号文字によって、前記抽出した字幕文の種類を判別し、判別した字幕文の種類の情報を当該字幕文の付加情報として生成する手段と、前記映像及び音声の情報を記録する映像・音声蓄積手段と、該映像及び音声の情報に関連付けられた字幕データと前記生成した提示タイミング情報とを、字幕情報としての前記話者の役柄情報、話者の登場位置情報、及び付加情報とリンクさせた状態で記録する字幕情報蓄積手段と、所望の映像及び／又は音声を検索するのに指定された字幕情報に基づいて、合致する字幕情報を前記字幕情報蓄積手段から探し出し、探し出した字幕情報に対応する提示タイミング情報を特定する手段と、前記検出した提示タイミング情報に対応する映像及び音声の再生を前記映像・音声蓄積手段に指示する手段とを備えることを特徴とするものである。
【０００６】
【発明の実施の形態】
以下に添付図面を参照し、発明の実施の形態に基づいて本発明を詳細に説明する。
図１は、本発明映像・音声検索装置の構成をブロック図にて示している。
図１において、１は多重分離処理部、２は映像・音声蓄積部、３は字幕データ処理部、４は字幕情報蓄積部、５はユーザ操作部、６は字幕情報検索部、および７は検索結果出力部である。
なお、本発明において、コンテンツとは、テレビジョン放送やＶＴＲなどによる映像と音声、またはラジオ放送などによる音声で提示されるメディアにテキストデータとして符号化された字幕データが付加されたものであると定義する。
【０００７】
図１を参照して、本発明映像・音声検索装置の動作を説明する。
字幕データが、放送や通信により伝送され、またはＶＴＲやビデオディスクなどの記録装置に記録されたコンテンツに多重されている場合には、それらコンテンツは多重分離処理部１に入力される。
【０００８】
多重分離処理部１では、入力されたコンテンツを、映像・音声と字幕データとに分離し、さらに、字幕データについて映像・音声と同期を取るために必要な時刻情報も抽出する。分離された映像・音声は映像・音声蓄積部２に記録され、字幕データと時刻情報は字幕データ処理部３に入力される。
【０００９】
上記以外の通信型サービスにより伝送され、またはフロッピーディスクなどの外部記憶メディアにより受け渡しされたコンテンツの字幕データコンテンツの字幕データの場合は、コンテンツの映像や音声は映像・音声蓄積部２に記録され、字幕データについては、コンテンツの映像・音声と同期を取るための字幕データの時刻情報も同時に得られるため、時刻情報と併せて字幕データ処理部３に入力される。
【００１０】
字幕データ処理部３は、提示タイミング抽出部３−ａ、色情報抽出部３−ｂ、提示位置情報抽出部３−ｃ、および付加情報抽出部３−ｄで構成され、入力された字幕データと時刻情報とを用いて、字幕文とともに、その提示タイミング情報、話者の役柄情報、話者の画面上における登場位置情報、および以下に説明するような付加情報をそれぞれ検出して出力する。
【００１１】
まず、提示タイミング抽出部３−ａでは、字幕文の提示タイミング情報を抽出する。ここでは、入力された時刻情報から映像に同期して提示される際の字幕文の提示タイミング情報を検出する。
【００１２】
色情報抽出部３−ｂでは、字幕文の色情報を抽出する。字幕文の色は、例えば、聴覚障害者用字幕においては、ドラマの主人公やドキュメンタリー番組のナレーター等の主たる登場人物の字幕文は黄色、主たる登場人物以外はシアンを使用するなど、登場人物とその登場人物の字幕の対応関係を明確にするために字幕文を色分けしている。従って、抽出された字幕文の色情報からその字幕に該当する話者の役柄情報を検出することができる。
【００１３】
提示位置情報抽出部３−ｃでは、画面上の字幕文の提示位置を抽出する。字幕文は、同一画面上に複数の登場人物が存在し、話をしている場合、それぞれの登場人物に最も近い部分に、その登場人物が話している字幕が提示される。従って、抽出された字幕文の提示位置から話者の画面上での登場位置を検出することができる。
【００１４】
付加情報抽出部３−ｄでは、字幕文中の表音文字以外の記号文字を利用し、その字幕文の種類を判別し、字幕文の付加情報を作成する。字幕文中で使用される記号文字の種類と使用例を以下のａ．〜ｐ．に示す。
ａ．話者名：（）＋セリフ
使用例
（Ａさん）こんにちは
ｂ．音の説明：（）
使用例
（ノック）
（汽笛）
本例の場合、ａ．の話者名と異なり、（）以降に文がない
【００１５】
ｃ．セリフ中の引用文、ことわざ、歌詞など：「」
使用例
「猫に小判」
ｄ．書名、作品名、曲名、番組名など：『』
使用例
『万葉集』
ｅ．ナレーション：＜＞
使用例
＜ナレーションが記述されます＞
ｆ．モノローグ：《》
使用例
《ぶつぶつ》
【００１６】
ｇ．音楽：
【外１】

使用例
〔外１〕
ｈ．楽器：〔外１〕＋（）
使用例
〔外１〕（笛）
i ．曲名：〔外１〕＋（『』）
使用例
〔外１〕（『曲名』）
ｊ．歌詞：〔外１〕＋「」
使用例
〔外１〕「君が代は・・・」
【００１７】
ｋ．シーン外の音声：
【外２】

使用例
〔外２〕（男）こんにちは
ｌ．電話の呼鈴：
【外３】

使用例
〔外３〕（ただし、〔外３〕を点滅させて表示する）
ｍ．電話の声：〔外３〕＋セリフ、または〔外３〕＋（話者名）＋セリフ
使用例
〔外３〕（Ａさん）もしもし
【００１８】
ｎ．状況説明：
【外４】

使用例
【外５】

ｏ．インタホン、無線など：〔外４〕＋セリフ、または〔外４〕＋（話者名）＋セリフ
使用例
【外６】

（Ａさん）了解
ｐ．次のページの文と続く：⇒
使用例
１ページ目ひさかたのひかりのどけき春の日に⇒
２ページ目しづ心なく花のちるらむ
【００１９】
これらの記号文字から、付加情報抽出部３−ｄは、字幕文の付加情報として、例えば、字幕文の話者名や、字幕文がセリフか、音の説明か、あるいは状況説明かなど字幕の種類、また、セリフが通常セリフか、ナレーションか、モノローグか、あるいは歌詞かなどのセリフの種類、さらには、セリフにの音声がシーン外からか、電話からか、あるいは無線機からのいずれから出ているかなどの付加情報を検出することができる。
【００２０】
字幕データ処理部３によって検出され、出力された字幕文、字幕文の提示タイミング情報、話者の役柄情報、話者の画面上における登場位置情報、および付加情報は字幕文毎にリンクさせた状態で字幕情報蓄積部４に記録される。ここで、字幕文とリンクされた各種情報とをあわせて字幕情報と定義する。
【００２１】
以上により、コンテンツに付加された字幕の字幕情報がコンテンツに対応して字幕情報蓄積部４に記録され、蓄積されたので、ユーザは字幕情報を指定することにより所望の字幕情報が含まれているコンテンツを検索して出力することが可能となる。
【００２２】
図１および図２を参照してさらに具体的に説明する。
ユーザはユーザ操作部５から検索したい対象を指定する。例えば、コンテンツの主役が話しているシーンの指定、あるいは同じコンテンツで、ナレーションがついているシーンの指定などである。指定された情報（この場合、シーン情報）は字幕情報検索部６に入力される。
【００２３】
字幕情報検索部６では、まず、ユーザー操作部５から指定された情報により、字幕情報蓄積部４内の字幕情報から検索する。次に、この検索の結果から、該当する字幕文の提示タイミング情報を取得する。さらに、取得した提示タイミング情報からその字幕文の提示タイミングに相当する映像を映像・音声蓄積部から呼び出す。
【００２４】
図２は、字幕情報検索部６が字幕情報の検索を行うときに図１に示す他の部分（映像・音声蓄積部２、字幕情報蓄積部４、ユーザ操作部、および検索結果出力部７）とどのような順序で、どのようなやりとりを行って検索を行うかを示している。
なお、図２は、特に、番組内のナレーションのシーンを検索する場合について示している。
図２において、ユーザが、ユーザ操作部５からナレーションのシーンを検索するよう指示した場合、字幕情報検索部６は、ユーザー操作部５からの検索命令を取得し、字幕情報蓄積部４に対して蓄積されている字幕情報を転送するよう指示する。
【００２５】
字幕情報検索部６においては、この指示により字幕情報蓄積部４から蓄積されている字幕情報が転送されてきた後、転送されてきた字幕情報と検索条件（この場合ナレーションのシーン）とを比較し、条件が合致する字幕情報を探し出す。探し出した字幕情報の提示タイミング情報をもとに映像・音声蓄積部２に該当する映像・音声の再生を指示し、また、字幕情報蓄積部４には該当する字幕情報を再生するよう指示する。
【００２６】
これらの指示に従って、映像・音声蓄積部２では該当する提示タイミングの映像・音声を再生し、また、字幕情報蓄積部４では該当する字幕情報を再生し、これら再生された映像・音声および字幕情報は検索結果出力部７から出力される。
【００２７】
【発明の効果】
本発明によれば、コンテンツ内の映像や音声を検索する際に、映像および音声に対して何らの信号処理を施す必要がなく、また、映像や音声に関するキーワード等の内容を記述した文字情報を新たに人手により作成することも必要としないで映像インデックスを作成することができ、所望の映像や音声の検索が可能となる。
【図面の簡単な説明】
【図１】本発明映像・音声検索装置の構成をブロック図にて示している。
【図２】図１中の字幕情報検索部が字幕情報の検索を行うときに図１中の他の部分とどのような順序で、どのようなやりとりを行って検索を行うかを示している。
【符号の説明】
１多重分離処理部
２映像・音声蓄積部
３字幕データ処理部
４字幕情報蓄積部
５ユーザ操作部
６字幕情報検索部
７検索結果出力部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a video / audio search apparatus for searching for a desired video or audio from contents transmitted by broadcasting or communication or recorded in a recording apparatus such as a VTR or a video disc.
[0002]
[Prior art]
Conventionally, in order to search for a desired video or audio in the content, a method for performing a certain signal processing on the video or audio to search for the video or audio, or a character that describes contents such as keywords related to the video or audio. There was a method of creating information manually and searching using this information.
[0003]
[Problems to be solved by the invention]
However, in the conventional method, it is necessary to perform some signal processing on the video and audio, or to manually create character information describing the content of keywords such as video and audio. In addition, there is a problem to be solved that if the signal processing or the creation of the character information is mistaken, it becomes impossible to search for the correct video and audio.
[0004]
It is an object of the present invention to provide a new video / audio search that does not require any signal processing on video or audio, or newly creates text information describing the content of keywords such as video and audio. To provide an apparatus.
[0005]
[Means for Solving the Problems]
In order to achieve the above object, the video / audio search device of the present invention is a device for searching for specific video and / or audio from content consisting of video and audio associated with subtitle data , wherein the subtitle data is , Having color information corresponding to subtitle sentences color-coded in advance to distinguish the role of the speaker in the video, inputting the content, separating the video and audio from the subtitle data, and the video and audio When means for extracting the caption data together with time information, which is information for synchronization, from the extracted time information, the presentation timing information of time at which the caption text data representing a timing that will be presented in synchronization with video means for detecting as information from the subtitle data the extracted speaker extracting color information of the caption text, to identify the speaker role in the video associated with the extracted color information It means for generating a role information, when extracting the caption data, extracts the presentation position of the caption text on the screen in the image, appearing position in speaker screen nearest the screen on the extracted presentation position detecting and means for generating the information of the detected emerged position as appearance position information of the speakers, by the symbol character except phonetic characters used in the caption text, determine the type of the subtitle text that the extracted means for generating information on the type of the discriminated caption text as additional information of the caption text, and video and audio storage means for recording information of the video and audio, subtitle data associated with the video and audio information Subtitle information accumulating means for recording the presentation timing information and the generated presentation timing information in a state linked with the speaker role information, the speaker appearance position information, and the additional information as subtitle information; And / or on the basis of the subtitle information specified to retrieve voice, subtitle information that matches searched from the subtitle information storage means, and means for specifying a presentation timing information corresponding to the caption information searched, and the detected And means for instructing the video / audio storage means to reproduce video and audio corresponding to the presentation timing information.
[0006]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described in detail based on an embodiment of the invention with reference to the accompanying drawings.
FIG. 1 is a block diagram showing the configuration of the video / voice search apparatus of the present invention.
In FIG. 1, 1 is a demultiplexing processing unit, 2 is a video / audio storage unit, 3 is a caption data processing unit, 4 is a caption information storage unit, 5 is a user operation unit, 6 is a caption information search unit, and 7 is a search. It is a result output part.
In the present invention, the content is obtained by adding subtitle data encoded as text data to media presented by video and audio by television broadcasting or VTR, or audio by radio broadcasting or the like. Define.
[0007]
With reference to FIG. 1, the operation of the video / voice search apparatus of the present invention will be described.
When caption data is transmitted by broadcasting or communication, or is multiplexed with content recorded on a recording device such as a VTR or a video disc, the content is input to the demultiplexing processing unit 1.
[0008]
The demultiplexing processing unit 1 separates the input content into video / audio and caption data, and also extracts time information necessary for synchronizing the caption data with the video / audio. The separated video / audio is recorded in the video / audio storage unit 2, and the caption data and time information are input to the caption data processing unit 3.
[0009]
In the case of caption data of content transmitted by a communication type service other than the above or delivered by an external storage medium such as a floppy disk, the video and audio of the content are recorded in the video / audio storage unit 2, The caption data is also input to the caption data processing unit 3 together with the time information because the time information of the caption data for synchronizing with the video / audio of the content is obtained at the same time.
[0010]
The caption data processing unit 3 includes a presentation timing extraction unit 3-a, a color information extraction unit 3-b, a presentation position information extraction unit 3-c, and an additional information extraction unit 3-d. Using the time information, along with the caption text, the presentation timing information, the speaker role information, the appearance position information on the speaker screen, and additional information as described below are detected and output.
[0011]
First, the presentation timing extraction unit 3-a extracts caption sentence presentation timing information. Here, the presentation timing information of the caption text when it is presented in synchronization with the video is detected from the input time information.
[0012]
The color information extraction unit 3-b extracts color information of the caption text. The color of the subtitle text is, for example, in the subtitles for the hearing impaired, the subtitle text of the main characters, such as the hero of the drama and the narrator of the documentary program, is yellow, and cyan is used other than the main characters. The subtitle text is color-coded to clarify the correspondence between the subtitles of the characters. Therefore, it is possible to detect the role information of the speaker corresponding to the caption from the extracted caption text color information.
[0013]
The presentation position information extraction unit 3-c extracts the presentation position of the caption text on the screen. In the subtitle sentence, when there are a plurality of characters on the same screen and are talking, the subtitles spoken by the characters are presented in the portion closest to each character. Therefore, the appearance position on the screen of the speaker can be detected from the extracted presentation position of the caption sentence.
[0014]
The additional information extraction unit 3-d uses symbol characters other than phonetic characters in the caption text, determines the type of the caption text, and creates additional information of the caption text. Types of symbol characters used in subtitle sentences and examples of use: a. ~ P. Shown in
a. Speaker name :() + serif use examples (A's) Hello b. Explanation of sound: ()
Usage example (knock)
(Whistle)
In this example, a. Unlike the speaker name of, there is no sentence after ().
c. Quotes, sayings, lyrics, etc. in the dialogue: “”
Example of use "Old to cat"
d. Book title, title, song title, program title, etc .: “”
Usage example "Manyoshu"
e. Narration: <>
Usage example <Narration is described>
f. Monologue: <<
Example of use
[0016]
g. musics:
[Outside 1]

Example of use [outside 1]
h. Musical instrument: [Outside 1] + ()
Usage example [Outside 1] (Futee)
i. Song name: [Outside 1] + ("")
Usage example [Outside 1] ("Song Title")
j. Lyrics: [Outside 1] + ""
Example of use [outside 1] “Kimigayo is ...”
[0017]
k. Audio outside the scene:
[Outside 2]

Example use [outside 2] (M) Hello l. Phone bell:
[Outside 3]

Usage example [Outside 3] (However, [Outside 3] blinks and displays)
m. Voice of the phone: [Outside 3] + line or [Outside 3] + (speaker name) + line usage example [Outside 3] (Mr. A)
n. Situation description:
[Outside 4]

Example of use [Outside 5]

o. Intercom, wireless, etc .: [Outside 4] + Line or [Outside 4] + (Speaker name) + Line usage example [Outside 6]

(Mr. A) OK p. Continue with the text on the next page: ⇒
Example usage page 1 Hikatakata no Hikaridoki Spring Day⇒
Page 2
From these symbolic characters, the additional information extraction unit 3-d may, as additional information of the subtitle sentence, for example, a caption name such as a speaker name of the subtitle sentence, whether the subtitle sentence is a speech, a sound description, or a situation description. The type of line, whether it is a normal line, narration, monologue, lyrics, etc., and the voice to the line comes from outside the scene, from the phone, or from the radio It is possible to detect additional information such as whether or not
[0020]
The subtitle text detected and output by the subtitle data processing unit 3, subtitle text presentation timing information, speaker role information, appearance position information on the speaker screen, and additional information are linked for each subtitle text Is recorded in the caption information storage unit 4. Here, the subtitle text and the various information linked to it are defined as subtitle information.
[0021]
As described above, the subtitle information of the subtitle added to the content is recorded and stored in the subtitle information storage unit 4 corresponding to the content, so that the user includes the desired subtitle information by specifying the subtitle information. Content can be searched and output.
[0022]
A more specific description will be given with reference to FIG. 1 and FIG.
The user designates a target to be searched from the user operation unit 5. For example, designation of a scene where the main character of the content is talking or designation of a scene with the same content and narration is performed. The designated information (in this case, scene information) is input to the caption information search unit 6.
[0023]
First, the subtitle information search unit 6 searches the subtitle information in the subtitle information storage unit 4 based on the information specified by the user operation unit 5. Next, presentation timing information of the corresponding caption text is acquired from the search result. Furthermore, a video corresponding to the presentation timing of the caption text is called from the acquired presentation timing information from the video / audio storage unit.
[0024]
FIG. 2 shows other parts shown in FIG. 1 when the subtitle information search unit 6 searches for subtitle information (video / audio storage unit 2, subtitle information storage unit 4, user operation unit, and search result output unit 7). In what order, what kind of exchange is performed and the search is performed.
FIG. 2 particularly shows a case where a narration scene in a program is searched.
In FIG. 2, when the user instructs the user operation unit 5 to search for a narration scene, the subtitle information search unit 6 acquires a search command from the user operation unit 5, and sends it to the subtitle information storage unit 4. Instructs to transfer the stored caption information.
[0025]
In the subtitle information search unit 6, after the subtitle information stored from the subtitle information storage unit 4 is transferred by this instruction, the transferred subtitle information is compared with the search condition (in this case, a narration scene). , Find subtitle information that matches the conditions. The video / audio storage unit 2 is instructed to reproduce the corresponding video / audio based on the searched subtitle information presentation timing information, and the subtitle information storage unit 4 is instructed to reproduce the corresponding subtitle information.
[0026]
In accordance with these instructions, the video / audio storage unit 2 reproduces the video / audio at the corresponding presentation timing, and the subtitle information storage unit 4 reproduces the corresponding subtitle information, and the reproduced video / audio and subtitle information is reproduced. Is output from the search result output unit 7.
[0027]
【The invention's effect】
According to the present invention, it is not necessary to perform any signal processing on the video and audio when searching for video and audio in the content, and character information describing the content of keywords such as video and audio is stored. A video index can be created without the need for new manual creation, and a desired video or audio can be searched.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the configuration of a video / voice search apparatus according to the present invention.
FIG. 2 shows in what order and what exchange is performed with the other parts in FIG. 1 when the subtitle information search unit in FIG. 1 searches for subtitle information. .
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Demultiplexing process part 2 Video | audio / audio storage part 3 Subtitle data processing part 4 Subtitle information storage part 5 User operation part 6 Subtitle information search part 7 Search result output part

Claims

An apparatus for searching for specific video and / or audio from content consisting of video and audio associated with subtitle data,
The caption data has color information corresponding to caption sentences that are color-coded in advance to distinguish the role of the speaker in the video,
Means for inputting the content, separating video and audio and caption data, and extracting the caption data together with time information which is information for synchronizing with the video and audio;
From the extracted time information, and means for detecting information of a time at which the caption text data representing a timing that will be presented in synchronization with the video as the presentation timing information,
Means for extracting subtitle sentence color information from the extracted subtitle data, and generating speaker role information for specifying the speaker role in the video associated with the extracted color information;
When extracting the caption data, the presentation position of the caption sentence on the screen in the video is extracted, the appearance position on the screen of the speaker within the screen closest to the extracted presentation position is detected, and the detected appearance position the information and means to generate as appeared position information of the speaker,
By the symbol character except phonetic characters used in the caption text, and means for generating and determine the type of caption text that the extracted, the type of information discriminated caption text as additional information of the caption text,
Video / audio storage means for recording the video and audio information;
Caption data associated with the video and audio information and the generated presentation timing information are recorded in a state of being linked to the speaker role information, speaker appearance position information, and additional information as caption information. Subtitle information storage means,
Means for searching for matching subtitle information from the subtitle information storage means based on subtitle information designated for searching for desired video and / or audio, and specifying presentation timing information corresponding to the searched subtitle information;
A video / audio search device comprising: means for instructing the video / audio storage means to reproduce video and audio corresponding to the detected presentation timing information.