JP2005115607A

JP2005115607A - Video retrieving device

Info

Publication number: JP2005115607A
Application number: JP2003348187A
Authority: JP
Inventors: Shingo Miyauchi; 進吾宮内
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-10-07
Filing date: 2003-10-07
Publication date: 2005-04-28

Abstract

<P>PROBLEM TO BE SOLVED: To provide a means for extracting a desired scene out of accumulated video contents so as to automatize retrieval of the video. <P>SOLUTION: Subtitle (teletext broadcasting) information multiplexed with video data is fetched and accumulated together with the video and, with these as a clue, the video is retrieved. First, by retrieving a character string or the like from the accumulated subtitle information, subtitles conforming to the scene that a user requires are retrieved and the video equivalent to subtitle presentation timing is used as a candidate for the requested scene. Moreover, image analysis and audio analysis are applied to the candidate scene and the scene which is consequently decided to meet the user's request is extracted. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、蓄積された映像コンテンツの中から、所望の映像シーンを検索する装置に関するものである。 The present invention relates to an apparatus for searching for a desired video scene from stored video content.

従来、蓄積された映像コンテンツの中からユーザ所望のシーンを検索するには、映像に対して何らかの画像解析あるいは音声解析を適用し、この結果に基づいて当該シーンを抽出する方法があった。また、コンテンツに関する情報を記述したメタデータを人手により作成し、これを用いて映像の検索を行う手段が取られていた。
特開２００１−６９４３７号公報特開２００１−１４３４５１号公報 Conventionally, in order to search a user-desired scene from the stored video content, there has been a method of applying some kind of image analysis or audio analysis to the video and extracting the scene based on the result. In addition, there has been a means for manually creating metadata describing information related to content and searching for video using the metadata.
JP 2001-69437 A JP 2001-143451 A

従来の特許文献１に記載の画像解析や特許文献２に記載の音声解析を用いる方法は処理の負荷が大きく、検索に時間がかかるという課題があった。また、コンテンツに関する情報を人手で作成しようとすると、膨大な手間が必要であった。 Conventional methods using the image analysis described in Patent Document 1 and the voice analysis described in Patent Document 2 have a problem that the processing load is large and the search takes time. Moreover, enormous effort is required to manually create information about content.

上記を解決するため、本発明装置は映像データに多重化されている字幕（文字放送）情報を取り出し、映像と共に蓄積し、これを手掛りとして映像の検索を行うことを特徴とするものである。 In order to solve the above-described problem, the apparatus of the present invention is characterized in that it extracts subtitle (text broadcast) information multiplexed in video data, accumulates it with the video, and searches for the video using this as a clue.

まず、蓄積された字幕情報に対して文字列探索などを行うことにより、ユーザの要求するシーンに適合すると思われる字幕を探し出し、その字幕の提示タイミングに相当する映像を要求シーンの候補とする。さらに、画像解析および音声解析をこの候補シーンに対して適用し、その結果ユーザの要求を満たすと判断されるシーンを抽出する。 First, by performing a character string search or the like on the stored subtitle information, a subtitle that seems to be suitable for the scene requested by the user is found, and a video corresponding to the presentation timing of the subtitle is set as a request scene candidate. Further, image analysis and sound analysis are applied to the candidate scene, and as a result, a scene determined to satisfy the user's request is extracted.

本発明によれば、字幕情報を利用することにより、画像あるいは音声解析のみを用いた従来の映像検索手段より高速に所望のシーンを抽出することが可能となる。また、映像コンテンツに関する情報を人手により作成する手間が省ける。 According to the present invention, by using subtitle information, a desired scene can be extracted at higher speed than conventional video search means using only image or audio analysis. Further, it is possible to save time and effort for manually creating information relating to video content.

（実施の形態１）
図１は、本発明の映像検索装置の実施の形態の一例の構成を示したブロック図である。以下、この図を参照し本発明の実施形態を説明する。なお本形態における映像データとは、動画像と音声、およびこれに付随する字幕（発話内容などをテキスト化したもの）が符号化され多重化されたものと定義する。また、ここで抽出対象とするシーンとは、例えばスポーツ中継における得点シーン、ドラマやバラエティにおける特定人物の登場シーン、ニュースにおける特定のトピックなど、意味的なまとまりを有する映像のセグメントを指す。 (Embodiment 1)
FIG. 1 is a block diagram showing a configuration of an example of an embodiment of a video search apparatus according to the present invention. Hereinafter, an embodiment of the present invention will be described with reference to this figure. Note that the video data in this embodiment is defined as video and audio and subtitles (texts of utterances and the like) accompanying the video and audio are encoded and multiplexed. In addition, the scene to be extracted here refers to a segment of a video having a meaningful group such as a scoring scene in a sports broadcast, an appearance scene of a specific person in a drama or variety, or a specific topic in news.

はじめに、本装置に映像データが入力されると、多重分離処理部１００において字幕データが取り出される。この字幕データは字幕データ処理部１１０に入力され復号化され、また併せて入力された映像の時刻情報と対応付けられる。これにより、発話内容等とその提示タイミング（映像と同期を取るために必要な時刻情報）を含む字幕情報が生成される。得られた字幕情報は映像と共に映像蓄積部２００に蓄積される。 First, when video data is input to this apparatus, subtitle data is extracted by the demultiplexing processing unit 100. The subtitle data is input to the subtitle data processing unit 110, decoded, and associated with the time information of the input video. Thereby, caption information including the utterance contents and the presentation timing thereof (time information necessary for synchronizing with the video) is generated. The obtained caption information is stored in the video storage unit 200 together with the video.

蓄積された映像に対し、ユーザ要求受付部４００よりユーザから特定シーンの検索要求があると、シーン抽出部３００はその要求に合致するシーンの抽出を開始する。まず、シーン抽出部において抽出シーンに関連する字幕的な特徴が設定されると、字幕解析部３１０は映像蓄積部に蓄積された字幕情報の中からその設定にマッチする箇所を検出する。そして、シーン抽出部は検出された字幕の時刻情報を参照し、このタイミングに相当する映像を候補シーンとして映像蓄積部から選び出す。 When the user request is received from the user request accepting unit 400 for the stored video, the scene extracting unit 300 starts extracting a scene that matches the request. First, when a subtitle-like feature related to the extracted scene is set in the scene extraction unit, the subtitle analysis unit 310 detects a portion that matches the setting from the subtitle information stored in the video storage unit. The scene extraction unit refers to the detected subtitle time information, and selects a video corresponding to this timing from the video storage unit as a candidate scene.

字幕情報の中には発話内容や話者名、情景のナレーションなどが含まれていることから、ここでの字幕的な特徴とはシーンに関するキーワードや人物名の有無など、ここでの字幕解析とは文字列探索や自然言語処理などを想定する。例えば処理の一例を図２を用いて説明すると、サッカーのゴールシーンを検索する場合、蓄積された字幕情報の中から「ゴール」という文字列を探し出し、それら字幕の提示タイミングにあたる周辺の映像を当該シーンの候補として選出する、といった方法を取ることが考えられる。 Since the subtitle information includes utterance content, speaker name, scene narration, etc., the subtitle-like features here include subtitle analysis here, such as the presence of keywords related to scenes and the presence of person names. Assumes character string search and natural language processing. For example, when an example of processing is described with reference to FIG. 2, when searching for a goal scene of soccer, a character string “goal” is searched from accumulated subtitle information, and surrounding video corresponding to the presentation timing of those subtitles is searched. It can be considered to select a candidate for a scene.

次に画像解析あるいは音声解析、または両者を用いることにより、前記で得られた候補シーンの精査を行う。シーン抽出部において抽出シーンに関連する画像的な特徴が設定されると、画像解析部３２０は候補シーンがその設定にマッチするかを判別する。ここでの画像解析としては、上記の従来技術にあるようなモデル画像とのマッチングや、オブジェクト／顔認識、輝度や色情報に基づく判定、カット分割、などの利用が考えられる。 Next, the candidate scene obtained as described above is scrutinized by using image analysis or voice analysis, or both. When an image feature related to the extracted scene is set in the scene extraction unit, the image analysis unit 320 determines whether the candidate scene matches the setting. As the image analysis here, it is conceivable to use matching with a model image as in the above prior art, object / face recognition, determination based on luminance or color information, cut division, and the like.

本例のサッカーのゴールシーンを検索する場合であれば、まず前記字幕解析により得られた候補映像を、カット分割を用いてカットに分割する。そして、分割されたカットの各先頭フレームと、ゴールポストを特徴とするゴールシーンのモデル画像とのマッチングを行う。この類似度が閾値以上であれば、そのカット（および前後数カット）は所望のシーンに相当すると判断することができる。 In the case of searching for the soccer goal scene of this example, first, the candidate video obtained by the caption analysis is divided into cuts using cut division. Then, each head frame of the divided cut is matched with the model image of the goal scene characterized by the goal post. If the similarity is equal to or greater than the threshold, it can be determined that the cut (and the preceding and following several cuts) correspond to a desired scene.

同様に音声解析部３３０は、抽出シーンに関連する音声的な特徴が設定されると、候補シーンがその設定にマッチするかを判別する。ここでの音声解析としては、上記の従来技術にあるような周波数解析や歓声／特定音検出、スペクトル分析、話者判定、などの利用が考えられる。サッカーのゴールシーンを検索する場合を例に挙げると、まず前記字幕解析により得られた候補映像に対して、音声のパワーレベルと周波数の解析を行う。この結果、大きな歓声を含む区間の前後を、所望のシーンに相当すると判断することができる。 Similarly, when an audio feature related to the extracted scene is set, the audio analysis unit 330 determines whether the candidate scene matches the setting. As voice analysis here, use of frequency analysis, cheer / specific sound detection, spectrum analysis, speaker determination, and the like as in the above-described prior art can be considered. Taking a case where a soccer goal scene is searched as an example, the power level and frequency of an audio are first analyzed for the candidate video obtained by the caption analysis. As a result, it can be determined that the front and back of the section including a loud cheer correspond to the desired scene.

シーン抽出部は以上の解析結果から総合的に判断し、ユーザの要求に合致するシーンを映像蓄積部から抽出する。そして、抽出された映像は映像提示部５００においてユーザに提示される。 The scene extraction unit comprehensively determines from the above analysis results, and extracts a scene that matches the user's request from the video storage unit. Then, the extracted video is presented to the user in the video presentation unit 500.

上記のように本発明によれば、字幕情報を利用することにより、画像あるいは音声解析のみを用いた従来の映像検索手段より高速に所望のシーンを抽出することが可能となる。また、映像コンテンツに関する情報を人手により作成する手間を省略することができる。 As described above, according to the present invention, it is possible to extract a desired scene at a higher speed than the conventional video search means using only image or audio analysis by using the caption information. Further, it is possible to save time and labor for manually creating information related to video content.

本発明の映像検索装置は、字幕情報を有する映像コンテンツが蓄積されている環境において、シーンの検索を実現する機能を持つことより、デジタル放送受信端末や映像記録・再生装置、メディアサーバなどの映像機器において有用である。 The video search apparatus according to the present invention has a function of searching for a scene in an environment in which video content having subtitle information is stored. Useful in equipment.

本発明の実施の形態の構成を示したブロック図The block diagram which showed the structure of embodiment of this invention 字幕解析の実施例の概観図Overview of subtitle analysis example

Explanation of symbols

１００多重分離処理部
１１０字幕データ処理部
２００映像蓄積部
３００シーン抽出部
３１０字幕解析部
３２０画像解析部
３３０音声解析部
４００ユーザ要求受付部
５００映像提示部 100 Demultiplexing Processing Unit 110 Subtitle Data Processing Unit 200 Video Storage Unit 300 Scene Extraction Unit 310 Subtitle Analysis Unit 320 Image Analysis Unit 330 Audio Analysis Unit 400 User Request Accepting Unit 500 Video Presentation Unit

Claims

The subtitle information multiplexed in the video data is taken out and stored together with the video, and a location suitable for the scene desired by the user is detected from the stored subtitle information, and the video corresponding to the presentation timing of the subtitle is displayed. A video search apparatus characterized by extracting scenes by applying analysis or voice analysis.