JP2008301340A

JP2008301340A - Digest creating apparatus

Info

Publication number: JP2008301340A
Application number: JP2007146917A
Authority: JP
Inventors: Shinji Nabeshima; 伸司鍋島; Takeshi Kawamura; 岳河村; Meiko Maeda; 芽衣子前田
Original assignee: Panasonic Corp
Current assignee: Panasonic Corp
Priority date: 2007-06-01
Filing date: 2007-06-01
Publication date: 2008-12-11

Abstract

<P>PROBLEM TO BE SOLVED: To generate digest contents in which a playback speed of a scene in which a title is being displayed or being played back as audio, is changed in accordance with the scene, in video contents. <P>SOLUTION: The number of characters in title data contained in video contents, a title display starting time to start displaying a title, and a title display ending time to end displaying the title are detected. Furthermore, regarding audio data contained in the video contents, its kind is determined and its fundamental frequency, volume, an audio starting time to start playing back audio, and an audio ending time to end playing back the audio are detected. A playback speed of the video contents is determined from the number of title characters, the title display starting time, the title display ending time, the fundamental frequency, the volume, the audio starting time and the audio ending time. On the basis of the determination, digest contents of the video contents are generated. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、映像コンテンツのダイジェスト作成装置に関し、特に、テレビ番組などのコンテンツを要約し効率的に視聴するためのダイジェストコンテンツを生成するダイジェスト作成装置に関する。 The present invention relates to a video content digest creation device, and more particularly to a digest creation device that summarizes content such as a television program and generates digest content for efficient viewing.

デジタル放送及びインターネット放送の一般化やＤＶＤ（ＤｉｇｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）レコーダなどの普及によって、テレビ番組の長時間録画やいわゆるタイムシフト再生が一般的におこなわれている。しかしユーザのテレビ視聴時間は限られているため、録画した全てのテレビ番組を視聴する時間は必ずしも確保できない。録画したテレビ番組を短時間で視聴するために、テレビ番組のコンテンツをただ単に早送り再生すると、重要な場面もそうでない場面も区別なく一律に高速再生されてしまう。この方法では、ユーザがコンテンツの内容を十分に理解できないおそれがある。そこでユーザにとっては、録画したテレビ番組を視聴する際に視聴時間の短縮と番組内容の正確な把握とをいかにして両立するかが重要な問題となる。 With the generalization of digital broadcasting and Internet broadcasting and the widespread use of DVD (Digital Versatile Disk) recorders and the like, long-time recording of TV programs and so-called time-shifted reproduction are generally performed. However, since the user's television viewing time is limited, it is not always possible to secure time for viewing all recorded television programs. In order to view a recorded television program in a short time, if the content of the television program is simply played back at a high speed, both important scenes and other scenes will be played back at high speed uniformly. With this method, the user may not be able to fully understand the content. Therefore, for a user, when viewing a recorded television program, an important issue is how to balance viewing time reduction and accurate grasp of the program contents.

このための技術として特許文献１〜４には、コンテンツに含まれる映像、音声、及び字幕などのデータを解析して重要であると判断した場面のみを抽出し、これら重要なシーンを繋いで再生するダイジェスト作成技術、またはユーザの操作で場面間の往来を簡単に可能にする技術が示されている。特許文献５には、重要だと判断した場面を通常の速度で再生し、そうでないシーンを高速再生する技術が示されている。特許文献６には、高速再生によってユーザが字幕の内容を理解し損なうことを防ぐために、字幕の文字数に応じて字幕の入った場面の再生を一時的に停止する技術が示されている。
特開２０００―２３０６２号公報特開２００２―３４４８７１号公報特開２００５―１１５６０７号公報特開平１１―５５６０３号公報特開２００５―２５２３７２号公報特開平９―２００７００号公報 As a technology for this purpose, Patent Documents 1 to 4 extract only scenes judged to be important by analyzing data such as video, audio, and subtitles included in the contents, and play back these important scenes by connecting them. A technique for creating a digest, or a technique for easily allowing traffic between scenes by a user operation is shown. Patent Document 5 discloses a technique for reproducing a scene determined to be important at a normal speed and reproducing a scene that is not so at high speed. Patent Document 6 discloses a technique for temporarily stopping playback of a scene including a subtitle according to the number of subtitle characters in order to prevent the user from losing the understanding of the subtitle content due to high-speed playback.
Japanese Patent Laid-Open No. 2000-23062 Japanese Patent Laid-Open No. 2002-344871 JP-A-2005-115607 Japanese Patent Laid-Open No. 11-55603 Japanese Patent Laid-Open No. 2005-252372 JP-A-9-200700

ドラマなどのコンテンツでは、セリフの入る場面が重要な意味を持つことが多い。このため、この種のコンテンツ全編を効率よく短時間で視聴するには、セリフの部分はできるだけ通常に近い速度で再生してセリフを聞き取りやすくし、それ以外の部分はできるだけ高速で再生することが好ましい。しかし、高速再生においてもユーザがコンテンツの内容を理解できるようにしなければならないので、再生速度には自ずと上限がある。短時間でより効率的にコンテンツを視聴するためには、セリフの入る場面はユーザがセリフの内容を聞き取れる範囲で最も高速に再生し、セリフの入らない場面は場面の展開が理解できる範囲の速度で、セリフの入る場面よりも高速で再生することが重要となる。 In the case of content such as dramas, the scenes where the lines enter are often important. For this reason, in order to efficiently view the entire content of this kind in a short time, the part of the speech can be played at a speed as close to normal as possible to make it easy to hear the speech, and the rest of the content can be played as fast as possible. preferable. However, since the user must be able to understand the content even during high-speed playback, the playback speed naturally has an upper limit. In order to view the content more efficiently in a short time, the scene where the speech enters is played at the highest speed within the range where the user can hear the content of the speech, and the scene where the speech does not enter is a speed that can understand the development of the scene. Therefore, it is important to play at a higher speed than the scene where the dialogue enters.

しかし、上記特許文献４が示す技術では、セリフが入っている場面の抽出を行うが、その場面をどのように再生するかはユーザの操作にゆだねられる。このため、高速再生か通常再生かスキップするかをユーザが指定しなければならず操作が煩雑になる。 However, in the technique shown in Patent Document 4, a scene containing a speech is extracted, but how to reproduce the scene is left to the user's operation. For this reason, the user must specify whether to perform high-speed playback, normal playback, or skipping, and the operation becomes complicated.

また、上記特許文献６が示す技術では、字幕の文字数だけを基にして再生速度を制御するため、長い字幕では無条件に再生速度が低下する。この技術は、発話速度が低く再生速度を上げることができる場合でも、長い字幕があれば再生速度が低下するので、視聴時間の効率的な短縮には適さない。 In the technique disclosed in Patent Document 6, the playback speed is controlled based only on the number of subtitle characters, and thus the playback speed is unconditionally reduced for long subtitles. Even if the speech speed is low and the playback speed can be increased, this technique is not suitable for efficient shortening of the viewing time because the playback speed decreases if there is a long subtitle.

このような問題点に鑑み本発明は、ユーザがコンテンツの内容を理解できる範囲でコンテンツの場面ごとに再生速度を変化させたダイジェストを作成し、コンテンツ全体の視聴時間の短縮と効率的なコンテンツ視聴を実現することを課題とする。 In view of these problems, the present invention creates a digest with a playback speed changed for each scene of content within a range in which the user can understand the content, thereby reducing the overall viewing time and efficient content viewing. It is a problem to realize.

この課題を解決するために、本発明のダイジェスト作成装置は、映像コンテンツを要約したダイジェストコンテンツを生成するダイジェスト作成装置であって、前記映像コンテンツで表示される字幕の文字数及び字幕表示時間を検出する字幕解析部と、字幕の文字数、及び当該字幕表示時間から発話速度を算出し、当該発話速度を基に前記字幕表示時間での映像コンテンツの再生速度を決定するシーン抽出部と、前記シーン抽出部により決定された再生速度に従ってダイジェストコンテンツを生成するダイジェスト生成部と、を備えるものである。 In order to solve this problem, a digest creation device according to the present invention is a digest creation device that generates digest content summarizing video content, and detects the number of subtitle characters and subtitle display time displayed in the video content. A subtitle analysis unit, a scene extraction unit that calculates a speech speed from the number of subtitle characters and the subtitle display time, and determines a playback speed of the video content at the subtitle display time based on the speech speed, and the scene extraction unit And a digest generation unit that generates digest content according to the playback speed determined by.

ここで前記ダイジェスト生成部は、字幕表示時間以外は一定速度で再生するダイジェストコンテンツを生成するようにしてもよい。 Here, the digest generation unit may generate digest content that is played back at a constant speed except for the caption display time.

ここで前記ダイジェスト生成部は、字幕表示時間以外は映像コンテンツ開始からの経過時間に従って速度を速めて再生するダイジェストコンテンツを生成するようにしてもよい。 Here, the digest generation unit may generate digest content to be played back at a higher speed according to the elapsed time from the start of the video content except for the caption display time.

ここで前記ダイジェスト生成部は、前記シーン抽出部により決定された再生速度で、字幕表示時間のみを再生するダイジェストコンテンツを生成するようにしてもよい。 Here, the digest generation unit may generate digest content that reproduces only the caption display time at the reproduction speed determined by the scene extraction unit.

この課題を解決するために、本発明のダイジェスト作成装置は、映像コンテンツを要約したダイジェストコンテンツを生成するダイジェスト作成装置であって、前記映像コンテンツで表示される字幕の文字数及び字幕表示時間を検出する字幕解析部と、前記字幕表示時間に再生される映像コンテンツの音声を解析し、基本周波数、音量のうちの少なくとも１つと音声再生時間とを検出する音声解析部と、前記字幕の文字数、及び音声再生時間から発話速度を算出し、前記基本周波数、音量、の少なくとも１つ及び発話速度を基にして当該音声再生時間での映像コンテンツの再生速度を決定するシーン抽出部と、前記シーン抽出部により決定された再生速度に従ってダイジェストコンテンツを生成するダイジェスト生成部と、を備えるものである。 In order to solve this problem, a digest creation device according to the present invention is a digest creation device that generates digest content summarizing video content, and detects the number of subtitle characters and subtitle display time displayed in the video content. A subtitle analysis unit; an audio analysis unit that analyzes audio of video content reproduced during the subtitle display time and detects at least one of a fundamental frequency and a volume; and an audio reproduction time; the number of characters of the subtitle; and audio A scene extraction unit that calculates an utterance speed from a reproduction time and determines a reproduction speed of video content in the audio reproduction time based on at least one of the fundamental frequency and the volume and the utterance speed; and the scene extraction unit A digest generation unit that generates digest content according to the determined playback speed. .

ここで前記ダイジェスト生成部は、音声再生時間以外は一定速度で再生するダイジェストコンテンツを生成するようにしてもよい。 Here, the digest generation unit may generate digest content that is played back at a constant speed except for the audio playback time.

ここで前記ダイジェスト生成部は、音声表示時間以外は映像コンテンツ開始からの経過時間に従って速度を速めて再生するダイジェストコンテンツを生成するようにしてもよい。 Here, the digest generation unit may generate digest content to be played back at a higher speed according to the elapsed time from the start of the video content except for the audio display time.

ここで前記ダイジェスト生成部は、前記シーン抽出部により決定された再生速度で、音声再生時間のみを再生するダイジェストコンテンツを生成するようにしてもよい。 Here, the digest generation unit may generate digest content that reproduces only the audio reproduction time at the reproduction speed determined by the scene extraction unit.

ここで前記シーン抽出部は、前記発話速度と前記映像コンテンツのジャンルを基に前記字幕表示時間での映像コンテンツの再生速度を決定するようにしてもよい。 Here, the scene extraction unit may determine the playback speed of the video content during the caption display time based on the utterance speed and the genre of the video content.

ここで前記シーン抽出部は、前記基本周波数、音量の少なくとも１つ及び前記発話速度と前記映像コンテンツのジャンルを基に前記音声再生時間での映像コンテンツの再生速度を決定するようにしてもよい。 Here, the scene extraction unit may determine the playback speed of the video content during the audio playback time based on at least one of the fundamental frequency and volume, the speech rate, and the genre of the video content.

ここで前記シーン抽出部は、前記基本周波数、音量の少なくとも１つ及び前記発話速度と前記音声再生時間の音声の種類を基に前記音声再生時間での映像コンテンツの再生速度を決定するようにしてもよい。 Here, the scene extraction unit determines the playback speed of the video content during the audio playback time based on at least one of the fundamental frequency, the volume, the speech rate and the audio type of the audio playback time. Also good.

ここで前記字幕の文字数は、字幕の文字をすべて仮名で表記した場合の前記仮名の文字数としてもよい。 Here, the number of characters of the subtitle may be the number of characters of the kana when all the characters of the subtitle are expressed in kana.

ここで前記映像コンテンツに含まれる映像信号に重畳された字幕から前記字幕の文字数を検出し、前記検出した字幕の文字数を前記字幕解析部に送信する入力処理部を備えるようにしてもよい。 Here, an input processing unit may be provided that detects the number of subtitle characters from subtitles superimposed on a video signal included in the video content and transmits the detected number of subtitle characters to the subtitle analysis unit.

ここで前記シーン抽出部は、複数の段階に分けた発話速度を基に、各段階の発話速度に対応した再生速度を決定するようにしてもよい。 Here, the scene extraction unit may determine a playback speed corresponding to the speech speed of each stage based on the speech speed divided into a plurality of stages.

ここで前記シーン抽出部は、複数の段階に分けた発話速度と、複数の段階に分けた基本周波数と、複数の段階に分けた音量を基に、対応する再生速度を決定するようにしてもよい。 Here, the scene extraction unit may determine the corresponding playback speed based on the speech speed divided into a plurality of stages, the fundamental frequency divided into the plurality of stages, and the volume divided into the plurality of stages. Good.

以上のような本発明によると、映像コンテンツの場面ごとに、セリフが含む感情や盛り上がりに適した再生速度を算出しダイジェストコンテンツを作成することができる。これによってユーザは、映像コンテンツを短時間で視聴できるだけでなく、盛り上がった場面の見逃しや聞き逃しを防ぐことができ、映像コンテンツの内容の正確な理解を伴った効率的な視聴ができる。 According to the present invention as described above, the digest content can be created by calculating the playback speed suitable for the emotion and excitement included in the speech for each scene of the video content. As a result, the user can not only watch the video content in a short time, but also can prevent the overlooked scene from being missed or missed, and can efficiently view the video content with an accurate understanding of the content of the video content.

（第１の実施の形態）
以下、本発明の第１の実施の形態について、図面を参照しながら説明する。図１は、本実施の形態に係るダイジェスト作成装置の構成を示す図である。ダイジェスト作成装置１は、蓄積部１１、入力処理部１２、字幕解析部１３、音声解析部１４、シーン抽出部１５、ダイジェスト生成部１６、制御部１７、及び出力インタフェース（Ｉ／Ｆ）１８を備えている。 (First embodiment)
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a configuration of a digest creation apparatus according to the present embodiment. The digest creation device 1 includes a storage unit 11, an input processing unit 12, a caption analysis unit 13, an audio analysis unit 14, a scene extraction unit 15, a digest generation unit 16, a control unit 17, and an output interface (I / F) 18. ing.

蓄積部１１は、例えばテレビ番組や映画等の映像コンテンツデータを始め、後述する字幕テーブル、音声テーブル、再生速度テーブル、ダイジェストテーブル、及び映像コンテンツを要約したダイジェストコンテンツを格納するものである。具体的に蓄積部１１は、半導体メモリ、ハードディスクドライブ、光ディスクドライブ、またはその他の記憶装置である。 The storage unit 11 stores, for example, video content data such as a TV program and a movie, a subtitle table, an audio table, a playback speed table, a digest table, and digest content that summarizes video content, which will be described later. Specifically, the storage unit 11 is a semiconductor memory, a hard disk drive, an optical disk drive, or other storage device.

入力処理部１２は、放送波やインターネットに代表されるネットワークなどから有線又は無線により映像コンテンツデータを受信し、それを蓄積部１１に格納するものである。入力処理部１２は、チューナーやネットワークアダプタなどを含んで構成されている。なお、入力処理部１２において映像コンテンツデータの受信はＵＳＢやＩＥＥＥ１３９４などのバスを経由して、あるいはハードディスクやメモリカードなどのリムーバブルメディアを経由して行ってもよい。 The input processing unit 12 receives video content data from a broadcast wave or a network represented by the Internet by wire or wireless, and stores it in the storage unit 11. The input processing unit 12 includes a tuner, a network adapter, and the like. Note that the video content data may be received by the input processing unit 12 via a bus such as USB or IEEE 1394, or via a removable medium such as a hard disk or a memory card.

字幕解析部１３は、蓄積部１１が格納する映像コンテンツデータを解析し、映像コンテンツデータに含まれる各々の字幕の表示が始まる字幕表示開始時刻、字幕の表示が終わる字幕表示開始時刻、及び字幕文字数を検出及び算出し、これら字幕表示開始時刻、字幕表示終了時刻、及び字幕文字数を後述する字幕テーブルとして蓄積部１１に格納するものである。尚、字幕表示開始時刻から字幕表示開始時刻までを字幕表示時間とよぶ。 The subtitle analysis unit 13 analyzes the video content data stored in the storage unit 11, and displays the subtitle display start time at which each subtitle included in the video content data starts, the subtitle display start time at which subtitle display ends, and the number of subtitle characters Are detected and calculated, and the subtitle display start time, the subtitle display end time, and the number of subtitle characters are stored in the storage unit 11 as a subtitle table to be described later. The subtitle display start time to the subtitle display start time is called subtitle display time.

音声解析部１４は、蓄積部１１に格納された字幕テーブルを参照するとともに、蓄積部１１が格納する映像コンテンツデータの音声データを解析し、後述する音声テーブルを生成するものである。音声解析部１４は、字幕テーブルに保持された字幕表示時間に再生される音声データを、いくつかの音声ジャンルに分類する。さらに音声解析部１４は、その音声や効果音が再生される音声開始時刻Ｔｓ、終了する音声終了時刻Ｔｅ、当該音声の基本周波数、及び音量を検出する。そして音声ジャンル、音声開始時刻、音声終了時刻、基本周波数、及び音量を後述する音声テーブルとして蓄積部１１に記録するものである。尚、音声開始時刻から音声終了時刻までを音声再生時間とよぶ。 The audio analysis unit 14 refers to the caption table stored in the storage unit 11 and analyzes the audio data of the video content data stored in the storage unit 11 to generate an audio table described later. The sound analysis unit 14 classifies the sound data reproduced during the caption display time held in the caption table into several sound genres. Furthermore, the voice analysis unit 14 detects a voice start time Ts at which the voice or sound effect is reproduced, a voice end time Te to end, a fundamental frequency of the voice, and a volume. The voice genre, voice start time, voice end time, fundamental frequency, and volume are recorded in the storage unit 11 as a voice table to be described later. Note that the period from the voice start time to the voice end time is called a voice playback time.

シーン抽出部１５は、蓄積部１１に格納された音声テーブルの中から音声開始時刻Ｔｓｓ及び音声終了時刻Ｔｓｅを抽出する。さらにシーン抽出部１５は、蓄積部１１に格納された字幕テーブルから、音声再生時間で表示される字幕文字数Ｎｃを抽出する。その上で字幕文字数Ｎｃと、音声開始時刻Ｔｓｓから音声終了時刻Ｔｓｅまでの音声が再生される時間の長さとを用いて、各音声再生時間における発話速度Ｓｖを以下の式で算出する。
Ｓｖ＝（Ｔｓｅ−Ｔｓｓ）／Ｎｃ
さらにシーン抽出部１５は再生速度テーブルを用いて、この発話速度と音声テーブルに保持される基本周波数及び音量とを基に各音声再生時間での再生速度を指定し、後述するダイジェストテーブルを生成し蓄積部１１に格納するものである。 The scene extraction unit 15 extracts the audio start time Tss and the audio end time Tse from the audio table stored in the storage unit 11. Further, the scene extraction unit 15 extracts the number Nc of subtitle characters displayed in the audio playback time from the subtitle table stored in the storage unit 11. Then, using the number of subtitle characters Nc and the length of time during which the voice from the voice start time Tss to the voice end time Tse is played, the speech speed Sv at each voice playback time is calculated by the following formula.
Sv = (Tse−Tss) / Nc
Further, the scene extraction unit 15 uses the playback speed table to specify the playback speed for each voice playback time based on the speech speed and the basic frequency and volume held in the voice table, and generates a digest table to be described later. The data is stored in the storage unit 11.

ダイジェスト生成部１６は、蓄積部１１が格納するダイジェストテーブルを参照して、コンテンツのダイジェストデータを生成し、当該生成したダイジェストデータを蓄積部１１に記録するものである。 The digest generation unit 16 refers to the digest table stored in the storage unit 11 to generate content digest data, and records the generated digest data in the storage unit 11.

制御部１７は、字幕解析部１３、音声解析部１４、シーン抽出部１５及びダイジェスト生成部１６の各動作を制御する。制御部１７はさらに、蓄積部１１からダイジェストデータを適宜読み出して、出力インタフェース１８を通じて当該ダイジェストを外部に出力するものである。 The control unit 17 controls each operation of the caption analysis unit 13, the audio analysis unit 14, the scene extraction unit 15, and the digest generation unit 16. The control unit 17 further reads digest data from the storage unit 11 as appropriate, and outputs the digest to the outside through the output interface 18.

ここで本実施の形態における映像コンテンツデータの形態として、ＭＰＥＧ２トランスポートストリームを想定している。図２は、本実施の形態におけるＭＰＥＧ２トランスポートストリームのデータ構造を示す図である。映像コンテンツデータは、字幕ＰＥＳ（ＰａｃｋｅｔｉｚｅｄＥｌｅｍｅｎｔａｒｙＳｔｒｅａｍ）からなる字幕ストリーム、音声ＰＥＳからなる音声ストリーム、映像ＰＥＳからなる映像ストリーム、コンテンツデータに付随したデータのデータストリーム、及びＰＣＲ（ＰｒｏｇｒａｍＣｌｏｃｋＲｅｆｅｒｅｎｃｅ）ストリームを含んでいる。ＰＣＲは送信機側の基準時刻であって、タイムスタンプの基準ともなる時刻情報である。映像コンテンツデータに含まれる字幕、音声、映像のそれぞれの実データを表示又は出力する時刻は、タイムスタンプとして各ＰＥＳに記載されている。本図において、字幕ＰＥＳは字幕データとタイムスタンプが記載されたもの、音声ＰＥＳは音声データとタイムスタンプが記載されたもの、映像ＰＥＳは映像データとタイムスタンプが記載されたものである。このような上記各ストリームを多重化して１つのトランスポートストリームを構成している。尚、ＭＰＥＧ２トランスポートストリーム及び各ＰＥＳの詳細はＩＳＯ／ＩＥＣ１３８１８−１ＭＰＥＧ２ＳＹＳＴＥＭＳに規定されているものである。 Here, an MPEG2 transport stream is assumed as a form of video content data in the present embodiment. FIG. 2 is a diagram showing a data structure of the MPEG2 transport stream in the present embodiment. The video content data includes a subtitle stream made up of subtitled PES (Packetized Elementary Stream), an audio stream made up of audio PES, a video stream made up of video PES, a data stream of data attached to content data, and a PCR (Program Clock Reference) stream. Contains. PCR is time information that is a reference time on the transmitter side and also serves as a reference for a time stamp. The time for displaying or outputting the actual data of subtitles, audio, and video included in the video content data is described in each PES as a time stamp. In this figure, subtitle PES has subtitle data and time stamps written therein, audio PES has audio data and time stamps written therein, and video PES has video data and time stamps written therein. Each of the above streams is multiplexed to constitute one transport stream. The details of the MPEG2 transport stream and each PES are defined in ISO / IEC13818-1 MPEG2 SYSTEMS.

尚、ＭＰＥＧ２トランスポートストリームから映像、音声及び字幕を再生する場合、コンテンツ再生装置は内部時計をＰＣＲで記述された時刻に設定する。そしてコンテンツ再生装置は、上記ストリームから、映像、音声、字幕の各ＰＥＳを分離し、内部時計の時刻が各ＰＥＳに記載されたタイムスタンプの時刻になった時点で当該ＰＥＳに記載されたデータをデコードして表示又は出力する。 When playing back video, audio, and subtitles from the MPEG2 transport stream, the content playback device sets the internal clock to the time described in PCR. Then, the content playback apparatus separates each PES of video, audio, and subtitles from the stream, and when the time of the internal clock becomes the time of the time stamp described in each PES, the data described in the PES is Decode and display or output.

次に字幕ＰＥＳについて説明する。図３は、映像コンテンツデータの字幕ストリームに含まれる字幕ＰＥＳの構成を示す図である。１つの字幕ＰＥＳはタイムスタンプ及び字幕データを含んでいる。字幕データは文字データを備える部分字幕データや、その文字データの装飾や表示を制御する制御データを備える部分字幕データを含んでいる。例えば図３の字幕ＰＥＳ２０は、タイムスタンプ２１及び字幕データ２２を含んでおり、字幕データ２２は、部分字幕データ２３〜２６を含んでいる。タイムスタンプ２１には、字幕ＰＥＳ２０に含まれる文字データの表示を開始する時刻が記載されている。部分字幕データ２３は字幕として表示する文字データとして「こんにちは」を保持している。部分字幕データ２４は、「ＴＩＭＥ」及び表示維持時間Ｔｄである「１０」を制御データとして保持している。「ＴＩＭＥ」は前の部分字幕データ２３に含まれる文字データの表示を維持することを指示するものであり、「１０」はその表示を維持する時間的な長さを示している。部分字幕データ２５は、制御データとして「ＣＳ」を保持している。「ＣＳ」は、表示されている字幕の表示を消去することを指示するものである。部分字幕データ２６は、字幕として表示する文字データとして「ありがとうございます」を保持している。字幕ＰＥＳのデータは図の左から順番に読み取られるので、タイムスタンプ２１〜部分字幕データ２６の順に読み取られ、順に実行される構成となっている。 Next, caption PES will be described. FIG. 3 is a diagram illustrating a configuration of a caption PES included in a caption stream of video content data. One caption PES includes a time stamp and caption data. The caption data includes partial caption data including character data and partial caption data including control data for controlling decoration and display of the character data. For example, the caption PES 20 in FIG. 3 includes a time stamp 21 and caption data 22, and the caption data 22 includes partial caption data 23 to 26. The time stamp 21 describes the time when the display of the character data included in the caption PES 20 is started. Part subtitle data 23 holds a "Hello" as the character data to be displayed as subtitles. The partial caption data 24 holds “TIME” and “10” which is the display maintenance time Td as control data. “TIME” instructs to maintain the display of the character data included in the previous partial caption data 23, and “10” indicates the length of time for maintaining the display. The partial caption data 25 holds “CS” as control data. “CS” instructs to delete the display of the displayed subtitles. The partial caption data 26 holds “Thank you” as character data to be displayed as a caption. Since the subtitle PES data is read sequentially from the left in the figure, the time stamp 21 to the partial subtitle data 26 are read in order and executed in order.

字幕ＰＥＳの内容を実行する手順を以下に説明する。図３に示した字幕ＰＥＳ２０の場合、まず映像コンテンツデータにおけるＰＣＲと字幕ＰＥＳ２０中のタイムスタンプ２１を参照して、字幕データが有効になる時刻を算出する。そして、当該算出した時刻に部分字幕データ２３の文字データ「こんにちは」を表示する。次の部分字幕データ２４に制御データ“ＴＩＭＥ”“１０”とあり、さらに次の部分字幕データ２５に“ＣＳ”とあることから、最初の文字データ「こんにちは」の表示を１０秒間維持し、表示開始時刻から１０秒後に消去する。そして、文字データ“こんにちは”の表示終了時刻に、次の部分字幕データ２６の文字データ「ありがとうございます」を表示する。また、「ありがとうございます」の表示終了時刻は、次の字幕ＰＥＳ３０の部分字幕データ３３に含まれる文字データ「ごきげんよう」の表示開始時刻である。この時刻は、字幕ＰＥＳ３０のタイムスタンプ３１に記載されている。 A procedure for executing the contents of the caption PES will be described below. In the case of the caption PES 20 shown in FIG. 3, first, the time at which the caption data becomes valid is calculated with reference to the PCR in the video content data and the time stamp 21 in the caption PES 20. Then, to display the character data "Hello" in the portion subtitle data 23 at the time that the calculated. "Yes and, further to the next part subtitle data 25" The next part subtitle data 24 to the control data "TIME" "10 from the phrase that CS", the display of the first character data "Hello" and maintained for 10 seconds, display Erase 10 seconds after the start time. Then, to display the end time of the character data "Hello", to display the character data of the next part subtitle data 26, "Thank you". The display end time of “Thank you” is the display start time of the character data “Gokugenyo” included in the partial subtitle data 33 of the next subtitle PES30. This time is described in the time stamp 31 of the caption PES 30.

次に、字幕テーブルについて説明する。図４は、本実施の形態の字幕テーブルを示す図である。字幕テーブルは、１つの部分字幕データに含まれる文字データごとに字幕表示の開始時刻、字幕表示の終了時刻、及び字幕の文字数を保持するものである。たとえば、図４のＮｏ．１では、ある文字データに対して、字幕表示開始時刻００：０１：１５、字幕表示終了時刻００：０１：２２、及び字幕文字数３が保持されている。 Next, the caption table will be described. FIG. 4 is a diagram illustrating a caption table according to the present embodiment. The caption table holds a caption display start time, a caption display end time, and the number of caption characters for each character data included in one partial caption data. For example, in FIG. 1, subtitle display start time 00:01:15, subtitle display end time 00:01:22, and subtitle character count 3 are held for certain character data.

次に音声テーブルについて説明する。図５は、本実施の形態の音声テーブルを示す図である。音声テーブルは、字幕テーブルの字幕表示開始時刻から字幕表示終了時刻までの間で再生される音声データのジャンル、音声の再生開始時刻、音声の再生終了時刻、音声の基本周波数、及び音量を保持している。図４の字幕テーブルでＮｏ．１に示される字幕表示開始時刻００：０１：１５から字幕表示終了時刻００：０１：２２の間に再生される音声データには、図５で音声ジャンルが「ＳＰＣ」であるＮｏ．１のデータと、同じく音声ジャンルが「ＥＦＦ」であるＮｏ．２のデータが存在する。ここで、音声ジャンルは、応援・喝采であれば“ＡＰＰ”、効果音であれば“ＥＦＦ”、絶叫であれば“ＳＣＲ”、解説・発声であれば“ＳＰＣ”、音楽であれば“ＭＳＣ”と表記するものとする。なお、音声ジャンルの分類は必ずしもこのとおりでなくてもよく、セリフが含まれる可能性があるかどうかが明確にわかる分類であればよい。 Next, the voice table will be described. FIG. 5 is a diagram showing an audio table according to the present embodiment. The audio table holds the genre of audio data, the audio playback start time, the audio playback end time, the audio fundamental frequency, and the volume that are played between the subtitle display start time and the subtitle display end time of the subtitle table. ing. In the caption table of FIG. In the audio data reproduced between the subtitle display start time 00:01:15 shown in FIG. 1 and the subtitle display end time 00:01:22, No. 1 whose audio genre is “SPC” in FIG. No. 1 with the same audio genre as “EFF”. There are two data. Here, the audio genre is “APP” for cheering / going, “EFF” for sound effects, “SCR” for screaming, “SPC” for commentary / speaking, “MSC” for music. ". Note that the classification of the audio genre does not necessarily have to be as described above, and may be a classification that clearly indicates whether or not there is a possibility that a speech is included.

尚、図４の字幕テーブルでＮｏ．２に示される字幕表示開始時刻００：０１：４８の字幕データに対応するデータが図５の音声テーブルには存在しないが、これは字幕表示に対応する音声データがないためである。 In the subtitle table of FIG. The data corresponding to the caption data at the caption display start time 00:01:48 shown in FIG. 2 does not exist in the audio table in FIG. 5 because there is no audio data corresponding to the caption display.

次に再生速度テーブルについて説明する。図６は、本実施の形態の再生速度テーブルを示す図である。本図では、音声データの基本周波数と音量との組み合わせ及び発話速度によって１つの再生速度を指定している。例えば、音声データの基本周波数が高くかつ音量が大きい場合、発話速度が高ければ聞き取りが困難になりやすいため再生速度を通常の再生速度である１．０倍速に指定し、発話速度が低ければ再生速度を高めても聞き取りやすいので再生速度を１．５倍速に指定している。このようにして再生速度テーブルは、基本周波数、音量、及び発話速度から各音声再生時間での再生速度を指定するためのものである。 Next, the playback speed table will be described. FIG. 6 shows a playback speed table of the present embodiment. In this figure, one playback speed is designated by the combination of the fundamental frequency and volume of the audio data and the speech rate. For example, if the basic frequency of audio data is high and the volume is high, listening is likely to be difficult if the speech speed is high, so the playback speed is designated as the normal playback speed of 1.0 times speed, and playback is performed if the speech speed is low. Since it is easy to hear even if the speed is increased, the playback speed is set to 1.5 times speed. In this way, the playback speed table is for designating the playback speed for each voice playback time from the fundamental frequency, volume, and speech speed.

次にダイジェストテーブルについて説明する。図７は、本実施の形態のダイジェストテーブルを示す図である。本図は、図６の再生速度テーブルを基にして特に再生速度を指定した部分をまとめたテーブルである。図７は、Ｎｏ．１〜４の４箇所を特に再生速度を指定した部分として示しており、例えばＮｏ．１の音声開始時刻００：０１：１５から再生終了時刻００：０１：１９までの部分は、２．０倍速で再生することを示している。 Next, the digest table will be described. FIG. 7 is a diagram showing a digest table of the present embodiment. This figure is a table in which the parts for which the reproduction speed is specified are summarized based on the reproduction speed table of FIG. FIG. Four portions 1 to 4 are shown as portions where the playback speed is specified. The portion from 1 audio start time 00:01:15 to playback end time 00:01:19 indicates that playback is performed at 2.0 times speed.

このような構成のダイジェスト作成装置１の動作について、図８に示したフローチャートを参照しながら説明する。まず、入力処理部１２は映像コンテンツデータを受信して（Ｓ１１）、蓄積部１１に記録する。次に、字幕解析部１３は映像コンテンツデータを解析して字幕テーブルを生成し（Ｓ１２）、この字幕テーブルを蓄積部１１に記録する。続いて音声解析部１４は、字幕テーブルに記載された字幕表示時間についてのみ、映像コンテンツデータ中の音声データを解析して音声テーブルを生成し（Ｓ１３）、この音声テーブルを蓄積部１１に記録する。そしてシーン抽出部１５は、音声テーブル中のデータである音声開始時刻、音声終了時刻、基本周波数および音量を抽出するとともに発話速度を算出する。さらにシーン抽出部１５は、これら音声テーブルのデータと発話速度に基づいて再生速度テーブルを参照し、当該音声再生時間での再生速度を指定する。このときの音声開始時刻、音声終了時刻、及び再生速度を保持するダイジェストテーブルを生成し（Ｓ１４）、蓄積部１１に格納する。その後ダイジェスト生成部１６は、ダイジェストテーブルを参照してダイジェストコンテンツを生成し、当該生成されたダイジェストコンテンツを蓄積部１１に格納するか、出力インタフェース１８を介して外部に出力する（Ｓ１５）。 The operation of the digest creating apparatus 1 having such a configuration will be described with reference to the flowchart shown in FIG. First, the input processing unit 12 receives video content data (S11) and records it in the storage unit 11. Next, the caption analysis unit 13 analyzes the video content data to generate a caption table (S12), and records the caption table in the storage unit 11. Subsequently, the audio analysis unit 14 analyzes the audio data in the video content data only for the caption display time described in the subtitle table, generates an audio table (S13), and records the audio table in the storage unit 11. . The scene extraction unit 15 extracts the voice start time, the voice end time, the fundamental frequency, and the volume, which are data in the voice table, and calculates the speech rate. Further, the scene extraction unit 15 refers to the playback speed table based on the data of the voice table and the speech speed, and designates the playback speed for the voice playback time. A digest table holding the voice start time, voice end time, and playback speed at this time is generated (S14) and stored in the storage unit 11. Thereafter, the digest generation unit 16 refers to the digest table to generate digest content, and stores the generated digest content in the storage unit 11 or outputs the digest content to the outside via the output interface 18 (S15).

以下、図８のＳ１２における字幕解析部１３の動作、Ｓ１３における音声解析部１４の動作、及びＳ１４におけるシーン抽出部１５の動作について詳細に説明する。 Hereinafter, the operation of the caption analysis unit 13 in S12 of FIG. 8, the operation of the audio analysis unit 14 in S13, and the operation of the scene extraction unit 15 in S14 will be described in detail.

＜字幕解析部１３の動作＞
図９Ａ及び図９Ｂは、字幕解析部１３の動作を示すフローチャートである。字幕解析部１３は、まず図２に示す映像コンテンツ（映像ストリーム）のデータを解析して１番目と２番目のＰＣＲを検索し取得する。そしてこれら２つのＰＣＲの時刻の差分及びストリーム内での位置の差分、並びに１番目のＰＣＲのストリーム内での位置から、ストリームの先頭基準時刻を近似算出する（Ｓ２１）。つまり、１番目と２番目のＰＣＲの時刻をそれぞれＴｐ１及びＴｐ２、またストリーム内での位置をそれぞれＰｐ１及びＰｐ２とすると、先頭基準時刻Ｔｓは以下の式により求められる。
Ｔｓ＝（Ｔｐ１＊Ｐｐ２−Ｔｐ２＊Ｐｐ１）／（Ｐｐ２−Ｐｐ１） <Operation of Subtitle Analysis Unit 13>
9A and 9B are flowcharts illustrating the operation of the caption analysis unit 13. The caption analysis unit 13 first analyzes the data of the video content (video stream) shown in FIG. 2 to search for and acquire the first and second PCRs. Then, the stream start reference time is approximately calculated from the time difference between these two PCRs, the position difference within the stream, and the position within the stream of the first PCR (S21). That is, assuming that the times of the first and second PCRs are Tp1 and Tp2, respectively, and the positions in the stream are Pp1 and Pp2, respectively, the head reference time Ts is obtained by the following equation.
Ts = (Tp1 * Pp2-Tp2 * Pp1) / (Pp2-Pp1)

以下の説明においては、カウント値ｎ（ｎ：整数）を用いて文字データを含む部分字幕データを区別する。すなわち、カウンタ値ｎ＝０のときは文字データがまだ検出されていないことを示しており、１番目に検出された文字データにはカウント値ｎ＝１を与える。カウント値ｎを用いて、各文字データの表示開始時刻及び表示終了時刻をそれぞれＴｓｃ（ｎ）及びＴｃｅ（ｎ）で表す。 In the following description, partial subtitle data including character data is distinguished using a count value n (n: integer). That is, when the counter value n = 0, it indicates that character data has not been detected yet, and the count value n = 1 is given to the first detected character data. Using the count value n, the display start time and display end time of each character data are represented by Tsc (n) and Tce (n), respectively.

先頭基準時刻Ｔｓを決定すると、字幕解析部１３は、映像コンテンツデータ中の字幕ＰＥＳをストリームの先頭から順に検索する（Ｓ２２）。続いて、字幕ＰＥＳを検出したかどうかを判断する（Ｓ２３）。ここで、図３の字幕ＰＥＳ２０を最初の字幕ＰＥＳとして検出すると、当該字幕ＰＥＳ２０中のタイムスタンプ２１に記述したタイムスタンプＴｐｔｓを抽出し、表示開始時刻及び表示終了時刻の候補となる解析中時刻Ｔｃｃの初期値を算出する（Ｓ２４）。解析中時刻Ｔｃｃの初期値は、以下の式により字幕ＰＥＳ２０のタイムスタンプＴｐｔｓと先頭基準時刻Ｔｓとの差分で求められる。
Ｔｃｃ＝Ｔｐｔｓ−Ｔｓ When the head reference time Ts is determined, the caption analysis unit 13 searches for the caption PES in the video content data in order from the head of the stream (S22). Subsequently, it is determined whether or not a caption PES is detected (S23). If the subtitle PES20 in FIG. 3 is detected as the first subtitle PES, the time stamp Tpts described in the time stamp 21 in the subtitle PES20 is extracted, and the analysis time Tcc that is a candidate for the display start time and the display end time. The initial value is calculated (S24). The initial value of the in-analysis time Tcc is obtained from the difference between the time stamp Tpts of the caption PES 20 and the head reference time Ts by the following formula.
Tcc = Tpts-Ts

なお、検出した字幕ＰＥＳ中にタイムスタンプＴｐｔｓが指定されていない場合には、映像コンテンツデータにおける当該字幕データのＰＣＲを基準とした位置及び先頭基準時刻Ｔｓに基づいて、最初の字幕を有効化する時刻を算出して解析中時刻Ｔｃｃとしてもよい。 If the time stamp Tpts is not specified in the detected caption PES, the first caption is validated based on the position of the caption data in the video content data based on the PCR and the start reference time Ts. The time may be calculated and used as the analysis time Tcc.

次に、検出した字幕ＰＥＳ２０の字幕データ２２からまず部分字幕データ２３を読み出す（Ｓ２５）。そして、読み出した部分字幕データ２３を解析し、データが含まれるかどうかを判断する（Ｓ２６）。データが含まれていなかった場合、ステップＳ２２に戻って、次の字幕ＰＥＳを検索する。 Next, partial subtitle data 23 is first read from the subtitle data 22 of the detected subtitle PES20 (S25). Then, the read partial subtitle data 23 is analyzed to determine whether data is included (S26). If no data is included, the process returns to step S22 to search for the next caption PES.

部分字幕データ２３のデータが文字データであるかどうかを判断する（Ｓ２７）。部分字幕データ２３のデータは文字データであるので、文字データの文字数を算出し、文字数Ｎｃとして保持する（Ｓ２８）。続いて、カウンタ値ｎが０より大きく且つ表示終了時刻Ｔｃｅ（ｎ）が未決定であるかどうかを判断する（Ｓ２９）。ここでカウンタ値は初期値のｎ＝０であるので、カウンタ値をインクリメントしてｎ＝１とし、解析中時刻Ｔｃｃを表示開始時刻Ｔｃｓ（１）とする（Ｓ３１）。 It is determined whether the data of the partial caption data 23 is character data (S27). Since the data of the partial caption data 23 is character data, the number of characters of the character data is calculated and held as the number of characters Nc (S28). Subsequently, it is determined whether or not the counter value n is greater than 0 and the display end time Tce (n) has not been determined (S29). Here, since the counter value is the initial value n = 0, the counter value is incremented to n = 1, and the analysis time Tcc is set as the display start time Tcs (1) (S31).

続いて、次の部分字幕データ２４を読み出す（Ｓ２５）。部分字幕データ２４のデータが「ＴＩＭＥ」であるかどうかを判断する（Ｓ３２）。部分字幕データ２４は制御データである「ＴＩＭＥ」と表示維持時間Ｔｄである「１０」を含んでいるので、解析中時刻Ｔｃｃに表示維持時間Ｔｄの１０秒を加算して更新する（Ｓ３３）。次に部分字幕データ２５を読み出す（Ｓ２５）。部分字幕データ２５のデータが「ＣＳ」であるかどうかを判断する（Ｓ３４）。部分字幕データ２５は制御データである「ＣＳ」であるので、次のステップへ進んで、カウンタ値ｎが０より大きく且つ表示終了時刻Ｔｃｅ（ｎ）が未決定であるかどうかを判断する（Ｓ３５）。いま、カウンタ値はｎ＝１であり表示終了時刻Ｔｃｅ（１）が未決定であるので、解析中時刻Ｔｃｃを表示終了時刻Ｔｃｅ（１）とし、保持している文字数Ｎｃを文字数Ｎｃ（１）とする（Ｓ３６）。ここまでで、カウンタ値ｎ＝１に対して表示開始時刻Ｔｃｓ（１）、表示終了時刻Ｔｃｅ（１）、及び文字数Ｎｃ（１）が決定する。 Subsequently, the next partial caption data 24 is read (S25). It is determined whether or not the data of the partial caption data 24 is “TIME” (S32). Since the partial caption data 24 includes “TIME” as the control data and “10” as the display maintenance time Td, it is updated by adding 10 seconds of the display maintenance time Td to the analysis time Tcc (S33). Next, partial subtitle data 25 is read (S25). It is determined whether or not the data of the partial caption data 25 is “CS” (S34). Since the partial subtitle data 25 is “CS” which is control data, the process proceeds to the next step, and it is determined whether the counter value n is greater than 0 and the display end time Tce (n) has not been determined (S35). ). Since the counter value is n = 1 and the display end time Tce (1) is not yet determined, the analysis time Tcc is set as the display end time Tce (1), and the retained character count Nc is the character count Nc (1). (S36). Up to this point, the display start time Tcs (1), the display end time Tce (1), and the number of characters Nc (1) are determined for the counter value n = 1.

続いて、次の部分字幕データ２６を読み出す（Ｓ２５）。部分字幕データ２６は文字データであるので、文字データの文字数を算出し、文字数Ｎｃとして保持する（Ｓ２８）。カウンタ値ｎが０より大きく且つ表示終了時刻Ｔｃｅ（ｎ）が未決定であるかどうかを判断し（Ｓ２９）、カウンタ値はｎ＝１であり、表示終了時間Ｔｃｅ（１）がすでに決定しているので、カウンタ値をインクリメントしてｎ＝２とし、解析中時刻Ｔｃｃを表示開始時刻Ｔｃｓ（２）とする（Ｓ３１）。 Subsequently, the next partial caption data 26 is read (S25). Since the partial caption data 26 is character data, the number of characters of the character data is calculated and held as the number of characters Nc (S28). It is determined whether or not the counter value n is greater than 0 and the display end time Tce (n) has not been determined (S29), the counter value is n = 1, and the display end time Tce (1) has already been determined. Therefore, the counter value is incremented to n = 2, and the analysis time Tcc is set to the display start time Tcs (2) (S31).

続いて、部分字幕データを読み出す（Ｓ２５）し、データがあるかどうかを判断する（Ｓ２６）。字幕データ２２には部分字幕データ２６に続く部分字幕データが存在しないので、データがないと判断してＳ２２に戻り、次の字幕ＰＥＳを検索する（Ｓ２２）。続いて、字幕ＰＥＳを検出したかどうかを判断し（Ｓ２３）、次の字幕ＰＥＳ３０を検出すると、当該字幕ＰＥＳ３０のタイムスタンプ３１に記述したタイムスタンプＴｐｔｓを抽出し、解析中時刻Ｔｃｃを算出する（Ｓ２４）。次に、検出した字幕ＰＥＳ３０の字幕データ３２から部分字幕データ３３を読み出す（Ｓ２５）。読み出した部分字幕データ３３を解析し、データが含まれるかどうかを判断し（Ｓ２６）、部分字幕データ３３のデータが文字データであるかどうかを判断する（Ｓ２７）。部分字幕データ３３には文字データが含まれているので、文字データの文字数を算出し、文字数Ｎｃとして保持する（Ｓ２８）。続いて、カウンタ値ｎが０より大きく且つ表示終了時刻Ｔｃｅ（ｎ）が未決定であるかどうかを判断する（Ｓ２９）。ここでカウンタ値はｎ＝２であり、表示終了時間Ｔｃｅ（２）が未決定であるので（Ｓ２９のＹＥＳ肢）、解析中時刻Ｔｃｃを表示終了時間Ｔｃｅ（２）とし、保持している文字数Ｎｃを文字数Ｎｃ（２）とする（Ｓ３０）。ここまでで、カウンタ値ｎ＝２に対して表示開始時刻Ｔｃｓ（２）、表示終了時刻Ｔｃｅ（２）、及び文字数Ｎｃ（２）が決定する。これに続いてカウンタ値をインクリメントしてｎ＝３とし、解析中時刻Ｔｃｃを表示開始時刻Ｔｃｓ（３）とする（Ｓ３１）。 Subsequently, partial subtitle data is read (S25), and it is determined whether there is data (S26). Since there is no partial caption data following the partial caption data 26 in the caption data 22, it is determined that there is no data, and the process returns to S22 to search for the next caption PES (S22). Subsequently, it is determined whether or not a caption PES is detected (S23). When the next caption PES30 is detected, the time stamp Tpts described in the time stamp 31 of the caption PES30 is extracted, and the analysis time Tcc is calculated ( S24). Next, partial subtitle data 33 is read from the subtitle data 32 of the detected subtitle PES 30 (S25). The read partial subtitle data 33 is analyzed to determine whether data is included (S26), and it is determined whether the data of the partial subtitle data 33 is character data (S27). Since the partial subtitle data 33 includes character data, the number of characters of the character data is calculated and held as the number of characters Nc (S28). Subsequently, it is determined whether or not the counter value n is greater than 0 and the display end time Tce (n) has not been determined (S29). Here, since the counter value is n = 2 and the display end time Tce (2) is undecided (YES in S29), the analysis time Tcc is set as the display end time Tce (2), and the number of characters held Nc is the number of characters Nc (2) (S30). Up to this point, the display start time Tcs (2), the display end time Tce (2), and the number of characters Nc (2) are determined for the counter value n = 2. Subsequently, the counter value is incremented to n = 3, and the analysis time Tcc is set to the display start time Tcs (3) (S31).

続いて部分字幕データを読み出すし（Ｓ２５）、データがあるかどうかを判断する（Ｓ２６）。字幕データ３２には部分字幕データ３３に続く部分字幕データが存在しないので、データがないと判断してＳ２２に戻り、次の字幕ＰＥＳを検索する（Ｓ２２）。このような処理を繰り返して字幕ＰＥＳ４０、字幕ＰＥＳ５０、及びそれ以降の全ての字幕ＰＥＳについて同様に解析し、順次字幕テーブルの表示開始時刻、表示終了時刻、及び字幕文字数を決定する。 Subsequently, partial subtitle data is read (S25), and it is determined whether there is data (S26). Since there is no partial subtitle data following the partial subtitle data 33 in the subtitle data 32, it is determined that there is no data, and the process returns to S22 to search for the next subtitle PES (S22). By repeating such processing, the caption PES 40, the caption PES 50, and all subsequent captions PES are similarly analyzed, and the display start time, the display end time, and the number of caption characters of the caption table are sequentially determined.

全ての字幕ＰＥＳの解析が終了し、検出すべき字幕ＰＥＳが無くなると、カウンタ値ｎが０より大きく且つ表示終了時刻Ｔｃｅ（ｎ）が未決定であるかどうかを判断する（Ｓ３７）。現在のカウンタ値ｎがｎ＞０でありかつ表示終了時刻Ｔｃｅ（ｎ）が未決定であるので、表示終了時刻Ｔｃｅ（ｎ）にストリームの終端時刻を代入し、保持している文字数Ｎｃを文字数Ｎｃ（ｎ）とする（Ｓ３８）。これで、最終の文字データについても表示開始時刻、表示終了時刻、及び字幕文字数が決定したので、字幕の表示開始時刻Ｔｃｓ（ｋ）（ｋは１からｎまでの各整数）、表示終了時刻Ｔｃｅ（ｋ）、及び字幕文字数Ｎｃ（ｋ）を蓄積部１１に記録し（Ｓ３９）、字幕テーブル生成の一連のフローを終了する。なお、Ｓ３７で現在のカウンタ値ｎがｎ＞０でありかつ表示終了時刻Ｔｃｅ（ｎ）が未決定でなければ、直ちに字幕テーブルを蓄積部１１に記録して（Ｓ３９）、字幕テーブル生成の一連のフローを終了する。 When the analysis of all subtitles PES is completed and there are no more subtitles PES to be detected, it is determined whether the counter value n is greater than 0 and whether the display end time Tce (n) is undetermined (S37). Since the current counter value n is n> 0 and the display end time Tce (n) is undecided, the end time of the stream is substituted for the display end time Tce (n), and the retained character count Nc is the number of characters. Nc (n) is set (S38). Since the display start time, the display end time, and the number of subtitle characters have been determined for the final character data, the subtitle display start time Tcs (k) (k is an integer from 1 to n) and the display end time Tce. (K) and the number of subtitle characters Nc (k) are recorded in the storage unit 11 (S39), and the series of subtitle table generation flow is terminated. If the current counter value n is n> 0 and the display end time Tce (n) is not yet determined in S37, the caption table is immediately recorded in the storage unit 11 (S39), and a series of caption table generation is performed. End the flow.

なお、字幕解析部１３は字幕文字数を算出する場合に、漢字の場合には読み仮名変換を行って実際の文字数を検出してもよい。例えば、文字データ“元気です”の場合４文字であるが、読み仮名変換すると“げんきです”となって５文字となるので、この場合当該文字データの字幕文字数は５とみなす。 Note that, when calculating the number of subtitle characters, the subtitle analysis unit 13 may detect the actual number of characters by performing kana conversion in the case of kanji. For example, in the case of character data “I'm fine”, there are 4 characters. However, if the reading is converted to kana, it becomes “genki” and becomes 5 characters. In this case, the number of subtitle characters in the character data is considered to be 5.

＜音声解析部１４の動作＞
図１０は、音声解析部１４の動作を示すフローチャートである。まず、音声解析部１４は、先に生成された字幕テーブルから字幕の表示開始時刻Ｔｃｓ及び表示終了時刻Ｔｃｅを１つ目のデータから順に読み出す（Ｓ５１）。続いて字幕テーブルの全てのデータを読み出したかどうかを判断する（Ｓ５２）。字幕テーブルのデータから検出した表示開始時刻Ｔｃｓを解析中時刻Ｔｃｃに代入する（Ｓ５３）。そして、映像コンテンツデータ中で解析中時刻Ｔｃｃの位置に該当する音声ＰＥＳの音声データを解析し、音声の種類（音声ジャンル）、発声終了時刻、基本周波数及び音量の検出する（Ｓ５４）。 <Operation of the voice analysis unit 14>
FIG. 10 is a flowchart showing the operation of the voice analysis unit 14. First, the audio analysis unit 14 sequentially reads the caption display start time Tcs and the display end time Tce from the first data, starting from the first data (S51). Subsequently, it is determined whether or not all data of the caption table has been read (S52). The display start time Tcs detected from the caption table data is substituted into the analysis time Tcc (S53). Then, the audio data of the audio PES corresponding to the position of the analysis time Tcc in the video content data is analyzed, and the type of audio (audio genre), utterance end time, fundamental frequency, and volume are detected (S54).

ここで音声解析は、音声データを各音声ジャンルの音声データサンプル（応援・喝采、効果音、絶叫、解説・発声、音楽など）の音声波形と比較し、波形が類似しているジャンルに分類するのが一般的である。 Here, the voice analysis compares the voice data with the voice waveform of the voice data sample of each voice genre (support / 喝采, sound effects, screaming, commentary / speech, music, etc.), and classifies them into genres with similar waveforms. It is common.

Ｓ５４の解析結果を音声テーブルのデータとして蓄積部１１に格納する（Ｓ５５）。Ｓ５４で得られた音声終了時刻が字幕テーブルに示された表示終了時刻Ｔｃｅを超えたかどうかを判断し（Ｓ５６）、表示終了時刻Ｔｃｅを超えていれば、Ｓ５１に戻って字幕テーブルの次のデータの表示開始時刻Ｔｃｓ及び表示終了時刻Ｔｃｅを読み出す。一方、表示終了時刻Ｔｃｅを超えていなければ、Ｓ５４で得られた音声終了時刻を解析中時刻Ｔｃｃとして（Ｓ５７）、ステップＳ５４に戻る。 The analysis result of S54 is stored in the storage unit 11 as voice table data (S55). It is determined whether or not the audio end time obtained in S54 exceeds the display end time Tce indicated in the caption table (S56). If the display end time Tce is exceeded, the process returns to S51 and the next data in the caption table. The display start time Tcs and the display end time Tce are read out. On the other hand, if the display end time Tce is not exceeded, the voice end time obtained in S54 is set as the analyzing time Tcc (S57), and the process returns to step S54.

＜シーン抽出部１０５の動作＞
図１１は、シーン抽出部１０５の動作を示すフローチャートである。まず、シーン抽出部１５は、音声テーブルから１つ目の音声ジャンル、音声開始時刻、音声終了時刻、基本周波数、及び音量を読み出す（Ｓ６１）。続いて音声テーブルの全てのデータを読み出したかどうかを判断する（Ｓ５２）。音声テーブルのデータを最後まで読み出していれば処理を終了するが、まだ１つ目のデータを読み出しただけなので、処理はＳ６３に進む。読み出した音声ジャンルが効果音“ＥＦＦ”又は音楽“ＭＳＣ”以外であるかどうかを判断し（Ｓ６３）、効果音“ＥＦＦ”又は音楽“ＭＳＣ”であればステップＳ６１に戻って次のデータを読み出す。一方、読み出した音声ジャンルが効果音“ＥＦＦ”又は音楽“ＭＳＣ”以外であれば、読み出した音声開始時刻及び音声終了時刻をダイジェストテーブルのデータとして蓄積部１１に格納する（Ｓ６４）。すなわち、音声ジャンルが効果音又は音楽のときの音声開始時刻から音声終了時刻までの期間はセリフのある期間とはみなさず、ダイジェストテーブルに記載する対象とはしない。 <Operation of Scene Extraction Unit 105>
FIG. 11 is a flowchart showing the operation of the scene extraction unit 105. First, the scene extraction unit 15 reads the first audio genre, audio start time, audio end time, fundamental frequency, and volume from the audio table (S61). Subsequently, it is determined whether all data in the voice table has been read (S52). If the data of the voice table has been read to the end, the process ends. However, since only the first data has been read, the process proceeds to S63. It is determined whether or not the read sound genre is other than the sound effect “EFF” or the music “MSC” (S63). If the sound genre is the sound effect “EFF” or the music “MSC”, the process returns to step S61 to read the next data. . On the other hand, if the read audio genre is other than the sound effect “EFF” or music “MSC”, the read audio start time and audio end time are stored in the storage unit 11 as digest table data (S64). That is, the period from the voice start time to the voice end time when the voice genre is a sound effect or music is not regarded as a period with speech and is not a target to be described in the digest table.

次に、Ｓ６１で読み出した音声開始時刻から音声終了時刻までの字幕文字数Ｎｃを字幕テーブルから読み出し、読み出した字幕文字数Ｎｃを該期間長で除する、または該期間長を読み出した字幕文字数Ｎｃで除することによって発話速度を算出する（Ｓ６５）。 Next, the number Nc of subtitle characters from the audio start time to the audio end time read in S61 is read from the subtitle table, and the read subtitle character number Nc is divided by the period length, or the period length is divided by the read subtitle character number Nc. As a result, the speech rate is calculated (S65).

さらに再生速度テーブルを参照して、算出した発話速度及びＳ６１で読み出した基本周波数並びに音量を基にして再生速度を抽出し、この再生速度をＳ６４で格納した音声開始時刻及び音声終了時刻に加えて、ダイジェストテーブルのデータとして蓄積部１１に格納する（Ｓ６６）。 Further, referring to the playback speed table, the playback speed is extracted based on the calculated speech speed and the basic frequency and volume read in S61, and this playback speed is added to the voice start time and voice end time stored in S64. The data is stored in the storage unit 11 as digest table data (S66).

その後処理はＳ６１に戻って、音声テーブルから次のデータを読み出し同様の処理を繰り返す。音声テーブルのデータを最後まで読み出したかどうかをＳ６２で判断し、最後まで読み出していればこの処理は終了する。このとき蓄積部１１は、図７に示すダイジェストテーブルを格納している。 Thereafter, the processing returns to S61, the next data is read from the voice table, and the same processing is repeated. In step S62, it is determined whether or not the voice table data has been read to the end. If the data has been read to the end, this process ends. At this time, the storage unit 11 stores the digest table shown in FIG.

ダイジェスト生成部１６は以上のような処理を経て生成されたダイジェストテーブルを参照して、蓄積部１１に格納する映像コンテンツデータからダイジェストコンテンツを生成し蓄積部１１に格納する。このときのダイジェストコンテンツは、ダイジェストテーブルに記載された部分は指定の再生速度で再生するものである。反対に、ダイジェストテーブルで記載されなかった部分については、全く再生しない、一定速度で再生する、当該部分の時間が長いほど高速で再生する、番組の経過とともに速度を速めて再生する等するものである。 The digest generation unit 16 refers to the digest table generated through the above processing, generates digest content from the video content data stored in the storage unit 11, and stores the digest content in the storage unit 11. The digest content at this time is to reproduce the portion described in the digest table at a specified reproduction speed. On the other hand, the part not described in the digest table is not played at all, is played at a constant speed, is played at a higher speed as the time of the part is longer, and is played at a higher speed as the program progresses. is there.

なお、本実施の形態のダイジェストは、ダイジェストテーブルに記載された部分以外を一定速度で再生するものであるが、ダイジェストテーブルに記載された部分のみを抽出してダイジェストコンテンツを生成することもできる。これにより、さらに短時間のダイジェストコンテンツを生成することもできる。さらに、記載された部分の再生速度を重要度を示す指標とみなし、重要な部分として再生速度の低い部分のみを抽出してダイジェストコンテンツを生成することもできる。 Note that the digest of the present embodiment is to reproduce a part other than the part described in the digest table at a constant speed, but it is also possible to extract only the part described in the digest table and generate the digest content. This makes it possible to generate digest content for a shorter time. Furthermore, the playback speed of the described part can be regarded as an index indicating the importance, and only the part with a low playback speed can be extracted as the important part to generate the digest content.

本実施の形態のダイジェスト作成装置によると、基本周波数、音量、及び発話速度を基にして場面に適した再生速度のダイジェストコンテンツを生成することができる。このためユーザは、セリフを聞きもらさないだけでなく、出演者の感情が高揚しているような重要な場面を見逃すこともなく、短時間で効率的な視聴が可能となる。 According to the digest creation device of the present embodiment, digest content with a playback speed suitable for a scene can be generated based on the fundamental frequency, volume, and speech rate. For this reason, the user can not only listen to the speech but also miss an important scene where the performer's emotions are uplifted, and can efficiently watch in a short time.

なお、本実施の形態の再生速度テーブルは、基本周波数、音量、及び発話速度のパラメータをそれぞれ高低または大小の２段階に分けて組み合わせを作り、それぞれの組み合わせに応じた再生速度を指定しているが、２段階に限らずより多くの段階に分けてもよい。また、それぞれの組み合わせ方を変えた再生速度テーブルでもよいし、パラメータに優先順位を付け、優先順位の高いパラメータから順に値を判定して再生速度を指定する再生速度テーブルでもよい。 In the playback speed table of the present embodiment, the basic frequency, volume, and speech speed parameters are divided into two levels, high, low, and large, and combinations are made, and the playback speeds corresponding to the combinations are designated. However, it is not limited to two stages and may be divided into more stages. Also, a playback speed table in which the respective combinations are changed may be used, or a playback speed table in which priorities are assigned to parameters, and values are determined in order from parameters with higher priorities to specify playback speeds.

（第２の実施の形態）
以下、本発明の第２の実施の形態について、図面を参照しながら説明する。図１２は、本実施の形態に係るダイジェスト作成装置の構成を示す図である。ダイジェスト作成装置２は、入力処理部１２、ダイジェスト生成部１６、出力インタフェース（Ｉ／Ｆ）１８、蓄積部２１、字幕解析部２３、シーン抽出部２５、及び制御部２７を備えている。尚、入力処理部１２、ダイジェスト生成部１６、制御部１７、及び出力インタフェース（Ｉ／Ｆ）１８は第１の実施の形態の図１で示したものと同様のものである。また、以下の説明で用いる図１３に示す字幕テーブルは、第１の実施の形態の図４で示したものと同様である。 (Second Embodiment)
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings. FIG. 12 is a diagram showing a configuration of the digest creation device according to the present embodiment. The digest creation device 2 includes an input processing unit 12, a digest generation unit 16, an output interface (I / F) 18, a storage unit 21, a caption analysis unit 23, a scene extraction unit 25, and a control unit 27. The input processing unit 12, digest generation unit 16, control unit 17, and output interface (I / F) 18 are the same as those shown in FIG. 1 of the first embodiment. Further, the caption table shown in FIG. 13 used in the following description is the same as that shown in FIG. 4 of the first embodiment.

蓄積部２１は、例えばテレビ番組や映画等の映像コンテンツデータを始め、後述する字幕テーブル、再生速度テーブル、ダイジェストテーブル、及び映像コンテンツを要約したダイジェストコンテンツを格納するものである。具体的に蓄積部２１は、半導体メモリ、ハードディスクドライブ、光ディスクドライブ、またはその他の記憶装置である。 The storage unit 21 stores, for example, video content data such as a television program and a movie, a subtitle table, a playback speed table, a digest table, and digest content summarizing the video content, which will be described later. Specifically, the storage unit 21 is a semiconductor memory, a hard disk drive, an optical disk drive, or other storage device.

字幕解析部２３は、蓄積部２１が格納する映像コンテンツデータを解析し、映像コンテンツデータに含まれる各々の字幕の表示が始まる字幕表示開始時刻、字幕の表示が終わる字幕表示終了時刻、及び字幕文字数を検出し、これら字幕表示開始時刻、字幕表示終了時刻、字幕文字数、及び発話速度を後述する字幕テーブルとして蓄積部２１に格納するものである。尚、字幕表示開始時刻から字幕表示開始時刻までを字幕表示時間という。 The caption analysis unit 23 analyzes the video content data stored in the storage unit 21, and displays the caption display start time at which the display of each caption included in the video content data starts, the caption display end time at which the caption display ends, and the number of caption characters , And the caption display start time, caption display end time, number of caption characters, and speech rate are stored in the storage unit 21 as a caption table to be described later. The subtitle display time is from the subtitle display start time to the subtitle display start time.

シーン抽出部２５は、蓄積部２１に格納された字幕テーブルの中から字幕表示開始時刻Ｔｓ、字幕表示終了時刻Ｔｅ、及び字幕文字数Ｎｃを抽出する。その上で字幕文字数Ｎｃと、字幕表示開始時刻Ｔｓから字幕表示終了時刻Ｔｅまでの字幕が表示される時間の長さとを用いて発話速度Ｓｖを以下の式で算出する。
Ｓｖ＝（Ｔｅ−Ｔｓ）／Ｎｃ
さらにシーン抽出部１５は、この発話速度と再生速度テーブルとを用いて字幕表示開始時刻から字幕表示終了時刻までの間での再生速度を指定し、後述するダイジェストテーブルを生成し蓄積部２１に格納するものである。 The scene extraction unit 25 extracts the subtitle display start time Ts, the subtitle display end time Te, and the number of subtitle characters Nc from the subtitle table stored in the storage unit 21. Then, the speech rate Sv is calculated by the following formula using the number Nc of subtitle characters and the length of time during which subtitles are displayed from the subtitle display start time Ts to the subtitle display end time Te.
Sv = (Te−Ts) / Nc
Furthermore, the scene extraction unit 15 specifies the playback speed from the caption display start time to the caption display end time using the speech speed and the playback speed table, generates a digest table described later, and stores it in the storage unit 21. To do.

制御部２７は、字幕解析部２３、シーン抽出部２５及びダイジェスト生成部１６の各動作を制御する。制御部２７はさらに、蓄積部２１からダイジェストコンテンツを適宜読み出して、出力インタフェース１８を通じて外部に出力するものである。 The control unit 27 controls each operation of the caption analysis unit 23, the scene extraction unit 25, and the digest generation unit 16. The control unit 27 further reads the digest content from the storage unit 21 as appropriate and outputs it to the outside through the output interface 18.

次に再生速度テーブルについて説明する。図１４は、本実施の形態の再生速度テーブルを示す図である。本図は発話速度を低、中、高の３つに分類して、それぞれの発話速度ごとに１つずつの再生速度を指定している。例えば、発話速度が高ければ聞き取りが困難になりやすいため再生速度を通常の再生速度である１．０倍速に指定し、発話速度が中程度であれば再生速度を多少高めても聞き取れるので再生速度を１．５倍速に指定している。さらに、発話速度が低くゆっくりであれば再生速度を２．０倍速に指定している。このよう再生速度テーブルは、発話速度に対して映像コンテンツの再生速度を指定するためのものである。 Next, the playback speed table will be described. FIG. 14 is a diagram showing a playback speed table of the present embodiment. In this figure, speaking speeds are classified into three, low, medium and high, and one playback speed is designated for each speaking speed. For example, if the utterance speed is high, the listening speed is likely to be difficult, so the playback speed is designated as the normal playback speed of 1.0 times speed, and if the utterance speed is medium, the playback speed can be heard even if the playback speed is slightly increased. Is specified at 1.5 times speed. Furthermore, if the speech rate is low and slow, the playback speed is specified as 2.0 times speed. Such a playback speed table is for designating the playback speed of the video content with respect to the speech speed.

次にダイジェストテーブルについて説明する。図１５は、本実施の形態のダイジェストテーブルを示す図である。本図は、図１３の字幕テーブルにおけるＮｏ．１〜Ｎｏ．５に対して、図１４の再生速度テーブルを基に再生速度を付加してできたテーブルである。このテーブルでは例えばＮｏ．１において、字幕表示開始時刻００：０１：１５から字幕表示終了時刻００：０１：２２までは、２．０倍速で再生することを示している。 Next, the digest table will be described. FIG. 15 is a diagram showing a digest table of the present embodiment. This figure shows No. in the caption table of FIG. 1-No. 5 is a table obtained by adding a reproduction speed based on the reproduction speed table of FIG. In this table, for example, No. 1 shows that the subtitle display start time 00:01:15 to the subtitle display end time 00:01:22 is reproduced at 2.0 times speed.

このような構成のダイジェスト作成装置２の動作について以下に説明する。まず、入力処理部１２は映像コンテンツデータを受信して蓄積部２１に記録する。次に、字幕解析部１３は映像コンテンツデータを解析して第１の実施の形態と同様に字幕テーブルを生成し、この字幕テーブルを蓄積部２１に記録する。続いてシーン抽出部１０５は、字幕テーブル中の情報である字幕表示開始時刻、字幕表示終了時刻、及び字幕文字数を抽出して発話速度を算出する。さらにシーン抽出部１０５は再生速度テーブルを参照して算出した発話速度に対応する再生速度を検出し、字幕表示時間における再生速度を指定する。このとき字幕表示開始時刻、字幕表示終了時刻、及び再生速度をダイジェストテーブルとして蓄積部２１に記録する。 The operation of the digest creating apparatus 2 having such a configuration will be described below. First, the input processing unit 12 receives video content data and records it in the storage unit 21. Next, the caption analysis unit 13 analyzes the video content data, generates a caption table as in the first embodiment, and records the caption table in the storage unit 21. Subsequently, the scene extraction unit 105 extracts the subtitle display start time, the subtitle display end time, and the number of subtitle characters, which are information in the subtitle table, and calculates the speech rate. Furthermore, the scene extraction unit 105 detects a playback speed corresponding to the speech speed calculated with reference to the playback speed table, and designates the playback speed during the caption display time. At this time, the subtitle display start time, subtitle display end time, and playback speed are recorded in the storage unit 21 as a digest table.

その後ダイジェスト生成部１６は、ダイジェストテーブルを参照して映像コンテンツのダイジェストコンテンツを生成し、当該生成されたダイジェストを蓄積部１１に格納するか出力インタフェース１８を介して外部に出力する。以上のようにして、音声データの解析がなくても字幕表示時間と字幕文字数とによってダイジェストコンテンツを生成することができる。 Thereafter, the digest generation unit 16 generates a digest content of the video content with reference to the digest table, and stores the generated digest in the storage unit 11 or outputs it to the outside via the output interface 18. As described above, the digest content can be generated based on the caption display time and the number of caption characters without analyzing the audio data.

なお、番組のジャンル別に、再生速度テーブルを用意してもよい。例えばスポーツ、ドラマ、ニュースなど、映像コンテンツのジャンル別に再生速度テーブルを用意して、映像コンテンツのジャンルに合わせて装置が再生速度テーブルを使い分けてもよい。また、例えばスポーツを野球、サッカー、相撲などに、ドラマをサスペンス、ＳＦ、アクションなどに、より細かくジャンル分けして再生速度テーブルを用意しておき、ユーザの指定によって再生速度テーブルを使いわけてもよい。この場合でも、本発明の実施の形態で説明したダイジェストテーブルを生成することができ、映像コンテンツのダイジェストを生成することができる。 A playback speed table may be prepared for each program genre. For example, a playback speed table may be prepared for each genre of video content such as sports, dramas, and news, and the apparatus may use the playback speed table in accordance with the genre of the video content. Further, for example, a playback speed table may be prepared by dividing a genre into finer categories such as sports for baseball, soccer, sumo, etc., drama for suspense, SF, action, etc., and the playback speed table may be used according to user designation. Even in this case, the digest table described in the embodiment of the present invention can be generated, and a digest of video content can be generated.

また、表示される字幕文字数が多い場合、字幕を読むことで内容を理解しやすくなるため、所定の再生速度よりもやや再生速度を高めてもよい。また、第１の実施の形態では基本周波数、音量、及び発話速度を、また第２の実施の形態では発話速度を基にして再生速度を決定したが、番組のジャンルや各音声の種類を再生速度決定に用いてもよい。 In addition, when the number of subtitle characters to be displayed is large, reading the subtitle makes it easier to understand the content, so the playback speed may be slightly higher than the predetermined playback speed. In the first embodiment, the playback speed is determined based on the fundamental frequency, the volume, and the speech speed. In the second embodiment, the playback speed is determined based on the speech speed. It may be used for speed determination.

なお、映像コンテンツのデータはアナログＡＶ信号などであってもよい。アナログＡＶ信号の場合や字幕が映像信号に重畳されている場合には、入力処理部１０２は、画像認識によって映像フレームから字幕データを抽出するようにしてもよい。これにより、字幕ストリームが存在しない場合であっても字幕の表示開始時刻及び表示終了時刻を算出することができる。すなわち、本発明はアナログ映像信号のコンテンツについても上記と同様の効果を奏する。 The video content data may be an analog AV signal or the like. In the case of an analog AV signal or when captions are superimposed on the video signal, the input processing unit 102 may extract caption data from the video frame by image recognition. Thereby, even when there is no subtitle stream, the subtitle display start time and display end time can be calculated. That is, the present invention provides the same effect as described above for the content of the analog video signal.

本発明に係るダイジェスト作成装置は、ＤＶＤ記録再生装置、デジタルテレビジョン装置、携帯電話機、ポータブルコンテンツプレーヤ、カーナビゲーション装置などに有用である。 The digest creation device according to the present invention is useful for DVD recording / playback devices, digital television devices, mobile phones, portable content players, car navigation devices, and the like.

本発明の第１の実施の形態におけるダイジェスト作成装置の構成を示す図である。It is a figure which shows the structure of the digest production apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるＭＰＥＧ２トランスポートストリームのデータ構造を示す図である。It is a figure which shows the data structure of the MPEG2 transport stream in the 1st Embodiment of this invention. 本発明の第１の実施の形態における字幕ストリームのデータ構造を示す図である。It is a figure which shows the data structure of the caption stream in the 1st Embodiment of this invention. 本発明の第１の実施の形態における字幕テーブルを表す図である。It is a figure showing the caption table in the 1st Embodiment of this invention. 本発明の第１の実施の形態における音声テーブルを表す図である。It is a figure showing the audio | voice table in the 1st Embodiment of this invention. 本発明の第１の実施の形態における再生速度テーブルを表す図である。It is a figure showing the reproduction speed table in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるダイジェストテーブルを表す図である。It is a figure showing the digest table in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるダイジェスト作成装置の動作のフローチャートを示す図である。It is a figure which shows the flowchart of operation | movement of the digest production apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における字幕解析部の動作のフローチャートを示す図である。It is a figure which shows the flowchart of operation | movement of the caption analysis part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における字幕解析部の動作のフローチャートを示す図である。It is a figure which shows the flowchart of operation | movement of the caption analysis part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における音声解析部の動作のフローチャートを示す図である。It is a figure which shows the flowchart of operation | movement of the audio | voice analysis part in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるシーン抽出部の動作のフローチャートを示す図である。It is a figure which shows the flowchart of operation | movement of the scene extraction part in the 1st Embodiment of this invention. 本発明の第２の実施の形態におけるダイジェスト作成装置の構成を示す図である。It is a figure which shows the structure of the digest production apparatus in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における字幕テーブルを表す図である。It is a figure showing the caption table in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における音声テーブルを表す図である。It is a figure showing the audio | voice table in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるダイジェストテーブルを表す図である。It is a figure showing the digest table in the 2nd Embodiment of this invention.

Explanation of symbols

１、２ダイジェスト作成装置
１２入力処理部
１１、２１蓄積部
１３字幕解析部
１４音声解析部
１５、２５シーン抽出部
１６ダイジェスト生成部
１７、２７制御部
１８出力Ｉ／Ｆ DESCRIPTION OF SYMBOLS 1, 2 Digest production apparatus 12 Input processing part 11, 21 Accumulation part 13 Subtitle analysis part 14 Voice analysis part 15, 25 Scene extraction part 16 Digest generation part 17, 27 Control part 18 Output I / F

Claims

A digest creation device that generates digest content summarizing video content,
A subtitle analysis unit for detecting the number of subtitle characters and subtitle display time displayed in the video content;
A scene extraction unit that calculates the utterance speed from the number of subtitle characters and the subtitle display time, and determines the playback speed of the video content at the subtitle display time based on the utterance speed;
A digest creation device comprising: a digest generation unit that generates digest content according to the playback speed determined by the scene extraction unit.

The digest creation device according to claim 1, wherein the digest generation unit generates digest content that is played back at a constant speed except for a caption display time.

2. The digest creation device according to claim 1, wherein the digest generation unit generates digest content to be played back at a higher speed according to an elapsed time from the start of video content except for the caption display time.

The digest creation device according to claim 1, wherein the digest generation unit generates digest content that reproduces only the caption display time at the reproduction speed determined by the scene extraction unit.

A digest creation device that generates digest content summarizing video content,
A subtitle analysis unit for detecting the number of subtitle characters and subtitle display time displayed in the video content;
An audio analysis unit that analyzes audio of the video content played during the caption display time and detects at least one of a fundamental frequency and a volume and an audio playback time;
A scene extraction unit that calculates an utterance speed from the number of characters of the subtitles and an audio playback time, and determines a playback speed of video content at the audio playback time based on at least one of the fundamental frequency and volume and the utterance speed When,
A digest creation device comprising: a digest generation unit that generates digest content according to the playback speed determined by the scene extraction unit.

The digest creation device according to claim 5, wherein the digest generation unit generates digest content that is played back at a constant speed except for a voice playback time.

The digest creation device according to claim 5, wherein the digest generation unit generates digest content to be played back at a higher speed in accordance with an elapsed time from the start of video content except for the audio display time.

6. The digest creation device according to claim 5, wherein the digest generation unit generates digest content that reproduces only the audio reproduction time at the reproduction speed determined by the scene extraction unit.

The digest creation device according to any one of claims 1 to 4, wherein the scene extraction unit determines a playback speed of the video content during the caption display time based on the utterance speed and the genre of the video content.

The said scene extraction part determines the reproduction speed of the video content in the said audio | voice reproduction time based on the at least 1 of the said fundamental frequency and a volume, the said speech rate, and the genre of the said video content. The digest creation apparatus according to item 1.

The scene extraction unit determines a playback speed of video content during the audio playback time based on at least one of the fundamental frequency and volume, the speech rate, and the type of audio during the audio playback time. The digest production apparatus of any one of these.

The digest creation device according to any one of claims 1 to 11, wherein the number of characters of the subtitle is the number of characters of the kana when all the characters of the subtitle are expressed in kana.

13. The input processing unit according to claim 1, further comprising: an input processing unit that detects a number of characters of the subtitle from a subtitle superimposed on a video signal included in the video content, and transmits the detected number of characters of the subtitle to the subtitle analysis unit. The digest creation device according to item.

The scene extraction unit
The digest creation device according to claim 1, wherein a reproduction speed corresponding to the utterance speed of each stage is determined based on the utterance speed divided into a plurality of stages.

The scene extraction unit
6. The digest creation device according to claim 5, wherein a corresponding playback speed is determined based on an utterance speed divided into a plurality of stages, a fundamental frequency divided into a plurality of stages, and a volume divided into a plurality of stages.