JP2011061675A

JP2011061675A - Video voice synchronization reproducing apparatus, video voice synchronization processing apparatus, and video voice synchronization reproducing program

Info

Publication number: JP2011061675A
Application number: JP2009211640A
Authority: JP
Inventors: Atsushi Imai; 篤今井; Toru Tsugi; 徹都木; Nobumasa Seiyama; 信正清山
Original assignee: Nippon Hoso Kyokai NHK; NHK Engineering Services Inc; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2009-09-14
Filing date: 2009-09-14
Publication date: 2011-03-24
Anticipated expiration: 2029-09-14
Also published as: JP5325059B2

Abstract

PROBLEM TO BE SOLVED: To provide techniques synchronizing video with an appropriate speech speed converted voice having the speech speed varied every moment without accumulated time delays in a simple configuration when viewing a real-time broadcast program or reproducing recorded contents. SOLUTION: A video voice synchronization reproducing apparatus divides voice data into a voice interval and a non-voice interval and converts a speech speed according to a predetermined expansion and contraction ratio so that it is within an utterance speed of the voice data before the speech speed conversion. Then, when a deviation between reproduction time of the voice data after the speech speed conversion and reproduction time of video data is less than a predetermined time, only a time stamp of the voice data is updated sequentially based on the expansion and contraction ratio. When the deviation is equal to or more than the predetermined time, the video data is interpolated or deleted dependent on the deviation time and a time stamp of the video data is updated. Then, the voice and the video are synchronized and outputted based on the time stamp of the voice data and the time stamp of the video data. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、映像音声コンテンツの視聴に際し、音声の時間長を伸縮させることで、音声の発話速度（話速）を早くしあるいは遅くする変換をした場合に、この発話速度の変換後の音声と映像とのずれを少なくする技術に関する。 In the present invention, when the audio / video content is converted, the audio speech length (speech speed) is increased or decreased by expanding / decreasing the audio time length. The present invention relates to a technique for reducing deviation from video.

従来から、放送番組の視聴時や録画されたコンテンツの再生時に、実際の音声の話速より聴感的にゆっくり聞きたい、あるいは、早く聞きたいという要望があった。このため、音声の発話速度を変換（話速変換）する技術が種々提案されている。 Conventionally, when watching a broadcast program or playing back recorded content, there has been a demand for listening more slowly or faster than the actual speech speed. For this reason, various techniques for converting the speech rate of speech (speech rate conversion) have been proposed.

例えば、リアルタイムの放送番組の視聴時に話速変換効果（ゆっくり感）を得るために、音声を実時間からの時間遅れを蓄積することなく話速変換して、ゆっくり感を得る手法が提案されている（非特許文献１）。 For example, in order to obtain a speech speed conversion effect (slow feeling) when watching a real-time broadcast program, a method has been proposed in which the speech speed is converted without accumulating a time delay from the real time to obtain a slow feeling. (Non-Patent Document 1).

また、予め記録されたコンテンツに対して、任意の一定再生速度で可変速再生することにより、音声の発話速度を変換し、さらに、話速変換後の音声と映像とのずれを少なく再生する手法が提案されている。
またさらに、話速変換後の音声と映像とを同期させる方法としては、映像の信号処理による厳密な同期、例えば、動きベクトルなどを用いた映像フィールド内挿による方法がある（特許文献１）。 Also, a method of converting the speech rate of the voice by playing back the pre-recorded content at an arbitrary constant playback speed, and further reducing the difference between the voice and the video after the speech speed conversion. Has been proposed.
Furthermore, as a method of synchronizing the voice and video after the speech speed conversion, there is a method of strict synchronization by video signal processing, for example, a video field interpolation using a motion vector or the like (Patent Document 1).

特許第３７９３３３３号公報Japanese Patent No. 3793333

“ニュース音声を対象にした時間遅れを蓄積しない適応型話速変換方式” 電子情報通信学会論文誌(A), vol. J83-A No.8 pp.935-945, Aug 2000."Adaptive speech rate conversion method for news speech without accumulating time delay" IEICE Transactions (A), vol. J83-A No.8 pp.935-945, Aug 2000.

ところで、近年では、時々刻々話速が変化する適応的な話速変換音声に映像を同期させ、リアルタイムの放送番組や録画コンテンツを、簡素な構成で、実時間からの時間遅れの蓄積なく再生できる技術の開発が要望されている。 By the way, in recent years, video can be synchronized with adaptive speech speed converted audio whose speech speed changes from moment to moment, and real-time broadcast programs and recorded content can be played back with a simple structure without accumulation of time delay from real time. Development of technology is desired.

しかしながら、非特許文献１に記載の手法では、リアルタイムの放送番組において、音声の時間遅れを蓄積することなく発話速度を変換することはできるものの、音声と映像との同期再生を考慮していないため、音声と映像がごく一部でしか同期しないという問題があった。
また、予め記録されたコンテンツに対して、任意の一定再生速度で可変速再生する手法では、発話速度を変換した音声と映像とを同期させることはできるものの、ゆっくり再生することにより時間遅れが必然的に生じてしまっていた。このため、所定の時間枠で、音声と映像とを同期させてゆっくり再生することができなかった。
また、特許文献１に記載の手法では、音声の伸縮時間が映像の挿入／間引き時間の整数倍になっていない場合などには、演算量が極めて多くなることに加え、容量の大きな高価なメモリを要するなどにより導入コストが高額となってしまい、工業製品としての実現が困難であった。 However, although the technique described in Non-Patent Document 1 can convert the speech rate in a real-time broadcast program without accumulating the time delay of the audio, it does not consider the synchronous reproduction of audio and video. There was a problem that audio and video were only partially synchronized.
In addition, with the method of variable-speed playback at an arbitrary constant playback speed with respect to pre-recorded content, it is possible to synchronize the voice and video with converted speech speed, but there is a time delay due to slow playback. It happened. For this reason, audio and video cannot be reproduced slowly in a predetermined time frame.
Further, according to the method described in Patent Document 1, when the audio expansion / contraction time is not an integral multiple of the video insertion / decimation time, the amount of calculation is extremely large, and an expensive memory having a large capacity is also provided. The cost of introduction is high due to the fact that it is necessary, and it has been difficult to realize it as an industrial product.

本発明は、以上のような問題点に鑑みてなされたものであり、簡素な構成で、リアルタイムの放送番組の視聴時や録画コンテンツの再生時において、時々刻々話速が変化する適応的な話速変換音声に、時間遅れの蓄積なく映像を同期させることが可能な、映像音声同期再生装置、映像音声同期処理装置、および、映像音声同期再生プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and is an adaptive story having a simple configuration and whose speech speed changes every moment when a real-time broadcast program is viewed or recorded content is reproduced. It is an object of the present invention to provide a video / audio synchronized playback apparatus, a video / audio synchronized processing apparatus, and a video / audio synchronized playback program capable of synchronizing video with speed-converted audio without accumulation of time delay.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載の映像音声同期再生装置は、映像音声コンテンツの音声の時間長を伸縮させることで、前記音声の発話速度を変換し、当該発話速度に映像を同期させて再生する映像音声同期再生装置であって、分離抽出手段と、音声データ復号手段と、映像データ復号手段と、倍率設定手段と、話速変換手段と、音声データ記憶手段と、映像データ記憶手段と、同期制御手段と、を備える構成とした。 The present invention was created to achieve the above object, and first, the video / audio synchronized playback apparatus according to claim 1 is configured to expand / contract the time length of the audio of the video / audio content, thereby A video / audio synchronized playback apparatus that converts a speech rate and plays back a video in synchronization with the speech rate, the separation and extraction unit, the audio data decoding unit, the video data decoding unit, the magnification setting unit, the speech rate A conversion unit, an audio data storage unit, a video data storage unit, and a synchronization control unit are provided.

かかる構成において、映像音声同期再生装置は、分離抽出手段によって、前記映像音声コンテンツを含むストリームの入力を受け付け、当該ストリームから、音声データと、映像データと、前記音声データのタイムスタンプと、前記映像データのタイムスタンプとを分離抽出する。
また、映像音声同期再生装置は、音声データ復号手段によって、前記分離抽出手段で分離抽出された前記音声データを復号する。
さらに、映像音声同期再生装置は、映像データ復号手段によって、前記分離抽出手段で分離抽出された前記映像データを復号する。そして、映像音声同期再生装置は、映像データ記憶手段によって、前記映像データ復号手段によって復号された前記映像データを記憶する。一例として、ＭＰＥＧ（Moving Picture Experts Group）信号では各データパケットにタイムスタンプを含むヘッダが付加される。このタイムスタンプに従うことで映像と音声との同期再生を実現している。 In such a configuration, the video / audio synchronized playback apparatus receives an input of a stream including the video / audio content by the separation / extraction means, and from the stream, audio data, video data, a time stamp of the audio data, and the video Separate and extract the time stamp of the data.
In the audio / video synchronized playback apparatus, the audio data decoding unit decodes the audio data separated and extracted by the separation and extraction unit.
Furthermore, the video / audio synchronized playback apparatus decodes the video data separated and extracted by the separation and extraction means by the video data decoding means. Then, the video / audio synchronized playback apparatus stores the video data decoded by the video data decoding means by the video data storage means. As an example, in a moving picture experts group (MPEG) signal, a header including a time stamp is added to each data packet. By following this time stamp, synchronized playback of video and audio is realized.

また、映像音声同期再生装置は、倍率設定手段によって、前記音声データの時間長を伸縮させる基準となる伸縮倍率を予め設定する。例えば、倍率設定手段は、ＧＵＩ（Graphical User Interface）などのユーザが入力可能な手段とすることができ、ユーザからの入力によって、伸縮倍率を設定することとしても良い。
さらに、映像音声同期再生装置は、話速変換手段によって、前記音声データ復号手段で復号された前記音声データを、声が出ている区間である音声区間とそれ以外の区間である非音声区間とに分け、前記音声区間の時間長と前記非音声区間の時間長とを、前記倍率設定手段によって予め設定された前記伸縮倍率に応じて、発話速度の変換前の前記音声データの発話速度内に収まるように伸縮させることで前記音声データの発話速度を変換する。これによれば、音声データの時間長を、実時間からの時間遅れを蓄積させずに変換することができる。なお、ここでの非音声区間とは、人の声が含まれない部分であって、無音の部分や、背景音にＢＧＭやノイズが薄く含まれているような部分が含まれるものとする。
そして、映像音声同期再生装置は、音声データ記憶手段によって、前記話速変換手段によって前記時間長が変換された前記音声データを記憶する。 The video / audio synchronized playback apparatus presets the expansion / contraction magnification as a reference for expanding / contracting the time length of the audio data by the magnification setting means. For example, the magnification setting means may be a means that can be input by the user, such as a GUI (Graphical User Interface), and the expansion / contraction magnification may be set by an input from the user.
Furthermore, the video / audio synchronized playback device uses the speech speed conversion means to convert the voice data decoded by the voice data decoding means into a voice section that is a voice section and a non-voice section that is a section other than the voice section. And dividing the time length of the speech section and the time length of the non-speech section within the speech speed of the speech data before conversion of the speech speed according to the expansion / contraction magnification preset by the magnification setting means. The speech speed of the voice data is converted by expanding / contracting to fit. According to this, the time length of the audio data can be converted without accumulating the time delay from the real time. Note that the non-speech section here is a portion that does not include a human voice, and includes a silent portion and a portion in which background music is thinly included in BGM or noise.
Then, the video / audio synchronized playback apparatus stores the audio data in which the time length is converted by the speech speed conversion unit by the audio data storage unit.

そして、映像音声同期再生装置は、タイムスタンプ変換手段によって、前記発話速度の変換後の音声データの時間長と前記映像データの時間長とのずれが所定時間以上である場合、前記音声データのタイムスタンプを、前記伸縮倍率に応じて逐次更新するとともに、前記映像データを、前記ずれをなくすように補間または削除し、前記映像データのタイムスタンプを、補間または削除後の前記映像データに応じて更新し、前記ずれが所定時間未満である場合、前記音声データのタイムスタンプを、前記伸縮倍率に応じて逐次更新する。
また、タイムスタンプ変換手段と、制御信号出力手段によって、前記タイムスタンプ変換手段によって更新された前記音声データのタイムスタンプと、更新された前記映像データのタイムスタンプ、または、前記分離抽出手段から入力された前記映像データのタイムスタンプとに基づいて、前記音声データ記憶手段に記憶された前記発話速度の変換後の音声データと、前記映像データ記憶手段に記憶された前記映像データとを同期させて出力させる制御信号を出力する。
これによれば、音声の同期情報の変化に応じて映像の時間情報を同期させることによって、時間長が変換された音声データと、映像データとを同期させて出力することができるので、映像音声コンテンツの再生時に、音声と映像とがずれるのを防止することができる。 Then, the video / audio synchronized playback apparatus uses the time stamp conversion means to determine the time of the audio data when the time length of the audio data after the conversion of the speech rate and the time length of the video data are not less than a predetermined time. The stamp is sequentially updated according to the expansion / contraction magnification, and the video data is interpolated or deleted so as to eliminate the shift, and the time stamp of the video data is updated according to the video data after the interpolation or deletion. If the deviation is less than a predetermined time, the time stamp of the audio data is sequentially updated according to the expansion / contraction magnification.
In addition, the time stamp conversion unit and the control signal output unit input the time stamp of the audio data updated by the time stamp conversion unit and the time stamp of the updated video data or the separation extraction unit. Based on the time stamp of the video data, the audio data after the conversion of the speech rate stored in the audio data storage means and the video data stored in the video data storage means are output in synchronization. The control signal to be output is output.
According to this, by synchronizing the video time information in accordance with the change in the audio synchronization information, the audio data whose time length has been converted and the video data can be output in synchronization. It is possible to prevent the sound and the video from deviating during the reproduction of the content.

また、請求項２に記載の映像音声同期再生装置は、請求項１に記載の映像音声同期再生装置において、前記倍率設定手段は、予め定めた複数の前記伸縮倍率を任意に選択して設定可能であり、前記話速変換手段は、前記倍率設定手段によって選択され設定された前記伸縮倍率に基づいて、前記音声データの時間長を変換することを特徴とする。
これによれば、音声データの時間長の伸縮倍率を適度な程度のステップに分けて任意に選択可能となるので、ユーザが聞きとりやすい速度で、映像音声コンテンツを視聴することが可能となる。 Further, in the video / audio synchronized playback apparatus according to claim 2, in the video / audio synchronized playback apparatus according to claim 1, the magnification setting unit can arbitrarily select and set a plurality of predetermined expansion / contraction magnifications. The speech speed conversion means converts the time length of the audio data based on the expansion / contraction magnification selected and set by the magnification setting means.
According to this, since the expansion / contraction magnification of the time length of the audio data can be arbitrarily selected by dividing it into appropriate steps, the video / audio content can be viewed at a speed that is easy for the user to hear.

また、請求項３に記載の映像音声同期再生装置は、請求項１または請求項２に記載の映像音声同期再生装置において、前記映像音声コンテンツは、放送番組であり、受信手段と、緊急情報検出手段と、をさらに備える構成とした。 Further, the video / audio synchronized playback apparatus according to claim 3 is the video / audio synchronized playback apparatus according to claim 1 or 2, wherein the video / audio content is a broadcast program, receiving means, and emergency information detection And means.

かかる構成によれば、映像音声同期再生装置は、受信手段によって、前記映像音声コンテンツを含むストリームを受信する。
また、映像音声同期再生装置は、緊急情報検出手段によって、前記受信手段によって受信した前記ストリームに、予め定めた緊急情報が含まれているか否かを検出した上で、前記ストリームを、前記分離抽出手段に出力する。また、前記緊急情報検出手段は、前記緊急情報を検出した場合、前記倍率設定手段に、前記緊急情報を検出したことを通知する。ここで、緊急情報とは、例えば、ニュース速報、緊急地震速報や気象警報などの緊要度の高い情報をいう。また、緊急情報検出手段は、前記緊急情報を検出した場合、その旨を前記倍率設定手段に通知する。
さらに、映像音声同期再生装置は、倍率設定手段によって、前記緊急情報検出手段から前記緊急情報を検出した旨の通知を受け取った場合、前記伸縮倍率を等倍に設定して、当該伸縮倍率を前記話速変換手段に出力する。 According to this configuration, the video / audio synchronized playback apparatus receives the stream including the video / audio content by the receiving unit.
The video / audio synchronized playback apparatus detects whether or not the stream received by the receiving unit includes predetermined emergency information by the emergency information detecting unit, and then separates and extracts the stream. Output to the means. Further, when the emergency information detecting means detects the emergency information, the emergency information detecting means notifies the magnification setting means that the emergency information has been detected. Here, the emergency information refers to highly important information such as a news bulletin, an emergency earthquake bulletin, and a weather warning. Further, when the emergency information detecting unit detects the emergency information, the emergency information detecting unit notifies the magnification setting unit to that effect.
Furthermore, when the notification that the emergency information is detected is received from the emergency information detection unit by the magnification setting unit, the video / audio synchronized playback apparatus sets the expansion / contraction magnification to the same magnification and sets the expansion / contraction magnification to the Output to speech speed conversion means.

また、映像音声同期再生装置は、話速変換手段によって、前記倍率設定手段から入力された、等倍に設定された前記伸縮倍率を前記タイムスタンプ変換手段に出力する。
さらに、映像音声同期再生装置は、タイムスタンプ変換手段によって、前記話速変換手段から、前記等倍に設定された伸縮倍率が入力された場合、前記分離抽出手段から入力された前記音声データのタイムスタンプおよび前記映像データのタイムスタンプを更新せずに、前記制御信号出力手段に出力する。
そして、映像音声同期再生装置は、制御信号出力手段によって、前記タイムスタンプ変換手段から入力された前記音声データのタイムスタンプと前記映像データのタイムスタンプとに基づいて、前記音声データ記憶手段に記憶された前記音声データと、前記映像データ記憶手段に記憶された前記映像データとを同期させて出力させる制御信号を出力する。 In the video / audio synchronized playback apparatus, the speech speed conversion means outputs the expansion / contraction magnification set to the same magnification input from the magnification setting means to the time stamp conversion means.
Furthermore, the video / audio synchronized playback apparatus, when the expansion / contraction magnification set to the same magnification is input from the speech speed conversion unit by the time stamp conversion unit, the time of the audio data input from the separation / extraction unit The stamp and the time stamp of the video data are output to the control signal output means without being updated.
Then, the video / audio synchronized playback apparatus is stored in the audio data storage unit by the control signal output unit based on the time stamp of the audio data and the time stamp of the video data input from the time stamp conversion unit. In addition, a control signal for outputting the audio data in synchronization with the video data stored in the video data storage means is output.

これによれば、放送番組の視聴時に、予め定めた緊急情報が検出された場合、音声データの話速変換処理を行わずに通常の速度で再生することで、緊急情報の即時性が確保される。 According to this, when predetermined emergency information is detected during viewing of a broadcast program, the immediacy of emergency information is ensured by playing at normal speed without performing speech speed conversion processing of audio data. The

また、請求項４に記載の映像音声同期処理装置は、放送受信装置に入力されるストリームに含まれる音声データを話速変換し、当該話速変換後の音声データを映像データと同期させる映像音声同期処理装置であって、請求項１または請求項２に記載の映像音声同期再生装置と、緊急情報検出手段と、符号化手段と、を備える構成とした。 According to a fourth aspect of the present invention, there is provided a video / audio synchronization device that converts voice data included in a stream input to a broadcast receiving device to a speech speed and synchronizes the voice data after the speech speed conversion with the video data. A synchronization processing apparatus comprising the video / audio synchronized playback apparatus according to claim 1, an emergency information detection unit, and an encoding unit.

かかる構成において、映像音声同期処理装置は、緊急情報検出手段によって、前記映像音声コンテンツを含む放送番組のストリームの入力を受け付け、前記ストリームに、予め定めた緊急情報が含まれているか否かを検出する。そして、映像音声同期処理装置は、緊急情報検出手段によって、前記ストリームから前記緊急情報を検出した場合、当該ストリームを外部に出力し、前記ストリームから前記緊急情報を検出しなかった場合、当該ストリームを、前記映像音声同期再生装置に出力する。
また、映像音声同期処理装置は、符号化手段によって、前記映像音声同期再生装置から入力された前記音声データと、前記映像データとを符号化して出力する。 In such a configuration, the video / audio synchronization processing apparatus receives an input of a broadcast program stream including the video / audio content by the emergency information detection unit, and detects whether or not the stream includes predetermined emergency information. To do. When the emergency information is detected from the stream by the emergency information detection unit, the video / audio synchronization processing apparatus outputs the stream to the outside, and when the emergency information is not detected from the stream, And output to the video / audio synchronized playback apparatus.
The video / audio synchronization processing apparatus encodes and outputs the audio data input from the video / audio synchronous reproduction apparatus and the video data by the encoding means.

これによれば、映像音声コンテンツの視聴時に、予め定めた緊急情報が検出された場合、映像音声同期処理装置をバイパスしてストリームがそのまま外部の放送受信装置に出力されるため、映像音声コンテンツが通常の速度で再生されることとなる。このため、緊急情報の即時性を確保することができる。
また、この映像音声同期処理装置を、既存のデジタルテレビジョン装置に実装することで、既存のデジタルテレビジョン装置においても、音声データの話速を変換し、話速変換後の音声と映像を同期させた映像音声コンテンツの視聴が可能となる。 According to this, when predetermined emergency information is detected during viewing of the audiovisual content, the audio / video synchronization processing device is bypassed and the stream is output as it is to the external broadcast receiving device. It will be played back at normal speed. For this reason, the immediacy of emergency information can be ensured.
In addition, by installing this video / audio synchronization processing device on an existing digital television device, the existing digital television device also converts the speech speed of the audio data and synchronizes the audio and video after the speech speed conversion. It is possible to view the selected video / audio content.

また、請求項５に記載の映像音声同期再生プログラムは、映像音声コンテンツの音声の時間長を伸縮させることで、前記音声の発話速度を変換し、当該発話速度に映像を同期させて再生するために、コンピュータを、分離抽出手段、音声データ復号手段、映像データ復号手段、倍率設定手段、話速変換手段、同期制御手段として機能させるためのものである。 The video / audio synchronized playback program according to claim 5 converts the speech rate of the audio by expanding and contracting the audio time length of the video / audio content, and reproduces the video in synchronization with the speech rate. In addition, the computer is caused to function as a separation and extraction unit, an audio data decoding unit, a video data decoding unit, a magnification setting unit, a speech speed conversion unit, and a synchronization control unit.

かかる構成において、映像音声同期再生プログラムは、分離抽出手段によって、前記映像音声コンテンツの入力を受け付け、当該映像音声コンテンツから、音声データと、映像データと、前記音声データのタイムスタンプと、前記映像データのタイムスタンプとを分離抽出する。 In this configuration, the video / audio synchronized playback program receives the input of the video / audio content by the separation / extraction means, and from the video / audio content, the audio data, the video data, the time stamp of the audio data, and the video data The time stamp is separated and extracted.

また、映像音声同期再生プログラムは、音声データ復号手段によって、前記分離抽出手段で分離抽出された前記音声データを復号する。
さらに、映像音声同期再生プログラムは、映像データ復号手段によって、前記分離抽出手段で分離抽出された前記映像データを復号し、当該映像データを映像データ記憶手段に記憶させる。
またさらに、映像音声同期再生プログラムは、倍率設定手段によって、前記音声データの時間長を伸縮させる基準となる伸縮倍率を予め設定する。 The video / audio synchronized playback program decodes the audio data separated and extracted by the separation and extraction means by an audio data decoding means.
Further, the video / audio synchronized reproduction program decodes the video data separated and extracted by the separation and extraction means by the video data decoding means, and stores the video data in the video data storage means.
Furthermore, the video / audio synchronized playback program presets the expansion / contraction magnification as a reference for expanding / contracting the time length of the audio data by the magnification setting means.

そして、映像音声同期再生プログラムは、話速変換手段によって、前記音声データ復号手段で復号された前記音声データを、声が出ている区間である音声区間とそれ以外の区間である非音声区間とに分け、前記音声区間の時間長と前記非音声区間の時間長とを、前記倍率設定手段によって予め設定された前記伸縮倍率に応じて、発話速度の変換前の前記音声データの発話速度内に収まるように変換する。 Then, the video / audio synchronized playback program uses the speech speed conversion means to decode the audio data decoded by the audio data decoding means into a voice section that is a voice section and a non-voice section that is a section other than the voice section. And dividing the time length of the speech section and the time length of the non-speech section within the speech speed of the speech data before conversion of the speech speed according to the expansion / contraction magnification preset by the magnification setting means. Convert to fit.

そして、映像音声同期再生プログラムは、タイムスタンプ変換手段によって、前記発話速度の変換後の音声データの時間長と前記映像データの時間長とのずれが所定時間以上である場合、前記音声データのタイムスタンプを、前記伸縮倍率に応じて逐次更新するとともに、前記映像データを、前記ずれをなくすように補間または削除し、前記映像データのタイムスタンプを、補間または削除後の前記映像データに応じて更新し、前記ずれが所定時間未満である場合、前記音声データのタイムスタンプを、前記伸縮倍率に応じて逐次更新する。
そして、映像音声同期再生プログラムは、制御信号出力手段によって、前記タイムスタンプ変換手段によって更新された前記音声データのタイムスタンプと、更新された前記映像データのタイムスタンプ、または、前記分離抽出手段から入力された前記映像データのタイムスタンプとに基づいて、前記音声データ記憶手段に記憶された前記発話速度の変換後の音声データと、前記映像データ記憶手段に記憶された前記映像データとを同期させて出力させる制御信号を出力する。 Then, the video / audio synchronized playback program performs the time of the audio data when the difference between the time length of the audio data after the conversion of the speech rate and the time length of the video data is a predetermined time or more by the time stamp conversion means. The stamp is sequentially updated according to the expansion / contraction magnification, and the video data is interpolated or deleted so as to eliminate the shift, and the time stamp of the video data is updated according to the video data after the interpolation or deletion. If the deviation is less than a predetermined time, the time stamp of the audio data is sequentially updated according to the expansion / contraction magnification.
Then, the audio / video synchronized playback program is input from the control signal output means from the time stamp of the audio data updated by the time stamp conversion means, the time stamp of the updated video data, or the separation extraction means On the basis of the time stamp of the video data, the audio data after conversion of the speech rate stored in the audio data storage means and the video data stored in the video data storage means are synchronized. The control signal to be output is output.

本発明は、以下に示す優れた効果を奏するものである。
請求項１、５に記載の発明によれば、リアルタイムの放送番組の視聴時や録画された映像音声コンテンツの再生時に、時間遅れを蓄積させることなく音声の時間長を伸縮させることができるとともに、発話速度を変換した音声と映像を同期させることができるので、ユーザが快適に映像音声コンテンツを視聴することができる。
また、リアルタイムの放送番組の視聴に適用できるため、高齢者や難聴者と、健常者とのリアルタイムのコミュニケーションの促進に寄与することができる。
さらに、音声と読唇を併用して映像音声コンテンツの情報を受容しているユーザが、より簡単に映像音声コンテンツの情報を受容することが可能となる。またさらに、構成が簡素であり、導入コストを低減することができる。
請求項２に記載の発明によれば、音声の時間長の伸縮倍率を、複数の段階から任意に選択することができるため、ユーザが聞きとりやすい速度で、映像音声コンテンツを視聴することが可能となる。
請求項３、４に記載の発明によれば、受信したストリームに、映像音声コンテンツの他に予め定めた緊急情報が含まれている場合には、映像音声コンテンツを通常の速度で再生することで、緊急情報の即時性を確保することが可能となる。
また、請求項４に記載の発明によれば、既存の放送受信装置をそのまま利用して、発話速度を変換した音声と映像が同期した映像音声コンテンツを視聴可能となるため、ユーザの導入コストを低減することができる。 The present invention has the following excellent effects.
According to the first and fifth aspects of the present invention, when viewing a real-time broadcast program or playing back recorded video / audio content, the time length of the audio can be expanded / contracted without accumulating a time delay, Since the audio and the video whose utterance speed is converted can be synchronized, the user can comfortably view the video and audio content.
Moreover, since it can apply to viewing of a real-time broadcast program, it can contribute to promotion of real-time communication between an elderly person or a hearing-impaired person and a healthy person.
Furthermore, it becomes possible for a user who receives audio / video content information using both audio and lip reading to more easily receive audio / video content information. Furthermore, the configuration is simple and the introduction cost can be reduced.
According to the second aspect of the present invention, since the expansion / contraction magnification of the audio time length can be arbitrarily selected from a plurality of stages, the audiovisual content can be viewed at a speed that is easy for the user to hear. It becomes.
According to the third and fourth aspects of the invention, when the received stream includes predetermined emergency information in addition to the video / audio content, the video / audio content is reproduced at a normal speed. It is possible to ensure the immediacy of emergency information.
According to the fourth aspect of the present invention, it is possible to view the video / audio content in which the voice and the video whose speech rate is converted are synchronized by using the existing broadcast receiving apparatus as it is, so that the introduction cost of the user is reduced. Can be reduced.

本発明の第一の実施形態に係る映像音声同期再生装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the video / audio synchronous reproduction apparatus which concerns on 1st embodiment of this invention. 本発明の第一の実施形態に係る映像音声同期再生装置で話速変換した音声と映像を同期させる手法について説明するための概念図である。It is a conceptual diagram for demonstrating the method to synchronize the audio | voice and video which carried out speech speed conversion with the audio | video synchronous reproduction apparatus which concerns on 1st embodiment of this invention. 本発明の第一の実施形態に係る映像音声同期再生装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the video / audio synchronous reproduction apparatus which concerns on 1st embodiment of this invention. 本発明の第二の実施形態に係る映像音声同期再生装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the video / audio synchronous reproduction apparatus which concerns on 2nd embodiment of this invention. 本発明の第三の実施形態に係る映像音声同期再生装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the video / audio synchronous reproduction apparatus which concerns on 3rd embodiment of this invention.

＜第一実施形態＞
［映像音声同期再生装置の構成］
まず、図１を参照して、第一実施形態に係る映像音声同期再生装置１Ａの具体的な構成について説明する。
図１に示すように、映像音声同期再生装置１Ａは、音声の時間長を伸縮させることで、音声の発話速度を変換し、当該発話速度に映像を同期させて再生するものであり、分離抽出手段１１と、音声データ復号手段１２と、話速変換手段１３と、倍率設定手段１４と、音声データ記憶手段１５と、映像データ復号手段１６と、映像データ記憶手段１７と、同期制御手段１８と、を備えている。 <First embodiment>
[Configuration of video / audio synchronized playback device]
First, a specific configuration of the video / audio synchronized playback apparatus 1A according to the first embodiment will be described with reference to FIG.
As shown in FIG. 1, the video / audio synchronized playback apparatus 1A converts the speech rate by expanding / decreasing the time length of the audio, and reproduces the video by synchronizing with the speech rate, and separating and extracting the video. Means 11, audio data decoding means 12, speech speed conversion means 13, magnification setting means 14, audio data storage means 15, video data decoding means 16, video data storage means 17, synchronization control means 18, It is equipped with.

この映像音声同期再生装置１Ａの外部には、受信アンテナＡが接続されており、図示しないコンテンツ送信装置から送信された、少なくとも映像音声コンテンツを含むストリームを受信している。受信アンテナＡによって受信したストリームは、分離抽出手段１１に出力される。このストリームは、例えば、ＭＰＥＧ（Moving Picture Experts Group）２のデータ形式で構成されているが、これに限られるものではなく、タイムスタンプが付されたコンテンツであれば他の形式であっても良い。 A reception antenna A is connected to the outside of the video / audio synchronized playback apparatus 1A, and a stream including at least video / audio content transmitted from a content transmission apparatus (not shown) is received. The stream received by the receiving antenna A is output to the separation / extraction means 11. This stream is configured in, for example, the MPEG (Moving Picture Experts Group) 2 data format, but is not limited thereto, and may be in other formats as long as the content is a time stamp. .

分離抽出手段１１は、受信アンテナＡによって受信された、ストリームを、ヘッダなどに基づいて、音声データと、映像データと、音声データのタイムスタンプと、映像データのタイムスタンプとにそれぞれ分離し抽出するものである。分離された音声データは、音声データ復号手段１２に出力され、分離された映像データは、映像データ復号手段１６に出力される。また、分離された音声データのタイムスタンプと映像データのタイムスタンプは、同期制御手段１８に出力される。 The separation / extraction means 11 separates and extracts the stream received by the receiving antenna A into audio data, video data, a time stamp of the audio data, and a time stamp of the video data based on the header or the like. Is. The separated audio data is output to the audio data decoding unit 12, and the separated video data is output to the video data decoding unit 16. Further, the time stamp of the separated audio data and the time stamp of the video data are output to the synchronization control means 18.

音声データ復号手段１２は、分離抽出手段１１によって分離された音声データを復号するものである。復号された音声データは、話速変換手段１３に出力される。 The audio data decoding unit 12 decodes the audio data separated by the separation / extraction unit 11. The decoded voice data is output to the speech speed conversion means 13.

話速変換手段１３は、音声データ復号手段１２によって復号された音声データの時間長を、倍率設定手段１４によって予め設定された伸縮倍率に基づいて変換するものである。
ここで、話速変換手段１３の詳細な説明に先立ち、倍率設定手段１４について説明する。 The speech speed conversion means 13 converts the time length of the voice data decoded by the voice data decoding means 12 based on the expansion / contraction magnification set in advance by the magnification setting means 14.
Here, prior to a detailed description of the speech speed conversion means 13, the magnification setting means 14 will be described.

倍率設定手段１４は、話速変換手段１３によって音声データの時間長を伸縮させる基準となる伸縮倍率を設定するものである。
伸縮倍率とは、リアルタイムの放送番組においては、音声データを伸張させる倍率であり、録画されたコンテンツにおいては、音声データを伸張させ、または、短縮させる倍率であって、話速変換前の音声データの再生倍率を“１”としたときの、話速変換後の音声データの再生倍率を示すものである。
この倍率設定手段１４は、例えば、ＧＵＩなどのユーザインターフェースにより構成することができる。倍率設定手段１４は、例えば、「ゆっくり１」、「ゆっくり２」、「ゆっくり３」、「はやく１」、「はやく２」などの予め定めた適当な程度のステップと、このステップにそれぞれ対応付けられる伸縮倍率（例えば、「ゆっくり１」に対し「１．２倍に伸張」、「ゆっくり２」に対し「１．４倍に伸張」、「ゆっくり３」に対し「１．６倍に伸張」、「はやく１」に対し、「０．８倍に短縮」、「はやく２」に対し、「０．６倍に短縮」など）の組み合わせを予め有している。そして、例えば、前記した予め定めたステップを、ユーザが選択可能な形式で提示し、ユーザにより前記したステップのうちの一つが選択されると、そのステップに対応付けられた伸縮倍率を、話速変換の目安となる伸縮倍率として設定する。なお、倍率設定手段１４は、初期設定などにより予め固定の伸縮倍率を設定しておいても良い。このようにして設定された伸縮倍率は、話速変換手段１３に出力される。 The magnification setting means 14 sets an expansion / contraction magnification that serves as a reference for expanding / contracting the time length of the audio data by the speech speed conversion means 13.
The expansion / contraction magnification is a magnification for expanding the audio data in a real-time broadcast program, and is a magnification for expanding or shortening the audio data in the recorded content, and is the audio data before the speech speed conversion. The reproduction magnification of the voice data after the speech speed conversion when the reproduction magnification of “1” is “1” is shown.
The magnification setting means 14 can be configured by a user interface such as a GUI, for example. The magnification setting means 14 is associated with a predetermined appropriate step such as “slowly 1”, “slowly 2”, “slowly 3”, “fast 1”, “fast 2”, etc., for example. Expansion ratio (for example, “slowly stretches to 1.2 times” for “slowly 1”, “slowly stretches to 1.4 times” for “slowly 2”, “stretches to 1.6 times” for “slowly 3”) , “Hasaku 1” is “reduced 0.8 times”, and “hayaku 2” is “reduced 0.6 times”). Then, for example, the above-described predetermined steps are presented in a format that can be selected by the user, and when one of the above-described steps is selected by the user, the expansion / contraction magnification associated with the step is set as the speech speed. Set as the scaling factor to be used as a guide for conversion. Note that the magnification setting means 14 may set a fixed expansion / contraction magnification in advance by initial setting or the like. The expansion / contraction magnification set in this way is output to the speech speed conversion means 13.

話速変換手段１３は、倍率設定手段１４によって設定された伸縮倍率に従って、音声データ復号手段１２で復号された話速変換前の音声データを所定の時間長毎に、音声が出ている区間である音声区間とそれ以外の区間である非音声区間とに分け、音声区間の時間長と非音声区間の時間長とを、倍率設定手段１４によって予め設定された伸縮倍率に応じて、話速変換前の音声データの発話速度内に収まるように変換するものである。 The speech speed conversion means 13 is a section in which the speech data before speech speed conversion decoded by the voice data decoding means 12 is output for every predetermined time length according to the expansion / contraction magnification set by the magnification setting means 14. The voice speed is divided into a voice segment and a non-speech segment which is the other segment, and the speech speed is converted into the time length of the voice segment and the time length of the non-speech segment according to the expansion / contraction magnification set in advance by the magnification setting means 14. The conversion is performed so as to be within the utterance speed of the previous voice data.

ここでは、話速変換手段１３は、音声データを、音声区間と非音声区間とに分けるとともに、音声データから音声の基本周期を検出する。そして、話速変換手段１３は、倍率設定手段１４によって予め設定された伸縮倍率に応じて、音声区間の時間長と非音声区間の時間長とを話速変換前の音声データの発話速度内に収まるようにそれぞれ変換する。ここで、話速変換手段１３による話速変換は、任意の手法により行うことができるが、例えば、特許文献１（特許文献１：特許第３２２００４３号公報）の手法により行うことができる。 Here, the speech speed conversion means 13 divides the voice data into a voice interval and a non-voice interval, and detects the basic period of the voice from the voice data. Then, the speech speed conversion means 13 sets the time length of the speech section and the time length of the non-speech section within the speech speed of the speech data before the speech speed conversion according to the expansion / contraction magnification preset by the magnification setting means 14. Convert each to fit. Here, the speech speed conversion by the speech speed conversion means 13 can be performed by an arbitrary method, and for example, can be performed by the method of Patent Document 1 (Patent Document 1: Japanese Patent No. 3220043).

話速変換手段１３は、話速変換後の音声データを、音声データ記憶手段１５に出力する。また、話速変換手段１３は、伸縮倍率を、同期制御手段１８に順次出力する。この伸縮倍率による話速変換は、基本周期単位の繰り返し、または、間引きにより波形を伸縮させることによって適応的に実現される。 The speech speed conversion means 13 outputs the speech data after the speech speed conversion to the speech data storage means 15. Further, the speech speed conversion means 13 sequentially outputs the expansion / contraction magnification to the synchronization control means 18. The speech speed conversion based on the expansion / contraction magnification is adaptively realized by expanding / contracting the waveform by repetition of the basic period unit or thinning.

音声データ記憶手段１５は、話速変換手段１３によって話速変換された音声データを順次一時的に記憶するものであり、ＦＩＦＯ（First In First Out）として機能するものである。音声データ記憶手段１５は、話速変換された音声データを数秒間蓄積可能となっている。なお、音声データ記憶手段１５は、音声データを話速変換したときの、実時間からの最大遅れ時間を予め設定できるようになっており、この最大遅れ時間の分の話速変換された音声データを記憶するようになっている。
音声データ記憶手段１５に記憶された音声データは、同期制御手段１８から出力される制御信号のタイミングで出力される。 The voice data storage means 15 sequentially and temporarily stores the voice data converted by the voice speed conversion means 13 and functions as a FIFO (First In First Out). The voice data storage means 15 can store voice data whose speech speed has been converted for several seconds. The voice data storage means 15 can set in advance the maximum delay time from the real time when the voice data is converted to the speech speed, and the voice data subjected to the speech speed conversion corresponding to the maximum delay time. Is to be remembered.
The audio data stored in the audio data storage unit 15 is output at the timing of the control signal output from the synchronization control unit 18.

映像データ復号手段１６は、分離抽出手段１１によって分離された映像データを復号するものである。復号された映像データは、映像データ記憶手段１７に出力される。 The video data decoding unit 16 decodes the video data separated by the separation and extraction unit 11. The decoded video data is output to the video data storage unit 17.

映像データ記憶手段１７は、映像データ復号手段１６によって復号された映像データを順次一時的に記憶するものであり、ＦＩＦＯ（First In First Out）として機能するものである。なお、映像データ記憶手段１７は、音声データを話速変換したときの、実時間からの最大遅れ時間を予め設定できるようになっており、この最大遅れ時間の分の映像データを記憶するようになっている。映像データ記憶手段１７は、映像データを数秒間蓄積可能となっている。映像データ記憶手段１７に記憶された映像データは、同期制御手段１８から出力される制御信号のタイミングで出力される。 The video data storage unit 17 sequentially and temporarily stores the video data decoded by the video data decoding unit 16 and functions as a first in first out (FIFO). The video data storage means 17 can set in advance the maximum delay time from the real time when the speech data is converted to the speech speed, and stores the video data for this maximum delay time. It has become. The video data storage means 17 can store video data for several seconds. The video data stored in the video data storage unit 17 is output at the timing of the control signal output from the synchronization control unit 18.

同期制御手段１８は、話速変換手段１３によって話速変換された音声データと原映像データの出力タイミングを同期させるものであり、タイムスタンプ変換手段１８ａと、制御信号出力手段１８ｂとを備える。
ここで、図２を参照して、話速変換手段１３によって、音声データの話速が変換された場合に、同期制御手段１８で行う処理について説明する。
まず、音声データおよび映像データの内容について説明する。
図２（ａ）に示すように、映像音声コンテンツの音声データが「あしたのてんきははれです」であるものとする。この音声データの音声区間「あしたのてんきは」の話始めの直前には、タイムスタンプ（時刻ｔ０）が付されており、音声区間「はれです」の話始めの直前には、タイムスタンプ（時刻ｔ１）が付されている。さらに、タイムスタンプ（時刻ｔ１）から間隔を空けて、タイムスタンプ（時刻ｔ２）が付されている。またここでは、「あしたのてんきは」と非音声区間が時刻ｔ０−ｔ１間で再生され、「はれです」と非音声区間が時刻ｔ１−ｔ２間で再生されるようになっている。なお、時刻ｔ０−ｔ１間、時刻ｔ１−ｔ２間は、それぞれ１秒間となっている。 The synchronization control means 18 synchronizes the output timing of the voice data converted by the speech speed conversion means 13 and the original video data, and includes a time stamp conversion means 18a and a control signal output means 18b.
Here, with reference to FIG. 2, the processing performed by the synchronization control means 18 when the speech speed of the voice data is converted by the speech speed conversion means 13 will be described.
First, the contents of audio data and video data will be described.
As shown in FIG. 2 (a), it is assumed that the audio data of the video / audio content is “Tomorrow is a tomorrow”. A time stamp (time t0) is attached immediately before the start of the voice section “Ashitano Tenkiha” in the voice data, and a time stamp (time t0) is added immediately before the start of the voice section “Hare is”. Time t1) is attached. Further, a time stamp (time t2) is attached with an interval from the time stamp (time t1). Also, here, “Ashitano Tenkiha” and the non-speech section are reproduced between the times t0 and t1, and “Now,” the non-speech section is reproduced between the times t1 and t2. In addition, between time t0-t1 and between time t1-t2, it is 1 second, respectively.

また、図２（ａ）に示すように、映像データとして、映像データＤ１、Ｄ２、Ｄ３・・・Ｄｎがある。映像データは、それぞれ６０枚の画像フレームで構成されている。ここでは、映像データＤ１、Ｄ２が音声データ「あしたのてんきははれです」に対応しており、映像データＤ１が時刻ｔ０−ｔ１間で再生され、映像データＤ２が時刻ｔ１−ｔ２間で再生されるようになっている。次に、処理の内容について説明する。 Further, as shown in FIG. 2A, the video data includes video data D1, D2, D3... Dn. Each video data is composed of 60 image frames. Here, the video data D1 and D2 correspond to the audio data “Tomorrow's tomorrow is”, the video data D1 is played back between times t0 and t1, and the video data D2 is played back between times t1 and t2. It has come to be. Next, the contents of the process will be described.

例えば、倍率設定手段１４によって、音声データの伸縮倍率を１．２倍に伸張させると設定し、話速変換手段１３によって、伸縮倍率に基づいて、音声データ「あしたのてんきははれです」の話速を１．２倍遅くするように適応的に変換されたものと仮定する。このとき、話速変換手段１３は、例えば、図２（ｂ）に示すように、音声データの音声区間「あしたのてんきは」、「はれです」の話速をそれぞれ１．２倍遅くなるように伸張させて話速を変換し、「あしたのてんきは」の後の非音声区間の時間長と音声区間「はれです」の後の非音声区間の時間長を伸張分に合わせて短縮する。このようにして、音声データの発話速度内で、音声データの話速を変換する。 For example, it is set that the expansion / contraction magnification of the voice data is expanded to 1.2 times by the magnification setting means 14, and the voice data “Tomorrow is tomorrow is” based on the expansion / contraction magnification by the speech speed conversion means 13. Assume that the speech rate is adaptively converted so as to make the speech rate 1.2 times slower. At this time, for example, as shown in FIG. 2 (b), the speech speed conversion means 13 slows the speech speeds of the speech sections “Ashita Tenkiha” and “Hare” of the speech data by 1.2 times. In this way, the speech speed is converted and the time length of the non-speech section after "Ashitano Tenkiha" and the time length of the non-speech section after the speech section "Hare is" are shortened according to the extension. To do. In this way, the speech speed of the speech data is converted within the speech rate of the speech data.

図２（ｂ）に示す例では、話速変換後の音声データにおける音声区間「あしたのてんきは」の再生時間が１秒間を超えている。
そこで、同期制御手段１８のタイムスタンプ変換手段１８ａは、話速変換手段１３から入力された伸縮倍率に基づき、分離抽出手段１１から入力された話速変換前の音声データのタイムスタンプを、話速変換後の音声データに対応するように変換する。
また、話速変換前の音声データ「はれです」の話始めの直前に付されたタイムスタンプを、時刻ｔ１から時刻ｔ１´に変換する。 In the example shown in FIG. 2B, the playback time of the voice section “Ashitano Tenkiha” in the voice data after the speech speed conversion exceeds 1 second.
Therefore, the time stamp conversion means 18a of the synchronization control means 18 uses the expansion rate inputted from the speech speed conversion means 13 as the speech speed before the speech speed conversion inputted from the separation extraction means 11 as the speech speed. Conversion is performed so as to correspond to the converted audio data.
Further, the time stamp attached immediately before the start of the speech data “Hare is” before the speech speed conversion is converted from time t1 to time t1 ′.

さらに、タイムスタンプ変換手段１８ａは、話速変換後の音声データのタイムスタンプと、映像データのタイムスタンプとに基づいて、話速変換後の音声データの再生時間と、映像データの再生時間とを比較し、話速変換後の音声データの再生時間に対する映像データの再生時間のずれ時間を計測する。
ここで、本実施形態では、映像音声コンテンツの再生時に、話速変換後の音声データに対する映像データのずれによる違和感を軽減するために、話速変換後の音声データに対応するタイムスタンプに同期して映像データを再生することとしているが、映像と音声とを完全に同期させる必要はなく、人が違和感を覚えない程度の映像と音声のずれは許容される。 Furthermore, the time stamp conversion means 18a calculates the reproduction time of the audio data after the speech speed conversion and the reproduction time of the video data based on the time stamp of the audio data after the speech speed conversion and the time stamp of the video data. In comparison, the deviation time of the reproduction time of the video data with respect to the reproduction time of the audio data after the speech speed conversion is measured.
Here, in the present embodiment, at the time of reproduction of the video / audio content, in order to reduce a sense of incongruity due to the video data shift with respect to the audio data after the speech speed conversion, the time stamp corresponding to the audio data after the speech speed conversion is synchronized. However, it is not necessary to completely synchronize the video and the audio, and the video and the audio can be shifted to such an extent that a person does not feel uncomfortable.

このため、タイムスタンプ変換手段１８ａは、音声データの再生時間に対する映像データの再生時間のずれ時間のしきい値を設定する。ここで、しきい値は、人の、映像と音声の時間ずれの検知限をもとに設定することができる。検知限については、公知の値を用いることができる。例えば、非特許文献２によれば、映像と音声とのずれが１００ｍｓ以下であると、人は、映像と音声とが同期しているものと感じる（ずれを検知しない）ことがわかっている（非特許文献２：“テレビの映像と音声の相対時間差に関する検討”,日本音響学会春季研究発表会講演論文集,3-3-6,pp.461-462,1996-3）ので、これに準じて、しきい値を例えば１００ｍｓとすることができる。 For this reason, the time stamp conversion means 18a sets a threshold value of a deviation time of the reproduction time of the video data with respect to the reproduction time of the audio data. Here, the threshold value can be set based on the detection limit of the time lag between the video and audio of a person. A known value can be used for the detection limit. For example, according to Non-Patent Document 2, it is known that when the deviation between video and audio is 100 ms or less, a person feels that the video and audio are synchronized (no deviation is detected) ( Non-patent document 2: “Examination of relative time difference between video and audio on TV”, Proceedings of the Acoustical Society of Japan Spring Meeting, 3-3-6, pp.461-462, 1996-3) Thus, the threshold value can be set to 100 ms, for example.

そして、タイムスタンプ変換手段１８ａは、音声データの再生時間に対する映像データの再生時間のずれ時間を計測し、ずれ時間がしきい値以上であるか否かを判定する。同期制御手段１８は、図２（ｂ）に示す例のように、ずれ時間がしきい値未満であると判定した場合、映像と音声が同期しているものと判断し、映像データを補間（または削除）しない。 Then, the time stamp conversion unit 18a measures the deviation time of the reproduction time of the video data with respect to the reproduction time of the audio data, and determines whether or not the deviation time is equal to or greater than a threshold value. When it is determined that the shift time is less than the threshold value as in the example illustrated in FIG. 2B, the synchronization control unit 18 determines that the video and audio are synchronized, and interpolates the video data ( Or do not delete).

一方、タイムスタンプ変換手段１８ａは、図２（ｃ）に示す例のように、ずれ時間がしきい値以上であると判定した場合、映像データのフレームを補間（または削除）し、映像データの時間長を、話速変換後の音声データの時間長に近接させるように揃える。
つまり、タイムスタンプ変換手段１８ａは、映像データＤ１の時間長（時刻ｔ０−ｔ１間の時間長）と、話速変換後の音声データ「あしたのてんきは」の時間長（時刻ｔ１−ｔ１´間の時間長）とを比較し、時刻ｔ１−ｔ１´間の時間長が１００ｍｓ以上である場合、ずれ時間がしきい値以上であるものと判定する。そして、タイムスタンプ変換手段１８ａは、映像データＤ１のフレームを補間して、映像データＤ１の時間長を伸張し、話速変換後の音声データ「あしたのてんきは」の時間長に近接させるように揃える。これにより、映像データが、時刻ｔ０からおよそ時刻ｔ１´まで再生される。 On the other hand, when the time stamp conversion means 18a determines that the shift time is equal to or greater than the threshold value as in the example shown in FIG. 2C, the time stamp conversion means 18a interpolates (or deletes) the frame of the video data. The time length is aligned so as to be close to the time length of the voice data after the speech speed conversion.
In other words, the time stamp conversion means 18a determines the time length of the video data D1 (the time length between times t0 and t1) and the time length of the voice data “Ashita Tenkiha” after the speech speed conversion (between times t1 and t1 ′). When the time length between times t1 and t1 ′ is 100 ms or more, it is determined that the shift time is equal to or greater than the threshold value. Then, the time stamp conversion means 18a interpolates the frame of the video data D1, extends the time length of the video data D1, and brings it closer to the time length of the speech data “Ashitano Tenkiha” after the speech speed conversion. Align. Thus, the video data is reproduced from time t0 to about time t1 ′.

また、映像データＤ２の時間長（時刻ｔ１−ｔ２間の時間長）と、話速変換後の音声データ「はれです」の時間長（時刻ｔ１´−ｔ２間の時間長）とを比較し、時刻ｔ１−ｔ１´間の時間長が１００ｍｓ以上である場合、ずれ時間がしきい値以上であるものと判定する。そして、タイムスタンプ変換手段１８ａは、映像データＤ１を構成する６０枚の画像フレームのうちの一部の画像フレームを削除して、映像データＤ２の時間長を音声データの時間長に合わせて短縮し、話速変換後の音声データ「はれです」の時間長に近接させるように揃える。このようにした場合、映像データＤ２の先頭のタイムスタンプを時刻ｔ１から時刻ｔ１´に変換する。これにより、映像データが、およそ時刻ｔ１´から時刻ｔ２まで再生される。
タイムスタンプ変換手段１８ａは、話速変換後の音声データのタイムスタンプと、更新された映像データのタイムスタンプまたは更新していない映像データのタイムスタンプ（分離抽出手段１１から入力された映像データのタイムスタンプ）とを、同期制御手段１８の制御信号出力手段１８ｂに出力する。 Also, the time length of the video data D2 (time length between times t1 and t2) is compared with the time length of the voice data “Hare is” after speech speed conversion (time length between times t1 ′ and t2). When the time length between times t1 and t1 ′ is 100 ms or more, it is determined that the shift time is equal to or greater than the threshold value. Then, the time stamp conversion means 18a deletes some of the 60 image frames constituting the video data D1, and shortens the time length of the video data D2 in accordance with the time length of the audio data. The voice data after the speech speed conversion is arranged so as to be close to the length of time of “Hare is”. In this case, the time stamp at the beginning of the video data D2 is converted from time t1 to time t1 ′. Thereby, the video data is reproduced from about time t1 ′ to time t2.
The time stamp conversion means 18a is a time stamp of the audio data after the speech speed conversion, a time stamp of the updated video data or a time stamp of the video data not updated (the time of the video data input from the separation extraction means 11). Is output to the control signal output means 18 b of the synchronization control means 18.

制御信号出力手段１８ｂは、タイムスタンプ変換手段１８ａから入力された話速変換後の音声データのタイムスタンプと、映像データのタイムスタンプに基づいて、音声データ記憶手段１５と、映像データ記憶手段１７とに音声データまたは映像データを出力させるための制御信号を出力する。
つまり、制御信号出力手段１８ｂは、まず、時刻ｔ０で、音声データ記憶手段１５に音声データ「あしたのてんきは」を出力するように制御信号を出力し、映像データ記憶手段１７に映像データＤ１を出力するように制御信号を出力する。 The control signal output means 18b is based on the time stamp of the voice data after the speech speed conversion input from the time stamp conversion means 18a and the time stamp of the video data, and the audio data storage means 15 and the video data storage means 17 Outputs a control signal for outputting audio data or video data.
That is, the control signal output means 18b first outputs a control signal to output the audio data “Ashita Tenkiha” to the audio data storage means 15 at time t0, and outputs the video data D1 to the video data storage means 17. A control signal is output so as to be output.

次に、制御信号出力手段１８ｂは、時刻ｔ１´で、音声データ記憶手段１５に音声データ「はれです」を出力するように制御信号を出力する。
また、制御信号出力手段１８ｂは、映像データＤ１を補間した場合は、時刻ｔ１で、映像データ記憶手段１７に映像データＤ２を出力するように制御信号を出力し、映像データＤ１を補間した場合は、時刻ｔ１´で、映像データ記憶手段１７に映像データＤ２を出力するように制御信号を出力する。 Next, the control signal output means 18b outputs a control signal so as to output the voice data “I'm off” to the voice data storage means 15 at time t1 ′.
Further, when the video data D1 is interpolated, the control signal output means 18b outputs a control signal so as to output the video data D2 to the video data storage means 17 at time t1, and when the video data D1 is interpolated. At time t1 ′, a control signal is output so that the video data D2 is output to the video data storage means 17.

このようにして、制御信号出力手段１８ｂは、話速変換後の音声データのタイムスタンプと映像データのタイムスタンプとに従ったタイミングで、音声データ記憶手段１５に音声データを出力させる制御信号を逐次出力し、映像データ記憶手段１７に、映像データを出力させるための制御信号を逐次出力する。これにより、映像データを話速変換後の音声データに同期させ、時間遅れを蓄積させずに再生可能となる。 In this way, the control signal output unit 18b sequentially outputs control signals for causing the audio data storage unit 15 to output audio data at a timing according to the time stamp of the audio data after the speech speed conversion and the time stamp of the video data. And sequentially outputs a control signal for causing the video data storage means 17 to output the video data. As a result, the video data is synchronized with the voice data after the speech speed conversion, and can be reproduced without accumulating time delay.

［映像音声同期再生装置１Ａの動作］
次に、図３を参照しながら、第一の実施形態に係る映像音声同期再生装置１Ａの動作を説明する。ここで、映像音声同期再生装置１Ａは、受信アンテナＡにより、図示しないコンテンツ送信装置から送信されたストリームを受信し、受信したストリームを分離抽出手段１１に出力しているものとする。 [Operation of Video / Audio Synchronous Playback Device 1A]
Next, the operation of the video / audio synchronized playback apparatus 1A according to the first embodiment will be described with reference to FIG. Here, it is assumed that the audio / video synchronized playback apparatus 1A receives a stream transmitted from a content transmission apparatus (not shown) by the receiving antenna A and outputs the received stream to the separation and extraction unit 11.

まず、映像音声同期再生装置１Ａは、倍率設定手段１４によって、伸縮倍率を設定する（ステップＳ１０１）。
次に、映像音声同期再生装置１Ａは、分離抽出手段１１によって、ストリームから、音声データと、映像データと、音声データのタイムスタンプと、音声データのタイムスタンプとを分離抽出する（ステップＳ１０２）。映像音声同期再生装置１Ａは、分離抽出手段１１によって、分離抽出した音声データを話速変換手段１３に出力し、分離抽出した映像データを映像データ復号手段１６に出力する。また、分離抽出した音声データのタイムスタンプと映像データのタイムスタンプとを同期制御手段１８に出力する。なお、ステップＳ１０１とステップＳ１０２〜Ｓ１０４は、いずれを先に行っても良いし、同時に行っても良い。 First, the audio / video synchronized playback apparatus 1A sets the expansion / contraction magnification by the magnification setting means 14 (step S101).
Next, the video / audio synchronized playback apparatus 1A uses the separation / extraction means 11 to separate and extract audio data, video data, a time stamp of the audio data, and a time stamp of the audio data from the stream (step S102). The video / audio synchronized playback apparatus 1A outputs the audio data separated and extracted by the separation / extraction means 11 to the speech speed conversion means 13 and outputs the video data separated and extracted to the video data decoding means 16. Further, the time stamp of the separated audio data and the time stamp of the video data are output to the synchronization control means 18. Note that step S101 and steps S102 to S104 may be performed first or simultaneously.

映像音声同期再生装置１Ａは、映像データ復号手段１６によって、映像データを復号し、映像データ記憶手段１７に出力する（ステップＳ１０３）。
映像音声同期再生装置１Ａは、音声データ復号手段１２によって、音声データを復号し、音声データ記憶手段１５に出力する（ステップＳ１０４）。
映像音声同期再生装置１Ａは、話速変換手段１３によって、音声データの入力を受け付けると、音声データを音声区間と非音声区間とに分けるとともに、音声データから音声の基本周期単位を検出する。そして、話速変換手段１３によって、倍率設定手段１４から伸縮倍率の入力を受け付けると、この伸縮倍率を目安として音声データの音声区間と非音声区間を適応的に伸縮して話速変換し、話速変換後の音声データを、音声データ記憶手段１５に順次出力する（ステップＳ１０５）。 The video / audio synchronized playback apparatus 1A decodes the video data by the video data decoding unit 16 and outputs the decoded video data to the video data storage unit 17 (step S103).
The video / audio synchronized playback apparatus 1A decodes the audio data by the audio data decoding unit 12 and outputs the decoded audio data to the audio data storage unit 15 (step S104).
When the speech speed conversion means 13 accepts input of audio data, the video / audio synchronized playback apparatus 1A divides the audio data into audio sections and non-audio sections, and detects the basic period unit of audio from the audio data. When the speech speed conversion means 13 receives an input of the expansion / contraction magnification from the magnification setting means 14, the speech speed and the non-speech section of the voice data are adaptively expanded / contracted using the expansion / contraction magnification as a guideline, and the speech speed is converted. The voice data after the speed conversion is sequentially output to the voice data storage means 15 (step S105).

また、映像音声同期再生装置１Ａは、話速変換手段１３によって、伸縮倍率を同期制御手段１８に出力する。
映像音声同期再生装置１Ａは、タイムスタンプ変換手段１８ａによって、話速変換手段１３から、伸縮倍率の入力を受け付けるとともに、分離抽出手段１１から音声データのタイムスタンプおよび映像データのタイムスタンプの入力を受け付ける。
そして、映像音声同期再生装置１Ａは、タイムスタンプ変換手段１８ａによって、入力された伸縮倍率に基づいて、話速変換前の音声データのタイムスタンプを、話速変換後の音声データに対応させるように変換する（ステップＳ１０６）。 Further, the audio / video synchronized playback apparatus 1 A outputs the expansion / contraction magnification to the synchronization control unit 18 by the speech speed conversion unit 13.
In the audio / video synchronized playback apparatus 1A, the time stamp conversion means 18a receives the input of the expansion / contraction magnification from the speech speed conversion means 13, and also receives the time stamp of the audio data and the time stamp of the video data from the separation / extraction means 11. .
Then, the audio / video synchronized playback apparatus 1A uses the time stamp conversion unit 18a to associate the time stamp of the audio data before the speech speed conversion with the audio data after the speech speed conversion based on the input expansion / contraction magnification. Conversion is performed (step S106).

映像音声同期再生装置１Ａは、タイムスタンプ変換手段１８ａによって、話速変換後の音声データの再生時間に対する映像データの再生時間のずれ時間を計測し、ずれ時間がしきい値以上であるか否かを判定する（ステップＳ１０７）。
映像音声同期再生装置１Ａは、タイムスタンプ変換手段１８ａによって、ずれ時間がしきい値未満であると判定した場合、映像と音声が同期しているものと判断し、映像データを補間（または削除）しない。 In the video / audio synchronized playback apparatus 1A, the time stamp conversion means 18a measures the shift time of the playback time of the video data with respect to the playback time of the audio data after the speech speed conversion, and whether or not the shift time is equal to or greater than a threshold value. Is determined (step S107).
When the time stamp conversion means 18a determines that the shift time is less than the threshold value, the video / audio synchronized playback apparatus 1A determines that the video and audio are synchronized, and interpolates (or deletes) the video data. do not do.

一方、映像音声同期再生装置１Ａは、タイムスタンプ変換手段１８ａによって、ずれ時間がしきい値以上であると判定した場合、映像データのフレームを補間（または削除）し、映像データの時間長を、話速変換後の音声データの時間長に近接させるように揃える（ステップＳ１０８）。
また、映像音声同期再生装置１Ａは、タイムスタンプ変換手段１８ａによって、映像データのタイムスタンプを、補間（または削除）後の映像データに対応するように更新する（ステップＳ１０９）。 On the other hand, when the time-stamp conversion unit 18a determines that the shift time is equal to or greater than the threshold value, the video / audio synchronized playback apparatus 1A interpolates (or deletes) the frame of the video data, The speech data after the speech speed conversion is aligned so as to be close to the time length (step S108).
Further, the video / audio synchronized playback apparatus 1A updates the time stamp of the video data so as to correspond to the video data after the interpolation (or deletion) by the time stamp conversion means 18a (step S109).

そして、映像音声同期再生装置１Ａは、制御信号出力手段１８ｂによって、映像データのタイムスタンプに従ったタイミングで、映像データ記憶手段１７に映像データを出力させるための制御信号を出力し、話速変換後の音声データのタイムスタンプに従ったタイミングで、音声データ記憶手段１５に音声データを出力させるための制御信号を出力する（ステップＳ１１０）。 Then, the video / audio synchronized playback apparatus 1A outputs a control signal for causing the video data storage means 17 to output the video data at the timing according to the time stamp of the video data by the control signal output means 18b, and converts the speech speed. At a timing according to the later time stamp of the audio data, a control signal for outputting the audio data to the audio data storage means 15 is output (step S110).

そして、映像音声同期再生装置１Ａは、制御信号出力手段１８ｂから出力された制御信号のタイミングで映像データ記憶手段１７から映像データを出力し、音声データ記憶手段１５から音声データを出力する（ステップＳ１１１）。
このようにして、映像音声同期再生装置１Ａから映像データと音声データとを同期させて出力させることができる。 The video / audio synchronized playback apparatus 1A outputs video data from the video data storage unit 17 and outputs audio data from the audio data storage unit 15 at the timing of the control signal output from the control signal output unit 18b (step S111). ).
In this way, video data and audio data can be synchronized and output from the video / audio synchronized playback apparatus 1A.

ここで、第一の実施形態に係る映像音声同期再生装置１Ａをテレビ受像機に内蔵する場合、音声データ記憶手段１５から出力された音声データは、図示しないスピーカによって再生され、映像データ記憶手段１７から出力された映像データは、図示しない表示装置によって表示される。このようにして、映像と音声を同期再生することが可能となる。 Here, when the video / audio synchronized playback apparatus 1A according to the first embodiment is built in the television receiver, the audio data output from the audio data storage means 15 is reproduced by a speaker (not shown), and the video data storage means 17 is provided. The video data output from is displayed on a display device (not shown). In this way, video and audio can be played back synchronously.

第一の実施形態に係る映像音声同期再生装置１Ａによれば、リアルタイムの放送番組の視聴時や録画された映像音声コンテンツの再生時に、実時間からの時間遅れを蓄積させることなく音声データの話速を変換することができるとともに、話速変換後の音声と映像とを同期させて再生することができるので、ユーザが快適に映像音声コンテンツを視聴することができる。
また、倍率設定手段１４によって、音声データの伸縮倍率を、複数のステップから任意に選択することができるため、ユーザが聞きとりやすい速度で、映像音声コンテンツを視聴することが可能となる。 According to the video / audio synchronized playback apparatus 1A according to the first embodiment, when viewing a real-time broadcast program or playing back recorded video / audio content, the audio data is spoken without accumulating a time delay from the real time. Since the speed can be converted and the voice and video after the conversion of the speech speed can be synchronized and reproduced, the user can comfortably view the video and audio content.
In addition, since the scaling factor of the audio data can be arbitrarily selected from a plurality of steps by the magnification setting means 14, it is possible to view the audiovisual content at a speed that is easy for the user to hear.

なお、第一の実施形態に係る映像音声同期再生装置１Ａは、テレビ受像機に内蔵しても良いし、チューナ付の外部録画機器に内蔵しても良い。例えば、映像音声同期再生装置１Ａをチューナ付の外部録画機器に内蔵する場合、この外部録画機器をテレビ受像機に接続することで、テレビ受像機に映像音声同期再生装置１Ａを備えていない場合でも、リアルタイムの放送番組の視聴時に、映像音声同期再生装置１Ａの効果を実現することができる。また、リアルタイムの放送番組の視聴時だけでなく、録画されたコンテンツに対しても、同様の効果を実現することができる。
この場合も、一連の処理の流れは、第一の実施形態に係る映像音声同期再生装置１Ａによって実現可能であるが、分離抽出手段１１によって、映像データおよび音声データをデジタル化するフォーマットは、ＭＰＥＧ２に限らず任意である。また、それぞれの符号化方式に応じた復号化、符号化および同期制御が必要となる。 Note that the video / audio synchronized playback apparatus 1A according to the first embodiment may be incorporated in a television receiver or an external recording device with a tuner. For example, when the video / audio synchronized playback apparatus 1A is built in an external recording device with a tuner, the external video recording / playback apparatus 1A is not provided in the television receiver by connecting the external recording device to the television receiver. When viewing a real-time broadcast program, the effect of the video / audio synchronized playback apparatus 1A can be realized. The same effect can be realized not only when viewing a real-time broadcast program but also for recorded content.
In this case as well, a series of processing flows can be realized by the video / audio synchronized playback apparatus 1A according to the first embodiment, but the format for digitizing the video data and audio data by the separation / extraction means 11 is MPEG2. Not limited to this. In addition, decoding, encoding, and synchronization control according to each encoding method are required.

＜第二の実施形態＞
次に、第二の実施形態に係る映像音声同期再生装置について説明する。第二の実施形態に係る映像音声同期再生装置は、第一の実施形態に係る映像音声同期再生装置の機能に加え、受信したストリームに予め定めた緊急情報が含まれるか否かを検出可能とした点にある。 <Second Embodiment>
Next, a video / audio synchronized playback apparatus according to the second embodiment will be described. In addition to the function of the video / audio synchronized playback apparatus according to the first embodiment, the video / audio synchronized playback apparatus according to the second embodiment can detect whether or not the received stream includes predetermined emergency information. It is in the point.

ここで、緊急情報とは、ニュース速報、緊急地震速報や津波警報などの緊要度の高い情報であり、このような情報は、その性質上、即時性を確保することがきわめて重要である。従って、本発明の第二の実施形態に係る映像音声同期再生装置１Ｂにおいて、映像音声コンテンツに重畳して緊急情報が受信された場合は、ユーザが話速変換を希望していたとしても、話速変換をせずに通常の再生速度で再生することにより、情報の即時性を確保するものである。 Here, emergency information is information of high importance such as breaking news, earthquake early warning, tsunami warning, etc. It is extremely important to ensure immediacy due to the nature of such information. Therefore, in the video / audio synchronized playback apparatus 1B according to the second embodiment of the present invention, when emergency information is received superimposed on the video / audio content, even if the user wishes to convert the speech speed, By reproducing at normal playback speed without speed conversion, the immediacy of information is ensured.

以下に説明する本発明の第二の実施形態に係る映像音声同期再生装置１Ｂにおいて、本発明の第一の実施形態に係る映像音声同期再生装置１Ａとの差異は、緊急情報検出手段を備えた点にある。以下の第二の実施形態に係る映像音声同期再生装置１Ｂの説明において、第１の実施形態に係る映像音声同期再生装置１Ａと共通する構成については同一の符号を付し重複する説明を省略する。 In the video / audio synchronized playback apparatus 1B according to the second embodiment of the present invention described below, the difference from the video / audio synchronized playback apparatus 1A according to the first embodiment of the present invention is provided with emergency information detection means. In the point. In the following description of the video / audio synchronized playback apparatus 1B according to the second embodiment, components that are the same as those in the video / audio synchronized playback apparatus 1A according to the first embodiment are denoted by the same reference numerals and redundant description is omitted. .

第二の実施形態に係る映像音声同期再生装置１Ｂは、分離抽出手段１１と、音声データ復号手段１２と、話速変換手段１３と、倍率設定手段１４と、音声データ記憶手段１５と、映像データ復号手段１６と、映像データ記憶手段１７と、同期制御手段１８と、緊急情報検出手段１９とを備えている。 The video / audio synchronized playback apparatus 1B according to the second embodiment includes a separation / extraction means 11, an audio data decoding means 12, a speech speed conversion means 13, a magnification setting means 14, an audio data storage means 15, and video data. Decoding means 16, video data storage means 17, synchronization control means 18, and emergency information detection means 19 are provided.

緊急情報検出手段１９は、受信アンテナＡによって受信されたストリームに、予め定めた緊急情報が含まれているか否かを検出するものである。緊急情報は、初期設定などにより予め設定しておくことができる。緊急情報検出手段１９は、判定結果を倍率設定手段１４に出力するとともに、受信したストリームを、分離抽出手段１１に出力する。 The emergency information detection means 19 detects whether or not a predetermined emergency information is included in the stream received by the receiving antenna A. The emergency information can be set in advance by initial setting or the like. The emergency information detection unit 19 outputs the determination result to the magnification setting unit 14 and outputs the received stream to the separation / extraction unit 11.

［映像音声同期再生装置１Ｂの動作］
次に、図３を適宜参照しながら、映像音声同期再生装置１Ｂの動作について説明する。
なお、ここでは、映像音声同期再生装置１Ｂの緊急情報検出手段１９と、倍率設定手段１４と、話速変換手段１３と、同期制御手段１８の動作について主に説明する。 [Operation of Video / Audio Synchronous Playback Device 1B]
Next, the operation of the video / audio synchronized playback apparatus 1B will be described with reference to FIG. 3 as appropriate.
Here, the operations of the emergency information detection means 19, the magnification setting means 14, the speech speed conversion means 13, and the synchronization control means 18 of the video / audio synchronized playback apparatus 1B will be mainly described.

映像音声同期再生装置１Ｂは、緊急情報検出手段１９によって、受信アンテナＡにより受信されたストリームの入力を受け付けると、受信されたストリームに、予め定めた緊急情報が含まれているか否かを検出する。
映像音声同期再生装置１Ｂは、緊急情報検出手段１９によって、ストリームに、予め定めた緊急情報が含まれていると判定した場合、倍率設定手段１４に、緊急情報が検出されたことを示す検出結果を出力する。 When the emergency information detection unit 19 receives the input of the stream received by the receiving antenna A, the video / audio synchronized playback apparatus 1B detects whether or not the received stream includes predetermined emergency information. .
When the emergency information detection unit 19 determines that the stream includes predetermined emergency information, the video / audio synchronized playback apparatus 1B detects in the magnification setting unit 14 that the emergency information has been detected. Is output.

映像音声同期再生装置１Ｂは、倍率設定手段１４によって、緊急情報検出手段１９から緊急情報が検出されたことを示す検出結果の入力を受け付けると、伸縮倍率を等倍に設定し、この伸縮倍率を、話速変換手段１３に出力する。
映像音声同期再生装置１Ｂは、話速変換手段１３によって、等倍に設定された伸縮倍率の入力を受け付けると、音声データを話速変換せずに、そのまま音声データ記憶手段１５に出力する。また、映像音声同期再生装置１Ｂは、話速変換手段１３によって、等倍に設定された伸縮倍率を同期制御手段１８に出力する。 When the video / audio synchronized playback apparatus 1B receives an input of a detection result indicating that emergency information has been detected from the emergency information detection unit 19 by the magnification setting unit 14, the expansion / contraction magnification is set to the same magnification. , Output to the speech speed conversion means 13.
When receiving the input of the expansion / contraction magnification set to the equal magnification by the speech speed conversion means 13, the video / audio synchronized playback apparatus 1 B outputs the voice data as it is to the voice data storage means 15 without converting the speech speed. The audio / video synchronized playback apparatus 1 B outputs the expansion / contraction magnification set to the same magnification by the speech speed conversion unit 13 to the synchronization control unit 18.

映像音声同期再生装置１Ｂは、タイムスタンプ変換手段１８ａによって、話速変換手段１３から等倍に設定された伸縮倍率の入力を受け付けると、分離抽出手段１１から入力を受け付けた音声データのタイムスタンプと映像データのタイムスタンプを更新せずに、そのまま、制御信号出力手段１８ｂに出力する。すなわち、映像音声同期再生装置１Ｂは、倍率設定手段１４によって、緊急情報検出手段１９から緊急情報が検出されたことを示す検出結果の入力を受け付けると、図３に示すステップＳ１０４からステップＳ１０９をスキップする。
そして、映像音声同期再生装置１Ｂは、制御信号出力手段１８ｂによって、制御信号出力手段１８ｂから入力された音声データのタイムスタンプと映像データのタイムスタンプとに従ったタイミングで、音声データ記憶手段１５に音声データを出力させるための制御信号を出力し、映像データのタイムスタンプに従ったタイミングで、映像データ記憶手段１７に映像データを出力させるための制御信号を出力する（ステップＳ１１０）。
そして、映像音声同期再生装置１Ｂは、制御信号出力手段１８ｂから出力された制御信号のタイミングで、音声データ記憶手段１５から音声データを出力し、映像データ記憶手段１７から映像データを出力する（ステップＳ１１１）。これによれば、音声データと映像データとが、実時間通りに再生されることとなる。 When the time stamp conversion unit 18a receives the input of the expansion / contraction magnification set to the same magnification from the speech speed conversion unit 13, the video / audio synchronized playback apparatus 1B receives the time stamp of the audio data received from the separation extraction unit 11 and the time stamp conversion unit 18a. Without updating the time stamp of the video data, it is output to the control signal output means 18b as it is. That is, when the video / audio synchronized playback apparatus 1B receives an input of a detection result indicating that emergency information is detected from the emergency information detection unit 19 by the magnification setting unit 14, the process skips steps S104 to S109 shown in FIG. To do.
Then, the video / audio synchronized playback apparatus 1B stores the audio data storage unit 15 in the audio data storage unit 15 at a timing according to the time stamp of the audio data and the time stamp of the video data input from the control signal output unit 18b by the control signal output unit 18b. A control signal for outputting audio data is output, and a control signal for outputting video data to the video data storage means 17 is output at a timing according to the time stamp of the video data (step S110).
Then, the video / audio synchronized playback apparatus 1B outputs the audio data from the audio data storage unit 15 and outputs the video data from the video data storage unit 17 at the timing of the control signal output from the control signal output unit 18b (step). S111). According to this, audio data and video data are reproduced in real time.

第二実施形態に係る映像音声同期再生装置１Ｂによれば、映像音声コンテンツに緊急情報が重畳されたストリームを受信した場合に、当該緊急情報を検出可能であり、緊急情報を検出した場合には、音声データの話速変換をせずに、通常の放送状態に切り替えることで、緊急情報の即時性を確保することが可能となる。 According to the video / audio synchronized playback device 1B according to the second embodiment, when a stream in which emergency information is superimposed on video / audio content is received, the emergency information can be detected, and when emergency information is detected, By switching to the normal broadcast state without converting the speech speed of the voice data, it is possible to ensure the immediacy of emergency information.

＜第三の実施形態＞
次に、図５を参照しながら、本発明の第三の実施形態に係る映像音声同期処理装置について説明する。
第三の実施形態に係る映像音声同期処理装置は、第一の実施形態に係る映像音声同期再生装置を放送受信装置（例えば、デジタルテレビジョン装置）に外付けすることを考慮して設計したものである。ここでは、映像音声同期処理装置の入出力をテレビ信号のストリームとしている。映像音声同期処理装置は、第一の実施形態に係る映像音声同期再生装置を備えるものとしたため、第一の実施形態に係る映像音声同期再生装置の構成には、前記した第一の実施形態と同一の符号を付し、重複する説明を省略する。 <Third embodiment>
Next, a video / audio synchronization processing apparatus according to a third embodiment of the present invention will be described with reference to FIG.
The video / audio synchronization processing apparatus according to the third embodiment is designed in consideration of externally attaching the video / audio synchronous reproduction apparatus according to the first embodiment to a broadcast receiving apparatus (for example, a digital television apparatus). It is. Here, the input / output of the video / audio synchronization processing device is a stream of a television signal. Since the video / audio synchronization processing apparatus includes the video / audio synchronous playback apparatus according to the first embodiment, the configuration of the video / audio synchronous playback apparatus according to the first embodiment includes the above-described first embodiment. The same reference numerals are given, and duplicate descriptions are omitted.

図６に示すように、第三の実施形態に係る映像音声同期処理装置２は、映像音声同期再生装置１Ａと、緊急情報検出手段２１と、符号化手段２２と、を備える。
緊急情報検出手段２１は、テレビ信号のストリームの入力を受け付けると、当該ストリームに、緊急情報が含まれているか否かを検出するものである。緊急情報検出手段２１は、緊急情報が検出された場合、そのまま外部にストリーム出力し、緊急情報が検出されなかった場合、映像音声同期再生装置１Ａにストリームを出力する。
符号化手段２２は、映像音声同期再生装置１Ａから入力された映像データと音声データを、それぞれテレビ信号のストリームに符号化し、外部に出力するものである。 As shown in FIG. 6, the video / audio synchronization processing device 2 according to the third embodiment includes a video / audio synchronous reproduction device 1 A, an emergency information detection unit 21, and an encoding unit 22.
When receiving the input of the stream of the television signal, the emergency information detection unit 21 detects whether or not the stream includes emergency information. When the emergency information is detected, the emergency information detection unit 21 outputs the stream to the outside as it is. When the emergency information is not detected, the emergency information detection unit 21 outputs the stream to the video / audio synchronized playback apparatus 1A.
The encoding means 22 encodes the video data and the audio data input from the video / audio synchronized playback apparatus 1A into a stream of television signals, respectively, and outputs them to the outside.

［映像音声同期処理装置２の動作］
次に、映像音声同期処理装置２の動作を説明する。なお、映像音声同期再生装置１Ａの動作は、第一の実施形態で説明した通りであるため、ここでは、説明を省略する。
映像音声同期処理装置２は、緊急情報検出手段２１によって、テレビ信号のストリームの入力を受け付けると、当該ストリームに、緊急情報が含まれているか否かを検出する。そして、映像音声同期処理装置２は、緊急情報検出手段２１によって、緊急情報を検出した場合、そのまま外部にストリームを出力する。 [Operation of the video / audio synchronization processing apparatus 2]
Next, the operation of the video / audio synchronization processing apparatus 2 will be described. The operation of the video / audio synchronized playback apparatus 1A is the same as that described in the first embodiment, and a description thereof will be omitted here.
When the emergency information detecting unit 21 receives the input of the stream of the television signal, the video / audio synchronization processing device 2 detects whether or not the stream includes emergency information. When the emergency information detecting unit 21 detects emergency information, the video / audio synchronization processing apparatus 2 outputs the stream to the outside as it is.

一方、映像音声同期処理装置２は、緊急情報検出手段２１によって、緊急情報を検出しなかった場合、映像音声同期再生装置１Ａにストリームを出力する。映像音声同期処理装置２は、映像音声同期再生装置１Ａによって、音声データを話速変換し、話速変換後の音声データと映像データとを同期させて符号化手段２２に出力する。映像音声同期処理装置２は、符号化手段２２によって、音声データと映像データとを、それぞれテレビ信号のストリームに符号化し、外部に出力する。 On the other hand, the video / audio synchronization processing device 2 outputs a stream to the video / audio synchronous reproduction device 1A when the emergency information is not detected by the emergency information detection means 21. The video / audio synchronization processing device 2 performs speech speed conversion on the audio data by the video / audio synchronous reproduction device 1A, synchronizes the audio data after the speech speed conversion and the video data, and outputs them to the encoding means 22. The video / audio synchronization processing apparatus 2 encodes the audio data and the video data into a stream of a television signal by the encoding means 22 and outputs them to the outside.

第三の実施形態に係る映像音声同期処理装置２によれば、映像音声コンテンツに緊急情報が重畳されたストリームを受信した場合に、この緊急情報を検出可能であり、緊急情報を検出した場合には、映像音声同期再生装置１Ａを迂回して、テレビ信号のストリームをそのまま外部のデジタルテレビジョン装置に出力することで、デジタルテレビジョン装置における通常の視聴状態とすることができるので、緊急情報の即時性を確保することが可能となる。また、既存のデジタルテレビジョン装置に取り付けるだけで、映像音声同期再生装置１Ａの効果を得ることができるので、ユーザが現在所有のデジタルテレビジョン装置をそのまま使用することができ、導入コストを低減することができる。 According to the video / audio synchronization processing device 2 according to the third embodiment, when the stream in which the emergency information is superimposed on the video / audio content is received, the emergency information can be detected and the emergency information is detected. Since the video / audio synchronized playback apparatus 1A is bypassed and the television signal stream is output to the external digital television apparatus as it is, the normal viewing state of the digital television apparatus can be obtained. Immediateness can be ensured. In addition, since the effect of the video / audio synchronized playback apparatus 1A can be obtained simply by attaching it to an existing digital television apparatus, the user can use the digital television apparatus that the user currently owns as it is, and the introduction cost can be reduced. be able to.

１Ａ映像音声同期再生装置
１１分離抽出手段
１２音声データ復号手段
１３話速変換手段
１４倍率設定手段
１５音声データ記憶手段
１６映像データ復号手段
１７映像データ記憶手段
１８同期制御手段
１８ａタイムスタンプ変換手段
１８ｂ制御信号出力手段
１Ｂ映像音声同期再生装置
１９緊急情報検出手段
２映像音声同期処理装置
２１緊急情報検出手段
２２符号化手段
Ａ受信アンテナ 1A Video / Audio Synchronous Playback Device 11 Separation / Extraction Unit 12 Audio Data Decoding Unit 13 Speech Speed Conversion Unit 14 Magnification Setting Unit 15 Audio Data Storage Unit 16 Video Data Decoding Unit 17 Video Data Storage Unit 18 Synchronization Control Unit 18a Timestamp Conversion Unit 18b Control Signal output means 1B Video / audio synchronous playback device 19 Emergency information detection means 2 Video / audio synchronous processing device 21 Emergency information detection means 22 Encoding means A Receiving antenna

Claims

A video / audio synchronized playback apparatus that converts the speech rate of the audio by expanding / decreasing the audio time length of the video / audio content, and reproduces the video in synchronization with the speech rate,
Separation and extraction means for receiving input of a stream including video and audio content, and separating and extracting audio data, video data, a time stamp of the audio data, and a time stamp of the video data from the stream;
Audio data decoding means for decoding the audio data separated and extracted by the separation and extraction means;
Video data decoding means for decoding the video data separated and extracted by the separation and extraction means;
Magnification setting means for presetting an expansion / contraction magnification as a reference for expanding / contracting the time length of the audio data;
The voice data decoded by the voice data decoding means is divided into a voice section that is a voice section and a non-voice section that is a section other than the voice section, and the time length of the voice section and the non-voice section The speech speed of the voice data is converted by expanding / contracting the time length so as to be within the speech speed of the voice data before conversion of the speech speed according to the expansion / contraction magnification preset by the magnification setting means. Speaking speed conversion means,
Voice data storage means for storing the voice data whose speech speed is converted by the speech speed conversion means;
Video data storage means for storing the video data decoded by the video data decoding means;
When the difference between the time length of the audio data after the conversion of the speech rate and the time length of the video data is a predetermined time or more, the time stamp of the audio data is sequentially updated according to the expansion / contraction magnification, and When the video data is interpolated or deleted so as to eliminate the shift, the time stamp of the video data is updated according to the video data after the interpolation or deletion, and when the shift is less than a predetermined time, the audio data Time stamp conversion means for sequentially updating the time stamp according to the expansion / contraction magnification,
Based on the time stamp of the audio data updated by the time stamp conversion means and the time stamp of the updated video data or the time stamp of the video data input from the separation and extraction means Control signal output means for outputting a control signal for synchronizing and outputting the voice data after conversion of the speech rate stored in the data storage means and the video data stored in the video data storage means A video / audio synchronized playback apparatus.

The magnification setting means includes
A plurality of predetermined expansion / contraction magnifications can be arbitrarily selected,
The video / audio synchronized playback apparatus according to claim 1, wherein the speech speed converting means converts the time length of the audio data based on the expansion / contraction magnification selected and set by the magnification setting means.

The video / audio content is a broadcast program,
Receiving means for receiving a stream including the video and audio content;
An emergency information detecting means for detecting whether or not predetermined emergency information is included in the stream received by the receiving means, and outputting the stream to the separation and extraction means; and
When the emergency information detecting means detects the emergency information, it notifies the magnification setting means to that effect,
The magnification setting means, when receiving a notification that the emergency information is detected from the emergency information detection means, sets the expansion / contraction magnification to the same magnification, and outputs the expansion / contraction magnification to the speech speed conversion means,
The speech speed conversion means outputs the expansion / contraction magnification set to the same magnification input from the magnification setting means to the time stamp conversion means,
The time stamp conversion means receives the time stamp of the audio data and the time stamp of the video data inputted from the separation and extraction means when the expansion / contraction magnification set to the same magnification is inputted from the speech speed conversion means. Without updating the output to the control signal output means,
The control signal output means includes
Based on the time stamp of the audio data and the time stamp of the video data input from the time stamp conversion means, the audio data stored in the audio data storage means and the video data storage means The video / audio synchronized playback apparatus according to claim 1 or 2, wherein a control signal for outputting the video data in synchronization with the video data is output.

An audio / video synchronization processing device that converts audio speed included in a stream input to a broadcast receiving apparatus and synchronizes audio data after the audio speed conversion with video data,
The video / audio synchronized playback apparatus according to claim 1 or 2,
Encoding means for encoding and outputting the audio data input from the video / audio synchronous reproduction device and the video data;
Emergency information detection means for receiving an input of a stream of a broadcast program including the video and audio content and detecting whether or not the stream includes predetermined emergency information;
When the emergency information is detected from the stream, the emergency information detection unit outputs the stream to the outside. When the emergency information is not detected from the stream, the emergency information detection unit outputs the stream to the video / audio synchronized playback apparatus. A video / audio synchronization processing apparatus characterized by outputting the data.

In order to convert the speech rate of the audio by expanding and contracting the audio time length of the video / audio content, and to play the video in synchronization with the speech rate,
Separation and extraction means for receiving the input of the stream and separating and extracting audio data, video data, a time stamp of the audio data, and a time stamp of the video data from the stream;
Audio data decoding means for decoding the audio data separated and extracted by the separation and extraction means;
Video data decoding means for decoding the video data separated and extracted by the separation and extraction means, and storing the video data in the video data storage means;
Magnification setting means for presetting an expansion / contraction magnification used as a reference for expanding / contracting the time length of the audio data;
The voice data decoded by the voice data decoding means is divided into a voice section that is a voice section and a non-voice section that is a section other than the voice section, and the time length of the voice section and the non-voice section The speech speed of the voice data is converted by expanding / contracting the time length so as to be within the speech speed of the voice data before conversion of the speech speed according to the expansion / contraction magnification preset by the magnification setting means. Speaking speed conversion means,
When the difference between the time length of the audio data after the conversion of the speech rate and the time length of the video data is a predetermined time or more, the time stamp of the audio data is sequentially updated according to the expansion / contraction magnification, and When the video data is interpolated or deleted so as to eliminate the shift, the time stamp of the video data is updated according to the video data after the interpolation or deletion, and when the shift is less than a predetermined time, the audio data Time stamp conversion means for sequentially updating the time stamp according to the expansion / contraction magnification,
Based on the time stamp of the audio data updated by the time stamp conversion means and the time stamp of the updated video data or the time stamp of the video data input from the separation and extraction means Function as control signal output means for outputting a control signal for synchronizing and outputting the voice data after conversion of the speech rate stored in the data storage means and the video data stored in the video data storage means Video / audio synchronized playback program.