JP2012204876A

JP2012204876A - Reproduction device, reproduction method and program

Info

Publication number: JP2012204876A
Application number: JP2011064900A
Authority: JP
Inventors: Masahiro Sumiya; 政宏角谷; Yoshikazu Shimada; 美和嶋田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-03-23
Filing date: 2011-03-23
Publication date: 2012-10-22
Anticipated expiration: 2031-03-23
Also published as: JP5696552B2

Abstract

PROBLEM TO BE SOLVED: To provide a reproducing device, a reproducing method and a program capable of establishing synchronization between video and voice without uncomfortable feeling.SOLUTION: In a reproducing device of video and voice, a deviation determining part determines the deviation amount between video data and voice data reproduced by a decoding reproduction part at the time of reproduction. A video importance degree calculation part calculates the degree of importance of video showing the degree of complexity of video reproduced by the video data. A voice importance degree calculation part calculates the degree of importance of voice showing characteristics of voice volume reproduced by the voice data. An automatic correction part controls reproduction of the video data and the voice data so that deviation between the video data and the voice data during reproduction is corrected based on the degree of importance of video, the degree of importance of voice and the deviation amount.

Description

本発明は、再生装置、再生方法およびプログラムに関する。 The present invention relates to a playback device, a playback method, and a program.

動画のコンテンツにおいては、映像と音声が別々の切り離された情報であるため、映像と音声との同期をとる必要がある。しかし、例えばフレームレートが３０ｆｐｓのコンテンツで、１秒間あたり３０枚と決まっている映像に対し、音声は連続したストリームであり、映像１枚ごとに対応して区切られた音声があるわけではない。また、映像と音声はそれぞれのデータ量に差がある。特に、映像データと音声データのデータ量が大きく違ってくる理由として、映像のデータ量は、高画質になるにしたがって飛躍的に大きくなること、同じ動画内であっても、画面の精細度によって一定時間あたりのデータ量が大きく変化することなどがある。これに対し、音声データ量は大きく変化しない。 In video content, since video and audio are separate pieces of information, it is necessary to synchronize video and audio. However, for example, for content with a frame rate of 30 fps and video that is determined to be 30 frames per second, the audio is a continuous stream, and there is no audio divided for each video. Also, there is a difference in the amount of data between video and audio. In particular, the reason for the large difference between the amount of video data and the amount of audio data is that the amount of video data increases dramatically as the image quality becomes higher. The amount of data per fixed time may change greatly. On the other hand, the amount of audio data does not change greatly.

このように、映像データはフレームなどによる区切りがあるのに対し、音声データが区切りのないデータであること、音声データと映像データは、データ量に差があること等が、映像と音声のずれの原因となる。しかし、映像と音声との同期をとることは人手によるところが大きく、必ずしも正確とはいえないため、ずれの少ない同期を実行することが課題である。 In this way, while video data is separated by frames, etc., audio data is not separated, and there is a difference in the amount of data between audio data and video data. Cause. However, since synchronization between video and audio is largely manual and not necessarily accurate, it is a problem to execute synchronization with little deviation.

このような課題に対応する例として、例えば、映像と音声とを別々のクロックで再生する再生装置がある。この再生装置では、動画データの再生に先立ち音声データを無音再生させ、その再生時間から音声データ用クロックと映像データ用クロックとの誤差を演算し、再生時のタイミングを調整する。また、映像データと音声データとの再生位置のずれを、音量レベルが特定値よりも低い区間の音声データの再生速度を変更することにより同期させる方法もある。さらに、音声データを一定時間周期で区切り、データを接続することにより再生速度を変更する例もある。再生速度の変更は、区切り位置近くの音声データの波形スタイルに応じて、切り出し開始および終了点を決めて波形を接続することにより行う。 As an example corresponding to such a problem, for example, there is a reproducing apparatus that reproduces video and audio with different clocks. In this reproducing apparatus, audio data is silently reproduced prior to reproduction of moving image data, an error between the audio data clock and the video data clock is calculated from the reproduction time, and the timing at the time of reproduction is adjusted. There is also a method of synchronizing the reproduction position deviation between the video data and the audio data by changing the reproduction speed of the audio data in the section where the volume level is lower than a specific value. Further, there is an example in which the playback speed is changed by dividing the audio data at a constant time period and connecting the data. The reproduction speed is changed by determining the start and end points of cutout and connecting the waveforms according to the waveform style of the audio data near the break position.

特開２００４−７１４０号公報Japanese Patent Laid-Open No. 2004-7140 特開２００６−０５０３６２号公報JP 2006-050362 A 特開平０６−２５９０９３号公報Japanese Patent Application Laid-Open No. 06-259093

しかしながら、上記のように、単に映像と音声とを再生速度を変えることにより同期させる方法では、同期する瞬間に音声または映像あるいはその両方が途切れてしまうことがあり、視聴者に違和感や不快感をもたらす。また、他の処理の割り込み等があってリアルタイム処理ができない状態が継続し、映像と音声の同期がずれてしまった場合には、そのずれを修正することが難しい。すなわち処理能力の不足のため、再生映像がコマ落ちしてしまう、音声が途切れてしまう、ずれがある一定間隔のまま修正されない、など、視聴者に違和感、不快感を与える問題がある。 However, as described above, in the method of simply synchronizing video and audio by changing the playback speed, the audio and / or video may be interrupted at the moment of synchronization, causing the viewer to feel uncomfortable or uncomfortable. Bring. In addition, when there is an interruption of other processing and the state where real-time processing cannot be continued, and the video and audio are out of synchronization, it is difficult to correct the shift. That is, there is a problem that the viewer feels uncomfortable or uncomfortable, for example, the playback video drops, the sound is interrupted, or the deviation is not corrected at a certain interval due to insufficient processing capability.

そこで本発明は、違和感のない映像と音声との同期が可能な再生装置、再生方法を提供することを目的とする。 Accordingly, an object of the present invention is to provide a playback apparatus and a playback method that can synchronize video and audio without a sense of incongruity.

ひとつの態様である再生装置は、符号化された映像および音声を含むストリームを取得し、前記映像および前記音声を再生する再生装置である。この再生装置において、信号取得部は、前記ストリームを取得する。復号再生部は、取得した前記ストリームを分離して復号化することにより、映像データおよび音声データを生成し、第１のずれ判定時間における第１の映像および第１の音声を再生する。ずれ判定部は、前記第１のずれ判定時間における前記第１の映像と前記第１の音声との再生時のずれ量を判定する。映像重要度算出部は、前記第１のずれ判定時間の次の第２のずれ判定時間において再生される第２の映像の複雑さの度合いを示す映像重要度を算出する。音声重要度算出部は、前記第１のずれ判定時間の次の第２のずれ判定時間において再生される第２の音声の音量の特徴を示す音声重要度を算出する。自動補正部は、前記映像重要度、前記音声重要度および前記ずれ量に基づき、前記映像と前記音声との再生時のずれを補正するよう前記第２の映像および前記第２の音声の再生を制御することを特徴としている。 A playback apparatus according to one aspect is a playback apparatus that acquires a stream including encoded video and audio and plays back the video and audio. In this playback apparatus, the signal acquisition unit acquires the stream. The decoding / playback unit generates video data and audio data by separating and decoding the acquired stream, and plays back the first video and the first audio at the first shift determination time. The deviation determination unit determines an amount of deviation at the time of reproduction between the first video and the first audio during the first deviation determination time. The video importance level calculation unit calculates a video importance level indicating a degree of complexity of the second video to be reproduced in a second shift determination time next to the first shift determination time. The voice importance level calculation unit calculates a voice importance level indicating a volume characteristic of the second voice reproduced at a second shift determination time next to the first shift determination time. The automatic correction unit reproduces the second video and the second audio so as to correct a deviation in reproduction between the video and the audio based on the video importance, the audio importance, and the shift amount. It is characterized by control.

別の態様である再生方法は、符号化された映像および音声を含むストリームを取得し、前記映像および前記音声を再生する再生方法である。この再生方法においては、前記ストリームを取得し、第１のずれ判定時間、前記ストリームに基づき第１の映像および第１の音声を再生し、前記第１のずれ判定時間における第１の映像と第１の音声との再生時のずれ量を判定する。また、前記第１のずれ判定時間の次の第２のずれ判定時間において再生される第２の映像の複雑さの度合いを示す映像重要度を算出し、前記第２のずれ判定時間において再生される第２の音声の音量の特徴を示す音声重要度を算出する。この再生方法では、前記映像重要度、前記音声重要度および前記ずれ量に基づき、前記映像と前記音声との再生時のずれを補正するよう前記第２のずれ判定時間における前記第２の映像および前記第２の音声の再生を制御することを特徴としている。 The reproduction method which is another aspect is a reproduction method for acquiring a stream including encoded video and audio and reproducing the video and audio. In this reproduction method, the stream is acquired, the first video and the first audio are reproduced based on the first deviation determination time and the stream, and the first video and the first audio at the first deviation judgment time are reproduced. A deviation amount at the time of reproduction with the sound of 1 is determined. In addition, a video importance level indicating a degree of complexity of the second video to be played back at the second shift determination time next to the first shift determination time is calculated, and played back at the second shift determination time. The voice importance indicating the volume characteristic of the second voice is calculated. In this reproduction method, the second video and the second video at the second deviation determination time so as to correct a deviation at the time of reproduction between the video and the audio based on the video importance, the audio importance, and the deviation amount. The reproduction of the second sound is controlled.

なお、上述した本発明に係る方法をコンピュータに行わせるためのプログラムであっても、このプログラムを当該コンピュータによって実行させることにより、上述した本発明に係る方法と同様の作用・効果を奏するので、前述した課題が解決される。 In addition, even if it is a program for causing a computer to perform the method according to the present invention described above, since the program is executed by the computer, the same operations and effects as the method according to the present invention described above are achieved. The aforementioned problems are solved.

上述した態様によれば、違和感のない映像と音声との同期が可能な再生装置、再生方法およびプログラムが提供される。 According to the above-described aspect, a playback device, a playback method, and a program that can synchronize video and audio without a sense of incongruity are provided.

一実施の形態による再生装置の構成を示すブロック図である。It is a block diagram which shows the structure of the reproducing | regenerating apparatus by one Embodiment. 一実施の形態による再生装置の機能を示すブロック図である。It is a block diagram which shows the function of the reproducing | regenerating apparatus by one Embodiment. 一実施の形態による再生装置の機能を示すブロック図である。It is a block diagram which shows the function of the reproducing | regenerating apparatus by one Embodiment. 一実施の形態による再生装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the reproducing | regenerating apparatus by one Embodiment. 一実施の形態による再生装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the reproducing | regenerating apparatus by one Embodiment. 一実施の形態による再生装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the reproducing | regenerating apparatus by one Embodiment. 一実施の形態による再生装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the reproducing | regenerating apparatus by one Embodiment. 一実施の形態による周波数領域毎の音量の算出方法を説明する図である。It is a figure explaining the calculation method of the sound volume for every frequency domain by one embodiment. 標準的なコンピュータのハードウエア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of a standard computer.

以下、図面を参照して実施の形態を説明する。まず、図１および図２Ａ、図２Ｂを参照しながら、一実施の形態による再生装置１の構成について説明する。図１は、本実施の形態による再生装置の構成を示すブロック図、図２Ａ、図２Ｂは、本実施の形態による再生装置１の機能を示すブロック図である。本実施の形態による再生装置１は、ローカル環境において映像及び音声のコンテンツデータをデコードし再生する際に、音声および映像を同期させる機能を有する。再生装置１は、例えば、携帯情報端末（ＰｅｒｓｏｎａｌＤａｔａＡｓｓｉｓｔａｎｔｓ：ＰＤＡ）、パーソナルコンピュータ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ：ＰＣ）、移動電話機などとして実現される。 Hereinafter, embodiments will be described with reference to the drawings. First, the configuration of the playback apparatus 1 according to an embodiment will be described with reference to FIGS. 1, 2A, and 2B. FIG. 1 is a block diagram illustrating a configuration of a playback device according to the present embodiment, and FIGS. 2A and 2B are block diagrams illustrating functions of the playback device 1 according to the present embodiment. The playback device 1 according to the present embodiment has a function of synchronizing audio and video when decoding and reproducing video and audio content data in a local environment. The playback device 1 is realized as, for example, a personal information terminal (PDA), a personal computer (PC), a mobile phone, or the like.

図１に示すように、再生装置１は、入力再生部５、ずれ判定部７、音声重要度判定部９、映像重要度判定部１１、比較評価部１３、自動補正部１５、記憶部３５、タイマ３７を有し、互いにシステムバス１７で接続され、主制御部３により制御される。入力再生部５は、信号取得部１９、分離部２１、音声復号部２３、音声再生部２５、映像復号部２７、映像再生部２９を有している。自動補正部１５は、音声操作部３１、映像操作部３３を有している。 As shown in FIG. 1, the playback device 1 includes an input playback unit 5, a shift determination unit 7, an audio importance level determination unit 9, a video importance level determination unit 11, a comparative evaluation unit 13, an automatic correction unit 15, a storage unit 35, A timer 37 is included, connected to each other via the system bus 17 and controlled by the main control unit 3. The input reproduction unit 5 includes a signal acquisition unit 19, a separation unit 21, an audio decoding unit 23, an audio reproduction unit 25, a video decoding unit 27, and a video reproduction unit 29. The automatic correction unit 15 includes an audio operation unit 31 and a video operation unit 33.

図１、図２Ａ、図２Ｂに示すように信号取得部１９は、符号化された映像データおよび符号化された音声データを含むストリーム５３を取得する。分離部２１は、信号取得部１９で取得されたストリーム５３を符号化された映像データの映像ストリーム５７と符号化された音声データの音声ストリーム５５とに分離する。 As illustrated in FIGS. 1, 2A, and 2B, the signal acquisition unit 19 acquires a stream 53 including encoded video data and encoded audio data. The separation unit 21 separates the stream 53 acquired by the signal acquisition unit 19 into a video stream 57 of encoded video data and an audio stream 55 of encoded audio data.

音声復号部２３は、音声ストリーム５５を復号化しつつ、自動補正部１５からの出力に応じた操作を行って、再生可能な音声データを生成する。音声再生部２５は、自動補正部１５からの再生開始箇所に関する情報と音声復号部２３からの音声データとに基づき、音声を再生する。映像復号部２７は、映像ストリーム５７を復号化し、自動補正部１５からの出力に応じた操作を行って、再生可能な映像データを生成する。映像再生部２９は、自動補正部１５からの再生開始箇所に関する情報と、映像復号部２７からの映像データとに基づき映像を再生する。 The audio decoding unit 23 decodes the audio stream 55 and performs an operation according to the output from the automatic correction unit 15 to generate reproducible audio data. The audio reproduction unit 25 reproduces audio based on the information regarding the reproduction start location from the automatic correction unit 15 and the audio data from the audio decoding unit 23. The video decoding unit 27 decodes the video stream 57 and performs an operation according to the output from the automatic correction unit 15 to generate reproducible video data. The video playback unit 29 plays back the video based on the information regarding the playback start location from the automatic correction unit 15 and the video data from the video decoding unit 27.

ずれ判定部７は、音声再生部２５で再生された音声と、映像再生部２９で再生された映像とのずれ量を判定する。音声重要度判定部９は、音声復号部２３で復号化された音声データに基づき音声重要度ＳＬを算出する。映像重要度判定部１１は、分離部２１で分離された映像ストリーム５７に基づき映像重要度ＩＬを算出する。比較評価部１３は、算出された音声重要度ＳＬと映像重要度ＩＬとを比較する。 The deviation determination unit 7 determines the amount of deviation between the audio reproduced by the audio reproduction unit 25 and the video reproduced by the video reproduction unit 29. The voice importance level determination unit 9 calculates the voice importance level SL based on the voice data decoded by the voice decoding unit 23. The video importance determination unit 11 calculates the video importance IL based on the video stream 57 separated by the separation unit 21. The comparative evaluation unit 13 compares the calculated audio importance level SL with the video importance level IL.

自動補正部１５は、比較評価部１３における比較結果に基づき、コンテンツデータの再生状態を自動的に補正する装置であり、音声操作部３１においては音声データを操作し、映像操作部３３においては映像データを操作する。音声操作部３１においては、ずれ量および音声データに基づき、無音間引き操作６１、無音補間操作６３、または再生速度変更操作６５の少なくとも一つが行われる。映像操作部３３においては、ずれ量および映像データに基づき、フレーム間引き操作６７、またはフレーム補間操作６９が行われる。 The automatic correction unit 15 is a device that automatically corrects the reproduction state of the content data based on the comparison result in the comparative evaluation unit 13. The audio operation unit 31 operates audio data, and the video operation unit 33 displays video. Manipulate data. In the audio operation unit 31, at least one of the silent thinning operation 61, the silent interpolation operation 63, and the reproduction speed changing operation 65 is performed based on the deviation amount and the audio data. In the video operation unit 33, a frame thinning operation 67 or a frame interpolation operation 69 is performed based on the shift amount and the video data.

記憶部３５は、ＲａｍｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＡＭ）、ＲｅａｄＯｎｌｙＭｅｍｏｒｙ（ＲＯＭ）等であり、上記操作を行うためのプログラムや、映像ストリーム５７、復号化された映像データ、音声ストリーム５５、復号化された音声データなどを格納する。主制御部３は、再生装置１の動作を制御するための演算処理装置である。 The storage unit 35 is a random access memory (RAM), a read only memory (ROM) or the like, and a program for performing the above operation, a video stream 57, decoded video data, an audio stream 55, a decoded Stores audio data. The main control unit 3 is an arithmetic processing device for controlling the operation of the playback device 1.

以下、図３から図５を参照しながら、本実施の形態による再生装置１の動作を説明する。図３、図４Ａ〜図４Ｃは、再生装置１の動作を示すフローチャートである。図３に示すように、信号取得部１９が動画コンテンツ等の符号化された音声データおよび映像データを含むストリームを取得し、ストリームは、分離部２１により映像ストリームと音声ストリームに分離される（Ｓ１００）。主制御部３は、ずれ判定部７により、閾値Ａの設定を行う（Ｓ１０１）。音声復号部２３は、音声ストリーム５５についてある程度のバッファ分デコードを行うが、このバッファ長が閾値Ａとされ、例えば１秒とすることができる。 Hereinafter, the operation of the playback apparatus 1 according to the present embodiment will be described with reference to FIGS. 3 to 5. 3 and 4A to 4C are flowcharts showing the operation of the playback apparatus 1. As shown in FIG. 3, the signal acquisition unit 19 acquires a stream including encoded audio data and video data such as moving image content, and the stream is separated into a video stream and an audio stream by the separation unit 21 (S100). ). The main control unit 3 sets the threshold value A by the deviation determination unit 7 (S101). The audio decoding unit 23 decodes the audio stream 55 by a certain amount of buffer. This buffer length is set as the threshold A, and can be set to 1 second, for example.

続いて、主制御部３は、ずれ判定部７により、閾値Ｂを、閾値Ａに対応する動画のフレーム数として算出する（Ｓ１０１）。すなわち、ずれ判定部７は、ストリーム５３から、閾値Ａの時間に対応するフレーム数を算出し、これを閾値Ｂとする。例えば、閾値Ａ＝１秒のとき、動画のフレームレートから閾値Ｂ＝３０フレームと算出される。 Subsequently, the main control unit 3 calculates the threshold value B as the number of frames of the moving image corresponding to the threshold value A by the deviation determination unit 7 (S101). That is, the deviation determination unit 7 calculates the number of frames corresponding to the time of the threshold A from the stream 53 and sets this as the threshold B. For example, when the threshold A = 1 second, the threshold B = 30 frames is calculated from the frame rate of the moving image.

入力再生部５は、動画および音声の再生を開始するとともに、図示せぬ音声タイマ、動画カウンタを起動する（Ｓ１０２）。音声タイマは、計時機能を有し、音声の再生済みサンプル数を計数することにより、再生済みの音声の時間を計測する。すなわち、音声タイマの値は、再生済み音声サンプル数を音声のサンプリングレートで割ったものである。動画カウンタは、計数機能を有し、表示した映像のフレーム数を計数する。 The input reproduction unit 5 starts reproduction of moving images and sounds, and activates an unillustrated audio timer and moving image counter (S102). The audio timer has a time measuring function, and measures the time of reproduced audio by counting the number of reproduced samples of the audio. That is, the value of the audio timer is the number of reproduced audio samples divided by the audio sampling rate. The moving picture counter has a counting function and counts the number of frames of the displayed video.

入力再生部５は、動画を１フレーム再生する（Ｓ１０３）。すなわち、音声復号部２３は、音声ストリーム５５を復号化して音声データを作成し、音声再生部２５は、音声データを再生する。また、映像復号部２７は、映像ストリーム５７を復号化し、映像データを作成し、映像再生部２９は、動画を再生する。 The input playback unit 5 plays back one frame of the moving image (S103). That is, the audio decoding unit 23 decodes the audio stream 55 to create audio data, and the audio reproduction unit 25 reproduces the audio data. The video decoding unit 27 decodes the video stream 57 to create video data, and the video playback unit 29 plays back a moving image.

ずれ判定部７は、１フレーム再生された時点で、これまで再生された映像のフレーム数を示す動画カウンタの値（動画カウンタ値ＩＣｔ）と閾値Ｂとを比較する（Ｓ１０４）。ずれ判定部７は、動画カウンタ値ＩＣｔが閾値Ｂに満たないと判別すると（Ｓ１０４：Ｙｅｓ）、現在映像再生部２９が再生しているフレームが、映像ストリームのＧｒｏｕｐｏｆＰｉｃｔｕｒｅｓ（ＧＯＰ）端か否か判定する（Ｓ１０７）。現在再生しているフレームがＧＯＰ端でない場合には（Ｓ１０７：Ｎｏ）、処理はＳ１０３に戻り、追加で１フレーム再生を行う。 The deviation determination unit 7 compares the value of the moving image counter (moving image counter value ICt) indicating the number of frames of the image reproduced so far with the threshold value B when one frame is reproduced (S104). When the deviation determination unit 7 determines that the moving image counter value ICt does not satisfy the threshold value B (S104: Yes), whether or not the frame currently being reproduced by the video reproduction unit 29 is the end of the Group of Pictures (GOP) of the video stream. (S107). If the currently playing frame is not at the GOP end (S107: No), the process returns to S103, and one frame is played back additionally.

現在再生している部分がＧＯＰ端である場合には（Ｓ１０７：Ｙｅｓ）、音声と映像とは、ＧＯＰ長に含まれる数のフレームが再生された際のずれを判定されることになる。ここで、ずれを判定される区間をずれ判定区間ＪＡということにする。ここでは、ずれ判定区間ＪＡ＝ＧＯＰ長となる。 When the currently reproduced portion is the GOP end (S107: Yes), the difference between the audio and the video when the number of frames included in the GOP length is reproduced is determined. Here, the section in which the shift is determined is referred to as a shift determination section JA. Here, the deviation determination section JA = GOP length.

動画カウンタ値ＩＣｔが閾値Ｂ以上のときであって（Ｓ１０４：Ｎｏ）、動画カウンタ値ＩＣｔが閾値Ｂであれば（Ｓ１０５：Ｙｅｓ）、主制御部３は、Ｓ１０８に処理を進める。このとき、ずれ判定区間ＪＡ＝閾値Ｂとなる。動画カウンタ値ＩＣｔが閾値Ｂより大きければ（Ｓ１０５：Ｎｏ）、主制御部３はエラーを出力し（Ｓ１０６）、処理を終了する。 If the moving image counter value ICt is equal to or greater than the threshold value B (S104: No) and the moving image counter value ICt is the threshold value B (S105: Yes), the main control unit 3 advances the process to S108. At this time, deviation determination section JA = threshold value B. If the moving image counter value ICt is larger than the threshold value B (S105: No), the main control unit 3 outputs an error (S106) and ends the process.

続いて、ずれ判定部７は、音声タイマの値（音声タイマ値ＳＴ）と、再生済み映像フレームに閾値Ａを掛けて閾値Ｂで割ったもの（再生済み映像フレームにあたる時間）との差を算出する。すなわち、ずれ判定部７は、ずれ量Ｌ１として、（音声タイマ値ＳＴ−閾値Ａ×（動画カウンタ値ＩＣｔ／閾値Ｂ））の値を算出する。ずれ判定部７は、算出した値が、所定のずれ許容時間ＡＴ未満か否かを判別する（Ｓ１０８）。所定のずれ許容時間ＡＴは、例えば１／３０秒とすることができる。 Subsequently, the deviation determination unit 7 calculates a difference between the value of the audio timer (audio timer value ST) and the value obtained by multiplying the reproduced video frame by the threshold A and dividing by the threshold B (time corresponding to the reproduced video frame). To do. That is, the deviation determination unit 7 calculates a value of (audio timer value ST−threshold value A × (moving image counter value ICt / threshold value B)) as the deviation amount L1. The deviation determination unit 7 determines whether or not the calculated value is less than the predetermined deviation allowable time AT (S108). The predetermined deviation allowable time AT can be set to 1/30 seconds, for example.

ずれ判定部７は、（音声タイマ値ＳＴ−閾値Ａ×（動画カウンタ値ＩＣｔ／閾値Ｂ））＜（ずれ許容時間ＡＴ）（Ｓ１０８：Ｙｅｓ）のときには、同期が取れているためずれを修正する必要なしと判別する。ずれ判定部７は、このとき、閾値Ａ＝音声タイマ値ＳＴ、閾値Ｂ＝動画カウンタ値ＩＣｔとした後、音声タイマ値ＳＴ＝０、動画カウンタ値ＩＣｔ＝０とリセットし（Ｓ１０９）、Ｓ１０３に戻る。 The deviation determination unit 7 corrects the deviation because synchronization is established when (audio timer value ST−threshold A × (movie counter value ICt / threshold B)) <(deviation allowable time AT) (S108: Yes). Determine that it is not necessary. At this time, the deviation determination unit 7 resets the audio timer value ST = 0 and the moving image counter value ICt = 0 after setting the threshold A = the audio timer value ST and the threshold B = the moving image counter value ICt (S109). Return.

ずれ判定部７は、（音声タイマ値ＳＴ−閾値Ａ×（動画カウンタ値ＩＣｔ／閾値Ｂ）＜（ずれ許容時間ＡＴ）でない場合には、同期が取れていないと判別し（Ｓ１０８：Ｎｏ）、主制御部３は、処理を図４Ａのフローチャートの処理に進める。 The deviation determination unit 7 determines that synchronization is not established when (voice timer value ST−threshold A × (moving image counter value ICt / threshold B) <(deviation allowable time AT)) (S108: No). The main control unit 3 advances the process to the process of the flowchart of FIG. 4A.

図４Ａに示すように、主制御部３は、自動補正部１５により、（ずれ時間Ｌ１）＝（音声タイマ値ＳＴ―閾値Ａ×（動画カウンタ値ＩＣｔ／閾値Ｂ））と設定する（Ｓ１３１）。ずれ時間Ｌ１は、映像と音声のずれ時間を表し、再生済み音声時間から再生済み映像時間を引いた時間であり、音声が遅れているときは負の値、映像が遅れているときは正の値をとる。 As shown in FIG. 4A, the main control unit 3 sets (shift time L1) = (audio timer value ST−threshold A × (moving image counter value ICt / threshold B)) by the automatic correction unit 15 (S131). . The deviation time L1 represents the deviation time between video and audio, and is the time obtained by subtracting the reproduced video time from the reproduced audio time. When the audio is delayed, a negative value is obtained. When the video is delayed, the deviation time L1 is positive. Takes a value.

主制御部３は、映像復号部２３により、映像ストリーム５７について次に再生する部分のＧＯＰ構造を判定し、Ｂｉ−ｄｉｒｅｃｔｉｏｎａｌＰｒｅｄｉｃｔｅｄＦｒａｍｅ（Ｂフレーム）を含むか否か判別する（Ｓ１３１）。例えば、ＧＯＰ構造が「ＩｎｔｅｒＶｉｄｅｏＢｉｔｒａｔｅＢａｌａｎｃｅＰｒｏｆｉｌｌｅ：ＩＢＢＰ」であるか否か判別される。 The main control unit 3 uses the video decoding unit 23 to determine the GOP structure of the portion to be reproduced next for the video stream 57, and determines whether or not it includes a Bi-directional Predicted Frame (B frame) (S131). For example, it is determined whether or not the GOP structure is “Inter Video Bitrate Balance Profile: IBBP”.

Ｂフレームを含まない場合には（Ｓ１３２：Ｎｏ）、音声を優先して操作するため、自動補正部１５は、処理を図４ＢのＳ１３５に進める。Ｂフレームを含む場合には（Ｓ１３２：Ｙｅｓ）、映像と音声のどちらを優先して操作する方が与える違和感が少ないかを、音声重要度ＳＬおよび映像重要度ＩＬをもとに判定する。 When the B frame is not included (S132: No), the automatic correction unit 15 advances the process to S135 in FIG. If a B frame is included (S132: Yes), it is determined based on the audio importance level SL and the video importance level IL whether the priority is given to the video or audio operation.

Ｓ１３３では、映像の重要さを示す映像重要度ＩＬおよび音声の重要さを示す音声重要度ＳＬを算出する。以下、Ｓ１３３の処理について説明する。映像重要度ＩＬは、映像の複雑さを示す値として算出される。音声重要度ＳＬは、音声の音量の特徴を示す値として算出される。 In S133, the video importance IL indicating the importance of the video and the audio importance SL indicating the importance of the audio are calculated. Hereinafter, the process of S133 will be described. The video importance IL is calculated as a value indicating the complexity of the video. The voice importance level SL is calculated as a value indicating the characteristics of the sound volume.

まず、映像重要度ＩＬの算出方法について説明する。映像重要度判定部１１は、映像のデコード処理を行う前に例えばＨ．２６４のパラメータ情報を解析し、（−１）×量子化係数（ＱｕａｎｔｉｚａｔｉｏｎＰａｒａｍｅｔｅｒ：ＱＰ）、デコード前のフレームあたりデータ量、動きベクトル総量の各パラメータを正規化した後加算することで、映像重要度ＩＬを算出する。 First, a method for calculating the video importance IL will be described. The video importance level determination unit 11 performs, for example, H.P. H.264 parameter information is analyzed, and (-1) × quantization coefficient (Quantization Parameter: QP), data amount per frame before decoding, and motion vector total amount are normalized and added to each other, thereby adding video importance. IL is calculated.

映像そのものを再生しながら映像を解析するのは処理負荷的に厳しいため、映像重要度ＩＬは、映像を符号化した際に付加される量子化係数、フレーム（ピクチャ）サイズ、動きベクトル総量に基づき求める。ここで、量子化係数に基づく映像重要度を、量子化重要度ＩＬ１、フレームサイズに基づく映像重要度を、サイズ重要度ＩＬ２、動きベクトル総量に基づく映像重要度を、ベクトル重要度ＩＬ３とする。 Since it is difficult to analyze the video while reproducing the video itself, the video importance IL is based on the quantization coefficient added when the video is encoded, the frame (picture) size, and the total amount of motion vectors. Ask. Here, it is assumed that the video importance based on the quantization coefficient is the quantization importance IL1, the video importance based on the frame size is the size importance IL2, and the video importance based on the total motion vector is the vector importance IL3.

以下、量子化重要度ＩＬ１の算出方法について説明する。量子化重要度ＩＬ１は、エンコード時にマクロブロック（Ｍａｃｒｏｂｌｏｃｋ：ＭＢ）毎に設定される量子化係数ＱＰに基づいて算出される。量子化係数ＱＰとは、映像データを圧縮する際に目標とするデータサイズになるように、映像の複雑さおよび劣化の解りにくさに応じて設定されるパラメータである。量子化係数ＱＰは、各ＭＢヘッダに直前ＭＢの量子化係数ＱＰとの差という形で設定されており、１ピクチャ当たり量子化値ＱＰｐは以下の式１により算出できる。
ＱＰｐ＝２６＋ＰＩＱＭ＋Σ（ＳＱＤ＋（ΣＭＱＤ／Ｍｂ））／（ＮＳＧＭ＋１）
・・・（式１） Hereinafter, a method for calculating the quantization importance IL1 will be described. The quantization importance IL1 is calculated based on a quantization coefficient QP that is set for each macroblock (Macroblock: MB) during encoding. The quantization coefficient QP is a parameter that is set according to the complexity of the video and the difficulty of understanding the degradation so that the target data size is obtained when the video data is compressed. The quantization coefficient QP is set in the form of a difference from the quantization coefficient QP of the previous MB in each MB header, and the quantization value QPp per picture can be calculated by the following equation 1.
QPp = 26 + PIQM + Σ (SQD + (ΣMQD / Mb)) / (NSGM + 1)
... (Formula 1)

ここで、一つ目の「Σ」は、各ピクチャに含まれる全スライス分の和を示し、二つ目の「Σ」は各スライスに含まれる全マクロブロック分の和を示す。また、各変数は以下の通りである。 Here, the first “Σ” indicates the sum of all slices included in each picture, and the second “Σ” indicates the sum of all macroblocks included in each slice. Each variable is as follows.

式１において、ＰＩＱＭは、ｐｉｃ＿ｉｎｉｔ＿ｑｐ＿ｍｉｎｕｓ２６を示し、ＰｉｃｔｕｒｅＰａｒａｍｅｔｅｒＳｅｔ（ＰＰＳ）に定義される、ＱＰの初期値を設定する値であり、実際の初期値から２６引いた値が設定される。ＳＱＤは、ｓｌｉｃｅ＿ｑｐ＿ｄｅｌｔａを示し、スライスヘッダに定義される値であり、スライス毎のＱＰの初期値を設定する値である。ＭＱＤは、ｍｂ＿ｑｐ＿ｄｅｌｔａを示し、マクロブロック毎に定義される値であり、当該マクロブロックと、直前のマクロブロックの量子化パラメータＱＰの差分値である。Ｍｂは、Ｍａｃｒｏｂｌｏｃｋｓを示し、スライスヘッダに定義される値であり、スライスに含まれるマクロブロック数である。ＮＳＧＭは、ｎｕｍ＿ｓｌｉｃｅ＿ｇｒｏｕｐｓ＿ｍｉｎｕｓ１を示し、ＰＰＳに定義される値であり、ピクチャに含まれるスライス数から１を減じた値を表す。 In Equation 1, PIQM indicates pic_init_qp_minus 26, which is a value for setting an initial value of QP defined in the Picture Parameter Set (PPS), and a value obtained by subtracting 26 from the actual initial value is set. SQD indicates slice_qp_delta, is a value defined in the slice header, and is a value for setting an initial value of QP for each slice. MQD indicates mb_qp_delta and is a value defined for each macroblock, and is a difference value between the quantization parameter QP of the macroblock and the immediately preceding macroblock. Mb indicates Macroblocks, is a value defined in the slice header, and is the number of macroblocks included in the slice. NSGM indicates num_slice_groups_minus1, is a value defined in the PPS, and represents a value obtained by subtracting 1 from the number of slices included in the picture.

さらに、この１ピクチャ当たりの量子化値ＱＰｐを、ずれ判定区間ＪＡ分の数のピクチャについて足し合わせ、平均をとったものをピクチャ平均ＱＰａとすると、
ピクチャ平均ＱＰａ＝ΣＱＰｐ／ずれ判定区間ＪＡ・・・（式２）
と表される。ここで「Σ」は、ずれ判定区間ＪＡに含まれるピクチャ数（すなわち、ＧＯＰ長または閾値Ｂ）分の和を示す。 Further, if the quantized value QPp per picture is added to the number of pictures corresponding to the shift determination section JA, and the average is taken as the picture average QPa,
Picture average QPa = ΣQPp / deviation determination section JA (Expression 2)
It is expressed. Here, “Σ” indicates the sum of the number of pictures (that is, GOP length or threshold value B) included in the shift determination section JA.

映像重要度判定部１１は、量子化重要度ＩＬ１を式３のように算出する。すなわち、
量子化重要度ＩＬ１=１０２−（２×ピクチャ平均ＱＰａ）・・・（式３）
とすることで、量子化重要度ＩＬ１は、１〜１００の範囲の値に正規化される。 The video importance level determination unit 11 calculates the quantization importance level IL1 as shown in Expression 3. That is,
Quantization importance IL1 = 102− (2 × picture average QPa) (Expression 3)
By so doing, the quantization importance IL1 is normalized to a value in the range of 1 to 100.

以下、サイズ重要度ＩＬ２の算出方法について説明する。サイズ重要度ＩＬ２は、映像の重要度算出のパラメータの一つとして、ピクチャのデータサイズに基づき算出される。ピクチャのデータサイズは、ＮｅｔｗｏｒｋＡｂｓｔｒａｃｔｉｏｎＬａｙｅｒ（ＮＡＬ）のうちｎａｌ＿ｕｎｉｔ＿ｔｙｐｅに応じて計測される。データサイズは、ｎａｌ＿ｕｎｉｔ＿ｔｙｐｅが「１」、すなわち、ＩｎｓｔａｎｔｅｎｏｕｓＤｅｃｏｄｉｎｇＲｅｆｒｅｓｈ（ＩＤＲ）ピクチャの場合、または、「５」、すなわち、ＩＤＲ以外のピクチャの場合、のもののバイナリサイズとして計測できる。これを、映像の大きさに対して正規化するため、ピクチャデータサイズはピクチャサイズ（１ピクチャの縦画素数×横画素数）で除される。さらに、映像重要度判定部１１は、ずれ判定区間ＪＡに含まれるピクチャデータサイズ合計を算出し、それを映像データ量ＤＶとする。すなわち、
映像データ量ＤＶ=Σ（ピクチャデータサイズ／ピクチャサイズ）・・・（式４） Hereinafter, a method of calculating the size importance IL2 will be described. The size importance IL2 is calculated based on the data size of the picture as one of the parameters for calculating the importance of the video. The data size of the picture is measured according to nal_unit_type in the network abstraction layer (NAL). The data size can be measured as a binary size of nal_unit_type of “1”, that is, an Instantaneous Decoding Refresh (IDR) picture, or “5”, that is, a picture other than IDR. In order to normalize this with respect to the size of the video, the picture data size is divided by the picture size (the number of vertical pixels of one picture × the number of horizontal pixels). Further, the video importance level determination unit 11 calculates the total picture data size included in the shift determination section JA and sets it as the video data amount DV. That is,
Video data amount DV = Σ (picture data size / picture size) (Formula 4)

ここで、「Σ」は、ずれ判定区間ＪＡに含まれるピクチャ分の和を表す。さらに、映像重要度判定部１１は、これまで再生した部分の映像データ量ＤＶの平均を算出しておき、それを（平均サイズ）とすると、サイズ重要度ＩＬ２は以下の式５のように表される。
サイズ重要度ＩＬ２
＝ｍａｘ（（（映像データ量ＤＶ）／（平均サイズ））×５０、１００）・・（式５）
式５により、サイズ重要度ＩＬ２は、１〜１００の範囲の値に正規化される。 Here, “Σ” represents the sum of pictures included in the shift determination section JA. Further, the video importance level determination unit 11 calculates the average of the video data amount DV of the portion reproduced so far, and if this is (average size), the size importance level IL2 is expressed as in the following Expression 5. Is done.
Size importance IL2
= Max (((video data amount DV) / (average size)) × 50, 100) (Equation 5)
According to Equation 5, the size importance IL2 is normalized to a value in the range of 1 to 100.

以下、ピクチャの動きベクトル総量に基づくベクトル重要度ＩＬ３の算出方法について説明する。映像重要度判定部１１は、ベクトル重要度ＩＬ３の算出において、まず動きベクトル総量をフレーム毎の全マクロブロックについて足し合わせ、１マクロブロックあたりの平均をとる。ベクトル重要度ＩＬ３は、１マクロブロックあたりの平均動きベクトル総量を例えば１０倍し上限を１００で抑える。ベクトル重要度ＩＬ３の例として、以下の式６が挙げられる。
ベクトル重要度ＩＬ３＝ｍａｘ（（Σ（各マクロブロックの動きベクトル長）
／ピクチャのマクロブロック数×１０）、１００）・・（式６） Hereinafter, a method of calculating the vector importance IL3 based on the total motion vector amount of the picture will be described. In calculating the vector importance level IL3, the video importance level determination unit 11 first adds the total amount of motion vectors for all macroblocks for each frame and takes an average per macroblock. For the vector importance IL3, the total amount of average motion vectors per macroblock is multiplied by 10, for example, and the upper limit is suppressed to 100. As an example of the vector importance IL3, the following Expression 6 is given.
Vector importance IL3 = max ((Σ (motion vector length of each macroblock)
/ Number of macroblocks of picture × 10), 100) (Equation 6)

ここで「Σ」は、１ピクチャ内のマクロブロック数分の和を表す。式６により、ベクトル重要度ＩＬ３は、１〜１００の範囲の値に正規化される。 Here, “Σ” represents the sum of the number of macroblocks in one picture. According to Equation 6, the vector importance IL3 is normalized to a value in the range of 1-100.

映像重要度判定部１１は、以上のように算出した量子化重要度ＩＬ１、サイズ重要度ＩＬ２およびベクトル重要度ＩＬ３に基づき映像重要度ＩＬを算出する。映像重要度ＩＬは、例えば量子化重要度ＩＬ１、サイズ重要度ＩＬ２およびベクトル重要度ＩＬ３の算術平均、すなわち、映像重要度ＩＬ＝（ＩＬ１＋ＩＬ２＋ＩＬ３）／３として算出される。 The video importance determination unit 11 calculates the video importance IL based on the quantization importance IL1, the size importance IL2, and the vector importance IL3 calculated as described above. The video importance IL is calculated as, for example, an arithmetic average of the quantization importance IL1, the size importance IL2, and the vector importance IL3, that is, the video importance IL = (IL1 + IL2 + IL3) / 3.

続いて、音声重要度ＳＬの算出について説明する。音声重要度判定部９は、例えば音声復号部２３により復号化された音声データの１フレーム分相当を周波数領域に転写する。音声重要度判定部９は、所定周波数範囲毎に振幅の積分を算出し、１フレーム分の時間における周波数毎の音量とする。 Next, calculation of the voice importance level SL will be described. The voice importance level determination unit 9 transfers, for example, one frame worth of the voice data decoded by the voice decoding unit 23 to the frequency domain. The voice importance determination unit 9 calculates the integral of the amplitude for each predetermined frequency range, and sets the volume for each frequency in the time of one frame.

図５は、周波数領域毎の音量の算出方法を説明する図である。図５において、縦軸は、音量に応じた振幅ｘ（ｉ）を示し、横軸は周波数ｉを対数軸で示している。図５は、例えば１フレーム分の音声データを周波数領域の振幅に変換した結果を表している。横軸は、周波数１０^ｋ〜１０^ｋ＋１（ｋは整数）毎の周波数範囲に区切られている。このとき、この周波数範囲毎の振幅の積分が、周波数範囲毎の音量（以下、周波数毎の音量という）に相当する。フレームに時間的順序を表すフレーム番号ｊを付し、フレーム番号ｊにおける周波数毎の音量ＳＶａ（ｊ）を以下の式７で表す。

式７により、周波数毎の音量ＳＶａ（ｊ）は、０〜１００までの範囲に正規化される。 FIG. 5 is a diagram for explaining a method of calculating a volume for each frequency domain. In FIG. 5, the vertical axis indicates the amplitude x (i) corresponding to the volume, and the horizontal axis indicates the frequency i on the logarithmic axis. FIG. 5 shows, for example, the result of converting audio data for one frame into amplitude in the frequency domain. The horizontal axis is divided into frequency ranges for each frequency 10 ^k to 10 ^{k + 1} (k is an integer). At this time, the integration of the amplitude for each frequency range corresponds to the volume for each frequency range (hereinafter referred to as the volume for each frequency). A frame number j indicating a temporal order is attached to the frame, and a volume SVa (j) for each frequency in the frame number j is expressed by the following Expression 7.

According to Equation 7, the volume SVa (j) for each frequency is normalized to a range from 0 to 100.

以下、上記音量ＳＶａ（ｊ）を用いた変動重要度ＳＬ１の算出方法について説明する。変動重要度ＳＬ１は、周波数帯毎の音声レベル変化量の平均である。すなわち、変動重要度ＳＬ１は、式７の周波数毎の音量ＳＶａ（ｊ）それぞれについて、前サンプル（ここでは、前フレーム）との差を取り、その差をＳ１０８における音声タイマ値ＳＴに対応するずれ判定時間、および全周波数に対応する周波数領域について足し合わせ、その平均を計算する。ずれ判定時間に含まれるフレーム数をフレーム数ＦＮ、全周波数領域を１０^０〜１０^Ｎ（Ｎは自然数）とすると、変動重要度ＳＬ１は、以下の式８で表される。
変動重要度ＳＬ１＝１／ＮΣ（Σ｜ＳＶａ（ｊ）−ＳＶ（ｊ−１）｜）／ＦＮ・・・（式８） Hereinafter, a method of calculating the variation importance SL1 using the volume SVa (j) will be described. The variation importance SL1 is an average of the sound level change amount for each frequency band. That is, the degree of importance SL1 takes the difference from the previous sample (here, the previous frame) for each volume SVa (j) for each frequency of Equation 7, and the difference corresponds to the audio timer value ST in S108. The determination time and the frequency region corresponding to all frequencies are added together, and the average is calculated. When the number of frames included in the shift determination time is the number of frames FN and the entire frequency region is 10 ⁰ to 10 ^N (N is a natural number), the variation importance SL1 is expressed by the following Expression 8.
Fluctuation importance SL1 = 1 / NΣ (Σ | SVa (j) −SV (j−1) |) / FN (Expression 8)

ここで、一つ目の「Σ」は、全周波数領域分の和を表し、二つめの「Σ」は、ずれ判定時間におけるフレーム数分の和（ｊ＝１〜ＦＮ）を表す。なお、式８により音声重要度ＳＬ１は、０〜１００までの値として算出される。 Here, the first “Σ” represents the sum of all frequency regions, and the second “Σ” represents the sum of the number of frames in the shift determination time (j = 1 to FN). Note that the voice importance SL1 is calculated as a value from 0 to 100 according to Equation 8.

次に、上記音量ＳＶａ（ｊ）を用いた音量重要度ＳＬ２の算出方法について説明する。音量重要度ＳＬ２は、音量レベル全体の総和であり、周波数毎の音量ＳＶａそれぞれについて、Ｓ１０８の音声タイマ値ＳＴに対応するずれ判定時間に含まれるフレーム、および全周波数に対応する周波数領域について足し合わせ、その平均を計算する。すなわち、以下の式９で表される。
音量重要度ＳＬ２＝１／ＮΣΣＳＶａ（ｊ）／ＦＮ・・・（式９） Next, a method for calculating the volume importance SL2 using the volume SVa (j) will be described. The volume importance level SL2 is the total sum of the volume levels. For each volume SVa for each frequency, the frame included in the deviation determination time corresponding to the audio timer value ST in S108 and the frequency region corresponding to all frequencies are added. Calculate the average. That is, it is expressed by the following formula 9.
Volume importance level SL2 = 1 / NΣΣSVa (j) / FN (Expression 9)

ここで、一つ目の「Σ」は、全周波数領域分（１０^０〜１０^Ｎ）の和を表し、二つめの「Σ」は、ずれ判定時間におけるフレーム数分の和（ｊ＝１〜ＦＮ）を表す。なお、音声重要度ＳＬ２は、０〜１００までの値として算出される。 Here, the first “Σ” represents the sum of all frequency regions (10 ⁰ to 10 ^N ), and the second “Σ” represents the sum of the number of frames in the shift determination time (j = 1 to 1). FN). The voice importance SL2 is calculated as a value from 0 to 100.

音声重要度判定部９は、以上のように算出された変動重要度ＳＬ１および音量重要度ＳＬ２に基づき、音声重要度ＳＬを設定する。例えば、音声重要度ＳＬ＝（変動重要度ＳＬ１＋音量重要度ＳＬ２）／２と算出される。 The voice importance level determination unit 9 sets the voice importance level SL based on the fluctuation importance level SL1 and the volume importance level SL2 calculated as described above. For example, voice importance SL = (variation importance SL1 + volume importance SL2) / 2 is calculated.

図４ＡのＳ１３４では、比較評価部１３が、上記のように算出した映像重要度ＩＬと、音声重要度ＳＬとの大小を判別する。比較評価部１３は、映像重要度ＩＬの方が音声重要度ＳＬより大きい場合には（Ｓ１３４：Ｙｅｓ）、音声を優先的に操作するため、処理を図４ＢのＳ１３５に進ませる。比較評価部１３は、音声重要度ＳＬの方が映像重要度ＩＬより大きい場合には（Ｓ１３４：Ｎｏ）、映像を優先的に操作するため、処理を図４ＣのＳ１４５に進ませる。以下、比較評価部１３の判別結果に応じて、自動補正部１５は、映像と音声との同期を行う。 In S134 of FIG. 4A, the comparative evaluation unit 13 determines the magnitude of the video importance IL calculated as described above and the audio importance SL. If the video importance level IL is greater than the audio importance level SL (S134: Yes), the comparative evaluation unit 13 advances the process to S135 in FIG. 4B in order to preferentially operate the audio. When the audio importance level SL is greater than the video importance level IL (S134: No), the comparative evaluation unit 13 advances the processing to S145 in FIG. 4C in order to preferentially operate the video. Hereinafter, according to the determination result of the comparative evaluation unit 13, the automatic correction unit 15 synchronizes video and audio.

まず、音声を優先的に操作する場合について説明する。図４Ｂに示すように、Ｓ１３５において、自動補正部１５の音声操作部３１は、操作対象（これから再生するずれ判定時間分）の音声部分に無音に近い部分か、音声変化量が小さいと判断できる部分があるかどうかを判断する。無音に近いと判断するのは、例えば音量重要度ＳＬ２が、ホワイトノイズに対して予め算出された音声重要度ＳＬ以下である場合とすることができる。音声変化量が小さいと判断するのは、例えば変動重要度ＳＬ１が、ホワイトノイズに対して予め算出された変動重要度ＳＬ１以下とすることができる。 First, a case where the voice is preferentially operated will be described. As shown in FIG. 4B, in S135, the voice operation unit 31 of the automatic correction unit 15 can determine that the voice part of the operation target (for the deviation determination time to be played back) is near silence or that the voice change amount is small. Determine if there is a part. It can be determined that the sound volume is close to silence, for example, when the volume importance level SL2 is equal to or lower than the voice importance level SL calculated in advance for white noise. For example, the change importance SL1 may be determined to be equal to or less than the change importance SL1 calculated in advance for white noise.

Ｓ１３５において、操作対象の音声部分に無音に近い部分か、音声変化量が小さいと判断できる部分（以下、無音に近い部分と音声変化量が小さいと判断できる部分を合わせて無音相当部分という）があると判別されると（Ｓ１３５：Ｙｅｓ）、音声操作部３１は、ずれ時間Ｌ１が正の数か否か判別する（Ｓ１３６）。ずれ時間Ｌ１が正の数である場合には（Ｓ１３６：Ｙｅｓ）、音声が進んでいるので、音声操作部３１は、ずれ時間Ｌ１分、無音に近い部分または音声変化量が小さいと判断できる部分を何回も再生して伸張することにより、同期処理を完了する（Ｓ１３７）。 In S135, there is a portion that is close to the operation target sound portion, or a portion that can be determined that the sound change amount is small (hereinafter, a portion that is close to silence and a portion that can be determined that the sound change amount is small is collectively referred to as a silence-corresponding portion). When it is determined that there is (S135: Yes), the voice operation unit 31 determines whether or not the deviation time L1 is a positive number (S136). When the deviation time L1 is a positive number (S136: Yes), since the voice is advanced, the voice operation unit 31 can determine that the deviation time L1 minutes is close to silence or the amount of change in voice is small. Is reproduced and expanded many times to complete the synchronization process (S137).

Ｓ１３６において、ずれ時間Ｌ１が負の数であり、映像が遅れていると判別された場合には（Ｓ１３６：Ｎｏ）、音声操作部３１は、ずれ時間Ｌ１の絶対値が無音相当部分の時間（無音相当時間という）よりも大きいか否か判別する（Ｓ１３８）。ずれ時間Ｌ１の絶対値が無音相当時間よりも長い場合には（Ｓ１３８：Ｙｅｓ）、音声操作部３１は、無音相当部分を再生せずに削除し（Ｓ１３９）、ずれ時間Ｌ１＝ずれ時間Ｌ１＋無音相当時間とし（Ｓ１４０）、Ｓ１４２に処理を進める。ずれ時間Ｌ１の絶対値が無音相当時間以下の場合には（Ｓ１３８：Ｎｏ）、音声操作部３１は、ずれ時間Ｌ１分の無音相当部分の音声を削除して同期処理を完了し（Ｓ１４１）、Ｓ１５３に処理を進める。 In S136, when it is determined that the shift time L1 is a negative number and the video is delayed (S136: No), the voice operation unit 31 determines that the absolute value of the shift time L1 is the time corresponding to the silence ( It is determined whether it is longer than the silent equivalent time (S138). When the absolute value of the deviation time L1 is longer than the silence equivalent time (S138: Yes), the voice operation unit 31 deletes the silence equivalent part without reproducing it (S139), and the deviation time L1 = the deviation time L1 + silence. The corresponding time is set (S140), and the process proceeds to S142. If the absolute value of the deviation time L1 is equal to or less than the silence equivalent time (S138: No), the voice operation unit 31 deletes the voice corresponding to the silence corresponding to the deviation time L1 and completes the synchronization process (S141). The process proceeds to S153.

Ｓ１３５において、無音相当部分がない場合には（Ｓ１３５：Ｎｏ）、音声操作部３１は、速度変更率ＶＣ＝（音声タイマ値ＳＴ／（音声タイマ値ＳＴ＋｜ずれ時間Ｌ１｜）が一定値以上か否かを判別する（Ｓ１４２）。判別に用いる一定値は、例えば０．８とすることができる。 In S135, when there is no silent equivalent (S135: No), the voice operation unit 31 determines whether the speed change rate VC = (voice timer value ST / (voice timer value ST + | deviation time L1 |) is equal to or greater than a certain value). (S142) The fixed value used for the determination can be set to 0.8, for example.

速度変更率ＶＣが一定値未満の場合には（Ｓ１４２：Ｎｏ）、音声操作部３１は、処理をＳ１４５に進める。これは、音声再生速度を一定以上変化させると、音声の速度を変更することによる違和感の方が、映像を操作することによる違和感よりも大きいと判断されるためである。ここで、音声操作部３１は、音声の再生速度を上記一定値に応じた割合だけ変更した後、処理をＳ１４５に進めるようにしてもよい。 When the speed change rate VC is less than a certain value (S142: No), the voice operation unit 31 advances the process to S145. This is because, when the audio playback speed is changed more than a certain level, it is determined that the uncomfortable feeling due to changing the sound speed is greater than the uncomfortable feeling caused by manipulating the video. Here, the voice operation unit 31 may change the voice playback speed by a ratio corresponding to the fixed value, and then proceed to S145.

速度変更率ＶＣが一定値以上の場合には（Ｓ１４２：Ｎｏ）、音声操作部３１は、音声再生速度変更値＝（音声タイマ値ＳＴ＋ずれ時間Ｌ１）／音声タイマ値ＳＴとし（Ｓ１４３）、音声再生速度を音声再生速度変更値に応じて変更することにより、同期処理を完了する（Ｓ１４４）。 When the speed change rate VC is equal to or greater than a certain value (S142: No), the voice operation unit 31 sets voice playback speed change value = (voice timer value ST + shift time L1) / voice timer value ST (S143), The synchronization process is completed by changing the reproduction speed according to the voice reproduction speed change value (S144).

ここで、音声再生速度変更方法について説明する。ここで採用する音声再生速度変更方法は、デジタル音声データの一部を省略したり挿入したりすることにより、その音程を変えずにデジタル音声データの再生速度を変換する方法である。 Here, the audio reproduction speed changing method will be described. The audio reproduction speed changing method employed here is a method of converting the reproduction speed of digital audio data without changing the pitch by omitting or inserting a part of the digital audio data.

音声操作部３１は、まず、操作対象、すなわちこれから再生するずれ判定区間ＪＡに相当する音声部分の音声データを、一定周期の区間に区切り、省略または伸張する割合を決める。例えば、１０％省略するのであれば、音声操作部３１は、区切られた音声データにおいて、１０区間毎に１区間を目安として省略すればよいし、１０％伸張するのであれば、１０区間毎に１区間を目安として挿入すればよい。次に、音声操作部３１は、省略または伸張する割合に応じて音声データを操作する。このような音声再生速度の変更は、例えば、特許文献３に記載の方法など、従来の様々な方法を適用することができる。 First, the voice operation unit 31 divides the voice data corresponding to the operation target, that is, the voice judgment section JA to be reproduced from now, into sections of a certain period, and determines the rate of omission or expansion. For example, if 10% is omitted, the voice operation unit 31 may omit one section every 10 sections as a guide in the divided voice data, and if 10% is expanded, every 10 sections. Insert one section as a guide. Next, the voice operation unit 31 operates the voice data according to the rate of omission or expansion. Various conventional methods such as the method described in Patent Document 3 can be applied to such a change in the audio reproduction speed.

映像を処理する場合には、図４Ｃに示すように、まず、映像操作部３３は、ずれ時間Ｌ１の正負を判別する（Ｓ１４５）。映像操作部３３は、ずれ時間Ｌ１が正の場合には（Ｓ１４５：Ｙｅｓ）、音声に対し映像が遅れているので、フレーム間引き処理を行い、負の場合には（Ｓ１４５：ＮＯ）、音声に対し映像が進んでいるので、フレーム補間処理を行う。 In the case of processing a video, as shown in FIG. 4C, first, the video operation unit 33 determines whether the shift time L1 is positive or negative (S145). When the shift time L1 is positive (S145: Yes), the video operation unit 33 performs frame thinning processing because the video is delayed with respect to the audio, and when negative (S145: NO), On the other hand, since the video is progressing, frame interpolation processing is performed.

フレーム間引き処理では、映像操作部３３は、まず、フレーム間引き枚数を算出する（Ｓ１４６）。すなわち、映像操作部３３は、フレーム間引き枚数＝ずれ時間Ｌ１／フレームレート（ＦｒａｍｅｐｅｒＳｅｃｏｎｄ：ＦＰＳ）を算出する。小数点以下については四捨五入する。 In the frame decimation process, the video operation unit 33 first calculates the number of frame decimation (S146). That is, the video operation unit 33 calculates the number of frames to be thinned = the shift time L1 / frame rate (Frame per Second: FPS). Round off to the nearest decimal point.

映像操作部３３は、操作対象（これから再生する、ずれ判定区間ＪＡ分）の映像に非参照ピクチャがある場合には、フレーム間引き枚数を超えない範囲で再生する映像データに含まないように間引く（Ｓ１４７）。非参照ピクチャの枚数がフレーム間引き枚数に至らない場合には、映像操作部３３は、その後、Ｐピクチャの後のＢピクチャのうち、量子化係数ＱＰの高いものから間引く（Ｓ１４８）。Ｓ１４８までの処理でフレーム間引き枚数に至らない場合には、映像操作部３３は、Ｉフレーム及びＰフレームのような参照ピクチャを間引く。このように参照ピクチャを間引く際には、次のフレームのためにピクチャをデコードする（Ｓ１４９）。 When there is a non-reference picture in the operation target image (for the shift determination section JA to be played back), the video operation unit 33 thins the video data so that it is not included in the video data to be played back within a range that does not exceed the frame skipping number ( S147). If the number of non-reference pictures does not reach the number of frames to be thinned out, the video operation unit 33 subsequently thins out the B picture after the P picture from the one with the higher quantization coefficient QP (S148). If the number of frames thinned out does not reach the processing up to S148, the video operation unit 33 thins out reference pictures such as I frames and P frames. When the reference picture is thinned out in this way, the picture is decoded for the next frame (S149).

上記のように、フレームの間引きを行う際には、映像操作部３３は、後フレームへの影響を小さくするため、まず、非参照フレームを間引き、次に、ＩＢＢＰにおけるＢフレームを間引く。さらに、ＩＰＰＰにおける量子化係数ＱＰの大きいフレーム順でフレームを間引く。映像操作部３３は、フレームを間引いた場合には、非参照フレームを除き、後続のフレームの為デコードは行なう。以上により、同期処理を完了する。 As described above, when thinning out a frame, the video operation unit 33 thins out a non-reference frame first and then thins out a B frame in IBBP in order to reduce the influence on the subsequent frame. Further, frames are thinned out in the order of frames with a large quantization coefficient QP in IPPP. When the frame is thinned, the video operation unit 33 performs decoding for subsequent frames except for the non-reference frame. Thus, the synchronization process is completed.

ずれ時間Ｌ１が負の数の場合には（Ｓ１４５：Ｎｏ）、フレーム補間処理を行う。映像操作部３３は、フレーム補間枚数＝−ずれ時間Ｌ１／ＦＰＳを算出する（Ｓ１５０）。小数点以下については四捨五入する。 If the shift time L1 is a negative number (S145: No), frame interpolation processing is performed. The video operation unit 33 calculates frame interpolation number = −shift time L1 / FPS (S150). Round off to the nearest decimal point.

映像操作部３３は、補間するフレームとして、なるべく量子化係数ＱＰの高いものについてそのフレームと次フレームの平均フレームを作成する（Ｓ１５１）。例えば、フレーム補間する場合、映像操作部３３は、対象箇所の前後フレームについて各ピクセルの各画素値の平均値を求め、これを補間フレームとして前後フレームの間に挿入する。 The video manipulating unit 33 creates an average frame of the frame and the next frame for the interpolated frame having as high a quantization coefficient QP as possible (S151). For example, when frame interpolation is performed, the video operation unit 33 obtains an average value of the pixel values of each pixel for the previous and subsequent frames of the target portion, and inserts this between the previous and next frames as an interpolation frame.

フレーム補間の際、映像操作部３３は、閾値Ｂに対して、補間するフレームがなるべく均等に配置されるように、補間枚数分だけ補間処理を行う（Ｓ１５２）。補間すべきフレーム枚数分の間引きを完了した後、処理は、Ｓ１５３の初期化処理に進む。 At the time of frame interpolation, the video operation unit 33 performs the interpolation processing for the number of interpolations so that the frames to be interpolated are arranged as evenly as possible with respect to the threshold value B (S152). After completing the thinning for the number of frames to be interpolated, the process proceeds to the initialization process of S153.

Ｓ１５３の初期化処理として、主制御部３は、閾値Ａに再生済み音声時間（音声タイマ値ＳＴ）を代入、閾値Ｂに表示済み映像フレーム数（動画カウンタ値ＩＣｔ）を代入する。その後、主制御部３は、音声タイマ値ＳＴ＝０、動画カウンタ値ＩＣｔ＝０と初期化し、Ｓ１０３に戻って処理を繰り返す。 As an initialization process of S153, the main control unit 3 substitutes the reproduced audio time (audio timer value ST) for the threshold A and substitutes the number of displayed video frames (moving picture counter value ICt) for the threshold B. Thereafter, the main control unit 3 initializes the audio timer value ST = 0 and the moving image counter value ICt = 0, and returns to S103 to repeat the processing.

なお、本実施の形態の分離部２１、音声復号部２３、音声再生部２５、映像復号部２７、映像再生部２９は、本発明の復号再生部の一例である。音声重要度判定部９は、音声重要度算出部の一例であり、映像重要度判定部１１は、映像重要度算出部の一例である。 Note that the separation unit 21, the audio decoding unit 23, the audio reproduction unit 25, the video decoding unit 27, and the video reproduction unit 29 of the present embodiment are examples of the decoding and reproduction unit of the present invention. The audio importance level determination unit 9 is an example of an audio importance level calculation unit, and the video importance level determination unit 11 is an example of a video importance level calculation unit.

以上説明したように、本実施の形態による再生装置１においては、映像重要度ＩＬ、音声重要度ＳＬが算出される。映像重要度ＩＬは、映像の複雑さの度合いを示し、復号化前の映像ストリームに基づき算出される。音声重要度ＳＬは、音声の音量の特徴を示し、復号化した音声データに基づき算出される。また、再生装置１は、映像重要度ＩＬ、音声重要度ＳＬ、およびずれ判別区間ＪＡにおけるずれ時間Ｌ１に応じて、映像および音声のいずれを優先して操作するかを判別することにより同期を行う。 As described above, in the playback apparatus 1 according to the present embodiment, the video importance level IL and the audio importance level SL are calculated. The video importance IL indicates the degree of video complexity, and is calculated based on the video stream before decoding. The voice importance level SL indicates the characteristics of the volume of the voice and is calculated based on the decoded voice data. Further, the playback device 1 performs synchronization by determining which one of video and audio is to be preferentially operated according to the video importance IL, the audio importance SL, and the shift time L1 in the shift determination section JA. .

よって、本実施の形態による再生装置１において、映像重要度ＩＬ、音声重要度ＳＬの算出は、映像そのもの再生中に行われるわけではないので、再生に影響を与えないという効果がある。また、再生装置１によれば、同期処理による視聴者への影響が少ないと考えられる方法を用いて同期処理を行うことが可能となる。よって、処理能力の不足のため、再生映像がコマ落ちしてしまう、音声が途切れてしまう、ずれがある一定間隔のまま修正されない、など、視聴者に違和感、不快感を与えることが少なくなるという効果がある。 Therefore, in the playback apparatus 1 according to the present embodiment, the calculation of the video importance level IL and the audio importance level SL is not performed during the playback of the video itself, so that there is an effect that the playback is not affected. Further, according to the playback apparatus 1, it is possible to perform the synchronization process using a method that is considered to have little influence on the viewer by the synchronization process. Therefore, due to lack of processing capacity, it is less likely to give viewers a sense of discomfort and discomfort, such as the playback video dropping frames, sound being interrupted, or deviation not being corrected at fixed intervals. effective.

なお、本発明は、以上に述べた実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内で種々の構成または実施形態を採ることができる。例えば、Ｓ１５３の初期化処理では、閾値Ａとして音声タイマ値ＳＴを設定し、閾値Ｂとして、動画カウンタ値ＩＣｔを設定したが、これに限定されない。例えば、Ｓ１５３が行われる度に音声タイマ値Ｓおよび動画カウンタ値ＩＣｔの値を記憶部３５に記憶しておき、それぞれの平均値を算出して、閾値Ａ、閾値Ｂの初期値として代入するようにしてもよい。 The present invention is not limited to the embodiments described above, and various configurations or embodiments can be adopted without departing from the gist of the present invention. For example, in the initialization process of S153, the audio timer value ST is set as the threshold A and the moving image counter value ICt is set as the threshold B. However, the present invention is not limited to this. For example, every time S153 is performed, the values of the audio timer value S and the moving image counter value ICt are stored in the storage unit 35, and the average values thereof are calculated and substituted as initial values of the threshold A and the threshold B. It may be.

閾値Ａは、変動重要度ＳＬ１が大きい場合には小さくし、変動重要度ＳＬ１が小さい場合には、大きくするようにしてもよい。例えば、算出された変動重要度ＳＬ１を記憶部３５に記憶して、変動重要度ＳＬ１の統計分布を算出し、現在の変動重要度ＳＬ１が、確率分布で上位２０％の範囲に入るか否かで、音声の変化量が大きいか否かを判別する。音声の変化量が大きいと判別されたときには、閾値Ａ＝０．５秒、それ以外は閾値Ａ＝１秒などと逐次更新するようにしてもよい。変化量の大きい部分は視聴者がリップシンクずれを感じる機会が多いと考えられるためである。 The threshold A may be decreased when the variation importance SL1 is large, and may be increased when the variation importance SL1 is small. For example, the calculated fluctuation importance SL1 is stored in the storage unit 35, the statistical distribution of the fluctuation importance SL1 is calculated, and whether or not the current fluctuation importance SL1 falls within the upper 20% range in the probability distribution. Thus, it is determined whether or not the change amount of the voice is large. When it is determined that the amount of change in the sound is large, the threshold A may be updated to 0.5 seconds, and otherwise, the threshold A may be sequentially updated to 1 second. This is because the part where the amount of change is large is considered that there are many opportunities for the viewer to feel a lip sync shift.

映像重要度ＩＬは、量子化重要度ＩＬ１、サイズ重要度ＩＬ２、またはベクトル重要度ＩＬ３のいずれか少なくとも一つに基づき算出することができる。なお、サイズ重要度ＩＬ２は、（平均サイズ）が安定するであろう、再生開始３０秒以降などにサイズ重要度ＩＬ２を反映させるようにしてもよい。音声重要度ＳＬは、変動重要度ＳＬ１または音量重要度ＳＬ２のいずれか少なくとも一つに基づき算出することができる。 The video importance IL can be calculated based on at least one of the quantization importance IL1, the size importance IL2, and the vector importance IL3. It should be noted that the size importance IL2 may be reflected after 30 seconds from the start of reproduction, where (average size) will stabilize. The voice importance SL can be calculated based on at least one of the fluctuation importance SL1 and the volume importance SL2.

映像重要度ＩＬ、音声重要度ＳＬの算出は、別の算出方法を用いるものでもよい。例えば、動きが小さく量子化係数が大きい場合には、絵が細かいが動きが少ないパターンであると考えられる（音楽がメインで背景として映像が使われている場合など）ので音声の重要度に少し下駄をはかせるなどの調整を行うようにしてもよい。また、小さい音声の大小は人間にとって差に気づきやすい傾向がある為、音声重要度ＳＬの算出における式７の被積分関数ｘ（ｉ）を１０×（ｘ（ｉ））^１／２に置き換えるようにしてもよい。これにより、同期のための操作を、人間が視聴した際に気になる度合いにより近づける効果がある。 The video importance level IL and the audio importance level SL may be calculated using another calculation method. For example, if the movement is small and the quantization coefficient is large, the pattern is fine but the movement is small (for example, when the music is the main and the video is used as the background). Adjustments such as removing clogs may be performed. Further, since the size of small speech tends to be noticed by human beings, the integrand x (i) in Equation 7 in the calculation of speech importance SL is replaced with 10 × (x (i)) ^1/2. It may be. As a result, there is an effect that the operation for synchronization is made closer to the degree of concern when viewed by a human.

本実施の形態による再生装置は、データの圧縮を伴い符号化されるＨ．２６４、ＭＰＥＧ２等によるデジタル動画像の再生装置として適用が可能である。
ここで、上記実施の形態による映像音声の再生方法の動作をコンピュータに行わせるために共通に適用されるコンピュータの例について説明する。図６は、標準的なコンピュータのハードウエア構成の一例を示すブロック図である。図６に示すように、コンピュータ３００は、ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ（ＣＰＵ）３０２、メモリ３０４、入力装置３０６、出力装置３０８、外部記憶装置３１２、媒体駆動装置３１４、ネットワーク接続装置等がバス３１０を介して接続されている。 The reproduction apparatus according to the present embodiment is an H.264 encoded with data compression. The present invention can be applied as a digital moving image reproducing apparatus based on H.264, MPEG2, or the like.
Here, an example of a computer that is commonly applied to cause the computer to perform the operation of the video / audio reproduction method according to the above embodiment will be described. FIG. 6 is a block diagram illustrating an example of a hardware configuration of a standard computer. As shown in FIG. 6, a computer 300 includes a central processing unit (CPU) 302, a memory 304, an input device 306, an output device 308, an external storage device 312, a medium driving device 314, a network connection device, and the like via a bus 310. It is connected.

ＣＰＵ３０２は、コンピュータ３００全体の動作を制御する演算処理装置である。メモリ３０４は、コンピュータ３００の動作を制御するプログラムを予め記憶したり、プログラムを実行する際に必要に応じて作業領域として使用したりするための記憶部である。メモリ３０４は、例えばＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＡＭ）、ＲｅａｄＯｎｌｙＭｅｍｏｒｙ（ＲＯＭ）等である。入力装置３０６は、コンピュータの使用者により操作されると、その操作内容に対応付けられている使用者からの各種情報の入力を取得し、取得した入力情報をＣＰＵ３０２に送付する装置であり、例えばキーボード装置、マウス装置などである。出力装置３０８は、コンピュータ３００による処理結果を出力する装置であり、表示装置などが含まれる。例えば表示装置は、ＣＰＵ３０２により送付される表示データに応じてテキストや画像を表示する。 The CPU 302 is an arithmetic processing unit that controls the operation of the entire computer 300. The memory 304 is a storage unit for storing in advance a program for controlling the operation of the computer 300 or using it as a work area when necessary when executing the program. The memory 304 is, for example, a random access memory (RAM), a read only memory (ROM), or the like. The input device 306 is a device that, when operated by a computer user, acquires various information input from the user associated with the operation content and sends the acquired input information to the CPU 302. Keyboard device, mouse device, etc. The output device 308 is a device that outputs a processing result by the computer 300, and includes a display device and the like. For example, the display device displays text and images according to display data sent by the CPU 302.

外部記憶装置３１２は、例えば、ハードディスクなどの記憶装置であり、ＣＰＵ３０２により実行される各種制御プログラムや、取得したデータ等を記憶しておく装置である。媒体駆動装置３１４は、可搬記録媒体３１６に書き込みおよび読み出しを行うための装置である。ＣＰＵ３０２は、可搬型記録媒体３１６に記録されている所定の制御プログラムを、記録媒体駆動装置３１４を介して読み出して実行することによって、各種の制御処理を行うようにすることもできる。ＣＰＵ３０２は、可搬記録媒体３１６に記録された動画コンテンツを読み出して、再生させるようにすることもできる。記憶可搬記録媒体３１６は、例えばＣｏｎｐａｃｔＤｉｓｃ（ＣＤ）−ＲＯＭ、ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ（ＤＶＤ）、ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ（ＵＳＢ）メモリ等である。 The external storage device 312 is a storage device such as a hard disk, and stores various control programs executed by the CPU 302, acquired data, and the like. The medium driving device 314 is a device for writing to and reading from the portable recording medium 316. The CPU 302 can read out and execute a predetermined control program recorded on the portable recording medium 316 via the recording medium driving device 314 to perform various control processes. The CPU 302 can read out and reproduce the moving image content recorded on the portable recording medium 316. The storage portable recording medium 316 is, for example, a Compact Disc (CD) -ROM, a Digital Versatile Disc (DVD), a Universal Serial Bus (USB) memory, or the like.

ネットワーク接続装置３１８は、有線または無線により外部との間で行われる各種データの授受の管理を行うインタフェース装置である。ＣＰＵ３０２は、ネットワーク接続装置３１８を介して外部の動画コンテンツを取得し、再生させるようにすることもできる。バス３１０は、上記各装置等を互いに接続し、データのやり取りを行う通信経路である。 The network connection device 318 is an interface device that manages transmission / reception of various data performed between the outside by wired or wireless. The CPU 302 can also acquire and reproduce external video content via the network connection device 318. A bus 310 is a communication path for connecting the above devices and the like to exchange data.

上記実施の形態による映像音声再生方法をコンピュータに実行させるプログラムは、例えば外部記憶装置３１２に記憶させる。ＣＰＵ３０２は、外部記憶装置３１２からプログラムを読み出し、コンピュータ３００に映像音声再生の動作を行なわせる。このとき、まず、映像音声再生の処理をＣＰＵ３０２に行わせるための制御プログラムを作成して外部記憶装置３１２に記憶させておく。そして、入力装置３０６から所定の指示をＣＰＵ３０２に与えて、この制御プログラムを外部記憶装置３１２から読み出させて実行させるようにする。また、このプログラムは、可搬記録媒体３１６に記憶するようにしてもよい。
ＣＰＵ３０２は、可搬記録媒体３１６に記録された動画コンテンツを読み出して、再生させるようにすることもできる。 A program that causes a computer to execute the video / audio reproduction method according to the above-described embodiment is stored in, for example, the external storage device 312. The CPU 302 reads the program from the external storage device 312 and causes the computer 300 to perform an audio / video reproduction operation. At this time, first, a control program for causing the CPU 302 to perform video / audio reproduction processing is created and stored in the external storage device 312. Then, a predetermined instruction is given from the input device 306 to the CPU 302 so that the control program is read from the external storage device 312 and executed. The program may be stored in the portable recording medium 316.
The CPU 302 can read out and reproduce the moving image content recorded on the portable recording medium 316.

以上の実施形態に関し、さらに以下の付記を開示する。
（付記１）
符号化された映像および音声を含むストリームを取得し、前記映像および前記音声を再生する再生装置であって、
前記ストリームを取得する信号取得部と、
取得した前記ストリームを分離して復号化することにより、映像データおよび音声データを生成し、第１のずれ判定時間における第１の映像および第１の音声を再生する復号再生部と、
前記第１のずれ判定時間における前記第１の映像と前記第１の音声との再生時のずれ量を判定するずれ判定部と、
前記第１のずれ判定時間の次の第２のずれ判定時間において再生される第２の映像の複雑さの度合いを示す映像重要度を算出する映像重要度算出部と、
前記第１のずれ判定時間の次の第２のずれ判定時間において再生される第２の音声の音量の特徴を示す音声重要度を算出する音声重要度算出部と、
前記映像重要度、前記音声重要度および前記ずれ量に基づき、前記映像と前記音声との再生時のずれを補正するよう前記第２の映像および前記第２の音声の再生を制御する自動補正部と、
を有することを特徴とする再生装置。
（付記２）
前記映像重要度は、前記映像の符号化の際に付加される量子化係数、データ量、または動きベクトル長の総量のいずれか少なくとも１つに基づいて算出されることを特徴とする付記１に記載の再生装置。
（付記３）
前記音声重要度は、前記音声データの所定周波数範囲毎の音量の時間的変化量の平均値、または前記音声データの所定周波数範囲毎の音量の平均値のいずれか少なくとも１つに基づいて算出されることを特徴とする付記１または付記２に記載の再生装置。
（付記４）
前記自動補正部は、さらに、復号前の前記第２の映像中に非参照ピクチャがあるか否かに基づき前記第２の映像と前記第２の音声との再生時のずれを補正することを特徴とする付記１から付記３のいずれかに記載の再生装置。
（付記５）
前記自動補正部は、
前記音声データを操作する音声操作部と、
前記映像データを操作する映像操作部と、
を有し、
復号前の前記映像中に非参照ピクチャがあるか否か、および前記映像重要度と前記音声重要度との大小関係に基づき、前記再生時のずれを補正する際に音声操作部と映像操作部のいずれの操作を優先するかを決定することを特徴とする付記４に記載の再生装置。
（付記６）
前記自動補正部は、
前記映像重要度が前記音声重要度より大きいときには、前記音声操作部の操作を優先し、
前記音声重要度が前記映像重要度より大きいときには、前記映像操作部による操作を行うことにより前記映像と前記音声の再生時のずれを補正することを特徴とする付記５に記載の再生装置。
（付記７）
前記音声操作部は、
前記第２のずれ判定時間に対応する区間に、前記音声重要度が所定値以下の無音相当区間がある場合であって、
前記第１の音声の再生が前記第１の映像の再生よりも進んでいる場合には、前記無音相当区間を繰り返し再生し、
前記第１の映像の再生が、前記第１の音声の再生よりも進んでいる場合には、前記ずれ量に応じて前記無音相当区間を削除する操作、前記音声の再生速度を変更する操作、映像操作部での操作を行う操作のいずれか少なくとも一つの操作を行い、
前記音声重要度を算出する区間に、前記音声重要度が所定値以下の無音相当区間がない場合であって、
前記ずれ量が、前記第２のずれ判別時間に対し所定割合未満の場合には、前記音声の再生速度を変更し、
前記ずれ量が、前記第２のずれ判別時間に対し所定割合以上の場合には、前記映像操作部による操作を行うことを特徴とする付記５または付記６に記載の再生装置。
（付記８）
前記映像操作部は、
前記第１の音声の再生が前記第１の映像の再生よりも進んでいる場合には、少なくとも非参照ピクチャを再生しない処理を行い、
前記第１の映像の再生が、前記第１の音声の再生よりも進んでいる場合には、前後のフレームの画素値の平均を有するフレームを補間する処理を行うことを特徴とする付記５または付記６に記載の再生装置。
（付記９）
符号化された映像および音声を含むストリームを取得し、前記映像および前記音声を再生する再生方法であって、
前記ストリームを取得し、
第１のずれ判定時間、前記ストリームに基づき第１の映像および第１の音声を再生し、
前記第１のずれ判定時間における第１の映像と第１の音声との再生時のずれ量を判定し、
前記第１のずれ判定時間の次の第２のずれ判定時間において再生される第２の映像の複雑さの度合いを示す映像重要度を算出し、
前記第２のずれ判定時間において再生される第２の音声の音量の特徴を示す音声重要度を算出し、
前記映像重要度、前記音声重要度および前記ずれ量に基づき、前記映像と前記音声との再生時のずれを補正するよう前記第２のずれ判定時間における前記第２の映像および前記第２の音声の再生を制御する、
ことを特徴とする再生方法。
（付記１０）
前記第１の映像および前記第１の音声を再生し、前記ずれ量を判定し、前記映像重要度を算出し、前記音声重要度を算出し、前記再生を制御することを繰り返し行うことを特徴とする付記９に記載の再生方法。
（付記１１）
符号化された映像および音声を含むストリームを取得し、前記映像および前記音声を再生する処理をコンピュータに実行させるためのプログラムであって、
前記ストリームを取得し、
第１のずれ判定時間、前記ストリームに基づき第１の映像および第１の音声を再生し、
前記第１のずれ判定時間における前記第１の映像と前記第１の音声との再生時のずれ量を判定し、
前記第１のずれ判定時間の次の第２のずれ判定時間において再生される第２の映像の複雑さの度合いを示す映像重要度を算出し、
前記第２のずれ判定時間において再生される第２の音声の音量の特徴を示す音声重要度を算出し、
前記映像重要度、前記音声重要度および前記ずれ量に基づき、前記映像と前記音声との再生時のずれを補正するよう前記第２のずれ判定時間における前記第２の映像および前記第２の音声の再生を制御する処理を前記コンピュータに実行させるためのプログラム。
（付記１２）
前記第１の映像および前記第１の音声を再生する処理、前記ずれ量を判定する処理、前記映像重要度を算出する処理、前記音声重要度を算出する処理、および前記再生を制御する処理を繰り返し前記コンピュータに実行させるための付記１１に記載のプログラム。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
A playback device that acquires a stream including encoded video and audio and reproduces the video and audio,
A signal acquisition unit for acquiring the stream;
A decoding reproduction unit that generates video data and audio data by separating and decoding the acquired stream, and reproduces the first video and the first audio at the first shift determination time;
A shift determination unit that determines a shift amount during reproduction of the first video and the first audio during the first shift determination time;
A video importance level calculating unit that calculates a video importance level indicating a degree of complexity of a second video that is reproduced in a second shift determination time next to the first shift determination time;
A voice importance level calculation unit for calculating a voice importance level indicating a volume characteristic of the second voice reproduced at a second shift determination time next to the first shift determination time;
An automatic correction unit that controls reproduction of the second video and the second audio so as to correct a deviation during reproduction between the video and the audio based on the video importance, the audio importance, and the shift amount. When,
A playback apparatus comprising:
(Appendix 2)
The supplementary note 1 is characterized in that the video importance is calculated based on at least one of a quantization coefficient added when the video is encoded, a data amount, and a total amount of motion vector lengths. The reproducing apparatus as described.
(Appendix 3)
The voice importance is calculated based on at least one of an average value of temporal changes in volume for each predetermined frequency range of the audio data and an average value of volume for each predetermined frequency range of the audio data. The reproducing apparatus according to appendix 1 or appendix 2, wherein:
(Appendix 4)
The automatic correction unit further corrects a deviation in reproduction between the second video and the second audio based on whether or not there is a non-reference picture in the second video before decoding. 4. The playback device according to any one of appendix 1 to appendix 3, which is characterized.
(Appendix 5)
The automatic correction unit is
A voice operation unit for operating the voice data;
A video operation unit for operating the video data;
Have
Based on whether there is a non-reference picture in the video before decoding and the magnitude relationship between the video importance level and the audio importance level, an audio operation unit and a video operation unit are used when correcting the shift at the time of reproduction. The playback apparatus according to appendix 4, wherein which operation is to be prioritized is determined.
(Appendix 6)
The automatic correction unit is
When the video importance is greater than the audio importance, priority is given to the operation of the audio operation unit,
6. The playback apparatus according to appendix 5, wherein when the audio importance level is greater than the video importance level, a shift between the video and the audio is corrected by performing an operation using the video operation unit.
(Appendix 7)
The voice operation unit
In a section corresponding to the second deviation determination time, there is a silent equivalent section in which the voice importance is equal to or less than a predetermined value,
When the reproduction of the first audio is ahead of the reproduction of the first video, the silent equivalent section is reproduced repeatedly,
When the reproduction of the first video is ahead of the reproduction of the first audio, an operation for deleting the silent equivalent section according to the deviation amount, an operation for changing the audio reproduction speed, Perform at least one of the operations to perform operations on the video operation unit,
In a section where the voice importance is calculated, there is no silent equivalent section where the voice importance is a predetermined value or less,
When the deviation amount is less than a predetermined ratio with respect to the second deviation determination time, the audio playback speed is changed,
The reproduction apparatus according to appendix 5 or appendix 6, wherein when the shift amount is equal to or greater than a predetermined ratio with respect to the second shift determination time, the video operation unit is operated.
(Appendix 8)
The video operation unit
If the reproduction of the first audio is ahead of the reproduction of the first video, at least a process of not reproducing the non-reference picture is performed,
The supplementary note 5 or 5, wherein when the reproduction of the first video is more advanced than the reproduction of the first audio, a process of interpolating a frame having an average of pixel values of preceding and succeeding frames is performed. The reproducing apparatus according to appendix 6.
(Appendix 9)
A reproduction method for obtaining a stream including encoded video and audio and reproducing the video and audio,
Get the stream,
Playing the first video and the first audio based on the first deviation determination time and the stream;
Determining a shift amount during reproduction of the first video and the first audio in the first shift determination time;
Calculating a video importance level indicating a degree of complexity of a second video to be reproduced in a second shift determination time next to the first shift determination time;
Calculating a voice importance level indicating a volume characteristic of the second voice reproduced in the second deviation determination time;
Based on the video importance level, the audio importance level, and the shift amount, the second video and the second audio at the second shift determination time so as to correct a shift during playback between the video and the audio. Control the playback of the
A reproduction method characterized by the above.
(Appendix 10)
Playing back the first video and the first audio, determining the shift amount, calculating the video importance, calculating the audio importance, and controlling the playback are repeatedly performed. The reproduction method according to appendix 9.
(Appendix 11)
A program for obtaining a stream including encoded video and audio and causing a computer to execute a process of reproducing the video and audio,
Get the stream,
Playing the first video and the first audio based on the first deviation determination time and the stream;
Determining a shift amount during reproduction of the first video and the first sound in the first shift determination time;
Calculating a video importance level indicating a degree of complexity of a second video to be reproduced in a second shift determination time next to the first shift determination time;
Calculating a voice importance level indicating a volume characteristic of the second voice reproduced in the second deviation determination time;
Based on the video importance level, the audio importance level, and the shift amount, the second video and the second audio at the second shift determination time so as to correct a shift during playback between the video and the audio. A program for causing the computer to execute a process for controlling the reproduction of an image.
(Appendix 12)
Processing for reproducing the first video and the first audio, processing for determining the shift amount, processing for calculating the video importance, processing for calculating the audio importance, and processing for controlling the reproduction The program according to appendix 11, which is repeatedly executed by the computer.

１映像音声再生装置
３主制御部
５入力再生部
７ずれ判定部
９音声重要度判定部
１１映像重要度判定部
１３比較評価部
１５自動補正部
１７システムバス
１９信号取得部
２１分離部
２３音声復号部
２５音声再生部
２７映像復号部
２９映像再生部
３１音声操作部
３３映像操作部 DESCRIPTION OF SYMBOLS 1 Video / audio reproduction device 3 Main control part 5 Input reproduction part 7 Deviation determination part 9 Audio importance degree determination part 11 Video importance degree determination part 13 Comparison evaluation part 15 Automatic correction part 17 System bus 19 Signal acquisition part 21 Separation part 23 Voice decoding Unit 25 Audio playback unit 27 Video decoding unit 29 Video playback unit 31 Audio operation unit 33 Video operation unit

Claims

A playback device that acquires a stream including encoded video and audio and reproduces the video and audio,
A signal acquisition unit for acquiring the stream;
A decoding reproduction unit that generates video data and audio data by separating and decoding the acquired stream, and reproduces the first video and the first audio at the first shift determination time;
A shift determination unit that determines a shift amount during reproduction of the first video and the first audio during the first shift determination time;
A video importance level calculating unit that calculates a video importance level indicating a degree of complexity of a second video that is reproduced in a second shift determination time next to the first shift determination time;
A voice importance level calculation unit for calculating a voice importance level indicating a volume characteristic of the second voice reproduced at a second shift determination time next to the first shift determination time;
An automatic correction unit that controls reproduction of the second video and the second audio so as to correct a deviation during reproduction between the video and the audio based on the video importance, the audio importance, and the shift amount. When,
A playback apparatus comprising:

2. The video importance level is calculated based on at least one of a quantization coefficient, a data amount, and a total motion vector length added when the video is encoded. The playback device described in 1.

The voice importance is calculated based on at least one of an average value of temporal changes in volume for each predetermined frequency range of the audio data and an average value of volume for each predetermined frequency range of the audio data. The reproducing apparatus according to claim 1 or 2, characterized in that:

The automatic correction unit further corrects a deviation in reproduction between the second video and the second audio based on whether or not there is a non-reference picture in the second video before decoding. The reproducing apparatus according to any one of claims 1 to 3, wherein the reproducing apparatus is characterized.

The automatic correction unit is
A voice operation unit for operating the voice data;
A video operation unit for operating the video data;
Have
Based on whether there is a non-reference picture in the video before decoding and the magnitude relationship between the video importance level and the audio importance level, an audio operation unit and a video operation unit are used when correcting the shift at the time of reproduction. The playback apparatus according to claim 4, wherein which operation is to be prioritized is determined.

The automatic correction unit is
When the video importance is greater than the audio importance, priority is given to the operation of the audio operation unit,
6. The playback apparatus according to claim 5, wherein when the audio importance level is greater than the video importance level, a deviation between the video and the audio is corrected by performing an operation using the video operation unit.

A reproduction method for obtaining a stream including encoded video and audio and reproducing the video and audio,
Get the stream,
Playing the first video and the first audio based on the first deviation determination time and the stream;
Determining a shift amount during reproduction of the first video and the first audio in the first shift determination time;
Calculating a video importance level indicating a degree of complexity of a second video to be reproduced in a second shift determination time next to the first shift determination time;
Calculating a voice importance level indicating a volume characteristic of the second voice reproduced in the second deviation determination time;
Based on the video importance level, the audio importance level, and the shift amount, the second video and the second audio at the second shift determination time so as to correct a shift during playback between the video and the audio. A reproduction method characterized by controlling reproduction of the video.

A program for obtaining a stream including encoded video and audio and causing a computer to execute a process of reproducing the video and audio,
Get the stream,
Playing the first video and the first audio based on the first deviation determination time and the stream;
Determining a shift amount during reproduction of the first video and the first sound in the first shift determination time;
Calculating a video importance level indicating a degree of complexity of a second video to be reproduced in a second shift determination time next to the first shift determination time;
Calculating a voice importance level indicating a volume characteristic of the second voice reproduced in the second deviation determination time;
Based on the video importance level, the audio importance level, and the shift amount, the second video and the second audio at the second shift determination time so as to correct a shift during playback between the video and the audio. A program for causing the computer to execute a process for controlling the reproduction of an image.