JP5099697B2

JP5099697B2 - Audio / video output method, audio / video output method realization program, and audio / video output device

Info

Publication number: JP5099697B2
Application number: JP2008089079A
Authority: JP
Inventors: 修二田坂; 光吉見; 敏郎布目
Original assignee: 国立大学法人名古屋工業大学
Priority date: 2008-03-31
Filing date: 2008-03-31
Publication date: 2012-12-19
Anticipated expiration: 2028-03-31
Also published as: JP2009246584A

Description

本発明は，パケット伝送された音声・ビデオストリームの受信端末における音声・ビデオ出力方式、音声・ビデオ出力方式実現プログラム及び音声・ビデオ出力装置に関するものである。 The present invention relates to an audio / video output method, an audio / video output method realization program, and an audio / video output device in a receiving terminal of an audio / video stream transmitted in packets.

アクセス回線の高速化に伴い，IP (Internet Protocol)ネットワーク上で音声・ビデオを扱うアプリケーションが広く普及している。しかし，IPネットワークは基本的にベストエフォート型であり，パケットの欠落や伝送遅延が生じる。 As access lines become faster, applications that handle audio and video on IP (Internet Protocol) networks have become widespread. However, an IP network is basically a best-effort type, resulting in packet loss and transmission delay.

パケットの欠落や伝送遅延は，音声の時間的構造やビデオの時間的・空間的構造を乱し，音声・ビデオ伝送のサービス品質（Quality of Service：QoS）を低下させる要因となる。音声・ビデオ伝送のQoSが低下すれば，ユーザ体感品質（Quality of Experience：QoE，非特許文献１参照）が下がることになる。 Packet loss and transmission delay disturb the audio temporal structure and video temporal / spatial structure, and lower the quality of service (QoS) of audio / video transmission. If the QoS of audio / video transmission decreases, the user experience quality (Quality of Experience: QoE, see Non-Patent Document 1) will decrease.

そこで，パケットの欠落や伝送遅延によるQoEの低下を抑えるため，ビデオの出力制御には，誤り補償方式とフレームスキップ方式のいずれかまたは両方が実装されていることが多い（非特許文献２参照）。 Therefore, in order to suppress a drop in QoE due to packet loss or transmission delay, either or both of an error compensation method and a frame skip method are often implemented in video output control (see Non-Patent Document 2). .

ビデオの誤り補償は，ネットワーク上での情報の欠落や同期はずれによる情報の廃棄に対処し，失われた情報を他の情報から補間することである。欠落を補間してビデオフレームを出力することができるため，ビデオの時間品質は改善される。しかし，その補間は必ずしも元の画像情報を復元できるとは限らないため，誤り補償によって空間品質が劣化する可能性もある。加えて，空間品質の劣化したフレームが次の動き予測の参照フレームとして用いられた場合には，その空間品質劣化がGOP (Group of Pictures)単位で伝播する問題がある。 Video error compensation is to deal with the loss of information on the network and the loss of information due to loss of synchronization, and to interpolate the lost information from other information. Since video frames can be output by interpolating the gaps, the temporal quality of the video is improved. However, since the interpolation does not necessarily restore the original image information, there is a possibility that the spatial quality is degraded by error compensation. In addition, when a frame with degraded spatial quality is used as a reference frame for the next motion prediction, there is a problem that the degraded spatial quality is propagated in GOP (Group of Pictures) units.

一方，フレームスキップ方式は，パケット欠落などによって乱れたビデオフレームがあれば，そのフレーム出力を一時中断し，次の正常なIフレームの出現まで中断を継続するものである。すなわち，この方式では，欠落などによって空間的構造が乱れたビデオフレームは出力せず，高い空間品質を持つフレームのみを出力することができる。しかし，ストリームのフリーズによって，ビデオの時間品質の低下をユーザに感じさせてしまう。 On the other hand, in the frame skip method, if there is a video frame disturbed due to packet loss or the like, the frame output is temporarily interrupted, and the interrupt is continued until the next normal I frame appears. In other words, in this method, video frames whose spatial structure is disturbed due to missing or the like are not output, and only frames with high spatial quality can be output. However, the freeze of the stream causes the user to feel a decrease in the time quality of the video.

以上から分かるように，誤り補償はビデオの時間品質を保持し，フレームスキップは空間品質を保持する。 As can be seen from the above, error compensation preserves the temporal quality of video and frame skip preserves spatial quality.

従来，音声・ビデオのパケット伝送において，誤り補償が引き起こすビデオの空間品質の劣化とフレームスキップが引き起こす時間品質の劣化とは，それぞれ別個の問題として取り上げられていた。そして，ほとんどの検討が，ビデオの空間品質または時間品質のいずれか一方しか考えておらず，更に音声品質は考慮に入れられていなかった。 Conventionally, in audio / video packet transmission, degradation of video spatial quality caused by error compensation and temporal quality caused by frame skip have been taken up as separate problems. And most studies have considered only one of video spatial quality or temporal quality, and audio quality has not been taken into account.

一方，放送の世界において，誤り補償とフレームスキップとを使い分ける方法に関する発明がいくつかなされている。しかし，これらでは，パケット伝送特有のパケット到着遅れによる時間品質劣化が考慮されていない。 On the other hand, in the broadcasting world, several inventions related to methods for properly using error compensation and frame skip have been made. However, these do not take into account degradation of time quality due to packet arrival delay peculiar to packet transmission.

特許文献１では，デジタル放送を対象とした記録再生装置が発明されている。この装置では，トランスポートストリームデータの復号エラー信号数が所定の閾値以上存在する部分の映像復号用データ単位もしくは音声復号用データ単位の削除を行うことで，再生映像の画像品質劣化を抑える。ここで映像復号用データ単位はGOPとされている。ただし，この装置が想定しているのは記録再生であり，受信しながら再生するストリーミング型の形態は対象としていない。このため，再生映像にパケット到着遅れの影響などといった時間品質が考慮されていない。また，音声とビデオとの間の相互作用は考慮されていない。 In Patent Document 1, a recording / reproducing apparatus for digital broadcasting is invented. In this apparatus, the image quality deterioration of the reproduced video is suppressed by deleting the video decoding data unit or the audio decoding data unit in a portion where the number of decoding error signals of the transport stream data is equal to or greater than a predetermined threshold. Here, the video decoding data unit is GOP. However, this apparatus assumes recording and reproduction, and does not target a streaming type that reproduces while receiving. For this reason, time quality such as the effect of packet arrival delay on the reproduced video is not considered. Also, the interaction between audio and video is not considered.

特許文献２では，受信した映像劣化の程度に対応した複数の処理態様で演算処理する映像データ処理手段を含めたデジタル放送受信装置が発明されている。この方式では，画像劣化の態様に応じてMPEG圧縮方式におけるフレーム間予測符号化画像を間引いて映像出力を行う。画像劣化の態様は複数種類のパラメータにより検出され，処理態様を決定する閾値は，ジャンルに対応して変化することを特徴としている。また，操作者により入力される表示出力中の映像劣化の度合いを閾値として設定することもできる。さらに，音声に対しては，劣化の程度に応じて音量を減少させる。しかし，映像劣化に応じてフレーム間予測符号化画像を単純に間引くことが，必ずしもユーザの体感品質向上につながっているか自明ではない。また，ユーザの体感品質を閾値設定に反映させる仕組みを持っているものの，ユーザからの情報入力を必要としており，ユーザにとって煩わしいものとなっている。更に，この特許文献においても，情報の到着遅れに起因する時間品質劣化は考慮されていない。
特開２００７−１２４４４５号公報特開２００４−１１２６５４号公報 ITU-T Rec. G.100/P.10 Amendment 1, ``Amendment 1: new appendix I definition of Quality of Experience (QoE),'' Jan. 2007. N. Feamster and H. Balakrishnan, ``Packet loss recovery for streaming video,'' Proc. IEEE 12th International Packetvideo Workshop (PV2002), Apr. 2002. ITU-T Rec. P.800.1, ``Mean Opinion Score (MOS) terminology,'' Mar. 2003. 田中良久, ``心理学的測定法第二版'', 東京大学出版会, 1997. J. P. Guilford, Psychometric methods, McGraw-Hill, N. Y., 1954. Mindcraft Inc, ``WebStone benchmark information,'' http://www.mindcraft.com/webstone/. ``H.264/MPEG-4 AVC reference software JM11.0,'' http://iphome.hhi.de/suehring/tml/index.htm The Video Quality Experts Group, ``The video quality experts group web site,'' http://www.its.bldrdoc.gov/vqeg/. ITU-T Rec. P.911, ``Subjective audiovisual quality assessment methods for multimedia applications,'' Dec. 1998. S. Tasaka and Y. Ito, ``Psychometric analysis of the mutually compensatory property of multimedia QoS,’’ Conf. Rec. IEEE ICC 2003, pp. 1880-1886, May 2003. F. Mosteller, ``Remarks on the method of paired comparisons: III．a test of significance for paired comparisons when equal standard deviations and equal correlations are assumed,'' Psychometrika, vol.16, no.2, pp.207-218, 1951. In Patent Document 2, a digital broadcast receiving apparatus is invented that includes video data processing means that performs arithmetic processing in a plurality of processing modes corresponding to the degree of received video degradation. In this method, video output is performed by thinning out inter-frame predictive encoded images in the MPEG compression method according to the mode of image degradation. The image deterioration mode is detected by a plurality of types of parameters, and the threshold value for determining the processing mode is characterized by changing in accordance with the genre. In addition, the degree of image degradation during display output input by the operator can be set as a threshold value. Furthermore, the volume of sound is reduced according to the degree of deterioration. However, it is not obvious that simply thinning out inter-frame predictive encoded images according to video degradation has led to an improvement in the quality of the user's experience. In addition, although it has a mechanism for reflecting the user's quality of experience in the threshold setting, it requires input of information from the user, which is troublesome for the user. Further, in this patent document, time quality deterioration due to arrival delay of information is not taken into consideration.
JP 2007-124445 A JP 2004-112654 A ITU-T Rec. G.100 / P.10 Amendment 1, `` Amendment 1: new appendix I definition of Quality of Experience (QoE), '' Jan. 2007. N. Feamster and H. Balakrishnan, `` Packet loss recovery for streaming video, '' Proc.IEEE 12th International Packetvideo Workshop (PV2002), Apr. 2002. ITU-T Rec. P.800.1, `` Mean Opinion Score (MOS) terminology, '' Mar. 2003. Yoshihisa Tanaka, `` Psychological Measurement Second Edition '', The University of Tokyo Press, 1997. JP Guilford, Psychometric methods, McGraw-Hill, NY, 1954. Mindcraft Inc, `` WebStone benchmark information, '' http://www.mindcraft.com/webstone/. `` H.264 / MPEG-4 AVC reference software JM11.0, '' http://iphome.hhi.de/suehring/tml/index.htm The Video Quality Experts Group, `` The video quality experts group web site, '' http://www.its.bldrdoc.gov/vqeg/. ITU-T Rec. P.911, `` Subjective audiovisual quality assessment methods for multimedia applications, '' Dec. 1998. S. Tasaka and Y. Ito, `` Psychometric analysis of the mutually compensatory property of multimedia QoS, '' Conf. Rec.IEEE ICC 2003, pp. 1880-1886, May 2003. F. Mosteller, `` Remarks on the method of paired comparisons: III. a test of significance for paired comparisons when equal standard deviations and equal correlations are assumed, '' Psychometrika, vol.16, no.2, pp.207-218, 1951.

音声・ビデオのパケット伝送に関する従来研究では，誤り補償が引き起こすビデオの空間品質の劣化とフレームスキップが引き起こす時間品質の劣化が，それぞれ別個の問題として取り上げられていた．そして，ほとんどの検討が，ビデオの空間品質または時間品質のいずれか一方しか考えておらず，更に音声品質は考慮に入れられていなかった。このことから，従来の誤り補償方式もしくはフレームスキップ方式のうちいずれか一方を用いての音声・ビデオ出力では，必ずしも最良のユーザ体感品質が得られるとは限らない。 In previous research on audio and video packet transmission, the spatial quality degradation caused by error compensation and the temporal quality degradation caused by frame skip were taken up as separate problems. And most studies have considered only one of video spatial quality or temporal quality, and audio quality has not been taken into account. Therefore, the best user experience quality is not always obtained in audio / video output using either the conventional error compensation method or the frame skip method.

そこで，本発明は，ビデオの時間品質と空間品質とのトレードオフの関係を利用することにより，パケット伝送された音声・ビデオを視聴するユーザの体感品質（QoE）を最大化するように音声・ビデオの出力を行うことを可能とするものである。 Therefore, the present invention uses the trade-off relationship between the temporal quality and spatial quality of video to maximize the quality of voice (QoE) of the user who views the packet-transmitted audio / video. It is possible to output video.

第１発明の音声・ビデオ出力方式は、パケット伝送された音声およびビデオを受信出力する際に、パケットの到着遅延揺らぎを吸収するバッファ部と、音声・ビデオの復号を行う復号部と、復号されたビデオフレームの出力可否を判定する判定部と、復号された音声・ビデオの出力を行う出力部とを備え、バッファ部でのバッファリング時間および復号部で復号されたビデオフレームの誤り補償率に応じて、判定部において誤り補償を行ったフレームを表示するか当該フレームをスキップするかを切り替える音声・ビデオ出力方式であって、音声・ビデオ出力時に、出力部ならびにビデオ復号部において得られる音声時間品質情報、ビデオ時間品質情報ならびにビデオ空間品質情報とからリアルタイムにユーザ体感品質を推定する推定部を設け、判定部において切り替えを行う閾値を、推定部の推定値を利用して設定することを特徴とする。第２発明の音声・ビデオ出力方式実現プログラムは、パケット伝送された音声およびビデオを受信出力する際に、パケットの到着遅延揺らぎを吸収するバッファ部と、音声・ビデオの復号を行う復号部と、復号されたビデオフレームの出力可否を判定する判定部と、復号された音声・ビデオの出力を行う出力部とを備え、バッファ部でのバッファリング時間および復号部で復号されたビデオフレームの誤り補償率に応じて、判定部において誤り補償を行ったフレームを表示するか当該フレームをスキップするかを切り替える音声・ビデオ出力方式実現プログラムであって、音声・ビデオ出力時に、出力部ならびにビデオ復号部において得られる音声時間品質情報、ビデオ時間品質情報ならびにビデオ空間品質情報とからリアルタイムにユーザ体感品質を推定する推定部を設け、判定部において切り替えを行う閾値を、推定部の推定値を利用して設定することを特徴とする。第３発明の音声・ビデオ出力装置は、パケット伝送された音声およびビデオを受信出力する際に、パケットの到着遅延揺らぎを吸収するバッファ部と、音声・ビデオの復号を行う復号部と、復号されたビデオフレームの出力可否を判定する判定部と、復号された音声・ビデオの出力を行う出力部とを備え、バッファ部でのバッファリング時間および復号部で復号されたビデオフレームの誤り補償率に応じて、判定部において誤り補償を行ったフレームを表示するか当該フレームをスキップするかを切り替える音声・ビデオ出力装置であって、音声・ビデオ出力時に、出力部ならびにビデオ復号部において得られる音声時間品質情報、ビデオ時間品質情報ならびにビデオ空間品質情報とからリアルタイムにユーザ体感品質を推定する推定部を設け、判定部において切り替えを行う閾値を，推定部の推定値を利用して設定することを特徴とする。 The audio / video output system according to the first aspect of the invention is a buffer unit that absorbs packet arrival delay fluctuations, a decoding unit that decodes audio / video, and a decoding unit that receives and outputs packet-transmitted audio and video. A determination unit for determining whether or not a video frame can be output, and an output unit for outputting decoded audio / video. The buffering time in the buffer unit and the error compensation rate of the video frame decoded by the decoding unit in response, an audio and video output format toggle its or skip or the frame to display the frame subjected to error compensation in the determination unit, sound at audio and video output is obtained at the output and the video decoding unit Estimating section for estimating user experience quality in real time from time quality information, video time quality information and video space quality information The threshold for switching in the determination unit, and wherein the benzalkonium be set using the estimated value of the estimator. The audio / video output method realization program of the second invention comprises a buffer unit that absorbs packet arrival delay fluctuations when receiving and outputting packet-transmitted audio and video, a decoding unit that decodes audio / video, A determination unit that determines whether or not a decoded video frame can be output, and an output unit that outputs decoded audio and video. The buffering time in the buffer unit and the error compensation of the video frame decoded by the decoding unit depending on the rate, a or a toggle its audio-video system implementing the program skips whether the frame to display the frame subjected to error compensation in the determination unit, when the audio-video output, the output unit and the video decoding unit User in real time from audio time quality information, video time quality information and video space quality information An estimation unit that estimates a sensitive quality is provided, the threshold for switching in the determination unit, characterized and Turkey set by using an estimate of the estimator. The audio / video output device according to the third aspect of the invention is a buffer unit that absorbs packet arrival delay fluctuations, a decoding unit that decodes audio / video, and a decoding unit that receives and outputs packet-transmitted audio and video. A determination unit for determining whether or not a video frame can be output, and an output unit for outputting decoded audio / video. The buffering time in the buffer unit and the error compensation rate of the video frame decoded by the decoding unit in response, a determination unit toggle its audio-video output device or skip whether the frame frame displays the performing error compensation in a voice at the time of audio and video output is obtained at the output and the video decoding unit Estimating section for estimating user experience quality in real time from time quality information, video time quality information and video space quality information The threshold for switching in the determination unit, characterized and Turkey set by using an estimate of the estimator.

本発明は，音声・ビデオのパケット伝送における受信側装置が備える出力方式に関するものである。本発明では，送信側装置に特別な機能を必要としない。このため，従前の送信側装置を利用可能である。 The present invention relates to an output method provided in a receiving side apparatus in audio / video packet transmission. In the present invention, no special function is required for the transmission side device. For this reason, the conventional transmission side apparatus can be used.

本発明は，図１のブロック構成図に示されるように，パケットの到着遅延揺らぎを吸収するバッファ部と，音声・ビデオの復号を行う復号部と，復号されたビデオフレームの出力可否を判定する判定部と，復号された音声・ビデオの出力を行う出力部とを備え，バッファ部でのバッファリング時間および復号部で復号されたビデオフレームの誤り補償の度合いに応じて，判定部において誤り補償を行ったフレームを表示するか当該フレームをスキップするかを切り替えることで，ビデオの時間品質と空間品質とのトレードオフを利用した高いユーザ体感品質の実現を可能とする。 As shown in the block diagram of FIG. 1, the present invention determines a buffer unit that absorbs packet arrival delay fluctuations, a decoding unit that decodes audio and video, and whether or not a decoded video frame can be output. A decision unit and an output unit for outputting decoded audio / video, and the error compensation in the decision unit according to the buffering time in the buffer unit and the degree of error compensation of the video frame decoded in the decoding unit It is possible to realize high user experience quality using the trade-off between video temporal quality and spatial quality by switching whether to display a frame that has been performed or to skip the frame.

本発明の方式では，受信された音声・ビデオパケットは，バッファ部においてバッファリングされる。これにより，ネットワークを伝送される際の遅延揺らぎを吸収する。バッファリングされる時間を表すバッファリング時間には任意の上限値を設ける。すなわち，大きく遅れて到着したパケットは，意味のない情報と判断し，廃棄される。したがって，このバッファリング時間は，時間品質，空間品質の両方に影響を与える。 In the system of the present invention, the received audio / video packet is buffered in the buffer unit. This absorbs delay fluctuations during transmission over the network. An arbitrary upper limit is set for the buffering time indicating the buffered time. That is, a packet that arrives with a large delay is determined to be meaningless information and is discarded. Thus, this buffering time affects both temporal quality and spatial quality.

このことから，バッファリングによる時間空間品質の改善効果と，誤り補償による空間品質改善効果と時間品質への影響とのトレードオフとを鑑みて，適切な制御を行うことが，ユーザの体感品質の向上に重要なものとなる。 Therefore, considering the effect of improving the time and space quality due to buffering and the tradeoff between the effect of space compensation and the effect on time quality due to error compensation, it is important to perform appropriate control to improve the user's quality of experience. It becomes important for improvement.

また，本発明の方式は，音声・ビデオ伝送におけるモダリティ間の相互作用を利用することで高いユーザ体感品質の実現を可能とすることを特徴としている。 Further, the system of the present invention is characterized in that high user experience quality can be realized by utilizing the interaction between modalities in audio / video transmission.

音声・ビデオ伝送では，音声とビデオという，異なる二つのメディアが伝送される。音声は聴覚により認識され，ビデオは視覚により認識される。これら異なる感覚（モダリティ）の間には，相互作用の性質が存在する。例えば，ビデオの出力品質が悪いときに，音声の品質が良ければ，ビデオの情報を補完することが可能となる。 In audio / video transmission, two different media, audio and video, are transmitted. Audio is recognized by hearing, and video is recognized by vision. There is a nature of interaction between these different modalities. For example, when the video output quality is poor and the sound quality is good, the video information can be complemented.

本発明では，このモダリティ間の相互作用を利用する。このために，ユーザ体感品質の尺度として，ITU-T標準の多くで用いられているMOS（Mean Opinion Score，非特許文献３参照）の代わりに，計量心理学的測定法の一つである系列カテゴリー法（非特許文献４，５参照）により得られる距離尺度を用いる。MOSにおいては，尺度値間の順序のみが保証され，値の等間隔性は保証されない。MOSは，単一メディア伝送の品質評価には向いている。しかし，マルチメディア伝送時には，モダリティ間の相互作用の影響により，ユーザにとって，尺度値の間隔は一定とならない可能性が高い。このため，MOSはマルチメディア伝送の品質評価に適しているとは言えない。 In the present invention, this interaction between modalities is used. For this reason, as a measure of user experience quality, instead of MOS (Mean Opinion Score, see Non-Patent Document 3) used in many of the ITU-T standards, it is a series that is one of the psychometric measurement methods. A distance scale obtained by the category method (see Non-Patent Documents 4 and 5) is used. In MOS, only the order between the scale values is guaranteed, and the equidistantness of the values is not guaranteed. MOS is suitable for quality evaluation of single media transmission. However, during multimedia transmission, the scale value interval is likely not to be constant for the user due to the interaction between modalities. For this reason, MOS is not suitable for evaluating the quality of multimedia transmission.

系列カテゴリー法により得られる距離尺度を用いることで，モダリティ間の相互作用を含むユーザ体感品質の適切な評価ならびに推定が可能となる。そして，このような方式で評価されたユーザ体感品質を鑑みて制御を行うことで，高いユーザ体感品質の実現が可能となる。 By using the distance measure obtained by the sequence category method, it is possible to appropriately evaluate and estimate the user experience quality including the interaction between modalities. Then, by performing control in view of the user experience quality evaluated by such a method, it is possible to realize high user experience quality.

更に，本発明の方式では，図２のブロック構成図に示されるように，音声・ビデオ出力時に，出力部ならびにビデオ復号部において得られる音声時間品質情報，ビデオ時間品質情報ならびにビデオ空間品質情報とからリアルタイムにユーザ体感品質を推定する推定部を設け，判定部においてビデオの時間品質と空間品質とのトレードオフから切り替えを判定する閾値を，この推定値を利用して設定することにより高いユーザ体感品質の実現を可能とすることを特徴としている。 Further, in the system of the present invention, as shown in the block diagram of FIG. 2, at the time of audio / video output, audio time quality information, video time quality information and video space quality information obtained at the output unit and video decoding unit Provides an estimation unit that estimates the user experience quality in real time, and uses this estimate to set a threshold for judging the switching from the trade-off between video temporal quality and spatial quality. It is characterized by enabling the realization of quality.

ユーザ体感品質の評価尺度として距離尺度を用いることにより，出力部において得られる時間品質情報ならびに空間品質情報，すなわちアプリケーションレベルQoSから高精度にユーザ体感品質を推定することが可能となる。そして，いくつかの出力形態に関してユーザ体感品質を推定し，推定結果が最も高くなる出力形態を採用することにより，高いユーザ体感品質を得ることが可能となる。 By using the distance scale as an evaluation scale of the user experience quality, it is possible to estimate the user experience quality with high accuracy from the time quality information and the spatial quality information obtained at the output unit, that is, the application level QoS. Then, it is possible to obtain a high user experience quality by estimating the user experience quality with respect to several output forms and adopting an output form with the highest estimation result.

このように，リアルタイムに推定されたユーザ体感品質をもとに，時間品質と空間品質とのどちらを優先するかを動的に決定することで，よりユーザの感覚に近い制御を行うことが可能となり，ユーザ体感品質の向上につなげることができる。 In this way, it is possible to perform control closer to the user's sense by dynamically deciding whether to prioritize temporal quality or spatial quality based on the user experience quality estimated in real time. Thus, the user experience quality can be improved.

本発明の構成により，パケット伝送された音声・ビデオについて，それを視聴するユーザに対して高いユーザ体感品質を提供することが可能となる。 With the configuration of the present invention, it is possible to provide a high user experience quality to a user who views a voice / video transmitted in packets.

以下，本発明を具体化した実施例１および実施例２について，図面を参照しつつ説明する。 Embodiments 1 and 2 embodying the present invention will be described below with reference to the drawings.

本発明の実施例１として，１ビデオフレームにおいて誤り補償されたスライスの割合によって，誤り補償とフレームスキップとを切り替え，ビデオの空間品質と時間品質とを制御する方式が考えられる。その具体的な動作を図３のフローチャートに沿って説明する。図３より，まず，受信されたフレームを格納するバッファから，出力予定のフレーム情報を取得し，復号する。そして，復号されたフレーム内で欠落のあったスライスに対して誤り補償を行う。その際，誤り補償方式は特定のものに限定せず，誤り補償された割合が算出できるものであれば何でもよい。誤り補償された割合を誤り補償率（単位：%）と呼び，次式で定義する。 As a first embodiment of the present invention, a method is conceivable in which error compensation and frame skip are switched and the spatial quality and temporal quality of video are controlled according to the ratio of slices in which error compensation is performed in one video frame. The specific operation will be described with reference to the flowchart of FIG. As shown in FIG. 3, first, frame information to be output is acquired from a buffer for storing received frames and decoded. Then, error compensation is performed on a slice that is missing in the decoded frame. At this time, the error compensation method is not limited to a specific one, and any error compensation method can be used as long as the error compensation ratio can be calculated. The rate of error compensation is called the error compensation rate (unit:%) and is defined by the following equation.

誤り補償率 =
誤り補償の行われたスライス数／1フレーム全体のスライス数×100 [％]
求めた誤り補償率が予め設定された閾値より大きい場合，出力を取り消し，非出力フレームとする。また，誤り補償率が閾値以下の場合でも，対象フレームの動き予測参照フレームが非出力フレームであると，同様に出力を取り消し，非出力フレームとする。誤り補償率が閾値以下で，且つ対象フレームの動き予測参照フレームが出力フレームである場合に，フレームを出力する。この操作を1フレーム毎に行い，誤り補償方式とフレームスキップ方式とを切り替える。 Error compensation rate =
Number of slices with error compensation / number of slices in one frame x 100 [%]
When the obtained error compensation rate is larger than a preset threshold value, the output is canceled and a non-output frame is set. Even when the error compensation rate is less than or equal to the threshold, if the motion prediction reference frame of the target frame is a non-output frame, the output is similarly canceled and a non-output frame is set. A frame is output when the error compensation rate is equal to or less than the threshold and the motion prediction reference frame of the target frame is an output frame. This operation is performed for each frame to switch between the error compensation method and the frame skip method.

閾値を100%，40%，20%，0%とした場合の動作例を図４に示す。図中の実線の平行四辺形が出力フレームを示し，淡い実線の平行四辺形が非出力フレームを表す。誤り補償されたスライスは，グレーの小さな平行四辺形で示す。また，図中の()内の数値は，そのフレームの誤り補償率を表す。ここで，閾値を100%とした場合は純粋な誤り補償方式に，閾値0%は純粋なフレームスキップ方式に対応することに注意されたい。 FIG. 4 shows an operation example when the threshold value is 100%, 40%, 20%, and 0%. The solid parallelogram in the figure represents the output frame, and the light solid parallelogram represents the non-output frame. Error-compensated slices are shown as small gray parallelograms. The numerical value in parentheses in the figure represents the error compensation rate of the frame. Note that when the threshold value is 100%, it corresponds to a pure error compensation method, and the threshold value 0% corresponds to a pure frame skip method.

この方式の有効性を評価するために行った実験の方法を示す。実験では，上述の切り替え方式の閾値を変化させてQoEを評価する。そして，求めたQoE評価結果より，閾値の変化がQoEに及ぼす影響を考察する。以下に実験システムとQoE評価実験方法を示す。 A method of experiments conducted to evaluate the effectiveness of this method is shown. In the experiment, QoE is evaluated by changing the threshold of the switching method described above. Then, based on the obtained QoE evaluation results, the effect of changes in threshold on QoE is discussed. The experimental system and QoE evaluation experiment method are shown below.

本実験に用いたネットワーク構成を図５に示す。４台のPCと２台のルータから構成される。４台のPCは，それぞれ，音声・ビデオ送信端末，音声・ビデオ受信端末，Web サーバおよびWebクライアントとして機能する。全ての回線は１００ＢＡＳＥ−ＴＸである。ルータは，Alcatel Lucent社（旧Riverstone Networks社）製のRS3000である。 The network configuration used in this experiment is shown in FIG. It consists of 4 PCs and 2 routers. Each of the four PCs functions as an audio / video transmitting terminal, audio / video receiving terminal, Web server, and Web client. All lines are 100BASE-TX. The router is RS3000 manufactured by Alcatel Lucent (formerly Riverstone Networks).

音声・ビデオは，音声・ビデオ送信端末から音声・ビデオ受信端末へ伝送される。音声とビデオは別個のトランスポートストリームで伝送される。トランスポートプロトコルにはRTP/UDPを用いる。その際，受信端末では，伝送遅延の揺らぎを吸収するため，1秒間のプレイアウトバッファリング制御を行う。 The audio / video is transmitted from the audio / video transmitting terminal to the audio / video receiving terminal. Audio and video are transmitted in separate transport streams. RTP / UDP is used as the transport protocol. At that time, the receiving terminal performs playout buffering control for 1 second in order to absorb fluctuations in transmission delay.

表１に，音声とビデオの仕様を示す。表中のMU（Media Unit）とは，メディア同期の処理単位であり，音声では20ミリ秒分のサンプリングデータ，ビデオでは1フレームに相当する。音声MUは一つのUDPデータグラムに梱包される。ビデオMUはスライスを単位とした15個のUDPデータグラムに分割されて伝送される。ピクチャパターンとして，Iフレームにn-1個のPフレーム（n=1,5,15）が続く3種類を考える。 Table 1 shows the audio and video specifications. The MU (Media Unit) in the table is a unit of media synchronization processing, which corresponds to 20 ms of sampling data for audio and 1 frame for video. The voice MU is packed into one UDP datagram. The video MU is divided into 15 UDP datagrams in units of slices and transmitted. Three types of picture patterns are considered in which n-1 P frames (n = 1, 5, 15) follow an I frame.

このネットワークにおいて，WebサーバからWebクライアントへWebトラヒックを伝送する。Webクライアントには，Webサーバ性能評価ツールであるWebStone （非特許文献６参照）を用いる。Webstoneに設定するクライアント数は，20, 30, 40, 50, 75, 100の計6種類とする。 In this network, Web traffic is transmitted from the Web server to the Web client. For the Web client, WebStone (refer to Non-Patent Document 6) which is a Web server performance evaluation tool is used. The number of clients set in Webstone is 6, 30, 40, 50, 75, and 100.

ビデオの誤り補償方式には，JM11.0（非特許文献７参照）に実装されているものを用いる。Iフレームにおける誤り補償では，欠落情報を周辺の情報から補間する。Pフレームにおける誤り補償には，2種類の誤り補償方式を用いることができるが，本実施例では，前出力フレームから欠落部分を複写する方式（Frame Copy）を選択する。 As the video error compensation method, the one implemented in JM11.0 (see Non-Patent Document 7) is used. In error compensation in I frames, missing information is interpolated from surrounding information. Two types of error compensation methods can be used for error compensation in the P frame. In this embodiment, a method (Frame Copy) for copying a missing portion from the previous output frame is selected.

誤り補償方式とフレームスキップ方式との切り替え制御は，前述の手順に従って行う。その際の閾値は，100%，40%，20%，0%とする。 Switching control between the error compensation method and the frame skip method is performed according to the procedure described above. The threshold values are 100%, 40%, 20%, and 0%.

コンテンツの選択には，VQEGのテストプラン（非特許文献８参照）を参考にした。本検討では，表２に概要を示すように，3種類のコンテンツタイプを選び，その各々に対して2個のコンテンツを用意して，計6個のコンテンツを用いた。表３に平均ビットレートを示す。また，参考のため，全てのコンテンツに対してITU-T P．911（非特許文献９参照）で提唱されているTI（Temporal Information）値を示しておく。これは，動きの程度を表し，数値が大きいほど動きが大きいものである。なお，この表３のTI値は，シーンチェンジを除いて算出したものである。 The content selection was made with reference to the VQEG test plan (see Non-Patent Document 8). In this study, as outlined in Table 2, we selected three types of content, prepared two contents for each, and used a total of six contents. Table 3 shows the average bit rate. For reference, ITU-TP is used for all contents. The TI (Temporal Information) value proposed in 911 (see Non-Patent Document 9) is shown. This represents the degree of movement, and the larger the value, the greater the movement. The TI values in Table 3 are calculated excluding scene changes.

メディア受信端末で出力された音声・ビデオを記録し，これを刺激として，非特許文献１０の方法により，系列カテゴリー法を用いたQoE評価を行う。評定尺度法におけるカテゴリーとして，表４に示す5段階妨害尺度（DCR：Degradation Category Rating）を用いる。そして，そのときの評価基準は，Webトラヒックを発生させなかった場合に出力された音声・ビデオとする。 Audio / video output from the media receiving terminal is recorded, and using this as a stimulus, QoE evaluation using the sequence category method is performed by the method of Non-Patent Document 10. As a category in the rating scale method, the five-step disturbance category rating (DCR) shown in Table 4 is used. The evaluation criteria at that time is the audio / video output when Web traffic is not generated.

被験者は20代の男性32人である。評価対象（刺激）数は，コンテンツが6個，ピクチャパターンが3通り，切り替えの閾値が4種類，Webクライアント数が6種類の計432（＝6×3×4×6）個とする。各コンテンツの提示順序は，被験者ごとにランダムとする。また，個々のコンテンツを評価する際，切り替えの閾値とWebクライアント数の組み合わせの提示順番もランダムとする。一つの刺激の再生時間を10秒とし，1人当りの評価時間は休憩を含めておよそ4．5時間であった。 The test subjects were 32 men in their 20s. The number of evaluation targets (stimulus) is 432 (= 6 × 3 × 4 × 6), which is 6 contents, 3 picture patterns, 4 types of switching thresholds, and 6 types of Web clients. The presentation order of each content is random for each subject. Also, when evaluating individual content, the presentation order of the combination of the switching threshold and the number of Web clients is also random. The regeneration time of one stimulus was 10 seconds, and the evaluation time per person was about 4.5 hours including breaks.

系列カテゴリー法により得られた尺度値に対し，Mostellerの適合度検定（非特許文献１１参照）を行った。その結果，有意水準5％において，系列カテゴリー法により得られた尺度値が測定結果に適合しなかった。そこで，432個の刺激の内，推定値と実測値の誤差が大きいものから順に，刺激を一つずつ取り除いた。その結果，40個の刺激を取り除いたことで，有意水準5％において，得られた尺度値が測定結果に適合した。本実施例では，これらの尺度値をQoEパラメータとして用いる。 Mosteller's fitness test (see Non-Patent Document 11) was performed on the scale values obtained by the series category method. As a result, at the significance level of 5%, the scale value obtained by the series category method did not match the measurement results. Therefore, the stimuli were removed one by one in descending order of the error between the estimated value and the actual measurement value among the 432 stimuli. As a result, by removing 40 stimuli, the obtained scale value matched the measurement result at 5% significance level. In this embodiment, these scale values are used as QoE parameters.

距離尺度では任意の値を原点とすることができるため，本実施例では，QoEパラメータの最小値が0となるように尺度の原点を定めた。その結果，各カテゴリーの下限値は，4．649（カテゴリー5），3．532（カテゴリー4），2．371（カテゴリー3），1．010（カテゴリー2）となった。 Since an arbitrary value can be set as the origin in the distance scale, in this embodiment, the scale origin is determined so that the minimum value of the QoE parameter is zero. As a result, the lower limit of each category was 4.649 (Category 5), 3.532 (Category 4), 2.371 (Category 3), and 1.010 (Category 2).

図６に，コンテンツがスポーツ２で，ピクチャパターンがIフレームのみの場合におけるWebクライアント数に対する心理的尺度値を示す。この図では，Mostellerの適合度検定の結果削除された評価値は示されていない。この図から，いずれのWebクライアント数においても，閾値を0%とした，フレームスキップ方式で高いQoEが得られることが分かる。ピクチャパターンがIフレームのみの場合，すべてのコンテンツにおいて，閾値を0%，すなわちフレームスキップ方式のみを用いることが高いQoEを得るのに有効であった。 FIG. 6 shows psychological scale values for the number of Web clients when the content is sport 2 and the picture pattern is only an I frame. In this figure, evaluation values deleted as a result of Mosteller's fitness test are not shown. From this figure, it can be seen that, regardless of the number of Web clients, high QoE can be obtained with the frame skip method with a threshold of 0%. When the picture pattern is only I frames, it was effective to obtain a high QoE for all contents by using a threshold of 0%, that is, only the frame skip method.

図７に，コンテンツがスポーツ２で，ピクチャパターンがIPPPPの場合におけるWebクライアント数に対する心理的尺度値を示す。この図から，Webクライアント数が多い，ネットワークが混雑した状況下において，閾値を0%より大きな値とした場合のQoEが，閾値を0%とした場合よりも優れていることが分かる。したがって，ピクチャパターンにPフレームが含まれる場合，ビデオが支配的に影響するコンテンツでは，閾値を0%より大きな値に設定することが，高いQoEを得るのに有効であると言える。 FIG. 7 shows psychological scale values for the number of Web clients when the content is Sports 2 and the picture pattern is IPPPP. From this figure, it can be seen that QoE when the threshold is set to a value larger than 0% is better than when the threshold is set to 0% in a situation where the number of Web clients is large and the network is congested. Therefore, when the P pattern is included in the picture pattern, it can be said that setting the threshold value to a value larger than 0% is effective for obtaining a high QoE in the content where the video influences dominantly.

図８に，コンテンツがアニメーション２で，ピクチャパターンがIPPPPの場合におけるWebクライアント数に対する心理的尺度値を示し，図９にピクチャパターンがIPPPPPPPPPPPPPPのときの心理的尺度値を示す。これらの図から，ピクチャパターンに含まれるPフレームの数が増えるほど，閾値を大きな値に設定することが効果的であることが分かる。 FIG. 8 shows psychological scale values for the number of Web clients when the content is animation 2 and the picture pattern is IPPPP, and FIG. 9 shows psychological scale values when the picture pattern is IPPPPPPPPPPPPPP. From these figures, it can be seen that it is more effective to set the threshold value to a larger value as the number of P frames included in the picture pattern increases.

また，図７，図８，図９から，アニメーションにおいては，ビデオが支配的なコンテンツに比べて閾値を大きな値に設定することによる効果が出始めるのに，ピクチャパターンに含まれるPフレーム数を多く要することが分かる。音声が支配的なコンテンツにおいても，アニメーションと同様の傾向が得られた。 7, 8, and 9, in animation, the effect of setting a threshold value larger than that of video-dominated content begins to appear. You can see that it takes a lot. The same tendency as the animation was obtained for the audio-dominated content.

また，ビデオの動きの激しいコンテンツやピクチャパターンにPフレームを多く含むコンテンツでは，閾値を100%とした，誤り補償方式のみの場合に，もっとも高いQoEが得られることが分かった。 In addition, it was found that the highest QoE can be obtained only for error compensation systems with a threshold of 100% for content with a lot of video motion and content that contains many P frames in picture patterns.

これらの結果から，ビデオ時間品質と空間品質との間にトレードオフの関係が存在し，それらがコンテンツの種類，ビデオの動きの度合い，およびピクチャパターンに依存することから，本発明の出力方式が有効であることが分かった。 From these results, there is a trade-off relationship between video temporal quality and spatial quality, which depends on the type of content, the degree of video motion, and the picture pattern. It turns out to be effective.

実施例１のQoE評価結果をもとに，実施例２の方式を考えることができる。この方式では，誤り補償方式とフレームスキップ方式とを，一定の学習期間中のユーザ体感品質推定結果に応じて切り替える。この方式の具体的な動作を，図１０のフローチャートに沿って説明する。 Based on the QoE evaluation result of Example 1, the method of Example 2 can be considered. In this method, the error compensation method and the frame skip method are switched according to the user experience quality estimation result during a certain learning period. A specific operation of this method will be described with reference to the flowchart of FIG.

実施例２の方式では，まずピクチャパターンに応じて，用いる方式を切り替える。ピクチャパターンがIフレームのみならば，閾値を0%とし，フレームスキップ方式のみを用いる。そうでなければ，コンテンツ種別に応じて，出力開始時の方式を決定する。ビデオ重視のコンテンツでは，閾値を100%とし，誤り補償方式で出力を開始する。音声重視のコンテンツならば，閾値を0%とし，フレームスキップ方式で出力を開始する。 In the system of the second embodiment, first, the system to be used is switched according to the picture pattern. If the picture pattern is only an I frame, the threshold is set to 0% and only the frame skip method is used. Otherwise, the output start method is determined according to the content type. For video-oriented content, set the threshold to 100% and start output using the error compensation method. For audio-oriented content, the threshold is set to 0% and output is started using the frame skip method.

出力開始後，一定の学習期間を設ける。その間，前述の設定にしたがって音声・ビデオを出力する一方で，復号されたビデオを用いて複数の閾値（例えば，0%，20%，40%，100%など）で切り替え方式を適用した場合のQoE推定値を導出する。学習期間終了時点で，これら複数のQoE推定値を比較する。そして，最も高いQoE推定値が得られた閾値を，以降のビデオ出力に使用する。また，一定期間ごとにこのような閾値選択を繰り返し行う。 A certain learning period is set after the start of output. Meanwhile, when audio / video is output according to the above settings, the switching method is applied with multiple thresholds (eg 0%, 20%, 40%, 100%, etc.) using decoded video. Deriving QoE estimates. At the end of the learning period, these multiple QoE estimates are compared. The threshold at which the highest QoE estimate is obtained is used for subsequent video output. Further, such threshold selection is repeated at regular intervals.

以上が本発明の実施例であるが，QoE推定値を用いる代わりに，音声・ビデオ伝送セッション確立時にやり取りされるコンテンツの属性情報から一意に定められた閾値を使用することも可能である。 Although the embodiment of the present invention has been described above, instead of using the estimated QoE value, it is also possible to use a threshold value uniquely determined from attribute information of contents exchanged when an audio / video transmission session is established.

本発明は，パケット伝送ネットワークを介した音声・ビデオストリーミングサービスに広く用いることができる。 The present invention can be widely used for an audio / video streaming service via a packet transmission network.

請求項１，２に示される方式のブロック構成図Block configuration diagram of the system shown in claims 1 and 2 請求項３に示される方式のブロック構成図Block diagram of the system shown in claim 3 実施例１のフローチャートFlow chart of the first embodiment ビデオストリーム出力例Video stream output example 実験システムExperimental system コンテンツがスポーツ２で，ピクチャパターンがIのときの心理的尺度値Psychological scale value when content is sport 2 and picture pattern is I コンテンツがスポーツ２で，ピクチャパターンがIPPPPのときの心理的尺度値Psychological scale value when the content is sport 2 and the picture pattern is IPPPP コンテンツがアニメーション２で，ピクチャパターンがIPPPPのときの心理的尺度値Psychological scale value when the content is animation 2 and the picture pattern is IPPPP コンテンツがアニメーション２で，ピクチャパターンがIPPPPPPPPPPPPPPのときの心理的尺度値Psychological scale value when the content is animation 2 and the picture pattern is IPPPPPPPPPPPPPP 実施例２のフローチャートFlow chart of embodiment 2

Claims

When receiving and outputting packet-transmitted audio and video, a buffer unit that absorbs packet arrival delay fluctuations, a decoding unit that decodes audio and video, and a determination unit that determines whether a decoded video frame can be output If, and an output unit for outputting the decoded audio and video, in accordance with the error compensation rate of video frames decoded by the buffering time and the decoder in the buffer unit, the error compensation in the evaluation unit a toggle its audio and video output format or skip or the frame to display the frame was, at the time of the audio-video output, audio time quality information obtained at the output unit and the video decoding unit, a video time An estimator for estimating a user experience quality in real time from the quality information and the video space quality information; The threshold for performing Oite switching, the estimating portion of the audio-video system and setting using the estimated value.

When receiving and outputting packet-transmitted audio and video, a buffer unit that absorbs packet arrival delay fluctuations, a decoding unit that decodes audio and video, and a determination unit that determines whether a decoded video frame can be output If, and an output unit for outputting the decoded audio and video, in accordance with the error compensation rate of video frames decoded by the buffering time and the decoder in the buffer unit, the error compensation in the evaluation unit a switching or skip or the frame to display the frame Ru audio-video system implementing the program went, when audio-video output, audio time quality information obtained at the output unit and the video decoding unit, Establishing an estimation unit that estimates user experience quality in real time from video temporal quality information and video spatial quality information The threshold for switching in the determination unit, the audio-video system implemented programs and setting using the estimated value of the estimator.

When receiving and outputting packet-transmitted audio and video, a buffer unit that absorbs packet arrival delay fluctuations, a decoding unit that decodes audio and video, and a determination unit that determines whether a decoded video frame can be output If, and an output unit for outputting the decoded audio and video, in accordance with the error compensation rate of video frames decoded by the buffering time and the decoder in the buffer unit, the error compensation in the evaluation unit a toggle its audio-video output device or skip whether the frame frame Show subjected to, when the audio-video output, audio time quality information obtained at the output unit and the video decoding unit, a video time An estimator for estimating a user experience quality in real time from the quality information and the video space quality information; The threshold for performing Oite switching, the estimating portion estimates the audio-video output device and setting utilized.