JP6652990B2

JP6652990B2 - Apparatus and method for surround audio signal processing

Info

Publication number: JP6652990B2
Application number: JP2018136700A
Authority: JP
Inventors: ゾンシャンリュウ; 田中　直也; 直也田中
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2018-07-20
Filing date: 2018-07-20
Publication date: 2020-02-26
Anticipated expiration: 2034-03-26
Also published as: JP2018196133A

Description

本発明は、サラウンドオーディオ信号処理システムに関し、特に、任意のデジタル化及び圧縮化オーディオ信号記憶若しくは送信アプリケーションにて、及びオーディオ再生アプリケーションのためのレンダリングにて、用いられ得るオーディオ信号の符号化及び復号化に関する。 The present invention relates to surround audio signal processing systems, and in particular to encoding and decoding of audio signals that can be used in any digitized and compressed audio signal storage or transmission application and in rendering for audio playback applications. About

音楽を聴くときや音声付きの映像を見るとき、オーディオ／及びビデオシーンのより良い感覚を得られるので、高程度のオーディオエンベロップメントを有することが聴衆（観衆）にとって望ましい。オーディオエンベロップメントの意味は、没入型３Ｄオーディオ、及び正確なオーディオ定位を含む。没入型３Ｄオーディオとは、オーディオシステムが空間の任意の位置にてサウンドソースを仮想化できるということを意味する。正確なオーディオ定位とは、オーディオシステムが方向と距離との両方の観点でオリジナルのオーディオシーンと正確に調整してサウンドソースを配置することができるということを意味する［１］。 It is desirable for the audience to have a high degree of audio envelopment when listening to music or watching video with audio, as it gives a better sense of the audio / video scene. The meaning of audio development includes immersive 3D audio and accurate audio localization. Immersive 3D audio means that the audio system can virtualize the sound source at any location in space. Accurate audio localization means that the audio system can position the sound source precisely in terms of both direction and distance with the original audio scene [1].

オーディオエンベロップメントの感覚は、３Ｄオーディオシステムにより提供され得るのであり、該３Ｄオーディオシステムは、多数のラウドスピーカを使用する。スピーカは聴衆（観衆）を取り囲み，かつ，高、中、低の鉛直位置に配置され得る。 The sensation of audio development can be provided by a 3D audio system, which uses multiple loudspeakers. The loudspeakers surround the audience (audience) and can be placed in high, medium and low vertical positions.

三つのタイプのインプット信号及びフォーマットが３Ｄオーディオシステムで共通して用いられる：チャネルベースのインプット、オブジェクトベースのインプット、及び高次アンビソニックスである。 Three types of input signals and formats are commonly used in 3D audio systems: channel-based inputs, object-based inputs, and higher-order ambisonics.

チャネルベースのインプットは、今日の２Ｄ及び３Ｄオーディオ信号生成処理及びメディア（例えば、２２．２、９．１、８．１、７．１、５．１など）で共通して用いられ、個々の生成されるオーディオ信号チャネルは、指定位置のラウドスピーカを直接駆動するように意図されている。 Channel-based inputs are commonly used in today's 2D and 3D audio signal generation processing and media (eg, 22.2, 9.1, 8.1, 7.1, 5.1, etc.), and individual The generated audio signal channel is intended to directly drive the loudspeaker at the designated location.

オブジェクトベースのインプットに対しては、個々の生成されるオーディオ信号チャネルは、実際に利用可能なラウドスピーカの数や位置とは無関係に、指定の空間位置にてレンダリングされるように意図されるオーディオソースを、表す。 For object-based inputs, each generated audio signal channel is intended to be rendered at a specified spatial location, regardless of the number or location of the loudspeakers actually available. Represents the source.

高次アンビソニックス（ＨＯＡ）に対しては、個々の生成されるオーディオ信号チャネルは、実際に利用可能なラウドスピーカの数や位置とは無関係に、サウンドシーン全体の全般的描写の一部である。 For Higher Order Ambisonics (HOA), each generated audio signal channel is part of the overall description of the entire sound scene, regardless of the number or location of loudspeakers actually available. .

三つのフォーマットの間で、ＨＯＡフォーマットは、非標準のスピーカレイアウトを含む、任意の再生セットアップへアンビソニック信号をレンダリングできるオーディオシーンの表現である。 Among the three formats, the HOA format is a representation of an audio scene that can render an Ambisonic signal into any playback setup, including non-standard speaker layouts.

ＭＰＥＧ−Ｈ３Ｄオーディオ標準化のためのモデルなどの、先行技術では、ＨＯＡフォーマットに対しては、デコーダサイドで、ＨＯＡ信号は、まずデコードされたコア信号から再構築され、続いてスピーカセットアップにレンダリングされる。 In the prior art, such as a model for MPEG-H 3D audio standardization, for the HOA format, on the decoder side, the HOA signal is first reconstructed from the decoded core signal and then rendered into the speaker setup. You.

図１は、ＨＯＡフォーマットのための、ＭＰＥＧ−Ｈ３Ｄオーディオ標準化のモデル内のデコーダを示す。 FIG. 1 shows the decoder in the model of MPEG-H 3D audio standardization for the HOA format.

まず、インプットビットストリームは、ＡＡＣ−ファミリモノエンコーダにより本来生成されるＮビットストリームと、加えてこれらのビットストリームから全体のＨＯＡ表現を組み立て直すのに必要とされるパラメータとに、デマルチプレスクされる（１０１）。 First, the input bitstream is demultiplexed into N bitstreams originally generated by the AAC-family mono encoder and the parameters needed to reassemble the entire HOA representation from these bitstreams. (101).

マルチチャネル知覚復号コンポーネント（１０２、１０３及び１０４）では、Ｎビットストリームは、ＡＡＣ−ファミリモノデコーダにより個別にデコードされてＮ信号を生成する。 In the multi-channel perceptual decoding components (102, 103 and 104), the N bit streams are individually decoded by AAC-family mono decoders to generate N signals.

後続の空間復号化コンポーネントでは、まず、これらの信号の実際値の範囲が逆ゲインコントロール処理（１０５）により再構築される。次のステップでは、Ｎ信号が再分配され、Ｍのプレドミナント信号と、よりアンビエントなＨＯＡコンポーネント（１０５）を表す（Ｎ−Ｍ）のＨＯＡ係数信号を提供する。 In the subsequent spatial decoding component, first the actual value ranges of these signals are reconstructed by an inverse gain control process (105). In the next step, the N signal is redistributed to provide M predominant signals and (NM) HOA coefficient signals representing the more ambient HOA component (105).

（Ｎ−Ｍ）のＨＯＡ係数信号の固定のサブセットは再相関される。これはＨＯＡ符号化ステージにおける脱相関を反対にすることである（１０７）。 A fixed subset of the (NM) HOA coefficient signal is re-correlated. This is to reverse the decorrelation in the HOA coding stage (107).

次に、（Ｎ−Ｍ）のＨＯＡ係数信号の全ては、アンビエントなＨＯＡコンポーネント（１０７）を作成するのに用いられる。 Next, all of the (NM) HOA coefficient signals are used to create an ambient HOA component (107).

プレドミナントのＨＯＡコンポーネントは、Ｍのプレドミナントの信号及び対応するパラメータから、合成される。 The predominant HOA component is synthesized from the M predominant signals and the corresponding parameters.

最後に、プレドミナント及びアンビエントのＨＯＡコンポーネントは、所望の完全なＨＯＡ表現（１０８）に組み立てられ、更に所与のラウドスピーカセットアップ（１０９）にレンダリングされる。 Finally, the predominant and ambient HOA components are assembled into the desired complete HOA representation (108) and further rendered to a given loudspeaker setup (109).

プレドミナントサウンド合成、アンビエンス合成、ＨＯＡコンポジション及びレンダリングの詳細なプロセスを、以下説明する。 The detailed process of predominant sound synthesis, ambience synthesis, HOA composition and rendering is described below.

プレドミナントサウンド合成（ＰＳＳ）ブロック（１０６）では、プレドミナントコンポーネントのＨＯＡ表現は、二つの方法のいずれかから計算される。これらの方法は、「方向ベースの」及び「ベクトルベースの」と称される。 In the predominant sound synthesis (PSS) block (106), the HOA representation of the predominant component is calculated from one of two methods. These methods are referred to as "direction-based" and "vector-based."

ベクトルベースのＰＳＳでは、プレドミナントのサウンドは、ベクトルベースの信号Ｘ_ＶＥＣ（ｋ）から計算される。Ｘ_ＶＥＣ（ｋ）信号は、それらの空間特性からデカップルされた時間領域オーディオ信号を表す。再構築されたＨＯＡ係数は、ベクトルベースの信号Ｘ_ＶＥＣ（ｋ）を対応する複数の変換ベクトル（Ｍ_ＶＥＣ（ｋ）の多重ベクトルにより表される）と乗じることにより計算される。よってＭ_ＶＥＣ（ｋ）は、対応するＸ_ＶＥＣ（ｋ）の時間領域オーディオ信号の（指向性や幅などの）空間特性を含む。計算は以下のようになる。 In vector-based PSS, the predominant sound is calculated from the vector-based signal X _VEC (k). The X _VEC (k) signals represent time-domain audio signals decoupled from their spatial characteristics. The reconstructed HOA coefficients are calculated by multiplying the vector-based signal X _VEC (k) by a corresponding plurality of transform vectors (represented by multiple vectors of M _VEC (k)). Thus, M _VEC (k) includes the spatial characteristics (such as directivity and width) of the corresponding X _VEC (k) time-domain audio signal. The calculation is as follows.

ここで、
Ｘ_ＶＥＣ（ｋ）は、デコードされたベクトルベースの、プレドミナントサウンドを示す。
Ｍ_ＶＥＣ（ｋ）は、ベクトルベースのプレドミナントサウンドからＨＯＡ係数を再構築するマトリクスを示す。
Ｃ_ＶＥＣ（ｋ）は、ベクトルベースのプレドミナントサウンドから再構築されたＨＯＡ係数を示す。

here,
X _VEC (k) indicates the decoded vector-based, predominant sound.
M _VEC (k) denotes the matrix that reconstructs the HOA coefficients from the vector-based predominant sound.
C _VEC (k) indicates the HOA coefficient reconstructed from the vector-based predominant sound.

方向ベースのＰＳＳでは、ＨＯＡ係数は、全ての方向ベースのプレドミナントのサウンド信号Ｘ_ＰＳ（ｋ）から計算される。タプルセットＭ_ＤＩＲ（ｋ）を用いて、計算は以下のようになる。 For direction-based PSS, the HOA coefficients are calculated from all direction-based predominant sound signals X _PS (k). Using the tuple set M _DIR (k), the calculation is as follows.

ここで、
Ｘ_ＰＳ（ｋ）は、デコードされた方向ベースの、プレドミナントサウンドを示す。
Ｍ_ＤＩＲ（ｋ）は、方向ベースのプレドミナントサウンドからＨＯＡ係数を再構築するマトリクスを示す。
Ｃ_ＤＩＲ（ｋ）は、方向ベースのプレドミナントサウンドから再構築されたＨＯＡ係数を示す。

here,
_XPS (k) indicates the decoded direction-based, predominant sound.
M _DIR (k) indicates the matrix that reconstructs the HOA coefficients from the direction-based predominant sound.
C _DIR (k) indicates the HOA coefficient reconstructed from the direction-based predominant sound.

アンビエンス合成では、アンビエントＨＯＡコンポーネントフレームＣ_ＡＭＢ（ｋ）は、参考文献［２］によると、以下のように得られる。 In ambience synthesis, the ambient HOA component frame _CAMB (k) is obtained as follows according to reference [2].

１）アンビエントＨＯＡコンポーネントの第１のＯ_ＭＩＮ係数は以下で得られる。

ここで、
Ｏ_ＭＩＮは、アンビエントＨＯＡ係数の最小数を示す。
Ψ_ＭＩＮは、ある固定の所定方向に関するモードマトリクスを示す。
ｃ_{Ｉ，ＡＭＢ，ｎ}（ｋ）は、デコードされたアンビエントサウンド信号を示す。 1) The first O _MIN coefficient of the ambient HOA component is obtained below.

here,
O _MIN indicates the minimum number of ambient HOA coefficients.
Ψ _MIN indicates a mode matrix related to a fixed predetermined direction.
c _{I, AMB, n} (k) indicates the decoded ambient sound signal.

２）アンビエントＨＯＡコンポーネントの残りの係数のサンプル値は、以下に従って計算される。

2) Sample values of the remaining coefficients of the ambient HOA component are calculated according to:

最後に、ＨＯＡコンポジション内で、アンビエントＨＯＡコンポーネント及びプレドミナントＨＯＡコンポーネントは、重ね合わされて、デコードされたＨＯＡフレームを提供する。方向ベースのプレドミナント合成に対して予測が作動していなければ、デコードされたＨＯＡフレームＣ（ｋ）は以下により計算される。

（方向ベースの合成に対するもの）

（ベクトルベースの合成に対するもの）
ここで、
Ｃ_ＶＥＣ（ｋ）は、ベクトルベースのプレドミナントサウンドから再構築されたＨＯＡ係数を示す。
Ｃ_ＤＩＲ（ｋ）は、方向ベースのプレドミナントサウンドから再構築されたＨＯＡ係数を示す。
Ｃ_ＡＭＢ（ｋ）は、アンビエント信号から再構築されたＨＯＡ係数を示す。
Ｃ（ｋ）は、最終的な再構築されたＨＯＡ係数を示す。 Finally, within the HOA composition, the ambient HOA component and the predominant HOA component are superimposed to provide a decoded HOA frame. If prediction is not working for direction-based predominant synthesis, the decoded HOA frame C (k) is calculated by:

(For direction-based compositing)

(For vector-based compositing)
here,
C _VEC (k) indicates the HOA coefficient reconstructed from the vector-based predominant sound.
C _DIR (k) indicates the HOA coefficient reconstructed from the direction-based predominant sound.
_CAMB (k) indicates the HOA coefficient reconstructed from the ambient signal.
C (k) indicates the final reconstructed HOA coefficient.

近距離補償が適用されないならば、デコードされたＨＯＡ係数Ｃ（ｋ）は、レンダリングマトリクスＤによる乗算により、ラウドスピーカ信号Ｗ（ｋ）の表現に変換される。

ここで、
Ｃ（ｋ）は、最終的な再構築されたＨＯＡ係数を示す。
Ｗ（ｋ）は、ラウドスピーカ信号を示す。
Ｄｈａ、レンダリングマトリクスを示す。 If no short-range compensation is applied, the decoded HOA coefficients C (k) are converted to a representation of the loudspeaker signal W (k) by multiplication by a rendering matrix D.

here,
C (k) indicates the final reconstructed HOA coefficient.
W (k) indicates a loudspeaker signal.
Dha shows a rendering matrix.

上記処理の複雑さを計算するために、以下の注記を記載する。
１）ＨＯＡ信号のオーダはＯ_ＨＯＡであり、ＨＯＡ係数の数は（Ｏ_ＨＯＡ＋１）^２である。
２）再生スピーカの数はＬである。
３）コア信号チャネルのトータル数はＮである。
４）プレドミナントサウンドチャネルの数はＭである。
５）アンビエントサウンドチャネルの数はＮ−Ｍである。 The following notes are provided to calculate the complexity of the above process.
1) The order of the HOA signal is _OHOA , and the number of HOA coefficients is ( _OHOA + 1) ² .
2) The number of playback speakers is L.
3) The total number of core signal channels is N.
4) The number of predominant sound channels is M.
5) The number of ambient sound channels is NM.

プレドミナントのサウンド合成のためのコンプレキシティ（演算量）は

ここで、
ＣＯＭ_ＰＳＳは、プレドミナントサウンド合成のための演算量を示す。
Ｍは、プレドミナントサウンドチャネルの数を示す。
Ｏ_ＨＯＡは、ＨＯＡのオーダを示す。
Ｆ_ｓは、サンプリング周波数を示す。 Complexity for predominant sound synthesis is

here,
COM _PSS indicates the amount of calculation for predominant sound synthesis.
M indicates the number of predominant sound channels.
O _HOA indicates the order of HOA.
F _s represents the sampling frequency.

レンダリングのための演算量は

ここで、
ＣＯＭ_{ＲＥＮＤＥＲ}は、レンダリングのための演算量を示す。
Ｌは、再生スピーカの数を示す。
Ｏ_ＨＯＡは、ＨＯＡのオーダを示す。
Ｆ_ｓは、サンプリング周波数を示す。 The amount of computation for rendering is

here,
COM _RENDER indicates the amount of calculation for rendering.
L indicates the number of reproduction speakers.
O _HOA indicates the order of HOA.
F _s represents the sampling frequency.

ＨＯＡ係数の数は、通常のＨＯＡフォーマットにて非常に大きく、例としてＯ_ＨＯＡ＝４ならば、ＨＯＡ係数の数は（４＋１）^２＝２５である。 The number of HOA coefficients is very large in a normal HOA format, for example, if _OHOA = 4, the number of HOA coefficients is (4 + 1) ² = 25.

また、３Ｄオーディオのより良好な感覚を有するために、再生チャネルの数も非常に大きく、例えば、２２．２セットアップは、２４スピーカの全体で有する。 Also, to have a better feeling of 3D audio, the number of playback channels is also very large, for example, a 22.2 setup has a total of 24 speakers.

オーディオ信号のためのサンプリング周波数は、通常、４４．１ｋＨｚ若しくは４８ｋＨｚである。 The sampling frequency for audio signals is typically 44.1 kHz or 48 kHz.

例として、Ｍ＝４、Ｏ_ＨＯＡ＝４、Ｌ＝２４及びＦｓ＝４８ｋＨｚに対して、プレドミナントサウンド合成及びレンダリングのための演算量を見積もると、

As an example, for M = 4, _OHOA = 4, L = 24 and Fs = 48 kHz, the computational complexity for predominant sound synthesis and rendering is estimated as

例から、合成及びレンダリングプロセスの両方が非常に複雑であることが分かり、よって複雑性（演算量）を削減することが望ましい。 The examples show that both the compositing and rendering processes are very complex, and it is therefore desirable to reduce complexity (computation).

ＨＯＡコンポジションプロセス（式（１）及び（２））に示すように、プレドミナントサウンド合成は、以下に従って為される。

（ベクトルベースの合成に対するもの）

（方向ベースの合成に対するもの） As shown in the HOA composition process (Equations (1) and (2)), predominant sound synthesis is performed according to the following.

(For vector-based compositing)

(For direction-based compositing)

アンビエントサウンド合成は、以下に従って為される。

Ambient sound synthesis is performed according to the following.

レンダリングは、（式（７））に従って為される。

The rendering is performed according to (Equation (7)).

ＨＯＡコンポジション及びレンダリングプロセスはチャネルコンバージョン^＊の一つのプロセスに組み合わされる。

（ベクトルベースの合成に対するもの） The HOA composition and rendering processes are combined into one process of channel conversion ^* .

(For vector-based compositing)

（方向ベースの合成に対するもの）

(For direction-based compositing)

例として、Ｏ_ＨＯＡ＝４、Ｍ＝４、Ｎ＝８、Ｌ＝２４及びＦｓ＝４８ｋＨｚに対して、プレドミナントサウンド合成及びレンダリングのための演算量を見積もると、

As an example, for O _HOA = 4, M = 4, N = 8, L = 24 and Fs = 48 kHz, estimating the amount of computation for predominant sound synthesis and rendering:

上例から、本発明のアイデアを実装することにより、演算量は大きく削減することができる。 From the above example, by implementing the idea of the present invention, the amount of computation can be greatly reduced.

ＭＰＥＧ−Ｈ３Ｄオーディオモデルでは、インプットシーケンスの一部に対する予測コンポーネントと、一部条件のためのレンダリング前の近距離補償がある。本発明は、予測コンポーネントが存在するときの、若しくは近距離補償が実施されるときの、条件には適合されない。 In the MPEG-H 3D audio model, there is a prediction component for part of the input sequence, and near-range compensation before rendering for some conditions. The invention is not adapted to the conditions when a prediction component is present or when near field compensation is performed.

ＭＰＥＧ−Ｈ３Ｄオーディオモデルでは、連続するフレーム間の（方向ベースの合成のための）方向の変化によるアーチファクトを回避するために、方向信号からのＨＡＯ表現の計算は、重複加算のコンセプトに基づく。 In the MPEG-H 3D audio model, the calculation of the HAO representation from the directional signal is based on the concept of overlap-add to avoid artifacts due to directional changes between successive frames (for direction-based synthesis).

よって、アクティブの方向信号のＨＯＡ表現Ｃ_ＤＩＲ（ｋ）は、フェードアウトコンポーネントとフェードインコンポーネントとの合計として計算される。

Thus, the HOA representation C _DIR (k) of the active direction signal is calculated as the sum of the fade-out and fade-in components.

ＨＯＡドメインにてフェードイン及びフェードアウトが為される際、本発明の方法に対してどれが課題をもたらすか。この課題を解決するために、以下のアイデアが想到される。
１）Ｘ’_ＰＳ（ｋ−１）＝Ｘ_ＰＳ（ｋ−１）ｗ_ｏｕｔ；Ｘ’_ＰＳ（ｋ）＝Ｘ_ＰＳ（ｋ）ｗ_ｉｎを規定する。
２）式（１１）を以下のように修正する：

Which poses challenges for the method of the invention when fading in and out in the HOA domain. In order to solve this problem, the following ideas are conceived.
_{_{1) X 'PS (k-}} 1) = X PS (k-1) w out; X' to define the _{_{PS (k) = X PS (}} k) w in.
2) Modify equation (11) as follows:

上記原理は、フェードイン及びフェードアウトがベクトルベースの合成に対してＨＯＡドメインで為されるならば、ベクトルベースの合成に適用され得る。 The above principles can be applied to vector-based compositing if fade-in and fade-out are made in the HOA domain for vector-based compositing.

フェードイン及びフェードアウトがベクトルベースの合成に対してベクトルドメインで為されるならば、以下の通りとなる。
１）Ｘ’_ＶＥＣ（ｋ）＝ｗ_ｏｕｔＸ_ＶＥＣ（ｋ−１）＋ｗ_ｉｎＸ_ＶＥＣ（ｋ）を規定する。
２）式（１０）を以下のように修正する：

If fade-in and fade-out are done in the vector domain for vector-based compositing, then:
_{_{1) X 'VEC (k)}} = w out X VEC (k-1) + w in defining _{X VEC} a (k).
2) Modify equation (10) as follows:

図１は、ＨＯＡインプットのＭＰＥＧ−Ｈ３Ｄオーディオ標準のデコーダ図である。FIG. 1 is a decoder diagram of the HOA input MPEG-H 3D audio standard. 図２は、本発明の実施の形態１のデコーダ図である。FIG. 2 is a decoder diagram according to the first embodiment of the present invention. 図３は、本発明の実施の形態２のデコーダ図である。FIG. 3 is a decoder diagram according to the second embodiment of the present invention. 図４は、本発明の実施の形態３のデコーダ図である。FIG. 4 is a decoder diagram according to the third embodiment of the present invention. 図５は、本発明の実施の形態４のデコーダ図である。FIG. 5 is a decoder diagram according to the fourth embodiment of the present invention. 図６Ａは、本発明の実施の形態５の一つのデコーダ図である。FIG. 6A is a decoder diagram according to the fifth embodiment of the present invention. 図６Ｂは、本発明の実施の形態５の別のデコーダ図である。FIG. 6B is another decoder diagram according to the fifth embodiment of the present invention. 図７Ａは、本発明の実施の形態６の一つのデコーダ図である。FIG. 7A is a decoder diagram according to Embodiment 6 of the present invention. 図７Ｂは、本発明の実施の形態６の別のデコーダ図である。FIG. 7B is another decoder diagram according to Embodiment 6 of the present invention. 図８は、本発明の実施の形態７のビットストリームの例を示す。FIG. 8 shows an example of a bit stream according to the seventh embodiment of the present invention. 図９は、本発明の実施の形態７のデコーダ図である。FIG. 9 is a decoder diagram according to the seventh embodiment of the present invention. 図１０は、本発明の実施の形態８のエンコーダ図である。FIG. 10 is an encoder diagram according to the eighth embodiment of the present invention. 図１１は、本発明の実施の形態９のエンコーダ図である。FIG. 11 is an encoder diagram according to Embodiment 9 of the present invention. 図１２は、本発明の実施の形態１０のエンコーダ図である。FIG. 12 is an encoder diagram according to the tenth embodiment of the present invention.

以下の実施形態は、種々の進歩性の原理のための例示に過ぎない。当然ながら、本明細書の詳細な説明の変形例は当業者には明白なものであろう。当業者は、本発明の精神から乖離すること無く本発明を修正して適用することができるものである。 The following embodiments are merely examples for various inventive steps principles. Of course, variations on the detailed description herein will be apparent to those skilled in the art. Those skilled in the art can modify and apply the present invention without departing from the spirit of the present invention.

１．実施の形態１
本発明の実施の形態１として、本発明に係るサラウンドサウンドデコーダは、ビットストリームを空間パラメータ及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダのセットと；空間パラメータと再生スピーカのレイアウトとからレンダリングマトリクスを導出するマトリクス導出ユニットと；レンダリングマトリクスを用いて、デコードされたコア信号を再生信号にレンダリングするレンダリング器と；を含む。 1. Embodiment 1
As a first embodiment of the present invention, a surround sound decoder according to the present invention includes: a bit stream demultiplexer that decompresses a bit stream into spatial parameters and core parameters; and a set of core decoders that decodes core parameters into a set of core signals. A matrix deriving unit that derives a rendering matrix from the spatial parameters and the layout of the playback speakers; and a renderer that renders the decoded core signal into a playback signal using the rendering matrix.

図２は、実施の形態１に係る前述のデコーダを示す。 FIG. 2 shows the above-described decoder according to the first embodiment.

ビットストリームデマルチプレクサ（２００）は、ビットストリームを空間パラメータ及びコアパラメータに解凍する。 The bitstream demultiplexer (200) decompresses the bitstream into spatial and core parameters.

コアデコーダのセット（２０１、２０２、２０３）は、コアパラメータをコア信号のセットにデコードするが、デコーダは、ＭＰＥＧ−１ＡｕｄｉｏＬａｙｅｒＩＩＩやＡＡＣやＨＥ−ＡＡＣやＤｏｌｂｙＡＣ−３やＭＰＥＧＵＳＡＣスタンダードなどの、任意の現存の若しくは新しいコーデックであればよい。 The set of core decoders (201, 202, 203) decodes the core parameters into a set of core signals, and the decoders are MPEG-1 Audio Layer III, AAC, HE-AAC, Dolby AC-3, MPEG USAC standard, etc. Any existing or new codec.

マトリクス導出ユニット（２０４）は、空間パラメータと再生スピーカのレイアウトとからレンダリングマトリクスを計算する。レンダリングは、以下のパラメータの一部若しくはすべてを用いて導出され得る。
ターゲットスピーカの数（５．１、７．１、１０．１若しくは２２．２．．．）、
スピーカの位置（スイートスポットからの距離、水平角及び仰角）、
球面モデリングの位置（水平及び仰角）、
ＨＯＡオーダ（一次（４のＨＯＡ係数）、二次（９のＨＯＡ係数）若しくは三次（１６のＨＯＡ係数）．．．．）、及び、
ＨＯＡデコンポジションパラメータ（方向ベースのデコンポジション若しくはＰＣＡまたはＳＶＤ）。 The matrix deriving unit (204) calculates a rendering matrix from the spatial parameters and the layout of the reproduction speakers. Rendering may be derived using some or all of the following parameters.
The number of target speakers (5.1, 7.1, 10.1, or 22.2 ...),
Speaker position (distance from sweet spot, horizontal angle and elevation angle),
Position of spherical modeling (horizontal and elevation),
HOA order (first order (HOA coefficient of 4), second order (HOA coefficient of 9) or third order (16 HOA coefficients ...)), and
HOA decomposition parameters (direction-based decomposition or PCA or SVD).

ＶＢＡＰ（ベクトルベースの振幅パニング）［３］、若しくはＤＢＡＰ（方向ベースの振幅パニング）［４］、又はＨＯＡフォーマットのためのＭＰＥＧ−Ｈ３Ｄに対する公表参照モデルに記載された方法［２］などの、所望のスピーカレイアウトへの再構築されたインプット信号から、レンダリングマトリクスを導出するのに利用可能な技術がある。 Such as VBAP (vector-based amplitude panning) [3], or DBAP (direction-based amplitude panning) [4], or a method described in the published reference model for MPEG-H 3D for HOA format [2]. There are techniques available to derive a rendering matrix from the reconstructed input signal to the desired speaker layout.

例として、インプット信号が四次ＨＯＡであるならば、球面空間の２５の方向を覆うための２５のＨＯＡ係数を有し、再生スピーカセットアップはスタンダード２２．２チャネルセットアップである。レンダリングマトリクスは、２５のＨＯＡ係数を２４のスピーカチャネルにマップする。 As an example, if the input signal is a fourth-order HOA, then the playback speaker setup is a standard 22.2 channel setup, with 25 HOA coefficients to cover 25 directions of spherical space. The rendering matrix maps 25 HOA coefficients to 24 speaker channels.

ＶＢＡＰがレンダリングマトリクスを導出するのに用いられると、ＶＢＡＰは、２２．２スピーカセットアップのラウドスピーカを指示する２４の単位ベクトルｌ，．．．，ｌ_２４のセットを用い、三角形のメッシュがラウドスピーカ間で形成される。２５のＨＯＡ球面方向ｐの各々に対しては、スピーカにより形成される三角形の一つの中にある。三角形を形成する三つのスピーカは、アクティブのスピーカであるように選択され、球面方向ｐは、それらラウドスピーカの線形の組み合わせにより計算され得る。

ここで、
ｐは、ＨＯＡ球面方向を示す。
ｌ_ｎは、ラウドスピーカベクトルを示す。
ｇ_ｎは、ｌ_ｎに適用される倍率を示す。
｛ｎ_１，ｎ_２，ｎ_３｝は、アクティブのラウドスピーカの三重項を示す。 When VBAP is used to derive the rendering matrix, VBAP is composed of 24 unit vectors l,. . . , L ₂₄ , a triangular mesh is formed between the loudspeakers. For each of the 25 HOA spherical directions p, it is in one of the triangles formed by the speakers. The three loudspeakers forming the triangle are selected to be active loudspeakers, and the spherical direction p can be calculated by a linear combination of the loudspeakers.

here,
p indicates the HOA spherical surface direction.
l _n denotes a loudspeaker vector.
g _n denotes a magnification to be applied to _{l n.}
{N ₁ , n ₂ , n ₃ } denotes the triplet of the active loudspeaker.

Ｒ_３では、ベクトル空間は、３のベクトルベースにより形成される。このことにより以下の解が導かれる。

ここで、
ｐは、ＨＯＡ球面方向を示す。
ｌ_ｎは、ラウドスピーカベクトルを示す。
ｇ_ｎは、ｌ_ｎに適用される倍率を示す。
｛ｎ_１，ｎ_２，ｎ_３｝は、アクティブのラウドスピーカの三重項を示す。 In R _3, the vector space is formed by the third vector base. This leads to the following solution:

上述の手順は、２５のＨＯＡ球面方向の全てに対して繰り返され、個々の球面方向に対する全てのゲインパラメータが導出可能であり、レンダリングマトリクスＤを形成し得る。 The above procedure is repeated for all 25 HOA spherical directions, and all gain parameters for each spherical direction can be derived to form a rendering matrix D.

ＨＯＡ係数からラウドスピーカアウトプットへのレンダリングは、以下の式で説明可能である。

ここで、
Ｃ’（ｋ）は、完全再構築されたオーディオ信号を示す。
Ｗ（ｋ）は、ラウドスピーカ信号を示す。
Ｄは、レンダリングマトリクスを示す。 The rendering from HOA coefficients to loudspeaker outputs can be described by the following equation:

here,
C ′ (k) indicates a completely reconstructed audio signal.
W (k) indicates a loudspeaker signal.
D indicates a rendering matrix.

しかしながら、本発明では、完全再構築されたオーディオ信号は利用可能ではない。再構築されるオーディオ信号が以下の式に従って導出され得ることを仮定する。

ここで、
Ｃ’（ｋ）は、完全再構築されたオーディオ信号を示す。
Ｓ’（ｋ）は、デコードされた信号を示す。
Ｍは、変換マトリクスを示す。 However, in the present invention, a completely reconstructed audio signal is not available. Assume that the reconstructed audio signal can be derived according to the following equation:

here,
C ′ (k) indicates a completely reconstructed audio signal.
S ′ (k) indicates a decoded signal.
M indicates a conversion matrix.

式（１７）と式（１８）とを組み合わせることにより以下のようになる。

ここで、
Ｃ’（ｋ）は、完全再構築されたオーディオ信号を示す。
Ｗ（ｋ）は、ラウドスピーカ信号を示す。
Ｄは、レンダリングマトリクスを示す。
Ｍは、変換マトリクスを示す。
Ｄ’は、新しいレンダリングマトリクスを示す。 The following is obtained by combining the equations (17) and (18).

here,
C ′ (k) indicates a completely reconstructed audio signal.
W (k) indicates a loudspeaker signal.
D indicates a rendering matrix.
M indicates a conversion matrix.
D 'indicates a new rendering matrix.

上述のアプローチ以外に、デコードされたコア信号及びスピーカレイアウト情報を直接用いて、レンダリングマトリクスを導出することが可能である。 In addition to the approaches described above, it is possible to derive a rendering matrix directly using the decoded core signal and speaker layout information.

上述の手順及び式は、本発明をいかに実装するかに関する例として示すものであり、当業者であれば、発明の精神から乖離することなくこの発明を修正して適用することができるであろう。 The above procedures and formulas are provided as examples of how to implement the present invention, and those skilled in the art will be able to modify and apply the present invention without departing from the spirit of the invention. .

最後に、レンダリング器（２０５）は、レンダリングマトリクスを用いて、デコードされたコア信号を再生信号にレンダリングする。 Finally, the rendering unit (205) renders the decoded core signal into a reproduction signal using a rendering matrix.

効果：この実施の形態では、サラウンドサウンド信号が、単独のステップで所望のスピーカレイアウトに再構築されてレンダリングされるのであり、このことにより、効率性は改善され演算量は大きく削減される。 Effect: In this embodiment, the surround sound signal is reconstructed and rendered into a desired speaker layout in a single step, which improves efficiency and greatly reduces the amount of computation.

２．実施の形態２
本発明に係るサラウンドサウンドデコーダは、ビットストリームをプレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当てパラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダのセットと；チャネル割り当てパラメータに従って、デコードされたコア信号をプレドミナントサウンド及びアンビエンスに割り当てる、プレドミナントサウンドアンビエンススイッチと；プレドミナントサウンドパラメータと再生スピーカのレイアウトとからプレドミナントサウンドレンダリングマトリクスを導出するマトリクス導出ユニットと；アンビエンスパラメータと再生スピーカのレイアウトとからアンビエンスレンダリングマトリクスを導出するマトリクス導出ユニットと；レンダリングマトリクスを用いて、プレドミナントサウンドを再生信号にレンダリングするプレドミナントサウンドレンダリング器と；レンダリングマトリクスを用いて、アンビエンスを再生信号にレンダリングするアンビエンスレンダリング器と；レンダリングされたプレドミナントサウンド及びアンビエントサウンドを用いて、再生信号を構成するアウトプット信号構成ユニットと；を含む。 2. Embodiment 2
A surround sound decoder according to the present invention comprises: a bitstream demultiplexer for decompressing a bitstream into predominant sound parameters, ambience parameters, channel allocation parameters, and core parameters; and a set of core decoders for decoding the core parameters into a set of core signals. A predominant sound ambience switch for assigning the decoded core signal to the predominant sound and ambience according to the channel assignment parameter; and a matrix deriving unit for deriving a predominant sound rendering matrix from the predominant sound parameters and the layout of the playback speakers And ambience rendering matrices based on the ambience parameters and the layout of the playback speakers A pre-dominant sound renderer that renders predominant sound into a playback signal using a rendering matrix; an ambience renderer that renders ambience into a playback signal using a rendering matrix; An output signal configuration unit that configures a reproduction signal using the obtained predominant sound and ambient sound.

図３は、実施の形態２に係る前述のデコーダを示す。 FIG. 3 shows the above-described decoder according to the second embodiment.

ビットストリームデマルチプレクサ（３００）は、ビットストリームをプレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当てパラメータ、及びコアパラメータに解凍する。 The bitstream demultiplexer (300) decompresses the bitstream into predominant sound parameters, ambience parameters, channel assignment parameters, and core parameters.

コアデコーダのセット（３０１、３０２、３０３）は、コアパラメータをコア信号のセットにデコードするが、デコーダは、ＭＰＥＧ−１ＡｕｄｉｏＬａｙｅｒＩＩＩやＡＡＣやＨＥ−ＡＡＣやＤｏｌｂｙＡＣ−３やＭＰＥＧＵＳＡＣスタンダードなどの、任意の現存の若しくは新しいコーデックであればよい。 The set of core decoders (301, 302, 303) decodes the core parameters into a set of core signals, and the decoders may be MPEG-1 Audio Layer III, AAC, HE-AAC, Dolby AC-3, MPEG USAC standard, etc. Any existing or new codec.

プレドミナントサウンド／アンビエンススイッチ（３０４）は、チャネル割り当てパラメータに従って、デコードされたコア信号をプレドミナントサウンド又はアンビエンスに割り当てる。 A predominant sound / ambience switch (304) assigns the decoded core signal to predominant sound or ambience according to channel assignment parameters.

レンダリングマトリクス計算ユニット（３０５）は、プレドミナントサウンドパラメータと再生スピーカのレイアウトとからレンダリングマトリクスを計算する。本実施の形態では、詳細な導出は省略し、プレドミナントサウンドから導出されるレンダリングマトリクスはＤ’であると、仮定する。 A rendering matrix calculation unit (305) calculates a rendering matrix from the predominant sound parameters and the layout of the playback speakers. In the present embodiment, detailed derivation is omitted, and it is assumed that the rendering matrix derived from the predominant sound is D '.

プレドミナントサウンドレンダリング器（３０６）は、ＰＳレンダリングマトリクスを用いて、デコードされたプレドミナントサウンドを再生信号に変換する。

ただし、
Ｗ_ｐｓ（ｋ）は、プレドミナントサウンドから導出された再生信号を示す。
Ｃ_ｐｓ（ｋ）は、デコードされたプレドミナントサウンド信号を示す。
Ｄ’は、ＰＳレンダリングマトリクスを示す。 The predominant sound renderer (306) converts the decoded predominant sound into a reproduction signal using the PS rendering matrix.

However,
_Wps (k) indicates a reproduced signal derived from the predominant sound.
C _ps (k) indicates the decoded predominant sound signal.
D 'indicates a PS rendering matrix.

レンダリングマトリクス計算ユニット（３０７）は、アンビエンスパラメータと再生スピーカのレイアウトとからレンダリングマトリクスを計算する。本実施の形態では、詳細な導出は省略し、アンビエントサウンドから導出されるレンダリングマトリクスはＤ_ＡＭＢであると、仮定する。 A rendering matrix calculation unit (307) calculates a rendering matrix from the ambience parameters and the layout of the playback speakers. In the present embodiment, detailed derivation is omitted, and it is assumed that the rendering matrix derived from the ambient sound is _DAMB .

アンビエントサウンドが、エンコーディング前に或る他のフォーマットに変換されるか他の方法で処理されたならば、レンダリング前に、信号を後処理して元のアンビエントサウンドを再構築するようにしてもよい。 If the ambient sound was converted to some other format or otherwise processed before encoding, the signal may be post-processed to reconstruct the original ambient sound before rendering. .

アンビエンスレンダリング器（３０８）は、アンビエンスレンダリングマトリクスを用いて、デコードされたアンビエントサウンドを再生信号に変換する。

ただし、
Ｗ_AMB（ｋ）は、アンビエントサウンドから導出された再生信号を示す。
Ｃ_ＡＭＢ（ｋ）は、デコードされたアンビエントサウンド信号を示す。
Ｄ_ＡＭＢは、アンビエンスレンダリングマトリクスを示す。 The ambience rendering device (308) converts the decoded ambient sound into a reproduction signal using the ambience rendering matrix.

However,
W _AMB (k) indicates a reproduced signal derived from the ambient sound.
_CAMB (k) indicates the decoded ambient sound signal.
_DAMB indicates an ambience rendering matrix.

アウトプット信号構成ユニットは、レンダリングされたプレドミナントサウンド及びアンビエントサウンドを用いて、再生信号を構成する。

ただし、
Ｗ_AMB（ｋ）は、アンビエントサウンドから導出された再生信号を示す。
Ｗ_ｐｓ（ｋ）は、プレドミナントサウンドから導出された再生信号を示す。
Ｗ（ｋ）は、最終的な再生信号を示す。 The output signal composition unit composes a reproduced signal using the rendered predominant sound and ambient sound.

However,
W _AMB (k) indicates a reproduced signal derived from the ambient sound.
_Wps (k) indicates a reproduced signal derived from the predominant sound.
W (k) indicates a final reproduced signal.

効果：この実施の形態では、プレドミナントサウンド信号が、たった一つのステップで所望のスピーカレイアウトに再構築されてレンダリングされるのであり、このことにより、効率性は改善され演算量は大きく削減される。 Effect: In this embodiment, the predominant sound signal is reconstructed and rendered to the desired speaker layout in one single step, which improves efficiency and greatly reduces the amount of computation. .

３．実施の形態３
本発明に係るサラウンドサウンドデコーダは、ビットストリームを空間パラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダのセットと；空間パラメータと再生スピーカのレイアウトとからレンダリングマトリクスを導出するマトリクス導出ユニットと；前フレームと現フレームのデコードされたコア信号に関してウインドウイングを実行するウインドウイングユニットと；ウインドウされた前フレームのデコードされたコア信号及びウインドウされた現フレームのデコードされたコア信号を、導出された平滑化コア信号に合計する総和ユニットと；レンダリングマトリクスを用いて、平滑化コア信号を再生信号にレンダリングするレンダリング器と；を含む。 3. Embodiment 3
A surround sound decoder according to the present invention comprises: a bitstream demultiplexer for decompressing a bitstream into spatial parameters and core parameters; a set of core decoders for decoding core parameters into a set of core signals; a layout of spatial parameters and playback speakers. And a windowing unit for performing windowing on the decoded core signals of the previous frame and the current frame; and a decoded core signal of the windowed previous frame and the windowed current. A summation unit for summing the decoded core signal of the frame with the derived smoothed core signal; and rendering the smoothed core signal into a reproduced signal using a rendering matrix. And; including.

フレーム境界に亘る人工音を避けるために、オーディオ信号処理でウインドウイングを適用することが一般的である。 It is common to apply windowing in audio signal processing to avoid artificial sounds across frame boundaries.

図４に示すように、ウインドウイングはデコードされたコア信号（４０４）に適用され、式（１７）及び式（１８）は以下のように修正される。

ここで、
Ｃ’（ｋ）は、完全再構築されたオーディオ信号を示す。
Ｓ’（ｋ）は、現フレームに対するデコードされた信号を示す。
Ｓ’（ｋ−１）は、前フレームに対するデコードされた信号を示す。
ｗｉｎ_ｃｕｒは、現フレームに対するウインドウイング関数を示す。
ｗｉｎ_ｐｒｅは、前フレームに対するウインドウイング関数を示す。
Ｍは、変換マトリクスを示す。 As shown in FIG. 4, windowing is applied to the decoded core signal (404), and equations (17) and (18) are modified as follows.

here,
C ′ (k) indicates a completely reconstructed audio signal.
S ′ (k) indicates a decoded signal for the current frame.
S ′ (k−1) indicates a decoded signal for the previous frame.
win _cur indicates a windowing function for the current frame.
win _pre indicates a windowing function for the previous frame.
M indicates a conversion matrix.

ここで、
Ｓ’（ｋ）は、現フレームに対するデコードされた信号を示す。
Ｓ’（ｋ−１）は、前フレームに対するデコードされた信号を示す。
ｗｉｎ_ｃｕｒは、現フレームに対するウインドウイング関数を示す。
ｗｉｎ_ｐｒｅは、前フレームに対するウインドウイング関数を示す。
Ｗ（ｋ）は、ラウドスピーカ信号を示す。
Ｄ’は、レンダリングマトリクスを示す。

here,
S ′ (k) indicates a decoded signal for the current frame.
S ′ (k−1) indicates a decoded signal for the previous frame.
win _cur indicates a windowing function for the current frame.
win _pre indicates a windowing function for the previous frame.
W (k) indicates a loudspeaker signal.
D 'indicates a rendering matrix.

効果：この実施の形態では、ウインドウイングは、フレーム境界に亘る人工音を回避するために適用される。 Effect: In this embodiment, windowing is applied to avoid artificial sounds across frame boundaries.

４．実施の形態４
本発明に係るサラウンドサウンドデコーダは、ビットストリームをプレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当てパラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダのセットと；チャネル割り当てパラメータに従って、デコードされたコア信号をプレドミナントサウンド及びアンビエンスに割り当てる、プレドミナントサウンドアンビエンススイッチと；プレドミナントサウンドパラメータと再生スピーカのレイアウトとからプレドミナントサウンドレンダリングマトリクスを導出するマトリクス導出ユニットと；アンビエンスパラメータと再生スピーカのレイアウトとからアンビエンスレンダリングマトリクスを導出するマトリクス導出ユニットと；前フレームと現フレームのプレドミナントサウンド信号に関してウインドウイングを実行するウインドウイングユニットと；レンダリングマトリクスを用いて、平滑化されたプレドミナントサウンドを再生信号にレンダリングするプレドミナントサウンドレンダリング器と；レンダリングされたプレドミナントサウンド及びアンビエンスサウンドを用いて、再生信号を構成するアウトプット信号構成ユニットと；を含む。 4. Embodiment 4
A surround sound decoder according to the present invention comprises: a bitstream demultiplexer for decompressing a bitstream into predominant sound parameters, ambience parameters, channel allocation parameters, and core parameters; and a set of core decoders for decoding the core parameters into a set of core signals. A predominant sound ambience switch for assigning the decoded core signal to the predominant sound and ambience according to the channel assignment parameter; and a matrix deriving unit for deriving a predominant sound rendering matrix from the predominant sound parameters and the layout of the playback speakers And ambience rendering matrices based on the ambience parameters and the layout of the playback speakers A window deriving unit for deriving a pre-dominant sound signal of a previous frame and a current frame; and a windowing unit for performing a windowing process on a pre-dominant sound signal of a previous frame and a current frame; A dominant sound renderer; and an output signal composing unit for composing a reproduction signal using the rendered predominant sound and ambience sound.

図５に示すように、フレーム境界に亘ってサウンドフィールドの連続且つ平坦な発生を保証するために、ウインドウイングがプレドミナントサウンドに適用される（５０６）。 As shown in FIG. 5, windowing is applied to the predominant sound (506) to ensure a continuous and flat occurrence of the sound field over the frame boundaries.

ウインドウイングがプレドミナントサウンドに適用されるので、式（２０）は以下のように修正される：

ただし、
Ｗ_ｐｓ（ｋ）は、プレドミナントサウンドから導出された再生信号を示す。
Ｃ_ｐｓ（ｋ）は、現フレームに対するデコードされたプレドミナントサウンド信号を示す。
Ｃ_ｐｓ（ｋ−１）は、前フレームに対するデコードされたプレドミナントサウンド信号を示す。
Ｄ’は、ＰＳレンダリングマトリクスを示す。 Since windowing is applied to the predominant sound, equation (20) is modified as follows:

However,
_Wps (k) indicates a reproduced signal derived from the predominant sound.
C _ps (k) indicates the decoded predominant sound signal for the current frame.
_Cps (k-1) indicates the decoded predominant sound signal for the previous frame.
D 'indicates a PS rendering matrix.

効果：この実施の形態では、フレーム境界に亘ってサウンドフィールドの連続且つ平坦な発生を保証するために、ウインドウイングが適用される。 Effect: In this embodiment, windowing is applied to ensure a continuous and flat occurrence of the sound field over the frame boundaries.

５．実施の形態５
図６Ａに示すように、本発明に係るサラウンドサウンドデコーダは、ビットストリームを空間パラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダ（６０１、６０２及び６０３）のセットと；空間パラメータと再生スピーカのレイアウトとから現状のフレームのデコードされた信号に対するレンダリングマトリクスを導出するマトリクス導出ユニット（６０４）と；レンダリングマトリクスを用いて、現状のフレームのデコードされたコア信号に関してウインドウイングとレンダリングを実行するウインドウイング及びレンダリングユニット（６０５）と；レンダリングマトリクスを用いて、前フレームのデコードされたコア信号に関してウインドウイングとレンダリングを実行するウインドウイング及びレンダリングユニット（６０６）と；前フレームの再生信号と現フレームの再生信号とを加えて最終的な再生信号を形成する加算ユニット（６０７）と；を含む。 5. Embodiment 5
As shown in FIG. 6A, a surround sound decoder according to the present invention includes a bit stream demultiplexer that decompresses a bit stream into spatial parameters and core parameters; and a core decoder (601, 602) that decodes core parameters into a set of core signals. And 603); a matrix deriving unit (604) for deriving a rendering matrix for the decoded signal of the current frame from the spatial parameters and the layout of the reproduction speakers; and a decoding of the current frame using the rendering matrix. A windowing and rendering unit (605) for performing windowing and rendering on the decoded core signal of the previous frame using the rendering matrix. Wing and render the windowing and rendering unit to perform (606); summing unit to form the final reproduced signal by adding the reproduced signal of the reproduced signal and the current frame the previous frame and (607); including.

実施の形態１において、前フレーム及び現フレームのデコードされたコア信号は異なる空間方向を有するので，ウインドウイングはデコードされたコア信号に適用され得ないとすると、ウインドウイングは再構築されたＨＯＡ係数に適用されなければならない。 In the first embodiment, since the decoded core signals of the previous frame and the current frame have different spatial directions, if the windowing cannot be applied to the decoded core signal, the windowing is performed by using the reconstructed HOA coefficients. Must be applied to

すると式（１８）は以下のように修正される：

ただし、
Ｓ’（ｋ）は、現フレームに対するデコードされた信号を示す。
Ｓ’（ｋ−１）は、前フレームに対するデコードされた信号を示す。
Ｓ’’（ｋ）は、現フレームに対するウインドウイングされた信号を示す。
Ｓ’’（ｋ−１）は、前フレームに対するウインドウイングされた信号を示す。
ｗｉｎ_ｃｕｒは、現フレームに対するウインドウイング関数を示す。
ｗｉｎ_ｐｒｅは、前フレームに対するウインドウイング関数を示す。
Ｗ（ｋ）は、ラウドスピーカ信号を示す。
Ｄ’_ｃｕｒは、現フレームに対する新しいレンダリングマトリクスを示す。
Ｄ’_ｐｒｅは、前フレームに対する新しいレンダリングマトリクスを示す。
Ｃ’（ｋ）は、現フレームに対する、完全再構築されたオーディオ信号を示す。
Ｃ’（ｋ−１）は、前フレームに対する、完全再構築されたオーディオ信号を示す。
Ｄは、レンダリングマトリクスを示す。
Ｍ_ｃｕｒは、現フレームに対する変換マトリクスを示す。
Ｍ_ｐｒｅは、前フレームに対する変換マトリクスを示す。 Equation (18) is then modified as follows:

However,
S ′ (k) indicates a decoded signal for the current frame.
S ′ (k−1) indicates a decoded signal for the previous frame.
S ″ (k) indicates the windowed signal for the current frame.
S ″ (k−1) indicates a windowed signal for the previous frame.
win _cur indicates a windowing function for the current frame.
win _pre indicates a windowing function for the previous frame.
W (k) indicates a loudspeaker signal.
D' _cur indicates the new rendering matrix for the current frame.
D' _pre indicates a new rendering matrix for the previous frame.
C ′ (k) indicates a completely reconstructed audio signal for the current frame.
C ′ (k−1) indicates a completely reconstructed audio signal for the previous frame.
D indicates a rendering matrix.
M _cur indicates a conversion matrix for the current frame.
M _pre indicates a transformation matrix for the previous frame.

図６Ａに示すように、ウインドウイングとレンダリングは、最初に、現フレームのデコードされたコア信号及び前フレームのデコードされたコア信号に関して、独立して（６０５及び６０６）為され、続いて前フレームのレンダリングされた信号と現フレームのレンダリングされた信号とが共に加えられて、最終的なアウトプットを形成する（６０７）。 As shown in FIG. 6A, windowing and rendering are first performed independently (605 and 606) on the decoded core signal of the current frame and the decoded core signal of the previous frame, followed by the previous frame. And the current frame's rendered signal are added together to form the final output (607).

前フレームのデコードされたコア信号に対するウインドウイング＆レンダリングに対しては、前フレームのレンダリングマトリクスが利用可能であるならば／格納されているならば、前フレームの計算から拾い上げることが可能である。利用可能でないならば／格納されていないならば、レンダリングマトリクスは、（６０４）と同じやり方にしたがって計算され得るが、但し前フレームの空間パラメータ及びスピーカレイアウト情報を用いる。 For windowing & rendering on the decoded core signal of the previous frame, if the rendering matrix of the previous frame is available / stored, it can be picked up from the calculation of the previous frame. If not available / unstored, the rendering matrix can be calculated according to the same manner as (604), but using the spatial parameters and speaker layout information of the previous frame.

別の方法を図６Ｂに示す。最初に、レンダリングが、現フレームのデコードされた信号（６１５）に関して為され、続いてウインドウイングが、前フレームのレンダリングされた信号及び現フレームのレンダリングされた信号に関して為され、最終的に、ウインドウイングされた前フレームのレンダリングされた信号と現フレームのレンダリングされた信号とが共に加えられて、最終的なアウトプットを形成する（６１６）。 Another method is shown in FIG. 6B. First, rendering is performed on the decoded signal of the current frame (615), followed by windowing on the rendered signal of the previous frame and the rendered signal of the current frame, and finally, The inked previous frame rendered signal and the current frame rendered signal are added together to form the final output (616).

効果：この実施の形態では、ウインドウイングは、フレーム境界に亘る人工音を避けるために適用される。 Effect: In this embodiment, windowing is applied to avoid artificial sounds across frame boundaries.

６．実施の形態６
図７Ａに示すように、本発明に係るサラウンドサウンドデコーダは、ビットストリームをプレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当てパラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサ（７００）と；コアパラメータをコア信号のセットにデコードするコアデコーダ（７０１、７０２及び７０３）のセットと；チャネル割り当てパラメータに従って、デコードされたコア信号をプレドミナントサウンド及びアンビエンスに割り当てる、プレドミナントサウンドアンビエンススイッチ（７０４）と；プレドミナントサウンドパラメータと再生スピーカのレイアウトとから現フレームのプレドミナントサウンド信号に対するプレドミナントサウンドレンダリングマトリクスを導出するマトリクス導出ユニット（７０５）と；現フレームのプレドミナントサウンド信号に関してウインドウイングとレンダリングを実行するウインドウイング及びレンダリングユニット（７０６）と；前フレームのプレドミナントサウンド信号に関してウインドウイングとレンダリングを実行するウインドウイング及びレンダリングユニット（７０７）と；前フレームのレンダリングされたプレドミナントサウンドと現フレームのプレドミナントサウンドとを加えてレンダリングされたプレドミナントサウンドを形成する加算ユニット（７０８）と；アンビエンスパラメータと再生スピーカのレイアウトとからアンビエンスレンダリングマトリクスを導出するマトリクス導出ユニット（７０９）と；レンダリングマトリクスを用いて、アンビエンスを再生信号にレンダリングするアンビエンスレンダリング器（７１０）と；レンダリングされたプレドミナントサウンド及びアンビエントサウンドを用いて、再生信号を構成するアウトプット信号構成ユニット（７１１）と；を含む。 6. Embodiment 6
As shown in FIG. 7A, a surround sound decoder according to the present invention includes a bitstream demultiplexer (700) for decompressing a bitstream into predominant sound parameters, ambience parameters, channel allocation parameters, and core parameters; A set of core decoders (701, 702 and 703) for decoding into a set of signals; a predominant sound ambience switch (704) for assigning the decoded core signal to predominant sound and ambience according to channel assignment parameters; Deriving a predominant sound rendering matrix for the predominant sound signal of the current frame from the sound parameters and the layout of the playback speakers A window deriving unit (705) that performs windowing and rendering on the predominant sound signal of the current frame; and a window that performs windowing and rendering on the predominant sound signal of the previous frame. An adding unit (708) for adding a rendered predominant sound of the previous frame and a predominant sound of the current frame to form a rendered predominant sound; and an ambience parameter and a reproduction speaker. A matrix deriving unit (709) for deriving an ambience rendering matrix from the layout of Ambience renderer to render Nsu playback signal (710); including; using the rendered pre-dominant sound and ambient sound, an output signal composing units that constitute the reproduced signal (711).

実施の形態２では、前フレーム及び現フレームのプレドミナントサウンド信号は異なる空間方向を有するので、デコードされたプレドミナントサウンド信号にウインドウイングを適用できないとすれば、再構築されたＨＯＡ係数にウインドウイングを適用しなければならない。 In the second embodiment, since the predominant sound signal of the previous frame and the current frame have different spatial directions, if windowing cannot be applied to the decoded predominant sound signal, the windowing is applied to the reconstructed HOA coefficients. Must be applied.

すると式（１９）は以下のように修正される：

ただし、
Ｃ’_ＰＳ（ｋ）は、現フレームに対するデコードされたプレドミナントサウンド信号を示す。
Ｃ’_ＰＳ（ｋ−１）は、前フレームに対するデコードされたプレドミナントサウンド信号を示す。
Ｃ’’_ＰＳ（ｋ）は、現フレームに対するウインドウイングされたプレドミナントサウンド信号を示す。
Ｃ’’_ＰＳ（ｋ−１）は、前フレームに対するウインドウイングされたプレドミナントサウンド信号を示す。
ｗｉｎ_ｃｕｒは、現フレームに対するウインドウイング関数を示す。
ｗｉｎ_ｐｒｅは、前フレームに対するウインドウイング関数を示す。
Ｗ_ＰＳ（ｋ）は、プレドミナントサウンドからのラウドスピーカ信号を示す。
Ｄ’_ｃｕｒは、現フレームに対する新しいレンダリングマトリクスを示す。
Ｄ’_ｐｒｅは、前フレームに対する新しいレンダリングマトリクスを示す。
Ｃ’（ｋ）は、現フレームに対する、再構築されたオーディオ信号を示す。
Ｃ’（ｋ−１）は、前フレームに対する、再構築されたオーディオ信号を示す。
Ｄは、レンダリングマトリクスを示す。
Ｍ_ｃｕｒは、現フレームに対する変換マトリクスを示す。
Ｍ_ｐｒｅは、前フレームに対する変換マトリクスを示す。 Equation (19) is then modified as follows:

However,
C ′ _PS (k) indicates a decoded predominant sound signal for the current frame.
C ′ _PS (k−1) indicates a decoded predominant sound signal for the previous frame.
C ″ _PS (k) indicates the windowed predominant sound signal for the current frame.
C ″ _PS (k−1) indicates the windowed predominant sound signal for the previous frame.
win _cur indicates a windowing function for the current frame.
win _pre indicates a windowing function for the previous frame.
W _PS (k) indicates the loudspeaker signal from the predominant sound.
D' _cur indicates the new rendering matrix for the current frame.
D' _pre indicates a new rendering matrix for the previous frame.
C ′ (k) indicates the reconstructed audio signal for the current frame.
C ′ (k−1) indicates the reconstructed audio signal for the previous frame.
D indicates a rendering matrix.
M _cur indicates a conversion matrix for the current frame.
M _pre indicates a transformation matrix for the previous frame.

図７Ａに示すように、ウインドウイングとレンダリングは、最初に、現フレームのデコードされたプレドミナントサウンド信号及び前フレームのデコードされたプレドミナントサウンド信号に関して、独立して（７０６及び７０７）為され、続いて前フレームのレンダリングされた信号と現フレームのレンダリングされた信号とが共に加えられて、最終的なプレドミナントサウンドのアウトプットを形成する（７０８）。 As shown in FIG. 7A, windowing and rendering are first performed independently (706 and 707) with respect to the decoded predominant sound signal of the current frame and the decoded predominant sound signal of the previous frame, Subsequently, the rendered signal of the previous frame and the rendered signal of the current frame are added together to form the final predominant sound output (708).

前フレームのプレドミナントサウンドに対するウインドウイング＆レンダリングに対しては、前フレームのＰＳマトリクスが利用可能であるならば／格納されているならば、前フレームの計算から拾い上げることが可能である。利用可能でないならば／格納されていないならば、ＰＳレンダリングマトリクスは、（７０５）と同じやり方にしたがって計算され得るが、但し従前の前フレームの空間パラメータ及びスピーカレイアウト情報を用いる。 For windowing & rendering for the predominant sound of the previous frame, it is possible to pick up from the calculation of the previous frame if the PS matrix of the previous frame is available / stored. If not available / not stored, the PS rendering matrix can be calculated according to the same manner as (705), but using the previous previous frame's spatial parameters and speaker layout information.

別の方法を図７Ｂに示す。最初に、レンダリングが、現フレームのデコードされたプレドミナントのサウンド信号（７１６）に関して為され、続いてウインドウイングが、前フレームのレンダリングされた信号及び現フレームのレンダリングされた信号に関して為され、最終的に、ウインドウイングされた前フレームのレンダリングされた信号と現フレームのレンダリングされた信号とが共に加えられて、最終的なプレドミナントサウンドのアウトプットを形成する（７１７）。 Another method is shown in FIG. 7B. First, rendering is performed on the decoded predominant sound signal (716) of the current frame, followed by windowing on the rendered signal of the previous frame and the rendered signal of the current frame. Finally, the rendered signal of the windowed previous frame and the rendered signal of the current frame are added together to form the final predominant sound output (717).

７．実施の形態７
本発明に係るサラウンドサウンドデコーダは、ビットストリームをレンダリングフラグ、プレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当てパラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダのセットと；チャネル割り当てパラメータに従って、デコードされたコア信号をプレドミナントサウンド及びアンビエンスに割り当てる、プレドミナントサウンドアンビエンススイッチと；レンダリングフラグにより特定される計算方法を利用してプレドミナントサウンドパラメータと再生スピーカのレイアウトとからプレドミナントサウンドレンダリングマトリクスを導出するマトリクス導出ユニットと；アンビエンスパラメータと再生スピーカのレイアウトとからアンビエンスレンダリングマトリクスを導出するマトリクス導出ユニットと；レンダリングマトリクスを用いて、プレドミナントサウンドを再生信号にレンダリングするプレドミナントサウンドレンダリング器と；レンダリングマトリクスを用いて、アンビエンスを再生信号にレンダリングするアンビエンスレンダリング器と；レンダリングされたプレドミナントサウンド及びアンビエントサウンドを用いて、再生信号を構成するアウトプット信号構成ユニットと；を含む。 7. Embodiment 7
A surround sound decoder according to the present invention comprises: a bitstream demultiplexer for decompressing a bitstream into rendering flags, predominant sound parameters, ambience parameters, channel allocation parameters, and core parameters; a core for decoding core parameters into a set of core signals. A set of decoders; a predominant sound ambience switch for assigning the decoded core signal to predominant sound and ambience according to channel assignment parameters; and a predominant sound parameter and a playback speaker using a calculation method specified by a rendering flag. A matrix deriving unit for deriving a predominant sound rendering matrix from the layout of the ambience; A matrix deriving unit for deriving an ambience rendering matrix from the parameters and the layout of the reproduction speakers; a predominant sound renderer for rendering a predominant sound into a reproduction signal using the rendering matrix; and reproducing the ambience using the rendering matrix. An ambience renderer for rendering a signal; and an output signal composition unit for composing a reproduced signal using the rendered predominant sound and ambient sound.

この実施の形態では、ビットストリームに、発明されたアイデアの実装を実用的でなくする何らかの他のデータがビットストリーム内に存在するかどうかを示すレンダリングフラグがある。 In this embodiment, the bitstream has a rendering flag that indicates whether there is any other data in the bitstream that makes the implementation of the invented idea impractical.

図８は、例として一つのビットストリームを示す。 FIG. 8 shows one bit stream as an example.

ビットストリームに、ＰＳパラメータデータ、アンビエンスパラメータデータ、チャネル割り当てパラメータデータ、及びコアコーダデータのみが在るとき、低演算量の構成及びレンダリングを達成するために発明されたアイデアを使用することが推奨され、従って、レンダリングフラグＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧが１にセットされる。 When the bitstream contains only PS parameter data, ambience parameter data, channel allocation parameter data, and core coder data, it is recommended to use the invented idea to achieve low computational complexity and rendering. Therefore, the rendering flag LC_RENDER_FLAG is set to 1.

ビットストリームに、予測データ及び近距離補償データが在るとき、発明されたアイデアを使用することが実用的ではなくなり、従来のデコード化、構成及びレンダリングのツールを使用することが推奨され、従って、レンダリングフラグＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧが０にセットされる。 When the bitstream contains prediction data and near field compensation data, it is impractical to use the invented idea and it is recommended to use conventional decoding, construction and rendering tools, thus: The rendering flag LC_RENDER_FLAG is set to 0.

図９は、この実施の形態の前述のデコーダを示す。 FIG. 9 shows the aforementioned decoder of this embodiment.

ビットストリームデマルチプレクサ（９０１）は、ビットストリームをＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧ及び他のパラメータに解凍する。 The bitstream demultiplexer (901) decompresses the bitstream into LC_RENDER_FLAG and other parameters.

ＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧが１に等しいならば、本発明のデコーダ（９０２）は、低演算量の解法を完成するために、デコード化、構成及びレンダリングを実行するように選択される。 If LC_RENDER_FLAG is equal to 1, the decoder (902) of the present invention is selected to perform decoding, configuration and rendering to complete a low complexity solution.

ＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧが０に等しいならば、従来のデコーダ（９０３）は、デコード化、構成及びレンダリングを実行するように選択される。 If LC_RENDER_FLAG is equal to 0, the conventional decoder (903) is selected to perform decoding, composition and rendering.

効果：この実施の形態では、ビットストリームの非互換性の課題が解決される。 Effect: This embodiment solves the problem of bitstream incompatibility.

８．実施の形態８
この実施の形態では、エンコーダは、インプット信号を分析してインプット信号を空間パラメータ及びＮ生成信号にエンコードする空間エンコーダと；Ｎ生成信号をコアパラメータのセットにエンコードするコアエンコーダのセットと；空間パラメータ及びコアパラメータをビットストリームにパックするビットストリームマルチプレクサと；を含む。 8. Embodiment 8
In this embodiment, the encoder comprises: a spatial encoder that analyzes the input signal and encodes the input signal into spatial parameters and an N-generated signal; a set of core encoders that encodes the N-generated signal into a set of core parameters; And a bitstream multiplexer that packs core parameters into the bitstream.

本発明に係るサラウンドサウンドデコーダは、ビットストリームを空間パラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダのセットと；空間パラメータと再生スピーカのレイアウトとからレンダリングマトリクスを導出するマトリクス導出ユニットと；レンダリングマトリクスを用いて、デコードされたコア信号を再生信号にレンダリングするレンダリング器と：を含む。 A surround sound decoder according to the present invention comprises: a bitstream demultiplexer for decompressing a bitstream into spatial parameters and core parameters; a set of core decoders for decoding core parameters into a set of core signals; a layout of spatial parameters and playback speakers. And a rendering unit that derives a rendering matrix from the decoded core signal into a reproduction signal using the rendering matrix.

図１０は、この実施の形態の前述のエンコーダ及びデコーダを示す。 FIG. 10 shows the above-described encoder and decoder of this embodiment.

空間エンコーダ（１００１）は、インプット信号を分析し、インプット信号を空間パラメータ及びＮ生成信号にエンコードする。 The spatial encoder (1001) analyzes the input signal and encodes the input signal into a spatial parameter and an N generated signal.

空間エンコーディングは、オーディオシーンの分析に基づいて、インプットオーディオシーン内にてどれだけ多くのサウンドソース若しくはオーディオオブジェクトが在るか決定し、サウンドソース若しくはオーディオオブジェクトをどのように抽出してエンコードするか判別し得る。例として、サウンドソース若しくはオーディオオブジェクトを抽出するのに主成分解析（ＰＣＡ）が用いられＮサウンドソースが抽出されてエンコードされるようにしても良い。このプロセスの間に、ＰＣＡパラメータ及びＮオーディオ信号が導出される。ＰＣＡパラメータ及びＮ生成オーディオ信号がエンコードされてデコーダ側に送られる。 Spatial encoding determines how many sound sources or audio objects are present in an input audio scene based on an analysis of the audio scene and determines how to extract and encode the sound sources or audio objects. I can do it. As an example, principal component analysis (PCA) may be used to extract sound sources or audio objects, and N sound sources may be extracted and encoded. During this process, PCA parameters and N audio signals are derived. The PCA parameters and the N-generated audio signal are encoded and sent to the decoder side.

生成信号は、以下の式に従って導出され得る。

ここで、
Ｃ（ｋ）は、インプットオーディオ信号を示す。
Ｓ（ｋ）は、生成されたオーディオ信号を示す。
Ｍは、変換マトリクスを示す。 The generated signal may be derived according to the following equation:

here,
C (k) indicates an input audio signal.
S (k) indicates the generated audio signal.
M indicates a conversion matrix.

コアエンコーダのセット（１００２、１００３、１００４）は、Ｎ生成信号をコアパラメータのセットにエンコードするが、エンコーダは、ＭＰＥＧ−１ＡｕｄｉｏＬａｙｅｒＩＩＩやＡＡＣやＨＥ−ＡＡＣやＤｏｌｂｙＡＣ−３やＭＰＥＧＵＳＡＣスタンダードなどの、任意の現存の若しくは新しいコーデックであればよい。 The set of core encoders (1002, 1003, 1004) encode the N generated signal into a set of core parameters, but the encoder uses MPEG-1 Audio Layer III, AAC, HE-AAC, Dolby AC-3, and MPEG USAC standards. And any existing or new codec.

ビットストリームマルチプレクサ（１００５）は、空間パラメータ及びコアパラメータをビットストリームにパックする。 The bitstream multiplexer (1005) packs the spatial and core parameters into a bitstream.

対応するデコーダは、図２に示すデコーダであってもよい。 The corresponding decoder may be the decoder shown in FIG.

９．実施の形態９
本発明の実施の形態９では、エンコーダは、インプット信号を分析して、インプット信号を、複数のプレドミナントサウンド及び複数のアンビエンスサウンドに、更に、対応するプレドミナントサウンドパラメータ及びアンビエンスパラメータに、エンコードする、オーディオシーン分析及び空間エンコーダと；コアデコーダを割り当ててプレドミナントサウンド及びアンビエンスサウンドをエンコードするチャネル割り当てユニットと；プレドミナントサウンドとアンビエンスサウンドとの両方をコアパラメータのセットにエンコードすることを含む、Ｎチャネルオーディオ信号をエンコードするコアエンコーダのセットと；プレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当て情報、及びコアパラメータをビットストリームにパックするビットストリームマルチプレクサと；を含む。 9. Embodiment 9
In a ninth embodiment of the present invention, an encoder analyzes an input signal and encodes the input signal into a plurality of predominant sounds and ambience sounds, and further into corresponding predominant sound parameters and ambience parameters. An audio scene analysis and spatial encoder; a channel allocation unit that allocates a core decoder to encode predominant and ambience sounds; and encodes both predominant and ambience sounds into a set of core parameters, N A set of core encoders that encode channel audio signals; and pre-dominant sound parameters, ambience parameters, channel assignment information, and core parameters. Including; a bit stream multiplexer pack bets stream.

本発明に係るサラウンドサウンドデコーダは、ビットストリームをプレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当てパラメータ、及びコアパラメータに解凍するビットストリームデマルチプレクサと；コアパラメータをコア信号のセットにデコードするコアデコーダのセットと；デコードされたコア信号をプレドミナントサウンド及びアンビエンスに割り当てる、プレドミナントサウンドアンビエンススイッチと；プレドミナントサウンドパラメータと再生スピーカのレイアウトとからプレドミナントサウンドのレンダリングマトリクスを導出するマトリクス導出ユニットと；アンビエンスパラメータと再生スピーカのレイアウトとからアンビエンスレンダリングマトリクスを導出するマトリクス導出ユニットと；レンダリングマトリクスを用いて、プレドミナントサウンドを再生信号にレンダリングするプレドミナントサウンドレンダリング器と；レンダリングマトリクスを用いて、アンビエンスを再生信号にレンダリングするアンビエンスレンダリング器と；レンダリングされたプレドミナントサウンド及びアンビエンスサウンドを用いて、再生信号を構成するアウトプット信号構成ユニットと；を含む。 A surround sound decoder according to the present invention comprises: a bitstream demultiplexer for decompressing a bitstream into predominant sound parameters, ambience parameters, channel allocation parameters, and core parameters; and a set of core decoders for decoding the core parameters into a set of core signals. A predominant sound ambience switch for assigning the decoded core signal to the predominant sound and ambience; a matrix deriving unit for deriving a rendering matrix of the predominant sound from the predominant sound parameters and the layout of the playback speakers; and an ambience parameter Matrix derivation unit that derives an ambience rendering matrix from the A predominant sound renderer that renders predominant sound into a playback signal using a rendering matrix; an ambience renderer that renders ambience into a playback signal using a rendering matrix; rendered predominant sound and ambience An output signal forming unit for forming a reproduced signal using sound.

図１１は、第２の実施の形態の、前述のエンコーダを示す。 FIG. 11 shows the above-described encoder according to the second embodiment.

エンコーダは、インプット信号を分析して、インプット信号を複数のプレドミナントサウンド及び複数のアンビエンスサウンドに、更に、対応するプレドミナントサウンドパラメータ及びアンビエンスパラメータに、エンコードする、オーディオシーン分析及び空間エンコーダと；コアデコーダを割り当ててプレドミナントサウンド及びアンビエンスサウンドをエンコードするチャネル割り当てユニットと；プレドミナントサウンドとアンビエンスサウンドとの両方をコアパラメータのセットにエンコードすることを含む、Ｎチャネルオーディオ信号をエンコードするコアエンコーダのセットと；プレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当て情報、及びコアパラメータをビットストリームにパックするビットストリームマルチプレクサと；を含む。 An audio scene analysis and spatial encoder that analyzes the input signal and encodes the input signal into a plurality of predominant sounds and ambience sounds, and further into corresponding predominant sound and ambience parameters; A channel allocation unit for allocating a decoder to encode predominant and ambience sounds; and a set of core encoders for encoding N-channel audio signals, including encoding both the predominant and ambience sounds into a set of core parameters. And a video for packing predominant sound parameters, ambience parameters, channel assignment information, and core parameters into a bitstream. Including; a preparative stream multiplexer.

オーディオシーン分析及び空間エンコーダ（１１０１）は、インプット信号を分析して、インプット信号を複数のプレドミナントサウンド及び複数のアンビエンスサウンドに、更に、対応するプレドミナントサウンドパラメータ及びアンビエンスパラメータに、エンコードする。 The audio scene analysis and spatial encoder (1101) analyzes the input signal and encodes the input signal into a plurality of predominant sounds and ambience sounds, and further into corresponding predominant sound parameters and ambience parameters.

オーディオシーン分析及び空間エンコーディングは、オーディオシーンの分析を行い、インプットオーディオシーン内にてどれだけ多くのサウンドソース若しくはオーディオオブジェクトが在るか決定し、サウンドソース若しくはオーディオオブジェクトをどのように抽出してエンコードするか判別する。例として、サウンドソース若しくはオーディオオブジェクトを抽出するのに主成分解析（ＰＣＡ）が用いられＭサウンドソースが抽出されてエンコードされるようにしても良い。このプロセスの間に、ＰＣＡパラメータ及びＭプレドミナントのサウンド信号が導出される。ＰＣＡパラメータ及びＭプレドミナントのオーディオ信号がエンコードされてデコーダ側に送られる。 Audio Scene Analysis and Spatial Encoding analyzes the audio scene, determines how many sound sources or audio objects are present in the input audio scene, and extracts and encodes the sound sources or audio objects. Is to be determined. As an example, Principal Component Analysis (PCA) may be used to extract sound sources or audio objects, and M sound sources may be extracted and encoded. During this process, PCA parameters and M predominant sound signals are derived. The PCA parameters and the M predominant audio signal are encoded and sent to the decoder side.

生成信号は、以下の式に従って導出され得る。

ここで、
Ｃ（ｋ）は、インプットオーディオ信号を示す。
Ｃ_ＰＳ（ｋ）は、生成されたオーディオ信号を示す。
Ｍは、変換マトリクスを示す。 The generated signal may be derived according to the following equation:

here,
C (k) indicates an input audio signal.
C _PS (k) indicates the generated audio signal.
M indicates a conversion matrix.

オーディオシーン分析及び空間エンコーダは、アンビエント信号と名付け得る、インプット信号とプレドミナントサウンド信号からの合成信号との間の残余を、抽出しエンコードするようにしても良い。空間エンコードは、インプット信号とプレドミナントサウンド信号からの合成信号との間の差分から、アンビエント信号を抽出する。プレドミナントサウンドの合成は、以下の式に従って為され得る。

ここで、
Ｃ’（ｋ）は、プレドミナントサウンドから、再構築されるオーディオ信号を示す。
Ｃ_ＰＳ（ｋ）は、デコードされたプレドミナントサウンド信号を示す。
Ｍは、変換マトリクスを示す。 The audio scene analysis and spatial encoder may extract and encode the residue between the input signal and the composite signal from the predominant sound signal, which may be termed the ambient signal. Spatial encoding extracts an ambient signal from the difference between the input signal and the composite signal from the predominant sound signal. The synthesis of the predominant sound can be made according to the following equation:

here,
C ′ (k) indicates an audio signal reconstructed from the predominant sound.
C _PS (k) indicates the decoded predominant sound signal.
M indicates a conversion matrix.

アンビエント信号は、以下の式に従って導出され得る。

ここで、
Ｃ’（ｋ）は、プレドミナントサウンドから、再構築されるオーディオ信号を示す。
Ｃ（ｋ）は、インプットオーディオ信号を示す。
Ｃ_ＡＭＢ（ｋ）は、アンビエンス信号を示す。 The ambient signal can be derived according to the following equation:

here,
C ′ (k) indicates an audio signal reconstructed from the predominant sound.
C (k) indicates an input audio signal.
_CAMB (k) indicates an ambience signal.

全てのアンビエント信号のうち、アンビエント信号のどれがエンコードされるべきかが決定された。アンビエント信号は、より効率的にエンコードされ得るように、他のフォーマットに処理されても若しくは変換されてもよい。 Of all the ambient signals, it was determined which of the ambient signals was to be encoded. Ambient signals may be processed or converted to other formats so that they can be more efficiently encoded.

チャネル割り当てユニット（１１０１）は、コアエンコーダを割り当ててプレドミナントサウンド及びアンビエンスサウンドをエンコードする。送信されるアンビエントＨＯＡ係数のシーケンスの選択、それらの割り当て、及び、所与のＮチャネルへのプレドミナントサウンド信号の割り当てについての情報は、デコーダ側に送られる。 The channel allocation unit (1101) allocates a core encoder and encodes predominant sound and ambience sound. Information about the selection of the sequence of ambient HOA coefficients to be transmitted, their assignment, and the assignment of the predominant sound signal to a given N channel is sent to the decoder side.

コアエンコーダのセット（１１０２、１１０３、１１０４）は、Ｍプレドミナントサウンド信号及び（Ｎ−Ｍ）アンビエント信号をコアパラメータのセットにエンコードするが、エンコーダは、ＭＰＥＧ−１ＡｕｄｉｏＬａｙｅｒＩＩＩやＡＡＣやＨＥ−ＡＡＣやＤｏｌｂｙＡＣ−３やＭＰＥＧＵＳＡＣスタンダードなどの、任意の現存の若しくは新しいコーデックであればよい。 The set of core encoders (1102, 1103, 1104) encodes the M predominant sound signal and the (NM) ambient signal into a set of core parameters, but the encoder uses MPEG-1 Audio Layer III, AAC or HE- Any existing or new codec, such as AAC, Dolby AC-3, or MPEG USAC standard may be used.

ビットストリームマルチプレクサ（１１０５）は、プレドミナントサウンドパラメータ、アンビエンスパラメータ、チャネル割り当て情報、及びコアパラメータをビットストリームにパックする。 The bitstream multiplexer (1105) packs predominant sound parameters, ambience parameters, channel assignment information, and core parameters into a bitstream.

対応するデコーダは、図３に示すデコーダであってもよい。 The corresponding decoder may be the decoder shown in FIG.

１０．実施の形態１０
図１２は、この実施の形態の、前述のエンコーダを示す。 10. Embodiment 10
FIG. 12 shows the above-described encoder of this embodiment.

オーディオシーン分析及び空間エンコーダ（１２０１）は、インプット信号を分析してインプット信号をエンコードする。 The audio scene analysis and spatial encoder (1201) analyzes the input signal and encodes the input signal.

オーディオシーン分析及び空間エンコーディングは、オーディオシーンの分析を行い、生成されたパラメータが発明されたアイデアと互換性があるか判別し、ＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧを送信することにより前記判別を反映する。 Audio scene analysis and spatial encoding perform an analysis of the audio scene, determine whether the generated parameters are compatible with the invented idea, and reflect the determination by sending LC_RENDER_FLAG.

ＰＳパラメータデータ、アンビエンスパラメータデータ、チャネル割り当てのパラメータデータ、及びコアコーダデータなどの、全ての生成されたパラメータが、発明されたアイデアと互換性があるならば、低演算量の構成及びレンダリングを達成するために、発明されたアイデアをデコーダ側内で使用することが推奨され、従って、レンダリングフラグＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧが１にセットされる。 Achieve low computational complexity configuration and rendering if all generated parameters, such as PS parameter data, ambience parameter data, channel assignment parameter data, and core coder data are compatible with the invented idea In order to do so, it is recommended to use the invented idea in the decoder side, so that the rendering flag LC_RENDER_FLAG is set to one.

全ての生成されたパラメータが、発明されたアイデアと互換性があるというわけではないならば、発明されたアイデアを使用することが実用的ではなく、従来のデコーディング、構成及びレンダリングのツールをデコーダ側内で使用することが推奨され、従って、レンダリングフラグＬＣ＿ＲＥＮＤＥＲ＿ＦＬＡＧが０にセットされる。 Unless all generated parameters are compatible with the invented idea, it is impractical to use the invented idea and use conventional decoding, construction and rendering tools to It is recommended to be used internally, so the rendering flag LC_RENDER_FLAG is set to zero.

効果：この実施の形態では、ビットストリーム非互換性の課題が解決される。 Effect: In this embodiment, the problem of bitstream incompatibility is solved.

参考文献
［１］ＩＳＯ／ＩＥＣＪＴＣ１／ＳＣ２９／ＷＧ１１／Ｎ１３４１１ “ＣａｌｌｆｏｒＰｒｏｐｏｓａｌｓｆｏｒ３ＤＡｕｄｉｏ”
［２］ＩＳＯ／ＩＥＣＪＴＣ１／ＳＣ２９／ＷＧ１１／Ｎ１４２６４ “ＷＤ１−ＨＯＡＴｅｘｔｏｆＭＰＥＧ−Ｈ３ＤＡｕｄｉｏ”
［３］Ｖ．Ｐｕｌｋｋｉ， ”ＶｉｒｔｕａｌＳｏｕｎｄＳｏｕｒｃｅＰｏｓｉｔｉｏｎｉｎｇＵｓｉｎｇＶｅｃｔｏｒＢａｓｅＡｍｐｌｉｔｕｄｅＰａｎｎｉｎｇ，” Ｊ．ＡｕｄｉｏＥｎｇ．Ｓｏｃ．，ｖｏｌ．４５，１９９７
［４］Ｔ．Ｌｏｓｓｉｕｓ，Ｐ．Ｂａｌｔａｚａｒ，ａｎｄＴ．ｄ．ｌ．Ｈｏｇｕｅ， ”ＤＢＡＰ - Ｄｉｓｔａｎｃｅｂａｓｅｄａｍｐｌｉｔｕｄｅｐａｎｎｉｎｇ，” ｉｎＩｎｔｅｒｎａｔｉｏｎａｌＣｏｍｐｕｔｅｒＭｕｓｉｃＣｏｎｆｅｒｅｎｃｅ（ＩＣＭＣ）．Ｍｏｎｔｒｅａｌ，２００９． Reference [1] ISO / IEC JTC1 / SC29 / WG11 / N13411 "Call for Proposals for 3D Audio"
[2] ISO / IEC JTC1 / SC29 / WG11 / N14264 “WD1-HOA Text of MPEG-H 3D Audio”
[3] V. Pulki, "Virtual Sound Source Positioning Using Vector Base Amplitude Panning," Audio Eng. Soc. , Vol. 45, 1997
[4] T.I. Lossius, P .; Baltazar, and T.W. d. l. Hogue, "DBAP-Distance-based amplitude panning," in International Computer Music Conference (ICMC). Montreal, 2009.

Claims

A device for decoding a surround audio signal,
A bitstream demultiplexer that decompresses the bitstream into predominant sound parameters, ambience parameters, channel assignment parameters, core parameters, and rendering flags;
A set of core decoders for decoding core parameters into a set of core signals;
A predominant sound ambience switch for assigning the decoded core signal to predominant sound and ambience according to the channel assignment parameter;
Deriving a predominant sound rendering matrix using the predominant sound parameters and the layout information of the playback speakers,
A matrix deriving unit that derives an ambience rendering matrix using the ambience parameter and the layout information of the reproduction speaker;
A predominant sound rendering device that renders a predominant sound into a playback signal using a predominant sound rendering matrix;
An ambience rendering device that renders ambience into a playback signal using an ambience rendering matrix;
An output signal composition unit for composing a playback signal using the rendered predominant sound and the rendered ambient sound;
Including
In the matrix deriving unit, based on a value of the rendering flag,
Decide not to use the rendering matrix derivation process,
apparatus.

The apparatus according to claim 1, wherein the core decoder corresponds to an MPEG-1 Audio Layer III, AAC, HE-AAC, Dolby AC-3, or MPEG USAC standard.

The apparatus according to claim 1, wherein the surround audio signal is a high-order ambisonic signal.

The matrix derivation is performed using some or all of a parameter group consisting of the number of target speakers, speaker positions, spherical modeling positions (horizontal and elevation), HOA order, and HOA decomposition parameters. The apparatus of claim 1.

The value of the rendering flag indicates that, when the bit stream includes at least short-range compensation data, the rendering process of the rendering matrix in the matrix deriving unit is not used.
The device according to claim 1.

The value of the rendering flag indicates that a rendering process of a rendering matrix in the matrix deriving unit is used when the bit stream does not include at least the short-range compensation data.
The device according to claim 1.

A device for encoding a surround audio signal,
A spatial encoder that encodes the input signal into a plurality of predominant sounds and corresponding predominant sound parameters, and a plurality of ambience sounds and corresponding ambience parameters based on an audio scene analysis result of the input signal;
A channel allocation unit that allocates a core decoder to encode predominant and ambience sounds;
A rendering flag determination unit that determines a rendering flag used in the decoder side;
A set of core encoders that encode the generated audio signal, including encoding both the predominant sound and the ambience sound into a set of core parameters, and rendering flags, predominant sound parameters, ambience parameters, channel assignment information, And a bitstream multiplexer that packs core parameters into a bitstream;
An apparatus, including:

A method of decoding a surround audio signal,
Decompressing the bitstream into predominant sound parameters, ambience parameters, channel allocation parameters, core parameters, and rendering flags,
Decode the core parameters into a set of core signals,
Assigning the decoded core signal to predominant sound and ambience according to the channel assignment parameters,
Deriving a predominant sound rendering matrix using the predominant sound parameters and the layout information of the playback speakers,
Deriving an ambience rendering matrix using the ambience parameters and the layout information of the playback speakers,
Using the predominant sound rendering matrix, render the predominant sound into a playback signal,
Render the ambience into a playback signal using the ambience rendering matrix,
Using the rendered pre-dominant sound and the rendered ambient sound, output comprising a playback signal;
Upon deriving the matrix, based on the value of the rendering flag, determine not to use the rendering matrix derivation process,
Method.

9. The method of claim 8 , wherein the core decoder corresponds to an MPEG-1 Audio Layer III, AAC, HE-AAC, Dolby AC-3, or MPEG USAC standard.

The method according to claim 8 , wherein the surround audio signal is a high-order ambisonic signal.

The matrix derivation is performed using some or all of a parameter group consisting of the number of target speakers, speaker positions, spherical modeling positions (horizontal and elevation), HOA order, and HOA decomposition parameters. The method of claim 8 .

The value of the rendering flag indicates that when at least the short-range compensation data is included in the bit stream, the rendering process of the rendering matrix in the matrix derivation is not used.
The method according to claim 8 .

The value of the rendering flag indicates that when at least the short-range compensation data is not included in the bitstream, a rendering matrix derivation process in the matrix derivation is used.
The method according to claim 8 .

A method of encoding a surround audio signal,
Encoding the input signal into a plurality of predominant sounds and corresponding predominant sound parameters, and a plurality of ambience sounds and corresponding ambience parameters based on an audio scene analysis result of the input signal;
Assign a core decoder and assign a channel to encode predominant sound and ambience sound,
Determine the rendering flag used in the decoder side,
A set of core encoders that encode the generated audio signal, including encoding both the predominant sound and the ambience sound into a set of core parameters, and rendering flags, predominant sound parameters, ambience parameters, channel assignment information, And packing core parameters into a bitstream,
Method.