HK40112969A

HK40112969A - Time-domain superwideband bandwidth expansion for cross-talk scenarios

Info

Publication number: HK40112969A
Application number: HK62024101175.3A
Authority: HK
Inventors: V·马列诺夫斯基; M·耶利内克
Original assignee: 沃伊斯亚吉公司
Priority date: 2022-02-03
Filing date: 2023-01-27
Publication date: 2025-01-28

Description

Time-domain ultrawideband bandwidth extension for crosstalk scenarios

技术领域Technical Field

本公开涉及一种用于在串扰(cross-talk)声音信号的编码/解码期间对激励信号进行时域带宽扩展的方法和设备。This disclosure relates to a method and apparatus for temporal bandwidth expansion of an excitation signal during encoding/decoding of a cross-talk audio signal.

在本公开和所附权利要求中：In this disclosure and the appended claims:

-术语“串扰”通常旨在表示其中第一声音元素叠加到第二声音元素的声音段，例如但不排他地，当第一人在第二人说话之上说话时的语音段。The term "crosstalk" is generally intended to refer to a segment of speech in which a first sound element is superimposed on a second sound element, for example, but not exclusively, when the first person speaks on top of the second person.

-术语“低频带”旨在表示较低的频率范围。尽管在本公开中给出了0kHz至6.4kHz和0kHz至8kHz频率范围作为“低频带”的示例，但是低频带频率范围的频率边界显然可以被修改/适配为编解码器的比特率和/或以实现诸如符合应用、系统、网络和设计/商业相关约束的特定目标。The term "low frequency band" is intended to refer to a lower frequency range. Although frequency ranges of 0 kHz to 6.4 kHz and 0 kHz to 8 kHz are given as examples of "low frequency band" in this disclosure, the frequency boundaries of the low frequency band range can obviously be modified/adapted to the bit rate of the codec and/or to achieve specific objectives such as compliance with application, system, network, and design/business-related constraints.

-术语“高频带”旨在表示较高的频率范围。尽管在本公开中给出了6.4kHz至14kHz和8kHz至16kHz频率范围作为“高频带”的示例，但是高频带频率范围的频率边界显然可以被修改/适配为编解码器的比特率和/或以实现诸如符合应用、系统、网络和设计/商业相关约束的特定目标。The term "high frequency band" is intended to refer to a higher frequency range. Although frequency ranges of 6.4 kHz to 14 kHz and 8 kHz to 16 kHz are given as examples of "high frequency band" in this disclosure, the frequency boundaries of the high frequency band range can obviously be modified/adapted to the bit rate of the codec and/or to achieve specific objectives such as compliance with application, system, network, and design/business-related constraints.

背景技术Background Technology

在很多会话应用中，经常会出现一个人在另一个人说话之上说话的情况。如上所述，这种情况通常被称为“串扰”。串扰语音段在现代语音编码/解码系统中可能是有问题的。由于传统的语音编码技术主要针对单说话内容(只有一个人说话)进行设计和优化，因此串扰语音的质量可能会受到编码/解码操作的严重影响。作为示例，3GPP EVS编解码器(参考文献[1]或其全部内容通过引用并入本文)中的串扰语音编码/解码中最严重的问题之一是偶尔存在“卡嗒卡嗒噪声”。“卡嗒卡嗒噪声”是在8kHz至14kHz的频率下(即在如本文上文定义的高频带频率范围示例内)产生的强烈的恼人声音。In many conversational applications, it is common for one person to speak over another. As mentioned above, this situation is often referred to as “crosstalk.” Crosstalk can be problematic in modern speech coding/decoding systems. Since traditional speech coding techniques are primarily designed and optimized for single-speech content (only one person speaking), the quality of crosstalk can be severely affected by the coding/decoding operation. As an example, one of the most serious problems in crosstalk speech coding/decoding in the 3GPP EVS codec (reference [1] or the entire contents thereof are incorporated herein by reference) is the occasional presence of “click-click noise.” “Click-click noise” is a loud, annoying sound produced at frequencies from 8 kHz to 14 kHz (i.e., within the high-frequency band range example defined above in this document).

在3GPP EVS编解码器的低比特率下，使用如参考文献[1]中描述的超宽带带宽扩展(SWB TBE)工具对高频带频率内容进行编码/解码。由于SWB TBE工具可用的比特数量有限，因此高频带频率范围内的高频带激励信号不被直接编码。相反，低频带频率范围内的低频带激励信号使用ACELP(代数码激励线性预测)编码器(参考文献[2]，其全部内容通过引用并入本文)来计算，然后取决于高频带频率范围被上采样并扩展至14kHz或16kHz，并且用作高频带激励信号的替代。如果在低频带激励信号与高频带激励信号之间存在不匹配，则与原始声音相比，合成声音可能听起来不同。当低频带激励信号是有声的(voiced)但高频带激励信号是无声(unvoiced)的时，合成的声音将被感知为上面定义的卡嗒卡嗒噪声。在图1的频谱图中示出了串扰内容中的卡嗒卡嗒噪声的问题。At the low bit rate of the 3GPP EVS codec, the high-band frequency content is encoded/decoded using the Ultra Wideband Bandwidth Extension (SWB TBE) tool as described in reference [1]. Due to the limited number of bits available to the SWB TBE tool, the high-band excitation signal in the high-band frequency range is not directly encoded. Instead, the low-band excitation signal in the low-band frequency range is computed using the ACELP (Algebraic Digital Excitation Linear Prediction) encoder (reference [2], the entire contents of which are incorporated herein by reference), and then upsampled and extended to 14 kHz or 16 kHz depending on the high-band frequency range, and used as a substitute for the high-band excitation signal. If there is a mismatch between the low-band excitation signal and the high-band excitation signal, the synthesized sound may sound different from the original sound. When the low-band excitation signal is voiced but the high-band excitation signal is unvoiced, the synthesized sound will be perceived as the click-clack noise defined above. The problem of click-clack noise in crosstalk content is illustrated in the spectrum diagram of Figure 1.

图1中的图示出了示例性串扰声音的功率谱P对频率f的关系，其中两个扬声器发出不同类型的声音。虽然来自第一扬声器(扬声器1)的声音主要包括有声内容，但是来自第二扬声器(扬声器2)的声音包含无声段。假设单声道捕获设备(诸如智能电话或全向麦克风)，那么来自两个扬声器的声音将在捕获设备内混合在一起。结果，如由编码器看到的是，输入声音信号的频谱内容将类似于两个频谱的超集。类似的情况出现在诸如立体声麦克风或环绕麦克风的多声道捕获设备中。如果编码器包含下混模块，则所得单声道输入信号可能包含在频谱域中可清楚区分的不同类型的声音。Figure 1 illustrates the power spectrum P versus frequency f of an exemplary crosstalk sound, where two speakers emit different types of sound. While the sound from the first speaker (speaker 1) primarily comprises audible content, the sound from the second speaker (speaker 2) contains silent segments. Assuming a mono capture device (such as a smartphone or omnidirectional microphone), the sounds from the two speakers will be mixed together within the capture device. As a result, as seen by the encoder, the spectral content of the input sound signal will resemble a superset of the two spectra. A similar situation occurs in multi-channel capture devices such as stereo microphones or surround microphones. If the encoder includes a downmixing module, the resulting mono input signal may contain different types of sound that are clearly distinguishable in the spectral domain.

发明内容Summary of the Invention

本公开涉及以下方面：This disclosure relates to the following aspects:

-一种用于在串扰声音信号的解码期间对激励信号进行时域带宽扩展的方法，包括：对在比特流中接收的高频带混合因子进行解码，以及使用该高频带混合因子混合低频带激励信号和随机噪声激励信号以产生时域带宽扩展的激励信号。- A method for time-domain bandwidth expansion of an excitation signal during decoding of a crosstalk audio signal, comprising: decoding a high-frequency band mixing factor received in a bitstream, and mixing a low-frequency band excitation signal and a random noise excitation signal using the high-frequency band mixing factor to generate a time-domain bandwidth expanded excitation signal.

-一种用于在串扰声音信号的编码期间对激励信号进行时域带宽扩展的方法，包括：(a)使用声音信号计算高频带残差信号和(b)计算高频带残差信号的时间包络；以及基于高频带残差信号的时间包络来计算高频带发声(voicing)因子。- A method for temporal bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising: (a) calculating a high-frequency band residual signal using the audio signal and (b) calculating the time envelope of the high-frequency band residual signal; and calculating a high-frequency band voicing factor based on the time envelope of the high-frequency band residual signal.

-一种用于在串扰声音信号的编码期间对激励信号进行时域带宽扩展的方法，包括：计算可用于混合低频带激励信号和随机噪声激励信号以产生时域带宽扩展的激励信号的高频带混合因子。- A method for time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising: calculating a high-frequency band mixing factor that can be used to mix a low-frequency band excitation signal and a random noise excitation signal to produce an excitation signal with time-domain bandwidth expansion.

-一种用于在串扰声音信号的编码期间对激励信号进行时域带宽扩展的方法，包括：(a)使用声音信号计算高频带残差信号和(b)计算高频带残差信号的时间包络；基于高频带残差信号的时间包络来计算高频带发声因子；计算可用于混合低频带激励信号和随机噪声激励信号以产生时域带宽扩展的激励信号的高频带混合因子；以及使用高频带发声因子来估计增益/形状参数。- A method for temporal bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising: (a) calculating a high-frequency band residual signal using the audio signal and (b) calculating the time envelope of the high-frequency band residual signal; calculating a high-frequency band phonation factor based on the time envelope of the high-frequency band residual signal; calculating a high-frequency band mixing factor that can be used to mix a low-frequency band excitation signal and a random noise excitation signal to generate an excitation signal with temporal bandwidth expansion; and using the high-frequency band phonation factor to estimate a gain/shape parameter.

-一种用于在串扰声音信号的解码期间对激励信号进行时域带宽扩展的设备，包括：解码器，其对在比特流中接收的高频带混合因子进行解码，以及混合器，其使用该高频带混合因子混合低频带激励信号和随机噪声激励信号以产生时域带宽扩展的激励信号。- An apparatus for time-domain bandwidth expansion of an excitation signal during decoding of a crosstalk audio signal, comprising: a decoder that decodes a high-frequency band mixing factor received in a bitstream, and a mixer that uses the high-frequency band mixing factor to mix a low-frequency band excitation signal and a random noise excitation signal to generate a time-domain bandwidth expanded excitation signal.

-一种用于在串扰声音信号的编码期间对激励信号进行时域带宽扩展的设备，包括：(a)使用声音信号计算高频带残差信号和(b)计算高频带残差信号的时间包络的计算器；以及基于高频带残差信号的时间包络的高频带发声因子的计算器。- An apparatus for time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising: (a) a calculator for calculating a high-frequency band residual signal using the audio signal and (b) a calculator for calculating the time envelope of the high-frequency band residual signal; and a calculator for a high-frequency band phonation factor based on the time envelope of the high-frequency band residual signal.

-一种用于在串扰声音信号的编码期间对激励信号进行时域带宽扩展的设备，包括：可用于混合低频带激励信号和随机噪声激励信号以产生时域带宽扩展的激励信号的高频带混合因子的计算器。- An apparatus for time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising: a calculator for a high-frequency band mixing factor that can be used to mix a low-frequency band excitation signal and a random noise excitation signal to generate a time-domain bandwidth-expanded excitation signal.

-一种用于在串扰声音信号的编码期间对激励信号进行时域带宽扩展的设备，包括：(a)使用声音信号计算高频带残差信号和(b)计算高频带残差信号的时间包络的计算器；基于高频带残差信号的时间包络的高频带发声因子的计算器；可用于混合低频带激励信号和随机噪声激励信号以产生时域带宽扩展的激励信号的高频带混合因子的计算器；以及使用高频带发声因子的增益/形状参数的估计器。- An apparatus for time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising: (a) a calculator for calculating a high-frequency band residual signal using the audio signal and (b) a calculator for calculating the time envelope of the high-frequency band residual signal; a calculator for a high-frequency band phonation factor based on the time envelope of the high-frequency band residual signal; a calculator for a high-frequency band mixing factor that can be used to mix a low-frequency band excitation signal and a random noise excitation signal to generate a time-domain bandwidth-expanded excitation signal; and an estimator for a gain/shape parameter using the high-frequency band phonation factor.

在阅读以下仅通过示例的方式参考附图给出的说明性实施例的非限制性描述后，用于串扰声音信号的编码/解码期间对激励信号进行时域带宽扩展的方法和设备的前述和其他目的、优点和特征将变得更加明显。The foregoing and other objects, advantages and features of the method and apparatus for temporal bandwidth expansion of an excitation signal during the encoding/decoding of crosstalk audio signals will become more apparent after reading the following non-limiting description of illustrative embodiments given by way of example only with reference to the accompanying drawings.

附图说明Attached Figure Description

在附图中：In the attached diagram:

图1是示出其中两个扬声器(扬声器1和扬声器2)发出不同类型(有声和无声)的声音的示例性串扰声音的功率谱P(dB)对频率f(kHz)的关系的图；Figure 1 is a graph showing the power spectrum P (dB) of an exemplary crosstalk sound emitting different types (audible and silent) of sound from two of the speakers (speaker 1 and speaker 2) versus frequency f (kHz).

图2是同时示出在串扰声音信号的编码期间对激励信号进行时域带宽扩展的方法和设备中的高频带发声因子的计算/计算器的示意性框图；Figure 2 is a schematic block diagram showing the calculation/calculator of the high-frequency band emission factor in a method and device for time-domain bandwidth expansion of the excitation signal during the encoding of crosstalk audio signals.

图3是示出如何确定高频带残差信号的时间包络的曲线图；Figure 3 is a graph showing how to determine the time envelope of the high-frequency band residual signal;

图4是示出使用高频带残差信号的经下采样的时间包络的连续段的平均值计算的段归一化因子的插值的曲线图；Figure 4 is a graph showing the interpolation of the segment normalization factor calculated using the average value of the continuous segments of the downsampled time envelope of the high-frequency band residual signal.

图5是同时示出在解码器处，用于对激励信号进行时域带宽扩展的方法和设备内的时域带宽扩展的激励信号的计算/计算器的示意性框图；Figure 5 is a schematic block diagram showing both the method for time-domain bandwidth expansion of the excitation signal at the decoder and the calculation/calculator for the time-domain bandwidth expansion of the excitation signal within the device.

图6是同时示出在编码器处，由用于对激励信号进行时域带宽扩展的方法和设备内的经量化的归一化增益形成/表示的高频带混合因子的计算/计算器的示意性框图；Figure 6 is a schematic block diagram showing the calculation/calculator of the high-frequency band mixing factor formed/represented by the method for time-domain bandwidth expansion of the excitation signal and the quantized normalized gain in the device at the encoder.

图7是用于对激励信号进行时域带宽扩展的方法和设备内的增益形状估计/估计器的示意框图；Figure 7 is a schematic block diagram of a method and device for time-domain bandwidth expansion of excitation signals and a gain shape estimator/estimater.

图8是示出子帧增益的插值的曲线图；以及Figure 8 is a graph showing the interpolation of subframe gain; and

图9是形成用于在串扰声音信号的编码/解码期间对激励信号进行时域带宽扩展的方法和设备的硬件组件的示例配置的简化框图。Figure 9 is a simplified block diagram of an example configuration of hardware components for forming a method and apparatus for time-domain bandwidth extension of an excitation signal during encoding/decoding of a crosstalk audio signal.

具体实施方式Detailed Implementation

以下描述涉及用于对串扰声音信号进行编码/解码的技术。在本公开中，编码/解码技术的基础是参考文献[1]中描述的3GPP EVS编解码器的SWB TBE工具。然而，应当记住，该技术可以与其他编码/解码技术结合使用。The following description relates to techniques for encoding/decoding crosstalk audio signals. In this disclosure, the encoding/decoding technique is based on the SWB TBE tool of the 3GPP EVS codec described in reference [1]. However, it should be noted that this technique can be used in combination with other encoding/decoding techniques.

更具体地，本公开提出了对SWB TBE工具的一系列修改。这一系列修改的目的是改善合成的串扰声音信号(诸如串扰语音信号)的质量，特别是但不排他地消除上面定义的卡嗒卡嗒噪声。该一系列修改涉及激励信号的时域带宽扩展，并且分布在以下三个区域中的一个或多个区域中：More specifically, this disclosure proposes a series of modifications to the SWB TBE tool. The purpose of these modifications is to improve the quality of synthesized crosstalk audio signals (such as crosstalk speech signals), particularly, but not exclusively, to eliminate the click noise defined above. These modifications involve time-domain bandwidth expansion of the excitation signal and are distributed across one or more of the following three regions:

-在编码器中，使用高频带残差信号的时间包络来计算高频带发声因子。在SWBTBE工具中，高频带对应于SHB(超高频带)。- In the encoder, the time envelope of the high-frequency band residual signal is used to calculate the high-frequency band phonation factor. In the SWBTBE tool, the high-frequency band corresponds to SHB (Ultra-High Frequency Band).

-在编码器和解码器中，对用于高频带激励信号的高频带混合因子的计算。- Calculation of the high-frequency band mixing factor for the high-frequency band excitation signal in the encoder and decoder.

-在编码器和解码器中，对增益/形状参数和帧增益的估计的改进。- Improvements in the estimation of gain/shape parameters and frame gain in the encoder and decoder.

根据本公开的高频带发声因子的计算使用高频带自相关函数本身，该高频带自相关函数本身例如在下采样域中从高频带残差信号的时间包络来计算。在编码器中使用高频带发声因子来代替从SWB TBE工具中的低频带发声参数导出的所谓的语音因子。The calculation of the high-frequency phonation factor according to this disclosure uses the high-frequency autocorrelation function itself, which is calculated, for example, from the time envelope of the high-frequency residual signal in the downsampling domain. The high-frequency phonation factor is used in the encoder instead of the so-called speech factor derived from the low-frequency phonation parameters in the SWB TBE tool.

根据本公开的高频带混合因子的计算取代了SWB TBE工具中的对应方法。高频带混合因子确定低频带激励信号(例如来自ACELP核心)和随机噪声(其也可以被定义为“白噪声”)激励信号的比例，用于产生时域带宽扩展的激励信号。在所公开的实施方式中，例如在下采样域中，通过随机噪声激励信号的时间包络与低频带激励信号的时间包络之间的MSE(均方误差)最小化来计算高频带混合因子。高频带混合因子的量化可以由SWB TBE工具的现有量化器来执行。将经量化的高频带混合因子添加到SWB TBE比特流引起比特率的很小的增加。混合操作在编码器和解码器两者处执行。混合操作的其他特性可以包括在每个帧的开始处对随机噪声激励信号的重新缩放和高频带混合因子的插值以确保当前帧与前一帧之间的平滑过渡。The calculation of the high-frequency band mixing factor according to this disclosure replaces the corresponding method in the SWB TBE tool. The high-frequency band mixing factor determines the ratio of a low-frequency band excitation signal (e.g., from the ACELP core) to a random noise (which can also be defined as "white noise") excitation signal used to generate an excitation signal with temporal bandwidth expansion. In the disclosed implementation, for example in the downsampling domain, the high-frequency band mixing factor is calculated by minimizing the MSE (mean square error) between the temporal envelope of the random noise excitation signal and the temporal envelope of the low-frequency band excitation signal. Quantization of the high-frequency band mixing factor can be performed by an existing quantizer in the SWB TBE tool. Adding the quantized high-frequency band mixing factor to the SWB TBE bitstream causes a small increase in the bit rate. The mixing operation is performed at both the encoder and decoder. Other features of the mixing operation may include rescaling of the random noise excitation signal at the beginning of each frame and interpolation of the high-frequency band mixing factor to ensure a smooth transition between the current frame and the previous frame.

根据本公开的增益/形状参数的估计包括：通过原始增益/形状参数与经插值的增益/形状参数之间的加权，(在编码器中)使用未量化的增益/形状参数的自适应平滑来对增益/形状参数进行后处理。增益/形状参数的量化可以由SWB TBE工具的现有量化器来执行。应用两次自适应平滑；它首先应用于未量化的增益/形状参数(在编码器中)，并且然后应用于经量化的增益/形状参数(在编码器和解码器两者中)。在编码器处将自适应衰减应用于未量化的帧增益。自适应衰减基于MSE超额误差(excess error)，MSE超额误差是SWB TBE工具中SHB发声参数计算的副产品。The estimation of the gain/shape parameters according to this disclosure includes: post-processing the gain/shape parameters (in the encoder) using adaptive smoothing of the unquantized gain/shape parameters by weighting the original gain/shape parameters with the interpolated gain/shape parameters. Quantization of the gain/shape parameters can be performed by the existing quantizer of the SWB TBE tool. Adaptive smoothing is applied twice; it is first applied to the unquantized gain/shape parameters (in the encoder), and then to the quantized gain/shape parameters (in both the encoder and decoder). Adaptive attenuation is applied to the unquantized frame gain at the encoder. The adaptive attenuation is based on the excess error (MSE), a byproduct of the SHB vocal parameter calculation in the SWB TBE tool.

图2是同时示出在串扰声音信号的编码期间对激励信号进行时域带宽扩展的方法200和设备250中的高频带发声因子的计算/计算器的示意性框图。Figure 2 is a schematic block diagram showing the calculation/calculator of the high-frequency band phonation factor in both the method 200 and the device 250 for time-domain bandwidth expansion of the excitation signal during the encoding of the crosstalk audio signal.

1.低频带激励信号1. Low-frequency excitation signal

参考图2，例如使用以下关系式(1)来表示到3GPP EVS编解码器的输入声音信号s_inp(n)：Referring to Figure 2, the input audio signal s _inp (n) to the 3GPP EVS codec can be represented, for example, by the following relation (1):

s_inp(n),n＝0,..,N_32k-1(1)s _inp (n),n＝0,..,N _32k -1(1)

其中，N_32k为帧中样本数量(帧长度)。在该特定非限制性示例中，输入声音信号s_inp(n)以F_s＝32kHz的速率采样并且单个帧的长度为N_32k＝640个样本。这对应于20ms的时间间隔。给定持续时间的帧，每个包括给定数量的子帧并且包括给定数量的连续声音信号样本，被用于在声音信号编码领域中处理声音信号；关于这种帧的进一步信息可以在例如参考文献[1]中找到。Where N _32k is the number of samples in a frame (frame length). In this particular non-limiting example, the input audio signal _sinp (n) is sampled at a rate of F _s = 32 kHz and the length of a single frame is N _32k = 640 samples. This corresponds to a time interval of 20 ms. Frames of a given duration, each comprising a given number of subframes and comprising a given number of consecutive audio signal samples, are used to process audio signals in the field of audio signal coding; further information about such frames can be found, for example, in reference [1].

方法200包括下采样操作201，并且设备250包括用于进行操作201的下采样器251。下采样器251取决于编码器的比特率将输入声音信号s_inp(n)从32kHz下采样到12.8kHz或16kHz。例如，对于高达24.4kbps的所有比特率，3GPP EVS编解码器中的输入声音信号被下采样到12.8kHz，否则被下采样到16kHz。得到的信号是低频带信号202。在ACELP编码操作203中使用ACELP编码器253对低频带信号202进行编码。Method 200 includes a downsampling operation 201, and device 250 includes a downsampler 251 for performing operation 201. The downsampler 251 downsamples the input audio signal _sinp (n) from 32 kHz to 12.8 kHz or 16 kHz depending on the encoder's bit rate. For example, for all bit rates up to 24.4 kbps, the input audio signal in the 3GPP EVS codec is downsampled to 12.8 kHz, otherwise it is downsampled to 16 kHz. The resulting signal is a low-frequency band signal 202. The low-frequency band signal 202 is encoded using the ACELP encoder 253 in ACELP encoding operation 203.

方法200包括ACELP编码操作203，而设备250包括3GPP EVS编解码器的ACELP编码器253以执行ACELP编码。ACELP编码器253生成两种类型的激励信号，自适应码本激励信号204和固定码本激励信号205，如参考文献[1]中所述。Method 200 includes ACELP encoding operation 203, while device 250 includes ACELP encoder 253 of 3GPP EVS codec to perform ACELP encoding. ACELP encoder 253 generates two types of excitation signals, adaptive codebook excitation signal 204 and fixed codebook excitation signal 205, as described in reference [1].

在方法200和设备250中，3GPP EVS编解码器内的SWB TBE工具执行低频带激励信号生成操作207，并且包括用于生成低频带激励信号208的对应生成器257。生成器257使用两个激励信号204和205作为输入，将它们混合在一起并应用非线性变换以产生具有翻转频谱的混合信号，该混合信号在SWB TBE工具中被进一步处理以得到图2的低频带激励信号208。关于低频带激励信号生成的细节可以在参考文献[1]中找到；具体地，第5.2.6.1节描述SWB TBE编码，第6.1.3.1节描述SWB TBE解码。In method 200 and device 250, the SWB TBE tool within the 3GPP EVS codec performs a low-band excitation signal generation operation 207 and includes a corresponding generator 257 for generating a low-band excitation signal 208. Generator 257 takes two excitation signals 204 and 205 as inputs, mixes them together, and applies a nonlinear transformation to produce a mixed signal with an inverted spectrum, which is further processed in the SWB TBE tool to obtain the low-band excitation signal 208 of FIG2. Details regarding the generation of the low-band excitation signal can be found in reference [1]; specifically, Section 5.2.6.1 describes SWB TBE encoding, and Section 6.1.3.1 describes SWB TBE decoding.

作为非限制性实例，具有翻转频谱的低频段激励信号208以16kHz采样，并且使用以下关系式(2)表示：As a non-limiting example, a low-frequency excitation signal 208 with a flipped spectrum is sampled at 16 kHz and expressed using the following relation (2):

l_LB(n),n＝0,..,N-1(2)l _LB (n), n＝0,..,N-1(2)

其中N＝320为帧长度。Where N = 320 is the frame length.

2.高频带目标信号2. High-frequency band target signal

参考图2，高频带目标信号210实质上是输入声音信号s_inp(n)的提取，该输入声音信号取决于编解码器的比特率而包含在6.4kHz至14kHz或8kHz至16kHz的频率范围内的频谱分量。无论编解码器的比特率如何，高频带目标信号210总是以16kHz采样，并且其频谱内容被翻转。因此，高频带目标频谱的第一频率二进制位(bin)对应于频谱的最后频率二进制位，反之亦然。在方法200和设备250中，可以例如使用如参考文献[1]中所描述的由3GPPEVS编解码器的QMF分析滤波器组259执行的QMF(正交镜像滤波器)分析操作209来生成高频带目标信号210。可替代地，高频带目标信号210可以通过用带通滤波器对输入声音信号s_inp(n)进行滤波、在频域中移位、如上所述翻转其频谱内容并且最终将其从32kHz下采样到16kHz来生成。在本公开中，将假设使用QMF处理，并且例如使用以下关系式(3)来表示高频带目标信号210：Referring to Figure 2, the high-frequency target signal 210 is essentially an extraction of the input audio signal _sinp (n), which contains spectral components in the frequency range of 6.4 kHz to 14 kHz or 8 kHz to 16 kHz, depending on the codec's bit rate. Regardless of the codec's bit rate, the high-frequency target signal 210 is always sampled at 16 kHz, and its spectral content is inverted. Therefore, the first frequency binary bit (bin) of the high-frequency target spectrum corresponds to the last frequency binary bit of the spectrum, and vice versa. In method 200 and device 250, the high-frequency target signal 210 can be generated, for example, using QMF (Quadrature Mirror Filter) analysis operation 209 performed by the QMF analysis filter bank 259 of the 3GPPEVS codec as described in reference [1]. Alternatively, the high-frequency target signal 210 can be generated by filtering the input audio signal _sinp (n) with a bandpass filter, shifting it in the frequency domain, inverting its spectral content as described above, and finally downsampling it from 32 kHz to 16 kHz. In this disclosure, it will be assumed that QMF processing is used, and the high-frequency target signal 210 will be represented, for example, by the following relation (3):

s_HB(n),n＝0,..,N-1(3)s _HB (n), n＝0,..,N-1(3)

在QMF滤波器组259中的处理之后，方法200包括估计高频带滤波器系数212的操作211，并且设备250包括用于执行操作211的估计器261。估计器261在每个子帧具有80个样本的长度的四个连续子帧中逐帧地从高频带目标信号210估计高频带LP(线性预测)滤波器系数212。估计器261使用如参考文献[1]中描述的Levinson-Durbin算法来计算高频带LP滤波器系数212。可以使用以下关系式(4)来表示高频带LP滤波器系数212：Following processing in the QMF filter bank 259, method 200 includes an operation 211 for estimating the high-frequency band filter coefficients 212, and device 250 includes an estimator 261 for performing operation 211. The estimator 261 estimates the high-frequency band LP (linear prediction) filter coefficients 212 frame-by-frame from the high-frequency band target signal 210 in four consecutive subframes, each subframe having a length of 80 samples. The estimator 261 uses the Levinson-Durbin algorithm as described in reference [1] to compute the high-frequency band LP filter coefficients 212. The high-frequency band LP filter coefficients 212 can be represented using the following relation (4):

其中P＝10是高频带LP滤波器的阶数，并且j＝0,…,3是子帧索引。每个子帧中的第一LP滤波器系数是单位1(unitary)，即Where P = 10 is the order of the high-frequency band LP filter, and j = 0, ..., 3 are the subframe indices. The first LP filter coefficient in each subframe is unitary, i.e.

方法200包括生成高频带残差信号214的操作213，并且设备250包括高频带残差信号的生成器263以进行操作213。生成器263通过用来自估计器261的高频带LP滤波器(LP滤波器系数212)对来自QMF分析滤波器组259的高频带目标信号210进行滤波来产生高频带残差信号214。高频带残差信号214可以例如使用以下关系式(5)来表示：Method 200 includes an operation 213 of generating a high-frequency band residual signal 214, and device 250 includes a generator 263 for generating the high-frequency band residual signal to perform operation 213. Generator 263 generates the high-frequency band residual signal 214 by filtering the high-frequency band target signal 210 from QMF analysis filter bank 259 with a high-frequency band LP filter (LP filter coefficients 212) from estimator 261. The high-frequency band residual signal 214 can be represented, for example, using the following relation (5):

使用来自前一帧的高频带目标信号210来计算高频带残差信号214的前P个样本。这由求和项中的s_HB(-k),k＝1,…,P中的负索引指示。负索引是指在前一帧结束处的高频带目标信号214的样本。The first P samples of the high-frequency band residual signal 214 are calculated using the high-frequency band target signal 210 from the previous frame. This is indicated by the negative index in _sHB (-k), k = 1, ..., P in the summation term. The negative index refers to the sample of the high-frequency band target signal 214 at the end of the previous frame.

3.高频带自相关函数与发声因子3. High-frequency autocorrelation function and phonation factor

第3节(高频带自相关函数)涉及编码器的特征。Section 3 (High-Frequency Band Autocorrelation Function) deals with the characteristics of the encoder.

由生成器263使用关系式5计算的高频带残差信号214用于计算高频带自相关函数和高频带发声因子。不直接在高频带残差信号214上计算高频带自相关函数。直接计算高频带自相关函数需要大量的计算资源。此外，高频带残差信号214的动态通常是低的，并且频谱翻转过程通常导致模糊有声与无声声音信号之间的差异。为了避免这些问题，例如在下采样域中在高频带残差信号214的时间包络上估计高频带自相关函数。The high-frequency band residual signal 214, calculated by generator 263 using relation 5, is used to calculate the high-frequency band autocorrelation function and the high-frequency band phonation factor. The high-frequency band autocorrelation function is not directly calculated on the high-frequency band residual signal 214. Directly calculating the high-frequency band autocorrelation function requires significant computational resources. Furthermore, the dynamics of the high-frequency band residual signal 214 are typically low, and spectral reversal often leads to blurring the difference between audible and silent sound signals. To avoid these problems, the high-frequency band autocorrelation function is estimated, for example, on the time envelope of the high-frequency band residual signal 214 in the downsampling domain.

方法200包括计算高频带残差信号214的时间包络的操作215，并且设备250包括用于执行操作215的计算器265。为了计算高频带残差信号214的时间包络R_TD(n)216，计算器265通过滑动移动平均(MA)滤波器来处理高频带残差信号214，该滑动MA滤波器在示例实施方式中包括M＝20个抽头。时间包络计算可以例如由以下关系式(6)表示：Method 200 includes an operation 215 of calculating the time envelope of the high-frequency band residual signal 214, and device 250 includes a calculator 265 for performing operation 215. To calculate the time envelope R _TD (n) 216 of the high-frequency band residual signal 214, the calculator 265 processes the high-frequency band residual signal 214 through a sliding moving average (MA) filter, which in an example embodiment includes M = 20 taps. The time envelope calculation can be represented, for example, by the following relation (6):

其中，负样本r_HB(k)，k＝-M/2,…,-1是指前一帧中的高频带残差信号214的值。在模式切换场景中，可能发生前一帧中的高频带残差信号214未被计算并且值是未知的。在这种情况下，第一M/2值r_HB(k)，k＝0,…M/2-1被复制并且用作前一帧的值r_HB(k)，k＝-M/2,…,-1的替代。计算器265通过IIR(无限脉冲响应)滤波来近似当前帧中的时间包络R_TD(n)216的最后M个值。这可以使用以下关系式(7)来完成：Wherein, the negative sample r _HB (k), k = -M/2, ..., -1 refers to the value of the high-frequency band residual signal 214 in the previous frame. In the mode switching scenario, it is possible that the high-frequency band residual signal 214 in the previous frame has not been calculated and its value is unknown. In this case, the first M/2 value r _HB (k), k = 0, ..., M/2-1 is copied and used as a substitute for the value r _HB (k), k = -M/2, ..., -1 in the previous frame. The calculator 265 approximates the last M values of the time envelope R _TD (n) 216 in the current frame by IIR (Infinite Impulse Response) filtering. This can be done using the following relation (7):

R_TD(n)＝0.05·r_HB(n)+0.95·R_TD(n-1),n＝N-M,...,N-1(7)R _TD (n)＝0.05·r _HB (n)+0.95·R _TD (n-1),n＝NM,...,N-1(7)

图3中示出了计算高频带残差信号214的时间包络R_TD(n)216的操作215。Figure 3 illustrates the operation 215 for calculating the time envelope R _TD (n) 216 of the high-frequency band residual signal 214.

方法200包括时间包络下采样操作217，并且设备250包括用于进行操作217的下采样器267。下采样器267使用例如以下关系式(8)通过因子4对时间包络R_TD(n)216进行下采样：Method 200 includes a time envelope downsampling operation 217, and device 250 includes a downsampler 267 for performing operation 217. Downsampler 267 downsamples the time envelope R _TD (n)216 by a factor of 4 using, for example, the following relation (8):

R_4kHz(n)＝R_TD(4n),n＝0,...,N/4-1(8)R _4kHz (n)＝R _TD (4n),n＝0,...,N/4-1(8)

方法200包括平均值计算操作219，并且设备250包括用于进行操作219的计算器269。计算器269将经下采样的时间包络R_4kHz(n)218划分成四个连续段，并且使用例如以下关系式(9)计算每个段中的经下采样的时间包络R_4kHz(n)218的平均值220：Method 200 includes an average calculation operation 219, and device 250 includes a calculator 269 for performing operation 219. Calculator 269 divides the downsampled time envelope R _4kHz (n) 218 into four consecutive segments and calculates the average value 220 of the downsampled time envelope R _4kHz (n) 218 in each segment using, for example, the following relation (9):

其中，k为段的索引。Where k is the index of the segment.

计算器269将所有平均值限制为最大值1.0。Calculator 269 limits all averages to the maximum value of 1.0.

方法200包括归一化因子计算操作221，并且设备250包括用于进行操作221的计算器271。计算器271使用经下采样的时间包络平均值220以使用例如以下关系式(10)来计算各个段k的段归一化因子：Method 200 includes a normalization factor calculation operation 221, and device 250 includes a calculator 271 for performing operation 221. Calculator 271 uses a downsampled time envelope average 220 to calculate the segment normalization factor for each segment k using, for example, the following relation (10):

然后，计算器271使用例如以下关系式(11)在当前帧的整个间隔内线性插值来自关系式(10)的段归一化因子，以产生经插值的归一化因子222：Then, calculator 271 uses, for example, the following relation (11), to linearly interpolate the segment normalization factor from relation (10) over the entire interval of the current frame to produce an interpolated normalization factor 222:

由操作221和计算器271执行的该插值过程在图4中示出。The interpolation process performed by operation 221 and calculator 271 is shown in Figure 4.

在关系式(11)中，术语η_-1是指前一帧中的最后段归一化因子。因此，η_-1在每个帧中的插值过程之后用η₃更新。In relation (11), the term _η⁻¹ refers to the normalization factor of the last segment in the previous frame. Therefore, _η⁻¹ is updated with _η⁻³ after the interpolation process in each frame.

方法200包括经下采样的时间包络的归一化操作223，并且设备250包括用于执行操作223的归一化器273。归一化器273使用例如以下关系式(12)，利用经插值的归一化因子γ(n)222处理来自下采样器267的经下采样的时间包络R_4kHz(n)218：Method 200 includes a normalization operation 223 of the downsampled time envelope, and device 250 includes a normalizer 273 for performing operation 223. Normalizer 273 processes the downsampled time envelope R _4kHz (n)218 from downsampler 267 using, for example, the following relation (12): γ(n)222, an interpolated normalization factor.

R_γ(n)＝R_4kHz(n)·γ(n),n＝0,...,N/4-1 (12)R _γ (n)＝R _4kHz (n)·γ(n),n＝0,...,N/4-1 (12)

然后，归一化器273从关系式(12)的值R_γ(n)减去经归一化的包络的全局平均值(关系式(13))，以在操作223中完成经下采样的时间包络的归一化过程(图2的R_norm(n)224)。这可以由关系式(13)表示：Then, normalizer 273 subtracts the global average of the normalized envelope (relation (13)) from the value R _γ (n) of relation (12) to complete the normalization process of the downsampled temporal envelope in operation 223 (R _norm (n) 224 in Figure 2). This can be represented by relation (13):

估计高频带残差信号的时间包络的倾斜度是有用的。为此，方法200包括时间包络倾斜度估计操作225，并且设备250包括用于进行操作225的估计器275。时间包络倾斜度估计可以通过用线性最小二乘(LLS)方法将线性曲线拟合到在关系式(9)中计算出的段平均值来完成。时间包络的倾斜度226然后是线性曲线的斜率。用LLS方法计算出的线性曲线定义为：It is useful to estimate the skewness of the time envelope of a high-frequency band residual signal. For this purpose, method 200 includes a time envelope skewness estimation operation 225, and device 250 includes an estimator 275 for performing operation 225. The time envelope skewness estimation can be accomplished by fitting a linear curve to the segment average calculated in relation (9) using the linear least squares (LLS) method. The skewness 226 of the time envelope is then the slope of the linear curve. The linear curve calculated using the LLS method is defined as:

根据LLS方法，目标是最小化所有k＝0,…,3的与之间的平方差之和。这可以使用以下关系式(15)来表示：According to the LLS method, the goal is to minimize the sum of squared differences between all k = 0, ..., 3. This can be expressed using the following relation (15):

最优斜率a_LLS(倾斜度226)可以由估计器275使用关系式(16)来计算：The optimal slope a _LLS (slope 226) can be calculated by estimator 275 using relation (16):

方法200包括高频带自相关函数计算操作227，并且设备250包括用于执行操作227的计算器277。计算器277使用例如关系式(17)基于归一化的时间包络来计算高频带自相关函数X_corr 228：Method 200 includes a high-band autocorrelation function calculation operation 227, and device 250 includes a calculator 277 for performing operation 227. Calculator 277 calculates the high-band autocorrelation function _Xcorr 228 based on a normalized time envelope, for example, relation (17):

其中，E_f是当前帧中的经归一化的时间包络R_norm(n)224的能量，并且是前一帧中的经归一化的时间包络R_norm(n)224的能量。计算器277可以使用以下关系式(18)来计算能量：Where E_f is the energy of the normalized time envelope R _norm (n)224 in the current frame, and is also the energy of the normalized time envelope R _norm (n)224 in the previous frame. Calculator 277 can calculate the energy using the following relation (18):

在模式切换的情况下，关系式(17)中的求和项前面的因子被设置为1/E_f，因为前一帧中的经归一化的时间包络R_norm(n)224的能量是未知的。In the case of mode switching, the factor preceding the summation term in relation (17) is set to 1/E _f , since the energy of the normalized time envelope R _norm (n)224 in the previous frame is unknown.

方法200包括高频带发声因子计算操作229，并且设备250包括用于执行操作229的计算器279。Method 200 includes a high-frequency band phonation factor calculation operation 229, and device 250 includes a calculator 279 for performing operation 229.

高频带残差信号的发声与高频带自相关函数X_corr 228的方差σ_corr密切相关。计算器279例如使用以下关系式(19)来计算方差σ_corr：The emission of high-frequency band residual signals is closely related to the variance _σcorr of the high-frequency band autocorrelation function _Xcorr 228. Calculator 279 uses, for example, the following relationship (19) to calculate the variance _σcorr :

为了改善发声参数v_mult的判别潜力(有声/无声判定)，计算器279将方差σ_corr与高频带自相关函数X_corr 228的最大值相乘，如以下关系式(20)所表示：To improve the discrimination potential (sound/no sound determination) of the sound parameter v _mult , calculator 279 multiplies the variance σ _corr by the maximum value of the high-frequency autocorrelation function X _corr 228, as expressed by the following relationship (20):

然后，计算器279使用例如以下关系式(21)，利用S形(sigmoid)函数对来自关系式(20)的发声参数v_mult进行变换以限制其动态范围，并且获得高频带发声因子v_HB 230：Then, calculator 279 uses, for example, the following relation (21), to transform the vocal parameter _{v_mult} from relation (20) using a sigmoid function to limit its dynamic range, and obtains the high-frequency vocal factor _{v_HB} 230:

其中，因子β通过实验估计并且被设置为例如恒定值25.0。然后，从上述关系式(21)计算出的高频带发声因子v_HB 230被限制在<0.0；1.0>的范围内并且被发送到解码器。The factor β is estimated experimentally and set to, for example, a constant value of 25.0. Then, the high-frequency band phonation factor _vHB 230 calculated from the above relation (21) is limited to the range of <0.0;1.0> and sent to the decoder.

4.激励混合因子4. Excitation Mixture Factor

图5是同时示出在解码器处，方法200和设备250内的时域带宽扩展的激励信号的计算/计算器的示意性框图。Figure 5 is a schematic block diagram showing the calculation/calculation of the excitation signal with time-domain bandwidth expansion in both method 200 and device 250 at the decoder.

第4节(激励混合因子)涉及编码器和解码器两者的特征。Section 4 (Excitation Hybridity Factor) deals with the characteristics of both the encoder and decoder.

3GPP EVS编解码器中的SWB TBE工具使用第1节(低频带激励信号)中描述的低频带激励信号208(图2)来预测第2节(高频带目标信号)中描述的高频带残差信号214(图2)。在EVS编解码器的较低比特率(低于24.4kbps)下，SWB TBE工具使用19比特来对预测的高频带残差信号的频谱包络和能量进行编码。对于20ms的帧长度，这得到0.95kbps的比特率。在比特率高于24.4kbps时，SWB TBE工具使用32比特对预测的高频带残差信号的频谱包络和能量进行编码。对于20ms的帧长度，这得到1.6kbps的比特率。在SWB TBE工具的两个比特率(0.95kbps和1.6kbps)下，没有比特用于对高频带残差信号214或高频带目标信号210进行编码。The SWB TBE tool in the 3GPP EVS codec uses the low-frequency band excitation signal 208 (Figure 2) described in Section 1 (Low-Frequency Excitation Signal) to predict the high-frequency band residual signal 214 (Figure 2) described in Section 2 (High-Frequency Band Target Signal). At lower bit rates (below 24.4 kbps) in the EVS codec, the SWB TBE tool uses 19 bits to encode the spectral envelope and energy of the predicted high-frequency band residual signal. For a frame length of 20 ms, this yields a bit rate of 0.95 kbps. At bit rates above 24.4 kbps, the SWB TBE tool uses 32 bits to encode the spectral envelope and energy of the predicted high-frequency band residual signal. For a frame length of 20 ms, this yields a bit rate of 1.6 kbps. At both bit rates (0.95kbps and 1.6kbps) of the SWB TBE tool, no bits are used to encode the high-frequency band residual signal 214 or the high-frequency band target signal 210.

参考图5，方法200包括伪随机噪声生成操作501，并且设备250包括用于执行操作501的伪随机噪声生成器551。Referring to Figure 5, method 200 includes a pseudo-random noise generation operation 501, and device 250 includes a pseudo-random noise generator 551 for performing operation 501.

伪随机噪声生成器551产生均匀分布的随机噪声激励信号502。例如，参考文献[1]中描述的3GPP EVS编解码器的伪随机数生成器可以用作伪随机噪声生成器551。随机噪声激励信号w_rand 502可以使用以下关系式(22)来表示：The pseudo-random noise generator 551 generates a uniformly distributed random noise excitation signal 502. For example, the pseudo-random number generator of the 3GPP EVS codec described in reference [1] can be used as the pseudo-random noise generator 551. The random noise excitation signal w _rand 502 can be represented by the following relation (22):

w_rand(n)∈U[-32767；32768],n＝0,...,N-1 (22)w _rand (n)∈U[-32767;32768],n＝0,...,N-1 (22)

随机噪声激励信号w_rand 502具有零均值和非零方差σ_rand＝1.14e+11。应当注意，方差仅是近似值，并且表示100个帧上的平均值。The random noise excitation signal w _rand 502 has zero mean and a non-zero variance σ _rand = 1.14e+11. It should be noted that the variance is only an approximation and represents the average value over 100 frames.

方法200包括计算低频带激励信号l_LB(n)208的功率的操作503和用于执行操作503的功率计算器553。Method 200 includes operation 503 for calculating the power of the low-frequency band excitation signal l _LB (n) 208 and a power calculator 553 for performing operation 503.

功率计算器503使用例如以下关系式(23)来计算从编码器发送的低频带激励信号l_LB(n)208的功率504：The power calculator 503 uses, for example, the following relationship (23) to calculate the power 504 of the low-frequency band excitation signal l _LB (n) 208 sent from the encoder:

方法200包括对随机噪声激励信号502的功率进行归一化的操作505和用于执行操作505的功率归一化器555。Method 200 includes an operation 505 that normalizes the power of a random noise excitation signal 502 and a power normalizer 555 for performing operation 505.

功率归一化器555例如使用以下关系式(24)将随机噪声激励信号502的功率归一化为低频带激励信号208的功率504：The power normalizer 555 normalizes the power of the random noise excitation signal 502 to the power 504 of the low-frequency band excitation signal 208, for example, using the following relationship (24):

尽管随机噪声激励信号502的真实方差逐帧而变化，但是功率归一化不需要精确值。相反，在上述关系式(24)中使用上述定义的方差的近似值，以节省计算资源。Although the true variance of the random noise excitation signal 502 varies frame by frame, power normalization does not require an exact value. Instead, an approximation of the variance defined above is used in the above relation (24) to save computational resources.

方法200包括将低频带激励信号l_LB(n)208与功率归一化随机噪声激励信号w_white(n)506混合的操作507以及用于执行操作507的混合器557。Method 200 includes an operation 507 of mixing a low-frequency band excitation signal l _LB (n) 208 with a power-normalized random noise excitation signal w _white (n) 506, and a mixer 557 for performing operation 507.

混合器557通过使用本公开稍后描述的高频带混合因子将低频带激励信号l_LB(n)208与功率归一化随机噪声激励信号w_white(n)506混合来产生时域带宽扩展的激励信号508。Mixer 557 generates a time-domain bandwidth-extended excitation signal 508 by mixing the low-frequency band excitation signal l _LB (n) 208 with the power-normalized random noise excitation signal w _white (n) 506 using a high-frequency band mixing factor described later in this disclosure.

图6是同时示出在编码器处，由用于对激励信号进行时域带宽扩展的方法和设备内的经量化的归一化增益形成/表示的高频带混合因子的计算/计算器的示意性框图。Figure 6 is a schematic block diagram showing the calculation/calculator of the high-frequency band mixing factor formed/represented by the method for time-domain bandwidth expansion of the excitation signal and the quantized normalized gain within the device at the encoder.

参考图6，在编码器处，Referring to Figure 6, at the encoder,

-方法200包括计算功率归一化随机噪声激励信号w_white(n)506的时间包络的操作602、计算低频带激励信号l_LB(n)208的时间包络的操作604、以及均方误差(MSE)最小化操作601和增益量化操作607；Method 200 includes an operation 602 to calculate the time envelope of the power-normalized random noise excitation signal _wwhite (n)506, an operation 604 to calculate the time envelope of the low-frequency band excitation signal _lLB (n)208, an operation 601 to minimize mean square error (MSE), and an operation 607 to quantize gain.

并且and

-设备250包括用于执行操作602的时间包络计算器652、用于执行操作604的时间包络计算器654、用于执行操作601的MSE最小化器651、以及用于执行操作607的增益量化器657。- Device 250 includes a time envelope calculator 652 for performing operation 602, a time envelope calculator 654 for performing operation 604, an MSE minimizer 651 for performing operation 601, and a gain quantizer 657 for performing operation 607.

如图6所示，为了节省计算资源，使用均方误差(MSE)最小化过程基于下采样域中的信号的时间包络来计算最佳增益这种方法的另一个优点是对背景噪声具有更高的鲁棒性。As shown in Figure 6, in order to save computational resources, the mean square error (MSE) minimization process is used to calculate the optimal gain based on the time envelope of the signal in the downsampled domain. Another advantage of this method is that it is more robust to background noise.

在计算(图2的操作215和计算器265)高频带残差信号214的时间包络并且对该时间包络进行下采样(图2的操作217和下采样器267)时，计算器652使用与第3节(高频带自相关函数和发声因子)中描述的相同的算法来计算功率归一化随机噪声激励信号w_white(n)506的经下采样的时间包络w_4kHz(n)606(其也在如图5和对应描述中所示的编码器处计算)。所使用的下采样因子例如是4。功率归一化随机噪声激励信号的经下采样的时间包络可以使用以下关系式(25)表示：When calculating (operation 215 and calculator 265 in Figure 2) the time envelope of the high-frequency band residual signal 214 and downsampling that time envelope (operation 217 and downsampler 267 in Figure 2), calculator 652 uses the same algorithm described in Section 3 (High-Frequency Band Autocorrelation Function and Speech Factor) to calculate the downsampled time envelope _w4kHz (n)606 of the power-normalized random noise excitation signal _wwhite (n)506 (which is also calculated at the encoder shown in Figure 5 and the corresponding description). The downsampling factor used is, for example, 4. The downsampled time envelope of the power-normalized random noise excitation signal can be expressed by the following relation (25):

W_4kHz(n),n＝0,...,N/4-1(25)W _4kHz (n),n＝0,...,N/4-1(25)

类似地，计算器654再次使用与第3节(高频带自相关函数和发声因子)中描述的相同的算法来计算以4kHz下采样的低频带激励信号l_LB(n)208的时间包络L_4kHz(n)605。低频带激励信号l_LB(n)208的经下采样的时间包络606可以表示如下：Similarly, calculator 654 again uses the same algorithm described in Section 3 (High-Frequency Band Autocorrelation Function and Speech Factor) to calculate the time envelope L4kHz(n)605 of the low-frequency band excitation signal _lLB (n)208 downsampled to _4kHz . The downsampled time envelope 606 of the low-frequency band excitation signal _lLB (n)208 can be expressed as follows:

L_4kHz(n),n＝0,...,N/4-1(26)L _4kHz (n),n＝0,...,N/4-1(26)

MSE最小化操作601的目的是找到使(a)组合的时间包络(L_4kHz(n)，W_4kHz(n))与(b)高频带残差信号r_HB(n)214的时间包络R_4kHz(n)之间的误差的能量最小化的最佳增益对这在数学上可以使用关系式(27)表示：The purpose of the MSE minimization operation 601 is to find the optimal gain that minimizes the energy of the error between (a) the combined time envelope (L _4kHz (n), W _4kHz (n)) and (b) the time envelope R _4kHz (n) of the high-frequency band residual signal r _HB (n)214. This can be mathematically expressed using the relation (27):

为此，MSE最小化器651求解线性方程组。解决方案可以在科学文献中找到。例如，可以使用关系式(28)来计算最佳增益对To this end, the MSE minimizer 651 solves the system of linear equations. Solutions can be found in scientific literature. For example, the optimal gain pair can be calculated using relation (28).

其中，c₀,…,c₄和c₅的值由下式给出The values of _c0 , ..., _c4 and _c5 are given by the following formula.

然后，MSE最小化器651使用例如以下关系式(30)来计算最小MSE误差能量(超额误差)：Then, the MSE minimizer 651 uses, for example, the following relation (30) to calculate the minimum MSE error energy (excess error):

为了进一步处理，增益量化器657以以下方式对最佳增益进行缩放，该方式使得与低频带激励信号l_LB(n)的时间包络L_4kHz(n)605相关联的增益g_ln变为与功率归一化随机噪声激励信号w_white(n)506的时间包络W_4kHz(n)606相关联的增益g_wn的单位1，使用例如以下关系式(31)给出：For further processing, the gain quantizer 657 scales the optimal gain in such a way that the gain g _ln associated with the time envelope L _4kHz (n) 605 of the low-frequency band excitation signal l _LB (n) becomes a unit 1 of the gain g _wn associated with the time envelope W _4kHz (n) 606 of the power-normalized random noise excitation signal w _white (n) 506, given using, for example, the following relation (31):

关系式(31)的重新缩放的结果/优点是仅一个参数(归一化增益g_wn)需要被编解码并且在比特流中从编码器发送到解码器，而不是两个参数。因此，使用关系式(31)对增益的缩放减少了比特消耗并且简化了量化过程。另一方面，组合的时间包络(L_4kHz(n)和W_4kHz(n))的能量将与高频带残差信号214的时间包络R_4kHz(n)的能量不匹配。这不是问题，因为SWBTBE工具使用包含关于高频带残差信号的能量的信息的全局增益和子帧增益。子帧增益和全局增益的计算在本公开的第6节(增益/形状估计)中描述。The result/advantage of rescaling relation (31) is that only one parameter (normalized gain g _{wn} ) needs to be encoded and decoded and sent from the encoder to the decoder in the bitstream, instead of two parameters. Therefore, scaling the gain using relation (31) reduces bit consumption and simplifies the quantization process. On the other hand, the energy of the combined time envelope (L _{4kHz} (n) and W _{4kHz} (n)) will not match the energy of the time envelope R _4kHz (n) of the high-frequency band residual signal 214. This is not a problem because the SWBTBE tool uses the global gain and subframe gain, which contain information about the energy of the high-frequency band residual signal. The calculation of the subframe gain and global gain is described in Section 6 (Gain/Shape Estimation) of this disclosure.

增益量化器657将归一化增益g_wn限制在最大阈值1.0与最小阈值0.0之间。增益量化器657使用例如由以下关系式(32)描述的3比特均匀标量量化器来对归一化增益g_wn进行量化：Gain quantizer 657 limits the normalized gain _gwn to between a maximum threshold of 1.0 and a minimum threshold of 0.0. Gain quantizer 657 uses a 3-bit uniform scalar quantizer, for example, described by the following relation (32), to quantize the normalized gain _gwn :

并且所得索引idx_g 610被限制为形成/表示高频带混合因子的间隔<0，7>，并且与SWB TBE编码器的现有索引一起以0.95kbps或1.6kbps在SWB TBE比特流中发送。The resulting index idx _g 610 is restricted to forming/representing the interval <0, 7> of the high-frequency band mixing factor, and is transmitted in the SWB TBE bitstream at 0.95kbps or 1.6kbps along with the existing index of the SWB TBE encoder.

返回参考图5，方法200包括在解码器处的混合因子解码操作509，并且设备250包括用于执行操作509的混合因子解码器559。Referring back to Figure 5, method 200 includes a hybrid factor decoding operation 509 at the decoder, and device 250 includes a hybrid factor decoder 559 for performing operation 509.

混合因子解码器559使用例如以下关系式(33)从接收的索引idx_g 610产生经解码的增益：The hybrid factor decoder 559 uses, for example, the following relation (33) to generate the decoded gain from the received index idx _g 610:

来自关系式(33)的经解码的增益形成高频带混合因子f_mix 510。The decoded gain from relation (33) forms the high-frequency band mixing factor _{f_mix} 510.

例如以16kHz采样的低频带激励信号l_LB(n)208和例如以16kHz采样的归一化随机噪声激励信号w_white(n)506在混合器557中被混合在一起。然而，低频带激励信号l_LB(n)208的能量和随机噪声激励信号w_rand 502的能量两者逐帧而变化。如果使用从关系式(33)获得的高频带混合因子f_mix 510直接混合低频带激励信号l_LB(n)208和随机噪声激励信号w_rand502，则能量的波动最终可能在帧边界处产生可听伪影。为了确保平滑过渡，随机噪声激励信号w_rand502的能量在生成器551中在前一帧与当前帧之间被线性插值。这可以通过用以下插值因子对当前帧的前半部分中的随机噪声激励信号w_rand 502进行缩放来完成：For example, a low-frequency band excitation signal l _LB (n) 208 sampled at 16 kHz and a normalized random noise excitation signal w _white (n) 506 sampled at 16 kHz are mixed together in mixer 557. However, the energy of both the low-frequency band excitation signal l _LB (n) 208 and the random noise excitation signal w _rand 502 varies frame by frame. If the low-frequency band excitation signal l _LB (n) 208 and the random noise excitation signal w _rand 502 are directly mixed using the high-frequency band mixing factor f _mix 510 obtained from relation (33), the energy fluctuations may eventually produce audible artifacts at frame boundaries. To ensure a smooth transition, the energy of the random noise excitation signal w _rand 502 is linearly interpolated between the previous frame and the current frame in generator 551. This can be done by scaling the random noise excitation signal w _rand 502 in the first half of the current frame with the following interpolation factor:

其中，E_LB是当前帧中的低频带激励信号l_LB(n)208的能量，并且是前一帧中的低频带激励信号l_LB(n)208的能量。Where E _LB is the energy of the low-frequency band excitation signal l _LB (n)208 in the current frame and the energy of the low-frequency band excitation signal l _LB (n)208 in the previous frame.

为了进一步平滑前一帧与当前帧之间的转换，解码器559还对高频带混合因子f_mix510进行线性插值。这可以通过引入例如使用以下关系式计算的缩放因子β_mix(n)来完成：To further smooth the transition between the previous and current frames, decoder 559 also performs linear interpolation on the high-frequency band mixing factor _{f_mix} 510. This can be done by introducing a scaling factor _{β_mix} (n) calculated, for example, using the following relation:

其中，为前一帧中高频带混频因子的值。注意，在关系式(34)中计算的插值因子ζ_w(n)和在关系式(35)中计算的缩放因子β_mix(n)是针对n＝0,…,N/2-1定义的。Where is the value of the high-frequency band mixing factor in the previous frame. Note that the interpolation factor _ζw (n) calculated in relation (34) and the scaling factor _βmix (n) calculated in relation (35) are defined for n = 0, ..., N/2-1.

低频带激励信号l_LB(n)208和随机噪声激励信号w_white(n)506的混合最终由混合器557使用例如关系式(36)来完成，以获得时域带宽扩展的激励信号u(n)508。The mixing of the low-frequency band excitation signal l _LB (n) 208 and the random noise excitation signal w _white (n) 506 is finally accomplished by the mixer 557 using, for example, relation (36) to obtain the time-domain bandwidth-extended excitation signal u (n) 508.

5.高频带合成(LP合成)5. High-frequency bandgap synthesis (LP synthesis)

在SWB TBE工具的编码器中将通过关系式(4)中对高频带输入信号s_HB(n)的LP分析计算出的高频带LP滤波器系数212转换为LSF参数并且对其进行量化。在0.95kbps的比特率下，SWB TBE编码器使用8比特来对LSF索引进行量化。在1.6kbps的比特率下，SWBTBE编码器使用21比特来对LSF索引进行量化。In the encoder of the SWB TBE tool, the high-frequency band LP filter coefficients 212, calculated by LP analysis of the high-frequency band input signal _sHB (n) in relation (4), are converted into LSF parameters and quantized. At a bit rate of 0.95 kbps, the SWB TBE encoder uses 8 bits to quantize the LSF index. At a bit rate of 1.6 kbps, the SWB TBE encoder uses 21 bits to quantize the LSF index.

返回参考图5，在编码器处，Referring back to Figure 5, at the encoder,

-方法200包括解码操作511，并且设备250包括用于对经量化的LSF索引进行解码的对应解码器561；并且Method 200 includes a decoding operation 511, and device 250 includes a corresponding decoder 561 for decoding the quantized LSF index; and

-方法200包括转换操作513，并且设备250包括用于将经解码的LSF索引512转换成高频带LP滤波器系数514的对应转换器563。Method 200 includes a conversion operation 513, and device 250 includes a corresponding converter 563 for converting the decoded LSF index 512 into high-frequency band LP filter coefficients 514.

经解码的高频带LP滤波器系数512可以表示为：The decoded high-frequency band LP filter coefficients 512 can be expressed as:

其中，P＝10为LP滤波器的阶数。每个子帧中的第一经解码的LP滤波器系数是单位1，即Where P = 10 is the order of the LP filter. The first decoded LP filter coefficient in each subframe is a unit 1, i.e.

方法200包括滤波操作515，并且设备250包括对应的合成滤波器565，其使用经解码的高频带LP滤波器系数514来使用例如以下关系式(38)对关系式(36)的混合时域带宽扩展的激励信号508进行滤波，以获得经LP滤波的高频带信号y_HB 516：Method 200 includes a filtering operation 515, and device 250 includes a corresponding synthesis filter 565 that uses decoded high-frequency band LP filter coefficients 514 to filter the mixed time-domain bandwidth-extended excitation signal 508 of relation (36) using, for example, the following relation (38), to obtain the LP-filtered high-frequency band signal _yHB 516:

6.增益/形状估计(图7)6. Gain/Shape Estimation (Figure 7)

在编码器和解码器两者处应用增益/形状参数平滑。仅在编码器处应用帧增益的自适应衰减。Apply gain/shape parameter smoothing at both the encoder and decoder. Apply adaptive attenuation of frame gain only at the encoder.

利用经量化的LSF系数对高频带目标信号s_HB(n)210的频谱形状进行编码。参考图7，SWB TBE工具还包括用于如参考文献[1]中所描述的估计高频带目标信号s_HB(n)210的时间子帧增益702的估计操作701/估计器751。估计器751将估计时间子帧增益归一化为单位能量。The spectral shape of the high-frequency band target signal _sHB (n)210 is encoded using quantized LSF coefficients. Referring to Figure 7, the SWB TBE tool also includes an estimation operation 701/estimater 751 for estimating the time subframe gain 702 of the high-frequency band target signal _sHB (n)210 as described in reference [1]. Estimator 751 normalizes the estimated time subframe gain to unit energy.

来自估计器751的归一化的估计时间子帧增益702可以使用关系式(39)来表示：The normalized estimated temporal subframe gain 702 from estimator 751 can be expressed using relation (39):

g_k, k＝0,...,3 (39)g _k , k＝0,...,3 (39)

方法200包括计算操作703，并且设备250包括用于通过线性最小二乘法(LLS)插值来确定归一化的估计时间子帧增益g_k 702的时间倾斜度704的对应计算器753。如图8所示，该插值过程可以通过将线性曲线801拟合到四个连续子帧(图8中的子帧0至3)中的真实子帧增益702并且计算其斜率来完成。Method 200 includes a computation operation 703, and device 250 includes a corresponding calculator 753 for determining the temporal slope 704 of the normalized estimated temporal subframe gain _gk 702 by linear least squares (LLS) interpolation. As shown in FIG8, this interpolation process can be accomplished by fitting a linear curve 801 to the true subframe gain 702 in four consecutive subframes (subframes 0 to 3 in FIG8) and calculating its slope.

利用LLS插值方法构建的线性曲线801可以使用以下关系式(40)来定义：The linear curve 801 constructed using the LLS interpolation method can be defined using the following relation (40):

其中，通过使真实子帧增益g_k 702与所有k＝0,…,3个子帧的线性曲线上的对应点之间的平方差之和最小化来找到参数c_LLS和d_LLS。这可以使用以下关系式(41)来表示：The parameters c _{LLS} and d _LLS are found by minimizing the sum of squared differences between the actual subframe gain g _k 702 and the corresponding points on the linear curves of all k = 0, ..., 3 subframes. This can be expressed by the following relation (41):

通过展开关系式(41)，可以表示估计时间子帧增益g_k 702的时间倾斜度g_tilt。时间倾斜度g_tilt 702实际上等于线性曲线的最佳斜率c_LLS。可以使用以下关系式(42)在计算器753中计算时间倾斜度g_tilt：By expanding relation (41), the time tilt _{g_tilt} of the estimated temporal subframe gain _{g_k} 702 can be expressed. The time tilt _{g_tilt} 702 is actually equal to the optimal slope _{c_LLS} of the linear curve. The time tilt _{g_tilt} can be calculated in calculator 753 using the following relation (42):

方法200包括平滑操作705，并且设备250包括对应的平滑器755，用于在例如以下条件为真时利用来自关系式(40)的经插值(LLS)的增益来平滑时间子帧增益g_k 702：Method 200 includes a smoothing operation 705, and device 250 includes a corresponding smoother 755 for smoothing the temporal subframe gain _gk using the interpolated (LLS) gain from relation (40) when, for example, the following condition is true:

v_HB＜0.4ANDidx_g≥5AND|g_tilt|＜0.2 (43)v _HB ＜0.4ANDidx _g ≥5AND|g _tilt |＜0.2 (43)

时间子帧增益g_k 702的平滑然后由平滑器755使用例如以下关系式(44)来完成：The smoothing of the temporal subframe gain _gk 702 is then accomplished by the smoother 755 using, for example, the following relation (44):

其中权重κ与由关系式(21)给出的发声参数v_HB 230(图2)成比例。例如，权重κ可以使用以下关系式(45)来计算：The weight κ is proportional to the vocal parameter v _HB 230 (Figure 2) given by relation (21). For example, the weight κ can be calculated using the following relation (45):

并且限制为最大值1.0和最小值0.0。And it is limited to a maximum value of 1.0 and a minimum value of 0.0.

方法200包括增益形状量化操作707，并且设备250包括用于对经平滑的时间子帧增益权重706进行量化的对应的增益形状量化器757。为此，如参考文献[1]中描述的使用例如5比特的SWB TBE工具的编码器的增益形状量化器可以用作量化器757。来自量化器757的经量化的时间子帧增益708可以使用以下关系式(46)来表示：Method 200 includes a gain shape quantization operation 707, and device 250 includes a corresponding gain shape quantizer 757 for quantizing the smoothed temporal subframe gain weight 706. For this purpose, a gain shape quantizer of an encoder using, for example, a 5-bit SWB TBE tool as described in reference [1] can be used as quantizer 757. The quantized temporal subframe gain 708 from quantizer 757 can be expressed using the following relation (46):

方法200包括插值操作709，并且设备250包括对应的插值器759，用于在量化操作707之后使用如关系式(40)和(41)中描述的相同LLS插值过程再次对经量化的时间子帧增益708进行插值。帧中的四个连续子帧中的经插值的经量化的子帧增益710可以使用以下关系式(47)来表示：Method 200 includes an interpolation operation 709, and device 250 includes a corresponding interpolator 759 for interpolating the quantized temporal subframe gain 708 again after the quantization operation 707 using the same LLS interpolation procedure as described in relations (40) and (41). The interpolated quantized subframe gain 710 in four consecutive subframes of a frame can be represented by the following relation (47):

方法200包括倾斜度计算操作711，并且设备250包括对应的倾斜度计算器761，用于使用例如关系式(42)来计算经插值的量化时间子帧增益710的倾斜度。经插值的量化时间子帧增益710的倾斜度可以表示为Method 200 includes a tilt calculation operation 711, and device 250 includes a corresponding tilt calculator 761 for calculating the tilt of the interpolated quantized temporal subframe gain 710 using, for example, relation (42). The tilt of the interpolated quantized temporal subframe gain 710 can be expressed as:

然后，当以下条件(48)的条件为真时，对量化时间子帧增益708进行平滑，其中idx_g是来自关系式(32)的索引：Then, the quantized temporal subframe gain 708 is smoothed when the following condition (48) is true, where idx _g is an index from relation (32):

为此，方法200包括量化增益平滑操作713，并且设备250包括用于通过使用例如来自关系式(47)的经插值的时间子帧增益710进行平均来平滑量化时间子帧增益708的对应的平滑器714。为此，可以使用以下关系式(49)：To this end, method 200 includes a quantization gain smoothing operation 713, and device 250 includes a corresponding smoother 714 for smoothing the quantization time subframe gain 708 by averaging, for example, the interpolated time subframe gain 710 from relation (47). For this purpose, the following relation (49) can be used:

方法200包括帧增益估计操作715，并且设备250包括对应的帧增益估计器765。SWBTBE工具使用帧增益来控制合成的高频带声音信号的全局能量。通过(a)关系式(38)的经LP滤波的高频带信号y_HB 516乘以来自关系式(49)的经平滑的量化时间子帧增益714与(b)关系式(3)的高频带目标信号s_HB(n)210之间的能量匹配来估计帧增益。使用例如以下关系式(50)，将关系式(38)的经LP滤波的高频带信号y_HB 516乘以经平滑的量化时间子帧增益714：Method 200 includes a frame gain estimation operation 715, and device 250 includes a corresponding frame gain estimator 765. The SWBTBE tool uses frame gain to control the global energy of the synthesized high-frequency band audio signal. The frame gain is estimated by multiplying the LP-filtered high-frequency band signal _yHB 516 of relation (a) by the energy match between the smoothed quantized time subframe gain 714 from relation (49) and the high-frequency band target signal _sHB (n) 210 of relation (b) using relation (3). The LP-filtered high-frequency band signal _yHB 516 of relation (38) is multiplied by the smoothed quantized time subframe gain 714 using, for example, the following relation (50):

在参考文献[1]中描述了帧增益估计操作715的细节。估计的帧增益参数表示为g_f(参见716)。The details of the frame gain estimation operation 715 are described in reference [1]. The estimated frame gain parameter is denoted as g _f (see 716).

方法200包括计算合成高频带信号718的操作717，并且设备250包括用于执行操作717的计算器767。计算器767可以在一些特定条件下修改估计的帧增益g_f 717。例如，在高频带发声因子v_HB 230(图2)和MSE超额误差能量E_err的给定值下，可以根据关系式(51)衰减帧增益g_f，如关系式(51)所示：Method 200 includes an operation 717 for calculating the synthesized high-frequency band signal 718, and device 250 includes a calculator 767 for performing operation 717. The calculator 767 can modify the estimated frame gain _gf 717 under certain conditions. For example, given the high-frequency band phonation factor _vHB 230 (Figure 2) and the MSE excess error energy E _err , the frame gain _gf can be attenuated according to relation (51), as shown in relation (51):

g_f←f_att·g_f,ifv_HB＞0.1ANDE_err＞5.0(51)g _f ←f _att ·g _f ,ifv _HB ＞0.1ANDE _err ＞5.0(51)

其中，E_err是在关系式(30)中计算出的MSE超额误差能量，并且f_att是衰减因子，例如计算为：Where E _{err} is the excess error energy of MSE calculated in relation (30), and f _{att} is the attenuation factor, for example, calculated as:

f_att＝1.0-0.04(E_err-5.0)(52)f _att =1.0-0.04(E _err -5.0)(52)

参考文献[1]中描述了在某些特定条件下对帧增益g_f的进一步修改。Reference [1] describes further modifications to the frame gain _gf under certain specific conditions.

然后，计算器767使用参考文献[1]的SWB TBE工具的编码器的帧增益量化器来对经修改的帧增益进行量化。Then, the calculator 767 uses the frame gain quantizer of the encoder of the SWB TBE tool in reference [1] to quantize the modified frame gain.

最后，计算器767使用例如以下关系式(53)来确定合成高频带声音信号718：Finally, calculator 767 uses, for example, the following relationship (53) to determine the synthesized high-frequency band sound signal 718:

7.硬件组件的示例配置7. Example configuration of hardware components

图9是形成用于在串扰信号的编码/解码期间对激励信号进行时域带宽扩展的上述方法200和设备250(在下文中称为“方法200和设备250”)的硬件组件的示例配置的简化框图。Figure 9 is a simplified block diagram of an example configuration of the hardware components of the aforementioned method 200 and device 250 (hereinafter referred to as "method 200 and device 250") for performing time-domain bandwidth extension of the excitation signal during encoding/decoding of the crosstalk signal.

方法200和设备250可以被实施为移动终端的一部分、便携式媒体播放器的一部分或实施在任何类似的设备中。设备250(在图9中标识为900)包括输入902、输出904、处理器906和存储器908。Method 200 and device 250 may be implemented as part of a mobile terminal, a portable media player, or in any similar device. Device 250 (identified as 900 in FIG9) includes input 902, output 904, processor 906, and memory 908.

输入902被配置为接收输入信号。输出904被配置为提供时域带宽扩展的激励信号。输入902和输出904可以在公共模块(例如串行输入/输出设备)中实施。Input 902 is configured to receive an input signal. Output 904 is configured to provide an excitation signal with time-domain bandwidth extension. Input 902 and output 904 can be implemented in a common module (e.g., a serial input/output device).

处理器906可操作地连接到输入902、输出904和存储器908。处理器906被实现为用于执行代码指令以支持如附图中所示和/或如本公开中所描述的上述方法200和设备250的各种操作和元素的功能的一个或多个处理器。Processor 906 is operatively connected to input 902, output 904, and memory 908. Processor 906 is implemented as one or more processors for executing code instructions to support the functions of various operations and elements of the methods 200 and apparatus 250 described above, as shown in the accompanying drawings and/or as described in this disclosure.

存储器908可以包括用于存储可由处理器906执行的代码指令的非暂时性存储器，具体地，处理器可读存储器包括/存储非暂时性指令，该非暂时性指令在被执行时使处理器实现方法200和设备250的操作和元素。存储器908还可以包括随机存取存储器或缓冲器，以存储来自由处理器908执行的各种功能的中间处理数据。Memory 908 may include non-transitory memory for storing code instructions executable by processor 906. Specifically, processor-readable memory includes/stores non-transitory instructions that, when executed, cause the processor to implement the operations and elements of method 200 and device 250. Memory 908 may also include random access memory or buffers for storing intermediate processing data from various functions performed by processor 908.

本领域普通技术人员将认识到，方法200和设备250的描述仅是说明性的，并不旨在以任何方式进行限制。受益于本公开的本领域普通技术人员将容易想到其他实施例。此外，所公开的方法200和设备250可以被定制以提供对编码和解码声音的现有需求和问题的有价值的解决方案。Those skilled in the art will recognize that the description of method 200 and apparatus 250 is merely illustrative and not intended to be limiting in any way. Other embodiments will readily conceive of those skilled in the art upon receiving this disclosure. Furthermore, the disclosed method 200 and apparatus 250 can be customized to provide valuable solutions to existing needs and problems in encoding and decoding sound.

为了清楚起见，未示出并描述方法200和设备250的实现的所有常规特征。当然，应当理解，在方法200和设备250的任何这种实际实施方式的开发中，可能需要做出许多特定于实施方式的决定，以便实现开发者的特定目标，诸如符合应用、系统、网络和业务相关的约束，并且这些特定目标将从一个实施方式到另一个实施方式以及从一个开发者到另一个开发者而变化。此外，应当理解，开发工作可能是复杂且耗时的，但是对于受益于本公开的声音处理领域的普通技术人员来说，仍然是工程的常规任务。For clarity, not all conventional features of the implementation of method 200 and device 250 are shown or described. It should be understood, of course, that in the development of any such actual implementation of method 200 and device 250, many implementation-specific decisions may need to be made to achieve specific developer objectives, such as compliance with application, system, network, and business-related constraints, and these specific objectives will vary from one implementation to another and from one developer to another. Furthermore, it should be understood that the development work may be complex and time-consuming, but remains a routine engineering task for those skilled in the art of sound processing who will benefit from this disclosure.

根据本公开，可以使用各种类型的操作系统、计算平台、网络设备、计算机程序和/或通用机器来实现本文描述的元件、处理操作和/或数据结构。此外，本领域普通技术人员将认识到，也可以使用通用性较低的设备，诸如硬连线设备、现场可编程门阵列(FPGA)、专用集成电路(ASIC)等。在包括一系列操作和子操作的方法由处理器、计算机或机器实施并且那些操作和子操作可以被存储为处理器、计算机或机器可读的一系列非暂时性代码指令的情况下，该非暂时性代码指令可以被存储在有形和/或非暂时性介质上。According to this disclosure, the elements, processing operations, and/or data structures described herein can be implemented using various types of operating systems, computing platforms, network devices, computer programs, and/or general-purpose machines. Furthermore, those skilled in the art will recognize that less general-purpose devices, such as hardwired devices, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), etc., can also be used. Where a method comprising a series of operations and sub-operations is implemented by a processor, computer, or machine, and those operations and sub-operations can be stored as a series of non-transitory code instructions readable by the processor, computer, or machine, such non-transitory code instructions can be stored on a tangible and/or non-transitory medium.

如本文所述的方法200和设备250的处理操作和元件可以包括软件、固件、硬件或适用于本文所述目的的软件、固件或硬件的任何组合。The processing operations and elements of the method 200 and device 250 described herein may include software, firmware, hardware, or any combination of software, firmware, or hardware suitable for the purposes described herein.

在方法200和设备250中，可以以各种顺序执行各种处理操作和子操作，并且处理操作和子操作中的一些可以是可选的。In method 200 and device 250, various processing operations and sub-operations can be performed in various orders, and some of the processing operations and sub-operations can be optional.

尽管上文已经通过其非限制性的说明性实施例描述了本公开，但是在不脱离本公开的精神和本质的情况下，可以在所附权利要求的范围内任意修改这些实施例。Although the present disclosure has been described above by way of its non-limiting illustrative embodiments, these embodiments may be modified in any way within the scope of the appended claims without departing from the spirit and essence of the present disclosure.

8.参考文献8. References

本公开提及以下参考文献，其全部内容通过引用并入本文：This disclosure references the following sources, the entire contents of which are incorporated herein by reference:

[1]3GPP TS26.445，“EVS Codec Detailed Algorithmic Description”，3GPP技术规范(版本12)(2014)-第5.2.6.1和6.1.3.1节。[1] 3GPP TS26.445, “EVS Codec Detailed Algorithmic Description”, 3GPP Technical Specification (Version 12) (2014) - Sections 5.2.6.1 and 6.1.3.1.

[2]Bessette，B.，Lefebvre，R.，Salami，R.等人的“Techniques for high-quality ACELP coding of wideband speech”。斯堪的纳维亚国际会议EUROSPEECH 2001，2001年9月3日至7日，丹麦奥尔堡，第七届欧洲语音通信和技术会议，第二届INTERSPEECH活动。[2] Bessette, B., Lefebvre, R., Salami, R. et al., “Techniques for high-quality ACELP coding of wideband speech”. Scandinavian International Conference EUROSPEECH 2001, September 3-7, 2001, Aalborg, Denmark, the 7th European Conference on Speech Communication and Technology, the 2nd INTERSPEECH event.

Claims

1. A method for time-domain bandwidth expansion of an excitation signal during decoding of a crosstalk audio signal, comprising:

Decoding the high-frequency band mixing factor received in the bitstream; and

The high-frequency band mixing factor is used to mix the low-frequency band excitation signal and the random noise excitation signal to generate a time-domain bandwidth-extended excitation signal.

2. The method of claim 1, wherein decoding the high-frequency band mixing factor comprises decoding a quantized normalized gain received in the bitstream, and using the decoded quantized normalized gain to calculate the high-frequency band mixing factor.

3. The method according to claim 1 or 2, further comprising: interpolating the energy of the random noise excitation signal between a previous frame and a current frame of the audio signal to smooth the transition between the previous frame and the current frame.

4. The method of claim 3, further comprising: scaling a portion of the random noise signal in order to interpolate the energy of the random noise excitation signal.

5. The method according to any one of claims 1 to 4, comprising: interpolating the high-frequency band mixing factor between a previous frame and a current frame of the audio signal to ensure a smooth transition between the previous frame and the current frame.

6. The method according to any one of claims 1 to 4, comprising: estimating the quantized gain/shape parameters.

7. A method for time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising:

(a) Calculate the high-frequency band residual signal using the sound signal and (b) Calculate the time envelope of the high-frequency band residual signal;

The high-frequency band sound generation factor is calculated based on the time envelope of the high-frequency band residual signal.

Calculate the high-frequency mixing factor that can be used to mix low-frequency band excitation signals and random noise excitation signals to generate excitation signals with extended time-domain bandwidth; and

The high-frequency band phonation factor is used to estimate the gain/shape parameters.

8. A method for time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising:

(a) Calculate the high-frequency band residual signal using the sound signal and (b) Calculate the time envelope of the high-frequency band residual signal; and

9. A method for time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising:

Calculate the high-frequency mixing factor that can be used to mix low-frequency band excitation signals and random noise excitation signals to generate excitation signals with extended time-domain bandwidth.

10. The method of claim 7 or 8, wherein calculating the high-frequency band phonation factor comprises (a) calculating a high-frequency band autocorrelation function based on the time envelope, and (b) using the high-frequency band autocorrelation function to calculate the high-frequency band phonation factor.

11. The method according to any one of claims 7, 8 and 10, wherein calculating the high-frequency band phonation factor comprises downsampling the time envelope of the high-frequency band residual signal by a given factor.

12. The method of claim 11, wherein calculating the high-frequency band phonation factor comprises dividing the downsampled time envelope into multiple segments and calculating the average value of each segment of the downsampled time envelope.

13. The method of claim 12, wherein calculating the high-frequency band phonation factor includes normalizing each segment of the downsampled time envelope of the high-frequency band residual signal.

14. The method of claim 13, wherein the normalization of each segment of the downsampled temporal envelope comprises (a) calculating a segment normalization factor based on a calculated average value, (b) interpolating the segment normalization factor in the current frame, and (c) normalizing the downsampled temporal envelope using the interpolated segment normalization factor.

15. The method according to any one of claims 7, 8 and 10 to 14, comprising: calculating the tilt of the time envelope of the high-frequency band residual signal based on linear least squares.

16. The method of claim 14, wherein calculating the high-frequency band phonation factor comprises (a) calculating a high-frequency band autocorrelation function based on a normalized time envelope, and (b) using the high-frequency band autocorrelation function to calculate the high-frequency band phonation factor.

17. The method according to any one of claims 7 and 9 to 16, wherein calculating the high-frequency band mixing factor comprises calculating and quantizing the gain from which the high-frequency band mixing factor is obtained.

18. The method of claim 17, wherein calculating the high-frequency band mixing factor includes generating a random noise excitation signal.

19. The method of claim 18, wherein generating the random noise excitation signal includes power normalization of the random noise excitation signal.

20. The method of claim 18 or 19, wherein calculating the high-frequency band mixing factor comprises (a) mixing the low-frequency band excitation signal with the random noise excitation signal, and (b) minimizing the mean square error between the mixed excitation signal and the high-frequency band residual signal calculated based on the sound signal.

21. The method according to any one of claims 18 to 20, wherein calculating the high-frequency band mixing factor comprises (a) calculating the time envelope of the random noise excitation signal, (b) calculating the time envelope of the low-frequency band excitation signal, and (c) finding the corresponding gains of the time envelopes of the random noise excitation signal and the low-frequency band excitation signal by means of a mean square error minimization process.

22. The method of claim 21, wherein calculating the high-frequency band mixing factor comprises scaling the gain of the time envelope of the random noise excitation signal and the low-frequency band excitation signal.

23. The method of claim 22, wherein scaling the gain includes obtaining a single gain parameter, and wherein calculating the high-frequency band mixing factor includes quantizing the single gain parameter to obtain the quantized gain, and obtaining the high-frequency band mixing factor from the quantized gain.

24. The method of claim 7 or 8, further comprising: using the high-frequency band phonation factor to estimate the gain/shape parameter.

25. The method according to any one of claims 7 to 24, wherein the gain/shape parameter is selected from the group consisting of:

- The spectral shape of the high-frequency target signal;

- Subframe gain of high-frequency target signals;

- Frame gain parameter.

26. The method according to any one of claims 7 to 25, wherein estimating the gain/shape parameter includes calculating the time skewness of the gain/shape parameter.

27. The method of claim 26, wherein calculating the time tilt includes interpolating the gain/shape parameter.

28. The method of claim 27, wherein interpolating the gain/shape parameter comprises using linear least squares.

29. The method of any one of claims 7 to 28, wherein estimating the gain/shape parameter includes smoothing the gain/shape parameter using an adaptive weighting parameter.

30. The method of claim 29, further comprising: using the high-frequency band phonation factor to calculate the adaptive weighting parameter.

31. The method of claim 29 or 30, further comprising: smoothing the gain/shape parameters using the adaptive weighting parameters in response to a given condition relating to the high-frequency band phonation factor.

32. The method according to any one of claims 29 to 31, wherein estimating the gain/shape parameter includes quantizing the smoothed gain/shape parameter.

33. The method of claim 32, wherein estimating the gain/shape parameter includes interpolating the quantized gain/shape parameter.

34. The method of claim 32 or 33, wherein estimating the gain/shape parameter includes smoothing the quantized gain/shape parameter.

35. The method of claim 34, wherein smoothing of the quantized gain/shape parameters is performed by averaging the quantized interpolated gain/shape parameters.

36. The method of any one of claims 7 to 35, wherein estimating the gain/shape parameter includes adaptive attenuation of the frame gain parameter using MSE excess error.

37. An apparatus for performing time-domain bandwidth expansion of an excitation signal during decoding of a crosstalk audio signal, comprising:

A decoder that decodes the high-frequency band mixing factors received in a bitstream; and

A mixer that uses the high-frequency mixing factor to mix low-frequency excitation signals and random noise excitation signals to generate a time-domain bandwidth-extended excitation signal.

38. The apparatus of claim 37, wherein the decoder of the high-frequency band mixing factor decodes the quantized normalized gain received in the bitstream and uses the decoded quantized normalized gain to calculate the high-frequency band mixing factor.

39. The apparatus of claim 37 or 38, further comprising: a generator of the random noise excitation signal, which interpolates the energy of the random noise excitation signal between a previous frame and a current frame of the audio signal to smooth the transition between the previous frame and the current frame.

40. The method of claim 39, further comprising: in order to interpolate the energy of the random noise excitation signal, the generator of the random noise excitation signal scales the random noise signal in a portion of the current frame.

41. The device according to any one of claims 37 to 40, wherein the decoder of the high-frequency band mixing factor interpolates the high-frequency band mixing factor between a previous frame and a current frame of the audio signal to ensure a smooth transition between the previous frame and the current frame.

42. The device according to any one of claims 37 to 40, comprising: an estimator for estimating quantized gain/shape parameters.

43. An apparatus for performing time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising:

(a) A calculator for calculating the high-frequency band residual signal using the sound signal and (b) a calculator for calculating the time envelope of the high-frequency band residual signal;

A calculator for calculating the high-frequency band sound generation factor based on the time envelope of the high-frequency band residual signal;

A calculator for calculating the high-frequency mixing factor that can be used to mix low-frequency band excitation signals and random noise excitation signals to produce excitation signals with extended time-domain bandwidth; and

An estimator for the gain/shape parameters is used to estimate the high-frequency band phonation factor.

44. An apparatus for performing time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising:

(a) a calculator for calculating the high-frequency band residual signal using the sound signal and (b) a calculator for calculating the time envelope of the high-frequency band residual signal; and

A calculator for calculating the high-frequency band sound generation factor based on the time envelope of the high-frequency band residual signal.

45. An apparatus for performing time-domain bandwidth expansion of an excitation signal during the encoding of a crosstalk audio signal, comprising:

A calculator for calculating the high-frequency mixing factor that can be used to mix low-frequency band excitation signals and random noise excitation signals to produce excitation signals with extended time-domain bandwidth.

46. The device according to claim 43 or 44, wherein the calculator for the high-frequency band phonation factor calculates the high-frequency band autocorrelation function based on the time envelope, and uses the high-frequency band autocorrelation function to calculate the high-frequency band phonation factor.

47. The device according to any one of claims 43, 44 and 46, wherein the calculator of the high-frequency band phonation factor includes a downsampler that downsamples the time envelope of the high-frequency band residual signal by a given factor.

48. The device of claim 47, wherein the calculator for the high-frequency band phonation factor includes a divider for dividing the downsampled time envelope into multiple segments, and a calculator for calculating the average value of each segment of the downsampled time envelope.

49. The device of claim 48, wherein the calculator of the high-frequency band phonation factor includes a normalizer for each segment of the downsampled time envelope of the high-frequency band residual signal.

50. The apparatus of claim 49, wherein each segment normalizer (a) calculates a segment normalization factor based on the calculated average value, (b) interpolates the segment normalization factor in the current frame, and (c) uses the interpolated segment normalization factor to normalize the downsampled temporal envelope.

51. The device according to any one of claims 43, 44 and 46 to 50, comprising: a calculator for calculating the tilt of the time envelope of the high-frequency band residual signal based on linear least squares method.

52. The device of claim 50, wherein the calculator for the high-frequency band phonation factor comprises a calculator that calculates a high-frequency band autocorrelation function based on a normalized time envelope and uses the high-frequency band autocorrelation function to calculate the high-frequency band phonation factor.

53. The device according to any one of claims 43 and 45 to 52, wherein the calculator of the high-frequency band mixing factor calculates and quantizes the gain forming the high-frequency band mixing factor.

54. The device of claim 53, wherein the calculator with the high-frequency band mixing factor includes a generator of a random noise excitation signal.

55. The apparatus of claim 54, comprising: a power normalizer for the power of the random noise excitation signal to the low-frequency band excitation signal.

56. The device according to claim 54 or 55, wherein the calculator of the high-frequency band mixing factor (a) combines the low-frequency band excitation signal with the random noise excitation signal, and (b) minimizes the mean square error between the mixed excitation signal and the high-frequency band residual signal calculated based on the sound signal.

57. The device according to any one of claims 54 to 56, wherein the calculator of the high-frequency band mixing factor (a) includes a calculator of the time envelope of the random noise excitation signal and a calculator of the time envelope of the low-frequency band excitation signal, and (b) finds the corresponding gains of the time envelopes of the random noise excitation signal and the low-frequency band excitation signal by a mean square error minimization process.

58. The device of claim 57, wherein the calculator of the high-frequency band mixing factor scales the gain of the time envelope of the random noise excitation signal and the low-frequency band excitation signal.

59. The apparatus of claim 58, wherein, in order to scale the gain of the time envelope of the random noise excitation signal and the low-frequency band excitation signal, the calculator of the high-frequency band mixing factor calculates a single gain parameter and quantizes the single gain parameter to obtain a quantized gain forming the high-frequency band mixing factor.

60. The device of claim 43 or 44, comprising: an estimator for estimating gain/shape parameters using the high-frequency band phonation factor.

61. The device according to any one of claims 43 to 60, wherein the gain/shape parameter is selected from the group consisting of:

- The spectral shape of the high-frequency target signal;

- Subframe gain of high-frequency target signals;

- Frame gain parameter.

62. The device according to any one of claims 43 to 61, wherein the gain/shape parameter includes the subframe gain of the high-frequency band target signal, and wherein the estimator of the gain/shape parameter includes a calculator for calculating the time tilt of the subframe gain.

63. The device of claim 62, wherein the calculator for the time tilt includes an interpolator that interpolates the subframe gain.

64. The device of claim 63, wherein the interpolator that interpolates the subframe gain uses linear least squares.

65. The apparatus of any one of claims 43 to 64, wherein the gain/shape parameter includes the subframe gain of the high-frequency band target signal, and the estimator of the gain/shape parameter includes a smoother that uses adaptive weight parameters to smooth the subframe gain.

66. The device of claim 65, wherein the smoother of the subframe gain uses the high-frequency band phonation factor to calculate the adaptive weight parameters.

67. The device of claim 29 or 30, wherein the smoother of the gain/shape parameter performs smoothing of the gain/shape parameter using the adaptive weighting parameter in response to a given condition relating to the high-frequency band phonation factor.

68. The apparatus according to any one of claims 65 to 67, wherein the estimator of the gain/shape parameter comprises a quantizer that quantizes the subframe gain.

69. The apparatus of claim 68, wherein the estimator of the gain/shape parameter comprises an interpolator that interpolates the quantized subframe gain.

70. The device of claim 68 or 69, wherein the estimator of the gain/shape parameter includes a smoother for smoothing the subframe gain.

71. The apparatus of claim 70, wherein the smoother smoothing the subframe gain smooths the quantized gain/shape parameters by averaging the quantized interpolated gain/shape parameters.

72. The apparatus of any one of claims 43 to 71, wherein the gain/shape parameter includes the subframe gain of the high-frequency band target signal, and wherein the estimator of the gain/shape parameter uses MSE excess error to perform adaptive attenuation of the frame gain parameter.