WO2019007969A1 - Low complexity dense transient events detection and coding - Google Patents

Low complexity dense transient events detection and coding Download PDF

Info

Publication number
WO2019007969A1
WO2019007969A1 PCT/EP2018/067970 EP2018067970W WO2019007969A1 WO 2019007969 A1 WO2019007969 A1 WO 2019007969A1 EP 2018067970 W EP2018067970 W EP 2018067970W WO 2019007969 A1 WO2019007969 A1 WO 2019007969A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
value
audio signal
transient events
determining
Prior art date
Application number
PCT/EP2018/067970
Other languages
French (fr)
Inventor
Arijit Biswas
Michael Schug
Harald Mundt
Original Assignee
Dolby International Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby International Ab filed Critical Dolby International Ab
Priority to CN201880049530.1A priority Critical patent/CN110998722B/en
Priority to EP18733915.5A priority patent/EP3649640A1/en
Priority to US16/628,235 priority patent/US11232804B2/en
Priority to JP2019572693A priority patent/JP7257975B2/en
Publication of WO2019007969A1 publication Critical patent/WO2019007969A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • G10L19/025Detection of transients or attacks for time/frequency resolution switching
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/03Spectral prediction for preventing pre-echo; Temporary noise shaping [TNS], e.g. in MPEG2 or MPEG4

Definitions

  • the present disclosure relates to methods of encoding audio signals.
  • the present disclosure further relates to methods of detecting dense transient events in portions of audio signals.
  • the present disclosure also relates to corresponding apparatus, such as encoders, for example.
  • Perceptual or lossy audio codecs (such as MP3, AAC, HE-AAC, AC-4, for example) are known to have problems with compressing audio signals including dense transient events, such as applause, crackling fire, or rain, for example, without loss of perceived audio quality.
  • Conventional efforts to increase compression efficiency typically tend to lead to vastly increased computational complexity at the encoder-side and/or to a loss of perceived audio quality.
  • the present disclosure addresses the above issues related to audio coding of audio signals including dense transient events, such as applause, crackling fire, or rain, for example, and describes methods and apparatus for improved coding of such audio signals.
  • the present disclosure further deals with detecting dense transient events in audio signals to enable appropriate treatment thereof.
  • a method of encoding a portion (e.g., frame) of an audio signal may include obtaining (e.g., determining, calculating, or computing) a value of a first feature relating to a perceptual entropy (PE) of the portion of the audio signal.
  • PE is known in the field of audio coding as a measure of perceptually relevant information contained in a particular audio signal and to represent a theoretical limit on the compressibility of the particular audio signal.
  • the method may further include selecting a quantization mode for quantizing the portion of the audio signal (e.g., for quantizing frequency coefficients of the portion of the audio signal, such as MDCT coefficients, for example) based on the (obtained) value of the first feature.
  • the method may further include quantizing the portion of the audio signal using the selected quantization mode. Selecting the quantization mode may involve determining, based at least in part on the (obtained) value of the first feature, whether a quantization mode that applies (e.g., enforces) a (substantially) constant signal-to-noise ratio (SNR) over frequency (e.g., over frequency bands) shall be used for the portion of the audio signal.
  • SNR signal-to-noise ratio
  • This quantization mode may be referred to as constant SNR mode or constant SNR quantization mode.
  • Applying the constant SNR over frequency may involve (e.g., relate to) noise shaping (e.g., quantization noise shaping). This may in turn involve appropriate selection or modification of quantization parameters (e.g., quantization step sizes, masking thresholds).
  • Quantization may be performed on a band-by-band basis. Further, quantization may be performed in accordance with a perceptual model (e.g., psychoacoustic model). In such case, for example, scalefactors for scalefactor bands and/or masking thresholds may be selected or modified in order to attain the substantially constant SNR over frequency when performing the quantization.
  • audio signals containing dense transient events can be encoded in a manner that achieves improved perceived quality of the audio after decoding.
  • dense transient events e.g., applause, crackling fire, rain, etc.
  • this constant SNR quantization mode is rather unusual for encoding audio signals and may not be suitable for other types of audio signals
  • presence of dense transient events in the audio signal is first detected by referring to perceptual entropy of the audio signal, and the quantization mode is chosen in accordance with the result of the detection.
  • degrading of audio signals that do not contain or that do not only contain dense transient events such as music, speech, applause mixed with music and/or cheering, for example
  • the perceptual entropy is determined anyway in state-of- the-art audio codecs (such as MP3, MC, HE-MC, AC-4, for example) for purposes of quantization, performing the aforementioned detection does not significantly add to computational complexity, delay, and memory footprint.
  • the proposed method improves the perceived quality of audio after decoding without significantly adding to complexity and memory footprint at the encoder-side.
  • the method may further include smoothing the value of the first feature over time to obtain a time-smoothed value of the first feature. Then, said determining may be based on the time-smoothed value of the first feature.
  • said determining may involve comparing the value of the first feature to a predetermined threshold for the value of the first feature.
  • Said quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the value of the first feature is above the predetermined threshold for the value of the first feature.
  • perceptual entropy above a certain threshold may be indicative of dense transient events in the audio signal.
  • a comparison of the value of the first feature to a threshold offers a simple and reliable determination of whether or not the portion of the audio signal is suitable for quantization using the constant SNR quantization mode.
  • said determining may be (further) based on a variation over time of the value of the first feature.
  • said determining may be based on a temporal variation, such as a standard deviation over time, or a maximum deviation from a mean value over time.
  • said determining may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation.
  • Said quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the variation of the value of the first feature is below the predetermined threshold for the variation.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison for the value of the first feature and the comparison for the variation of the variation of the first feature over time.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) both the value of the first feature is above its respective threshold and the variation of the first feature over time is below its respective threshold.
  • a perceptual entropy that is high on average but has comparatively little temporal variation may be indicative of dense transient events in the audio signal.
  • a comparison of the variation over time of the value of the first feature to a threshold offers a simple and reliable determination of whether or not the portion of the audio signal is suitable for quantization using the constant SNR quantization mode. Combining both decision criteria pertaining to the value of the first feature may result in an even more reliable determination of whether the constant SNR quantization mode shall be applied.
  • the first feature may be proportional to the perceptual entropy.
  • the first feature may be proportional to a factor (component) of the perceptual entropy.
  • the value of the first feature may be obtained in the frequency domain (e.g., MDCT domain).
  • the method may further include obtaining a val ue of a second feature relating to a measure of (spectral) sparsity in the frequency domain (e.g., MDCT domain) of the portion of the audio signal.
  • the measure of sparsity may be given by or relate to the form factor.
  • the measure of sparsity may be proportional to the form factor or the perceptually weighted form factor. Said determining may be (further) based on the value of the second feature.
  • the method may further include smoothing the value of the second feature over time to obtain a time-smoothed value of the second feature. Said determining may be based on the time-smoothed value of the second feature.
  • said determining may involve comparing the value of the second feature to a predetermined threshold for the value of the second feature.
  • Said quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the value of the second feature is above the predetermined threshold for the value of the second feature.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency would be selected if (e.g., only if) the value of the second feature is below the predetermined threshold for the val ue of the second feature.
  • a measure of sparsity such as the form factor, the perceptually weighted form factor, or an estimated number of frequency coefficients (frequency lines) that are not quantized to zero
  • a certain threshold may be indicative of dense transient events in the audio signal, and moreover of a case in which applying the constant SNR quantization mode is advantageous.
  • a comparison of the val ue of the second feature to a threshold offers a simple and reliable confirmation of the determination of whether or not the portion of the audio signal is suitable for quantization using the constant SNR quantization mode.
  • Another aspect of the disclosure relates to a method of detecting dense transient events (e.g., applause, crackling fire, rain, etc.) in a portion of an audio signal.
  • the method may include obtaining (e.g., determining, calculating, or computing) a value of a first feature relating to a perceptual entropy of the portion of the audio signal.
  • the method may further include determining whether the portion of the audio signal is likely to contain dense transient events based at least in part on the value of the first feature.
  • the portion of the audio signal can be classified as to its content of dense transient events without significantly adding to computational complexity and memory footprint.
  • the method may further include generating metadata for the portion of the audio signal.
  • the metadata may be indicative of whether the portion of the audio signal is likely to contain dense transient events.
  • the method may further include smoothing the value of the first feature over time to obtain a time-smoothed value of the first feature. Then, said determining may be based on the time-smoothed value of the first feature.
  • said determining may involve comparing the value of the first feature to a predetermined threshold for the value of the first feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the first feature is above the predetermined threshold for the value of the first feature.
  • said determining may be (further) based on a variation over time of the value of the first feature.
  • said determining may be based on a temporal variation, such as a standard deviation over time, or a maximum deviation from a mean value over time.
  • said determining may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the variation of the value of the first feature is below the predetermined threshold for the variation.
  • the portion of the audio signal may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison for the value of the first feature and the comparison for the variation of the variation of the first feature over time. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) both the value of the first feature is above its respective threshold and the variation of the first feature over time is below its respective threshold.
  • the first feature may be proportional to the perceptual entropy.
  • the first feature may be proportional to a factor (component) of the perceptual entropy.
  • the value of the first feature may be obtained in the frequency domain (e.g., MDCT domain).
  • the method may further include obtaining a val ue of a second feature relating to a measure of (spectral) sparsity in the frequency domain (e.g., MDCT domain) of the portion of the audio signal.
  • the measure of sparsity may be given by or relate to the form factor.
  • the measure of sparsity may be proportional to the form factor or the perceptually weighted form factor. Said determining may be (further) based on the value of the second feature.
  • the method may further include smoothing the value of the second feature over time to obtain a time-smoothed value of the second feature. Said determining may be based on the time-smoothed value of the second feature.
  • said determining may involve comparing the value of the second feature to a predetermined threshold for the value of the second feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is above the predetermined threshold for the value of the second feature.
  • the second feature is defined such that its value increases with increasing spectral density (as is the case for the form factor, for example); in the reverse case (i.e., if the second feature is defined such that its value decreases with increasing spectral density), it would be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is below the predetermined threshold for the value of the second feature.
  • the method may include determining whether the portion of the audio signal is likely to contain dense transient events (e.g., applause, crackling fire, rain, etc.). The method may further incl ude, if (e.g., only if) it is determined that the portion of the audio signal is likely to contain dense transient events, quantizing the portion of the audio signal using a quantization mode that applies (e.g., enforces) a (substantially) constant signal-to-noise ratio over frequency (e.g., over frequency bands) for the portion of the audio signal.
  • dense transient events e.g., applause, crackling fire, rain, etc.
  • the method may further incl ude, if (e.g., only if) it is determined that the portion of the audio signal is likely to contain dense transient events, quantizing the portion of the audio signal using a quantization mode that applies (e.g., enforces) a (substantially) constant signal-to-noise ratio over frequency
  • audio signals containing dense transient events can be encoded in a manner that achieves improved perceived audio quality of the decoded output audio.
  • conditionally applying the constant SNR quantization mode for portions of the audio signal that are determined to contain dense transient events allows avoiding degradation of other classes of audio signals (such as music and/or speech, for example).
  • the method may further include obtaining (e.g., determining, calculating, or computing) a value of a first feature relating to a perceptual entropy of the portion of the audio signal. Then, said determining may be based at least in part on the (obtained) value of the first feature.
  • the method may further include smoothing the value of the first feature over time to obtain a time-smoothed value of the first feature. Then, said determining may be based on the time-smoothed value of the first feature.
  • said determining may involve comparing the value of the first feature to a predetermined threshold for the value of the first feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the first feature is above the predetermined threshold for the value of the first feature.
  • said determining may be (further) based on a variation over time of the value of the first feature.
  • said determining may be based on a temporal variation, such as a standard deviation over time, or a maximum deviation from a mean value over time.
  • said determining may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the variation over time of the value of the first feature is below the predetermined threshold for the variation.
  • the first feature may be proportional to the perceptual entropy.
  • the first feature may be proportional to a factor (component) of the perceptual entropy.
  • the value of the first feature may be obtained in the frequency domain (e.g., MDCT domain).
  • the method may further include obtaining a val ue of a second feature relating to a measure of (spectral) sparsity in the frequency domain (e.g., MDCT domain) of the portion of the audio signal.
  • the measure of sparsity may be given by or relate to the form factor.
  • the measure of sparsity may be proportional to the form factor or the perceptually weighted form factor. Said determining may be (further) based on the value of the second feature.
  • the method may further include smoothing the value of the second feature over time to obtain a time-smoothed value of the second feature. Said determining may be based on the time-smoothed value of the second feature.
  • said determining may involve comparing the value of the second feature to a predetermined threshold for the value of the second feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is above the predetermined threshold for the value of the second feature.
  • the second feature is defined such that its value increases with increasing spectral density (as is the case for the form factor, for example); in the reverse case (i.e., if the second feature is defined such that its value decreases with increasing spectral density), it would be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is below the predetermined threshold for the value of the second feature.
  • the apparatus may include a processor.
  • the apparatus may further include a memory coupled to the processor and storing instructions for execution by the processor.
  • the processor may be adapted to perform the method of any one of the aforementioned aspects and embodiments.
  • the software program may be adapted for execution on a processor and for performing the method steps outlined in the present disclosure when carried out on a computing device.
  • the storage medium may include a software program adapted for execution on a processor and for performing the method steps outlined in the present disclosure when carried out on a computing device.
  • the computer program may include executable instructions for performing the method steps outlined in the present disclosure when executed on a computer.
  • Fig. 1 is a block diagram schematically illustrating an encoder to which embodiments of the disclosure may be applied;
  • Fig. 2 is a block diagram schematically illustrating a decoder corresponding to the encoder of Fig. 1;
  • Fig. 3 is a flow chart illustrating an example of a method of encoding a portion of an audio signal according to embodiments of the disclosure;
  • Fig. 4 is a flow chart illustrating an example of a variation of the method of Fig. 3;
  • Fig. 5 is a flow chart illustrating an example of a method of detecting dense transient events in a portion of an audio signal according to embodiments of the disclosure
  • Fig. 6 is a flow chart illustrating an example of a variation of the method of Fig. 5;
  • Fig. 7 is a flow chart illustrating an example of another method of encoding a portion of an audio signal according to embodiments of the disclosure.
  • Figs. 8, 9, 10, and 11 are histograms illustrating feasibility of methods according to embodiments of the disclosure.
  • Figs. 12A, 12B, 13A, and 13B are graphs illustrating feasibility of methods according to embodiments of the disclosure.
  • the present disclosure describes two schemes (methods) for addressing the above issues. These schemes, directed to detecting dense transient events and encoding of portions of audio signals comprising dense transient events, respectively, may be employed individually or in conjunction with each other.
  • the present disclosure relates to improving audio quality of dense transient event audio signals (such as applause, crackling fire, rain, etc.), without negatively impacting audio quality of other classes of audio signals.
  • the present disclosure further seeks to achieve this goal at low complexity at the encoder-side, with negligible memory footprint and delay.
  • the present disclosure describes methods for detecting dense transient events in (portions of) audio signals, using features that are already computed in a perceptual audio encoder.
  • the present disclosure further describes methods for quantizing dense transient event audio signals using a special constant signal-to-noise ratio quantization noise shaping mode to improve the audio quality of these dense transient audio signals.
  • the present disclosure further proposes to conditionally apply this special constant signal-to-noise ratio quantization noise shaping mode in accordance with a result of the detection of dense transient events in the audio signal.
  • the present disclosure is particularly, though not exclusively, applicable to the AC-4 audio codec.
  • a portion of an audio signal shall mean a section of certain length (e.g., in the time domain or in the frequency domain) of an audio signal.
  • a portion may relate to a certain number of samples (e.g., Pulse Code Modulation, PCM, samples), to a certain number of frames, may be defined to extend over a certain amount of time (e.g., over a certain number of ms), or may relate to a certain number of frequency coefficients (e.g., MDCT coefficients).
  • the portion of the audio signal may indicate a frame of the audio signal or a sub-frame of the audio signal.
  • the audio signal may include more than one channel (e.g., two channels in a stereo configuration, or 5.1 channels, 7.1 channels, etc.).
  • the portion of the audio signal shall mean a section of certain length, as described above, of the audio signal in a given one of the channels of the audio signal.
  • the present disclosure is applicable to any or each of the channels of a multi-channel audio signal. Multiple channels may be processed in parallel or sequentially. Further, the present disclosure may be applied to a sequence of portions, and respective portions may be processed sequentially by the proposed methods and apparatus.
  • dense transient events shall mean a series of individual, brief (measurable) events (e.g., hand claps of applause, fire crackles, splashes of rain) which persist as (e.g., impulsive) noise bursts.
  • Dense transient signals (signals of dense transient events) within the meaning of the present disclosure (and for which the proposed detector for dense transient events would turn ON) shall include 20 to 60 measurable transient events per second, e.g. 30 to 50, or typically 40 measurable events per second. Time intervals between subsequent transient events in dense transient events may vary. Dense transient events are distinct from tonal audio signals (such as music), speech, and sparse transient events (such as castanets, for example).
  • dense transient events may be noisy (i.e., without strong, stable periodic components) and rough (i.e., with an amplitude modulated in the 20-60 Hz range). Dense transient events may also be referred to as sound textures. Examples of dense transient events incl ude applause, crackling fire, rain, running water, babble, and machinery, etc.
  • Fig. 1 is a block diagram of an encoder 100 (e.g., AC-4 encoder) to which embodiments of the disclosure may be applied.
  • Fig. 2 is a block diagram of a corresponding decoder 200 (e.g., AC-4 decoder).
  • the encoder 100 comprises a filterbank analysis block 110, a parametric coding block 120, a filterbank synthesis block 130, a time-frequency transform block 140, a quantization block 150, a coding block 160, a psychoacoustic modeling block 170, and a bit allocation block 190.
  • the parametric coding block 120 may comprise (not shown) parametric bandwidth extension coding tools (A-SPX), parametric multi-channel coding tools, and a companding tool for temporal noise shaping.
  • A-SPX parametric bandwidth extension coding tools
  • the time-frequency transform block 140, the quantization block 150, the psychoacoustic modeling block 170, and the bit allocation block 190 may be said to form an audio spectral frontend (ASF) of the encoder 100.
  • ASF audio spectral frontend
  • the present disclosure may be said to relate to an implementation (modification) of the ASF of the encoder 100.
  • the present disclosure may be said to relate to modifying the psychoacoustic model in the ACF (e.g., of AC-4) to enforce a different noise shaping guided by an additional detector located in the ASF for detecting dense transient events.
  • the present disclosure is not so limited and may be likewise applied to other encoders.
  • the encoder 100 receives an input audio signal 10 (e.g., samples of an audio signal, such as PCM samples, for example) as an input.
  • the input audio signal 10 may have one or more channels, e.g. may be a stereo signal with a pair of channels, or a 5.1 channel signal. However, the present disclosure shall not be limited to any particular number of channels.
  • the input audio signal 10 e.g., the samples of the audio signal
  • a filterbank analysis e.g. a QMF analysis
  • parametric coding which may involve bandwidth extension and/or channel extension is performed at the parametric coding block 120.
  • filterbank synthesis e.g., QMF synthesis
  • the audio signal is provided to the time-frequency transform block 140, at which a time-frequency analysis (e.g., MDCT analysis) is performed.
  • a time-frequency analysis e.g., MDCT analysis
  • MDCT coefficients a sequence of blocks of frequency coefficients (MDCT coefficients). Each block of frequency coefficients corresponds to a block of samples of the audio signal. The number of samples in each block of samples of the audio signal is given by the transform length that is used by the MDCT.
  • a psychoacoustic model is applied to the MDCT coefficients at the psychoacoustic modeling block 170.
  • the psychoacoustic model may group the MDCT coefficients into frequency bands (e.g., scalefactor bands), the respective bandwidths of which may depend on a sensitivity of the human auditory sensitivity at the frequency bands' center frequency.
  • a masking threshold 180 e.g., psychoacoustic threshold
  • the number of allocated bits for a frequency band may translate into a quantization step size (e.g., scalefactor).
  • the (masked) MDCT coefficients in each frequency band are quantized at the quantization block 150 in accordance with the determined bit allocation for the respective frequency band, i.e., the MDCT coefficients are quantized in accordance with the psychoacoustic model.
  • the quantized MDCT coefficients are then encoded at the coding block 160.
  • the encoder 100 outputs a bitstream (e.g., AC-4 bitstream) 20 that can be used for storing or for transmission to a decoder.
  • the above-described operations at each block may be performed for each of the channels of the audio signal.
  • the corresponding decoder 200 (e.g., AC-4 decoder) is shown in Fig. 2 and comprises an inverse coding block 260, an inverse quantization block 250, a stereo and multi-channel (MC) audio processing block 245, an inverse time-frequency transform block 240, a filterbank analysis block 230, an inverse parametric coding block 220, and a filterbank synthesis block 210.
  • the inverse parametric coding block 220 comprises a companding block 222, an A-SPX block 224, and a parametric multi-channel coding block 226.
  • the decoder 200 receives an input bitstream (e.g., AC- 4 bitstream) 20 and outputs an output audio signal (e.g., PCM samples) 30 for one or more channels.
  • the blocks of the decoder 200 reverse respective operations of the blocks of the encoder 100.
  • any of the methods described below may also comprise applying a time-frequency transform to the portion of the audio signal.
  • an MDCT is applied to the (portion of the) audio signal.
  • the time-frequency transform e.g. MDCT
  • a sequence of blocks of frequency coefficients e.g., MDCT coefficients.
  • Each block of frequency coefficients in said sequence corresponds to a respective block of samples, wherein the number of samples in each block of samples is given by the transform length.
  • the blocks of samples corresponding to the sequence of blocks of frequency coefficients may correspond to a frame or a half-frame, depending on the relevant audio codec.
  • a psychoacoustic model may be calculated for frequency bands (e.g., for the so called scalefactor bands, which groups of frequency sub-bands, e.g., groups of MDCT lines).
  • all frequency coefficients (e.g., MDCT coefficients) of a frequency band may be quantized with the same scalefactor, wherein the scalefactor determines the quantizer step size (quantization step size).
  • a masking threshold may be applied to the frequency bands to determine how the frequency coefficients in a given frequency band shall be quantized. For example, the masking threshold may determine, possibly together with other factors, the quantization step size for quantization. At least part of the methods described below relate to selecting or modifying quantization parameters (e.g., masking thresholds and scalefactors) for quantization.
  • Fig. 3 is a flow chart illustrating an example of a method 300 of encoding a portion (e.g., frame) of an audio signal according to embodiments of the disclosure. This method may be advantageously applied for encoding portions of an audio signal that contain dense transient events, such as applause, crackling fire, or rain, for example.
  • a value of a first feature relating to a perceptual entropy of the portion of the audio signal is obtained.
  • the value of the first feature may be determined, computed, or calculated, possibly following analysis of the portion of the audio signal.
  • the val ue of the first feature may be obtained in the frequency domain (e.g., in the MDCT domain).
  • the portion of the audio signal may be analyzed in the frequency domain (e.g., MDCT domain).
  • the value of the first feature may also be obtained in the time domain.
  • speech codecs are typically time-domain codecs based on linear prediction. Linear prediction filter coefficients model the signal spectrum and also the masking model in speech codecs is derived from the linear prediction coefficients, so that features relating to perceptual entropy can be derived also in time-domain codecs.
  • the first feature may be given by or may be proportional to the perceptual entropy of the given portion of the audio signal.
  • the perceptual entropy is a measure of the amount of perceptually relevant information contained in a (portion of a) given audio signal. It represents a theoretical limit on the compressibility of the given audio signal (provided that a perceivable loss in audio quality is to be avoided).
  • the perceptual entropy may be determined for each frequency band in an MDCT representation of the portion of the audio signal and may be generally said to depend, for a given frequency band (e.g., scalefactor band) on a ratio between the energy spectrum (energy) of the given frequency band and a psychoacoustic threshold in an applicable psychoacoustic model for the given frequency band.
  • the val ue of the first feature may be calculated in a psychoacoustic model, for example in the manner described in document 3GPP TS 26.403 (Vl.0.0), section 5.6.1.1.3, which section is hereby incorporated by reference in its entirety.
  • the perceptual entropy is determined as follows.
  • the perceptual entropy is determined for each scalefactor band (as an example of a frequency band) via ⁇ cl
  • n denotes the index of the respective scalefactor band
  • X(k) is the value of the frequency coefficient (e.g., MDCT line) for index k
  • kOffset(n) is the index of the lowest-frequency (i.e., first) MDCT line of the n-th scalefactor band.
  • nl denotes the estimate of the number of lines in the scalefactor band that will not be zero after quantization. This number can be derived from the form factor ffac(n) via
  • the form factor ffac(n) is defined as
  • thr(n) denotes the psychoacoustic threshold for the n-th scalefactor band.
  • thr(n) denotes the psychoacoustic threshold for the n-th scalefactor band.
  • One way to determine the psychoacoustic threshold thr is described in section 5.4.2 of document 3GPP TS 26.403 (Vl.0.0), which section is hereby incorporated by reference in its entirety.
  • peOffset is a constant value (that may be zero in some implementations) that can be added to achieve a more linear relationship between perceptual entropy and the number of bits needed for encoding the portion (e.g., frame) of the audio signal.
  • the above expression for the perceptual entropy can be split into several components (e.g., terms and/or factors). It is considered that a combination of any, some, or all of these components may be used instead of the full expression for the perceptual entropy for obtaining the value of the first feature.
  • the perceptual entropy of a given frequency band in the context of this disclosure can be said to depend on a ratio between the energy spectrum (energy) en of the given frequency band and the psychoacoustic threshold thr for the given frequency band.
  • the first feature may be said to depend on the ratio between the energy spectrum (energy) en of the given frequency band and the psychoacoustic threshold thr for the given frequency band.
  • a quantization mode for quantizing the portion of the audio signal is selected based on the value of the first feature.
  • the quantization mode may be said to be selected based on the first feature. This may involve a determination of, based at least in part on the value of the first feature, whether a quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal (e.g., for the frequency coefficients, such as MDCT coefficients, for example, of a frequency domain representation of the portion of the audio signal).
  • This quantization mode may be referred to as constant SNR mode, constant SNR quantization mode, or constant SNR quantization noise shaping mode.
  • Applying the constant SNR quantization mode may be referred to as applying a dense transient event improvement (e.g., applause improvement), or simply, as applying an improvement, to the portion of the audio signal. Without intended limitation, applying this improvement may also be referred to as applying a fix in the remainder of this disclosure, without this term implying that the improvement is only of temporal nature.
  • the constant SNR quantization mode is suitable for quantizing portions of dense transient events and may produce a pleasant auditory result for such audio signals.
  • the constant SNR quantization mode may degrade other audio signals, such as music and speech, or combinations of dense transient events with music or speech, which typically require non-constant SNR for best perceptual quality. This issue is addressed by the selection process for the quantization mode at step S320.
  • Selection of the quantization mode at step S320 may be said to correspond to modifying the psychoacoustic model that is used for quantizing the audio signal (e.g., modifying the frequency coefficients, or MDCT coefficients) to apply (e.g., enforce) a different noise shaping in the quantization process.
  • modifying the psychoacoustic model that is used for quantizing the audio signal e.g., modifying the frequency coefficients, or MDCT coefficients
  • the obtained value of the first feature may be smoothed over time, in order to avoid unnecessary toggling of the selection at step S320.
  • frame-to-frame switching of the selection can be avoided by considering a time-smoothed version of the value of the first feature.
  • the selection e.g., the determination
  • the perceptual entropy is a suitable feature for discriminating portions of an audio signal that contain dense transient events (e.g., applause, crackling fire, rain, etc.) from portions that contain speech or music.
  • dense transient events e.g., applause, crackling fire, rain, etc.
  • This histogram is normalized so that bar heights add to one and uniform bin width is used.
  • the horizontal axis indicates a (time-smoothed) measure of perceptual entropy
  • the vertical axis indicates a (normalized) count of items per bin of the measure of perceptual entropy.
  • Bin counts 810 (dark grey) in the histogram relate to a set of audio items that have been manually classified as applause items (in particular, applause items that are improved by the fix), whereas bin counts 820 (white) relate to a set of audio items that have been manually classified as non-applause items (e.g., speech or music).
  • the perceptual entropy is consistently higher for the applause items than for the non-applause items, so that the perceptual entropy can provide for suitable discrimination between the two classes of audio items.
  • the perceptual entropy is also a suitable feature for discriminating portions of an audio signal that contain dense transient events and are improved by the fix, and portions of an audio signal that contain dense transient events but that may not be improved by the fix (e.g., portions that contain dense transient evens, but that also contain speech and/or music).
  • This is illustrated in the histogram of Fig. 9, in which the horizontal axis indicates a (time-smoothed) measure of perceptual entropy, and the vertical axis indicates a (normalized) count of items per bin of the measure of perceptual entropy.
  • Bin counts 910 dark grey in the histogram relate to a set of audio items that have been manually classified as applause items that are improved by the fix
  • bin counts 920 white relate to a set of audio items that have been manually classified as applause items that are not improved by the fix.
  • the perceptual entropy is consistently higher for the applause items that are improved by the fix than for applause items that are not improved by the fix, so that the perceptual entropy can provide for suitable discrimination between the two classes of audio items.
  • the (time-smoothed) perceptual entropy can also be used to sub-classify audio items relating to dense transient events (such as applause, crackling fire, rain, etc.).
  • the determination of whether a quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may involve comparing the value of the first feature (or, if available, the time-smoothed value of the first feature) to a predetermined threshold for the value of the first feature.
  • This threshold may be determined manually, for example, to have a value that ensures reliable classification of audio items into applause items (or applause items that are improved by the fix) and non-applause items.
  • the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with (e.g., depending on) a result of this comparison.
  • the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the value of the first feature (or the time-smoothed value of the first feature) is above the predetermined threshold for the first feature.
  • reference to applause, as an example of an audio item containing dense transient events is made without intended limitation, and the present disclosure shall not be construed to be in any way limited by this reference.
  • the determination may be based on a variation over time of the value of the first feature (notably, the variation over time would be determined from the un-smoothed version of the value of the first feature).
  • This variation over time may be the standard deviation over time or a maximum deviation from the mean over time, for example.
  • the time variation may indicate a temporal variation or temporal peakedness of the value of the first feature.
  • the time variation of the perceptual entropy is suitable for discriminating portions of an audio signal that contain dense transient events (e.g., applause, crackling fire, rain, etc.) from portions that contain speech and/or music. This is illustrated in the graphs of Figs. 12A and 12B and Figs. 13A and 13B.
  • Fig. 12A illustrates the broad band energy (in dB) for different channels of an applause audio signal (as an example of an audio signal of dense transient events) as a function of time
  • Fig. 12B illustrates the perceptual entropy for the different channels of the applause audio signal as a function of time
  • Fig. 13A illustrates the broad band energy (in dB) for different channels of a music audio signal as a function of time
  • Fig. 13B illustrates the perceptual entropy for the different channels of the music audio signal as a function of time.
  • dense transient event signals e.g., applause signals
  • non-dense transient event signals may have high bursts of perceptual entropy, but at lower average perceptual entropy. Therefore, any features derived from perceptual entropy that indicate temporal variation or temporal peakedness of perceptual entropy may also be used to detect dense transient events and to discriminate dense transient events from, for example, music and/or speech.
  • the determination of whether the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation over time of the value of the first feature. Also this threshold may be determined manually, for example, in line with the criteria set out above for the threshold for the val ue of the first feature. Then, the decision of whether or not to select the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be made in accordance with (e.g., depending on) a result of this comparison.
  • the quantization mode that applies (e.g., enforces) a substantially constant signal-to- noise ratio over frequency may be selected if (e.g., only if) the variation over time of the value of the first feature is below the predetermined threshold for the variation over time of the value of the first feature.
  • either or both of the (time-smoothed) value of the first feature and the variation over time of the value of the first feature may be referred to for determining whether to use the constant SNR quantization mode. If both are referred to, the decision of whether or not to select the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be made in accordance with (e.g., depending on) the results of both the aforementioned comparisons to respective thresholds.
  • the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may be selected if (e.g., only if), the (time-smoothed) value of the first feature is above the predetermined threshold for the value of the first feature and the time variation of the value of the first feature is below the predetermined threshold for the variation over time of the value of the first feature.
  • a quantization mode that does not apply a substantially constant SNR over frequency i.e., that applies different SNRs to different frequencies or frequency bands
  • the constant SNR quantization mode is conditionally applied depending on whether the aforementioned criteria of the determination are met.
  • the portion of the audio signal is quantized using the selected quantization mode. More specifically, frequency coefficients (e.g., MDCT coefficients) of the portion of the audio signal may be quantized at this step. Quantization may be performed in accordance with the psychoacoustic model. Further, quantization may involve noise shaping (i.e., shaping of quantization noise).
  • noise shaping i.e., shaping of quantization noise
  • the selected quantization mode is the quantization mode that applies (e.g., enforces) a (substantially) constant SNR over frequency (e.g., over frequency bands)
  • this may involve selecting appropriate quantization parameters, such as masking thresholds and/or quantization step sizes (e.g., scalefactors) or appropriately modifying the quantization parameters, to achieve the substantially SNR over frequency (e.g., over frequency bands, such as scalefactor bands).
  • the perceptual entropy of (a portion of) an audio signal is computed during normal encoding operation of state-of-the-art audio encoders, such as AC-4, for example.
  • state-of-the-art audio encoders such as AC-4
  • Fig. 4 is a flow chart illustrating an example of a variation 400 of the method 300 of Fig. 3.
  • Step S410 in variation 400 corresponds to step S310 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here.
  • a value of a second feature relating to a measure of sparsity (e.g., spectral sparsity) in the frequency domain of the portion of the audio signal is obtained.
  • the value of the second feature may be determined, computed, or calculated, possibly following analysis of the portion of the audio signal.
  • the value of the second feature may be obtained in the frequency domain (e.g., in the MDCT domain).
  • the portion of the audio signal may be analyzed in the frequency domain (e.g., MDCT domain).
  • the value of the second feature may also be obtained in the time domain.
  • the measure of sparsity may be given by or relate to the form factor. That is, the value of the second feature may be given by or relate to the form factor (in the frequency domain) for the portion of the audio signal. For example, the value of the second feature may be proportional to the form factor or the perceptually weighted form factor.
  • the perceptually weighted form factor may be said to be an estimate of a number of frequency coefficients (e.g., per frequency band) that are (expected to be) not quantized to zero.
  • the form factor depends on a sum of the square root of the absolute values of the frequency coefficients of a frequency-domain representation of a portion of an audio signal, e.g., for each frequency band.
  • An overall from factor may be obtained by summing the form factors for all frequency bands.
  • a prescription for calculating the form factor in the context of the perceptual model of AC-4 has been given above in the context of the discussion of step S310.
  • a perceptually weighted form factor may be used as the measure of sparsity (e.g., as the second feature).
  • An example for a perceptually weighted form factor is given by the number nl that has been discussed above in the context of S310.
  • An overall perceptually weighted form factor may be obtained by summing perceptually weighted form factors for all frequency bands.
  • the second feature is assumed to have a higher value for a spectrally denser representation of the (portion of the) audio signal, and to have a lower value for a spectrally sparser representation of the (portion of the) audio signal.
  • a quantization mode for quantizing the portion of the audio signal is selected based (at least in part) on the value of the first feature and the value of the second feature.
  • the quantization mode may be said to be selected based on the first feature and the second feature. This may involve a determination of, based (at least in part) on the value of the first feature and the value of the second feature, whether the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal (e.g., for the frequency coefficients, such as MDCT coefficients, for example, of a frequency domain representation of the portion of the audio signal).
  • Selection of the quantization mode at step S420 may be said to correspond to modifying the psychoacoustic model that is used for quantizing the audio signal (e.g., modifying the frequency coefficients, or MDCT coefficients) to apply (e.g., enforce) a different noise shaping in the quantization process.
  • modifying the psychoacoustic model that is used for quantizing the audio signal e.g., modifying the frequency coefficients, or MDCT coefficients
  • the obtained value of the second feature may be smoothed over time, in order to avoid unnecessary toggling of the selection at step S420.
  • frame-to-frame switching of the selection can be avoided by considering a time-smoothed version of the value of the second feature.
  • the selection e.g., the determination
  • the selection would be based, at least in part, on the (time-smoothed, if available) value of the first feature and the time-smoothed value of the second feature.
  • the reason for considering also the value of the second feature is the following.
  • the (time-smoothed) perceptual entropy alone may not under all circumstances be sufficient for distinguishing between dense transient event audio items (such as applause items, for example) that are improved by the fix and audio items that contain dense transient events together with speech (including cheering) and/or music (and that may not be improved by the fix).
  • dense transient event audio items such as applause items, for example
  • audio items that contain dense transient events together with speech (including cheering) and/or music (and that may not be improved by the fix This is illustrated in the histogram of Fig. 10 in which the horizontal axis indicates a (time-smoothed) measure of perceptual entropy, and the vertical axis indicates a (normalized) count of items per bin of the measure of perceptual entropy.
  • Bin counts 1010 dark grey in the histogram relate to a set of audio items that have been manually classified as applause items that are improved by the fix, whereas bin counts 1120 (white) relate to a set of audio items that have been manually classified as applause containing speech (including cheering) and/or music. As can be seen from the histogram, distinguishing between these two classes of audio items may be difficult, depending on circumstances.
  • the sparsity in the frequency domain is a suitable feature for discriminating portions of an audio signal that contain dense transient events (e.g., applause, crackling fire, rain, etc.) and that are improved by the fix from portions that contain dense transient events together with speech (including cheering) or music (and that may not be improved by the fix).
  • dense transient events e.g., applause, crackling fire, rain, etc.
  • the horizontal axis indicates a (time-smoothed) measure of sparsity in the frequency domain
  • the vertical axis indicates a (normalized) count of items per bin of the measure of sparsity in the frequency domain.
  • Bin counts 1110 dark grey
  • bin counts 1120 white
  • the measure of sparsity in the frequency domain is consistently higher for the applause items than for the items relating to applause containing speech (including cheering) and/or music, so that the sparsity in the frequency domain can provide for suitable discrimination between the two classes of audio items.
  • the determination of whether the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may involve, in addition to the determination based on the value of the first feature (see, e.g., step S320 described above) comparing the val ue of the second feature (or, if available, the time-smoothed value of the second feature) to a predetermined threshold for the value of the first feature.
  • This threshold may be determined manually, for example, to have a val ue that ensures reliable classification of audio items into applause items that are improved by the fix and items relating to applause containing speech (including cheering) and/or music.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with (e.g., depending on) a result of the comparison.
  • the quantization mode that applies (e.g., enforces) the substantially constant signal-to- noise ratio over frequency may be selected if (e.g., only if) the value of the second feature (or the time-smoothed value of the second feature) is above the predetermined threshold for the second feature.
  • reference to applause, as an example of an audio item containing dense transient events is made without intended limitation, and the present disclosure shall not be construed to be in any way limited by this reference.
  • the decision of whether or not to select the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be based on the result of the comparison of the (time-smoothed) value of the first feature to its respective threshold and/or the result of the comparison of the time variation of the value of the first feature to its respective threshold, and the result of the comparison of the (time-smoothed) value of the second feature to its respective threshold.
  • the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal if (e.g., only if) the (time-smoothed) value of the first feature is above the predetermined threshold for the value of the first feature and/or the time variation of the value of the first feature is below the predetermined threshold for the time variation of the value of the first feature, and the (time-smoothed) val ue of the second feature is above the predetermined threshold for the value of the second feature.
  • a quantization mode that does not apply a substantially constant SNR over frequency i.e., that applies different SNRs to different frequencies or frequency bands
  • the constant SNR quantization mode is conditionally applied depending on whether the aforementioned criteria of the determination are met.
  • step S420 may nevertheless produce an auditory result that is on the overall perceived as an improvement over conventional techniques for encoding dense transient events.
  • Step S430 in variation 400 corresponds to step S330 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here.
  • the form factor and the perceptually weighted form factor of (a portion of) an audio signal are computed during normal encoding operation of state-of-the-art audio encoders, such as AC-4, for example.
  • state-of-the-art audio encoders such as AC-4
  • a method 500 for detecting dense transient events e.g., applause, crackling fire, rain, etc.
  • a portion of an audio signal e.g., for classifying a portion of an audio signal as to whether the portion is likely to contain dense transient events
  • the portion is classified as likely to contain dense transient events if (e.g., only if) a probability that the portion contains dense transient events is found to exceed a predetermined probability threshold.
  • Step S510 in variation 500 corresponds to step S310 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here.
  • step S520 it is determined whether the portion of the audio signal is likely to contain dense transient events based at least in part on the value of the first feature.
  • This step corresponds to the determination of, based at least in part on the value of the first feature, whether the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal in step S320 of method 300 in Fig. 3, except for that this determination is replaced by the determination of whether the portion of the audio signal is likely to contain dense transient events based at least in part on the value of the first feature. Otherwise, the details of the determination, in particular the determination criteria, are the same as in step S320 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here.
  • An apparatus or module performing steps S510 and S520 may be referred to as a detector for detecting dense transient events.
  • metadata is generated for the portion of the audio signal.
  • the metadata may be indicative of whether the portion of the audio signal is likely to contain dense transient events (e.g., whether the portion of the audio signal is determined at step S520 to be likely to contain dense transient events).
  • the metadata may include a binary decision bit (e.g., flag) for each portion of the audio signal, which may be set if the portion of the audio signal is (determined to be) likely to contain dense transient events.
  • Providing this kind of metadata enables downstream devices to perform more efficient and/or improved post processing with regard to dense transient events. For example, specific post processing for dense transient events may be performed for a given portion of the audio signal if (e.g., only if, or if and only if) the metadata indicates that the portion of the audio signal is likely to contain dense transient events.
  • step S520 may also be used for other purposes apart from generating metadata, and the present disclosure shall not be construed as being limited to generating metadata that is indicative of the result of the determination (classification).
  • Fig. 6 is a flow chart illustrating an example of a variation 600 of the method 500 of Fig. 5.
  • Step S610 in variation 600 corresponds to step S510 of method 500 in Fig. 5 (and thereby to step S310 of method 300 in Fig. 3 and step S410 of variation 400 in Fig. 4) and any statements made above with respect to this step (or these steps) apply also here.
  • Step S615 in variation 600 corresponds to step S415 of variation 400 of Fig. 4 and any statements made above with respect to this step apply also here.
  • step S620 it is determined whether the portion of the audio signal is likely to contain dense transient events based (at least in part) on the value of the first feature and the value of the second feature.
  • This step corresponds to the determination of, based at least in part on the value of the first feature and the value of the second feature, whether the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal in step S420 of variation 400 in Fig. 4, except for that this determination is replaced by the determination of whether the portion of the audio signal is likely to contain dense transient events based (at least in part) on the value of the first feature and the value of the second feature. Otherwise, the details of the determination, in particular the determination criteria, are the same as in step S420 of variation 400 in Fig. 4 and any statements made above with respect to this step apply also here.
  • Step S630 in variation 600 corresponds to step S530 in Fig. 5 and any statements made above with respect to this step apply also here.
  • step S710 it is determined whether the portion of the audio signal is likely to contain dense transient events (e.g., applause, crackling fire, rain, etc.). This determination may involve the same criteria and decisions as the determination of, based at least in part on the value of the first feature, whether a quantization mode that applies a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal in step S320 of method 300 in Fig. 3 or the determination of, based at least in part on the value of the first feature and the val ue of the second feature, whether a quantization mode that applies a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal in step S420 of variation 400 in Fig. 4.
  • dense transient events e.g., applause, crackling fire, rain, etc.
  • this step may comprise obtaining the value of the first feature (e.g., in the manner described with reference to step S310 of method 300 in Fig. 3) and/or obtaining the value of the second feature (e.g., in the manner described with reference to step S415 of variation 400 in Fig. 4).
  • the present disclosure is not limited to these determinations, and other processes for determining whether the portion of the audio signal is likely to contain dense transient events are feasible as well.
  • the portion of the audio signal is quantized using a quantization mode that applies a (substantially) constant signal-to-noise ratio over frequency for the portion of the audio signal.
  • the constant SNR quantization mode is conditionally applied depending on whether the portion of the audio signal is determined to be likely to contain dense transient events.
  • the quantization mode that applies the (substantially) constant SNR has been described above, for example with reference to step S330 of method 300 in Fig. 3.
  • the quantization mode that applies the (substantially) constant signal-to-noise ratio over frequency for the portion of the audio signal is particularly suitable for encoding portions of an audio signal that contain dense transient events.
  • the determination at step 710 ensures that portions of the audio signal for which the constant SNR quantization mode is not suitable are not quantized using this quantization mode, thereby avoiding degradation of such portions.
  • Such apparatus may comprise respective units adapted to carry out respective steps described above.
  • Such apparatus for performing method 300 may comprise a first feature determination unit adapted to perform aforementioned step S310 (and likewise aforementioned steps S410, S510, and S610), a quantization mode selection unit adapted to perform aforementioned step S320, and a quantization unit adapted to perform aforementioned step S330 (and likewise aforementioned steps S430 and S720).
  • an apparatus for performing variation 400 of method 300 may comprise the first feature determination unit, a second feature determination unit adapted to perform aforementioned step S415, a modified quantization mode selection unit adapted to perform aforementioned step S420, and the quantization unit.
  • An apparatus for performing method 500 may comprise the first feature determination unit, an audio content determination unit adapted to perform aforementioned step S520, and optionally a metadata generation unit adapted to perform aforementioned step S530 (and likewise aforementioned step S630).
  • An apparatus for performing variation 600 of method 500 may comprise the first feature determination unit, the second feature determination unit, a modified audio content determination unit adapted to perform aforementioned step S620, and optionally the metadata generation unit.
  • An apparatus for performing method 700 may comprise a dense transient event detection unit adapted to perform aforementioned step S710, and the quantization unit. It is further understood that the respective units of such apparatus (e.g., encoder) may be embodied by a processor of a computing device that is adapted to perform the processing carried out by each of said respective units, i.e. that is adapted to carry out each of the aforementioned steps. This processor may be coupled to a memory that stores respective instructions for the processor.
  • the methods and apparatus described in the present disclosure may be implemented as software, firmware and/or hardware. Certain components may e.g. be implemented as software running on a digital signal processor or microprocessor. Other components may e.g. be implemented as hardware and or as application specific integrated circuits.
  • the signals encountered in the described methods and apparatus may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, e.g. the Internet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present disclosure relates to methods and apparatus for audio coding. A method of encoding a portion of an audio signal comprises determining whether the portion of the audio signal is likely to contain dense transient events, and if it is determined that the portion of the audio signal is likely to contain dense transient events, quantizing the portion of the audio signal using a quantization 5 mode that applies a substantially constant signal-to-noise ratio over frequency for the portion of the audio signal. The present disclosure further relates to a method of detecting dense transient events in a portion of an audio signal.

Description

LOW COMPLEXITY DENSE TRANSIENT EVENTS DETECTION AND CODING
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims priority of the following priority applications: US provisional application 62/528,198 (reference: D17046USP1), filed 03 July 2017 and EP application 17179316.9 (reference: D17046EP), filed 03 July 2017, which are hereby incorporated by reference.
TECHNICAL FIELD
The present disclosure relates to methods of encoding audio signals. The present disclosure further relates to methods of detecting dense transient events in portions of audio signals. The present disclosure also relates to corresponding apparatus, such as encoders, for example.
BACKGROUND
Perceptual or lossy audio codecs (such as MP3, AAC, HE-AAC, AC-4, for example) are known to have problems with compressing audio signals including dense transient events, such as applause, crackling fire, or rain, for example, without loss of perceived audio quality. Conventional efforts to increase compression efficiency typically tend to lead to vastly increased computational complexity at the encoder-side and/or to a loss of perceived audio quality.
The present disclosure addresses the above issues related to audio coding of audio signals including dense transient events, such as applause, crackling fire, or rain, for example, and describes methods and apparatus for improved coding of such audio signals. The present disclosure further deals with detecting dense transient events in audio signals to enable appropriate treatment thereof.
SUMMARY
According to an aspect of the disclosure, a method of encoding a portion (e.g., frame) of an audio signal is described. The method may include obtaining (e.g., determining, calculating, or computing) a value of a first feature relating to a perceptual entropy (PE) of the portion of the audio signal. PE is known in the field of audio coding as a measure of perceptually relevant information contained in a particular audio signal and to represent a theoretical limit on the compressibility of the particular audio signal. The method may further include selecting a quantization mode for quantizing the portion of the audio signal (e.g., for quantizing frequency coefficients of the portion of the audio signal, such as MDCT coefficients, for example) based on the (obtained) value of the first feature. The method may further include quantizing the portion of the audio signal using the selected quantization mode. Selecting the quantization mode may involve determining, based at least in part on the (obtained) value of the first feature, whether a quantization mode that applies (e.g., enforces) a (substantially) constant signal-to-noise ratio (SNR) over frequency (e.g., over frequency bands) shall be used for the portion of the audio signal. This quantization mode may be referred to as constant SNR mode or constant SNR quantization mode. Applying the constant SNR over frequency may involve (e.g., relate to) noise shaping (e.g., quantization noise shaping). This may in turn involve appropriate selection or modification of quantization parameters (e.g., quantization step sizes, masking thresholds). Quantization may be performed on a band-by-band basis. Further, quantization may be performed in accordance with a perceptual model (e.g., psychoacoustic model). In such case, for example, scalefactors for scalefactor bands and/or masking thresholds may be selected or modified in order to attain the substantially constant SNR over frequency when performing the quantization.
By enforcing constant SNR over frequency in quantization, audio signals containing dense transient events (e.g., applause, crackling fire, rain, etc.) can be encoded in a manner that achieves improved perceived quality of the audio after decoding. Since this constant SNR quantization mode is rather unusual for encoding audio signals and may not be suitable for other types of audio signals, presence of dense transient events in the audio signal is first detected by referring to perceptual entropy of the audio signal, and the quantization mode is chosen in accordance with the result of the detection. Thereby, degrading of audio signals that do not contain or that do not only contain dense transient events (such as music, speech, applause mixed with music and/or cheering, for example) can be reliably avoided. Since the perceptual entropy is determined anyway in state-of- the-art audio codecs (such as MP3, MC, HE-MC, AC-4, for example) for purposes of quantization, performing the aforementioned detection does not significantly add to computational complexity, delay, and memory footprint. On the overall, the proposed method improves the perceived quality of audio after decoding without significantly adding to complexity and memory footprint at the encoder-side.
In embodiments, the method may further include smoothing the value of the first feature over time to obtain a time-smoothed value of the first feature. Then, said determining may be based on the time-smoothed value of the first feature.
Thereby, unnecessary toggling of the decision which quantization mode to use, where toggling might result in audible artifacts, can be avoided. Accordingly, perceptual quality of the audio output can be further improved.
In embodiments, said determining may involve comparing the value of the first feature to a predetermined threshold for the value of the first feature. Said quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison. For example, the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the value of the first feature is above the predetermined threshold for the value of the first feature.
As has been found, perceptual entropy above a certain threshold may be indicative of dense transient events in the audio signal. Thus, a comparison of the value of the first feature to a threshold offers a simple and reliable determination of whether or not the portion of the audio signal is suitable for quantization using the constant SNR quantization mode.
In embodiments, said determining may be (further) based on a variation over time of the value of the first feature. For example, said determining may be based on a temporal variation, such as a standard deviation over time, or a maximum deviation from a mean value over time. For example, said determining may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation. Said quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison. For example, the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the variation of the value of the first feature is below the predetermined threshold for the variation. In certain implementations, the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison for the value of the first feature and the comparison for the variation of the variation of the first feature over time. For example, the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) both the value of the first feature is above its respective threshold and the variation of the first feature over time is below its respective threshold.
As has been found, a perceptual entropy that is high on average but has comparatively little temporal variation may be indicative of dense transient events in the audio signal. Thus, a comparison of the variation over time of the value of the first feature to a threshold offers a simple and reliable determination of whether or not the portion of the audio signal is suitable for quantization using the constant SNR quantization mode. Combining both decision criteria pertaining to the value of the first feature may result in an even more reliable determination of whether the constant SNR quantization mode shall be applied.
In embodiments, the first feature may be proportional to the perceptual entropy. Alternatively, the first feature may be proportional to a factor (component) of the perceptual entropy. The value of the first feature may be obtained in the frequency domain (e.g., MDCT domain).
Since state-of-the-art codes calculate the perceptual entropy anyway, referring to the perceptual entropy as the first feature allows to re-use calculation results, and to thereby avoid a significant increase of complexity and memory footprint for the proposed determination of whether the constant SNR quantization mode shall be applied or not. In embodiments, the method may further include obtaining a val ue of a second feature relating to a measure of (spectral) sparsity in the frequency domain (e.g., MDCT domain) of the portion of the audio signal. The measure of sparsity may be given by or relate to the form factor. For example, the measure of sparsity may be proportional to the form factor or the perceptually weighted form factor. Said determining may be (further) based on the value of the second feature.
Referring also to a measure of sparsity allows for an even further improved distinction of cases in which applying the constant SNR quantization mode is advantageous, and cases in which it is not.
In embodiments, the method may further include smoothing the value of the second feature over time to obtain a time-smoothed value of the second feature. Said determining may be based on the time-smoothed value of the second feature.
Thereby, unnecessary toggling of the decision which quantization mode to use, where toggling might result in audible artifacts, can be avoided. Accordingly, perceptual quality of the audio output can be further improved.
In embodiments, said determining may involve comparing the value of the second feature to a predetermined threshold for the value of the second feature. Said quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with a result of the comparison. For example, the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the value of the second feature is above the predetermined threshold for the value of the second feature. Notably, referring to the condition of whether the value of the second feature is above (i.e., exceeds) its threshold in the above determination assumes that the second feature is defined such that its value increases with increasing spectral density (as is the case for the form factor, for example); in the reverse case (i.e., if the second feature is defined such that its value decreases with increasing spectral density), the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency would be selected if (e.g., only if) the value of the second feature is below the predetermined threshold for the val ue of the second feature.
As has been found, a measure of sparsity (such as the form factor, the perceptually weighted form factor, or an estimated number of frequency coefficients (frequency lines) that are not quantized to zero) above a certain threshold may be indicative of dense transient events in the audio signal, and moreover of a case in which applying the constant SNR quantization mode is advantageous. Thus, a comparison of the val ue of the second feature to a threshold offers a simple and reliable confirmation of the determination of whether or not the portion of the audio signal is suitable for quantization using the constant SNR quantization mode.
Another aspect of the disclosure relates to a method of detecting dense transient events (e.g., applause, crackling fire, rain, etc.) in a portion of an audio signal. The method may include obtaining (e.g., determining, calculating, or computing) a value of a first feature relating to a perceptual entropy of the portion of the audio signal. The method may further include determining whether the portion of the audio signal is likely to contain dense transient events based at least in part on the value of the first feature.
Thereby, the portion of the audio signal can be classified as to its content of dense transient events without significantly adding to computational complexity and memory footprint.
In embodiments, the method may further include generating metadata for the portion of the audio signal. The metadata may be indicative of whether the portion of the audio signal is likely to contain dense transient events.
Providing such metadata enables more efficient and improved post processing of audio signals.
In embodiments, the method may further include smoothing the value of the first feature over time to obtain a time-smoothed value of the first feature. Then, said determining may be based on the time-smoothed value of the first feature.
In embodiments, said determining may involve comparing the value of the first feature to a predetermined threshold for the value of the first feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the first feature is above the predetermined threshold for the value of the first feature.
In embodiments, said determining may be (further) based on a variation over time of the value of the first feature. For example, said determining may be based on a temporal variation, such as a standard deviation over time, or a maximum deviation from a mean value over time. For example, said determining may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the variation of the value of the first feature is below the predetermined threshold for the variation. In certain implementations, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison for the value of the first feature and the comparison for the variation of the variation of the first feature over time. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) both the value of the first feature is above its respective threshold and the variation of the first feature over time is below its respective threshold. In embodiments, the first feature may be proportional to the perceptual entropy. Alternatively, the first feature may be proportional to a factor (component) of the perceptual entropy. The value of the first feature may be obtained in the frequency domain (e.g., MDCT domain).
In embodiments, the method may further include obtaining a val ue of a second feature relating to a measure of (spectral) sparsity in the frequency domain (e.g., MDCT domain) of the portion of the audio signal. The measure of sparsity may be given by or relate to the form factor. For example, the measure of sparsity may be proportional to the form factor or the perceptually weighted form factor. Said determining may be (further) based on the value of the second feature.
In embodiments, the method may further include smoothing the value of the second feature over time to obtain a time-smoothed value of the second feature. Said determining may be based on the time-smoothed value of the second feature.
In embodiments, said determining may involve comparing the value of the second feature to a predetermined threshold for the value of the second feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is above the predetermined threshold for the value of the second feature. Notably, referring to the condition of whether the value of the second feature is above (i.e., exceeds) its threshold in the above determination assumes that the second feature is defined such that its value increases with increasing spectral density (as is the case for the form factor, for example); in the reverse case (i.e., if the second feature is defined such that its value decreases with increasing spectral density), it would be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is below the predetermined threshold for the value of the second feature.
Another aspect of the disclosure relates to a method of encoding a portion of an audio signal. The method may include determining whether the portion of the audio signal is likely to contain dense transient events (e.g., applause, crackling fire, rain, etc.). The method may further incl ude, if (e.g., only if) it is determined that the portion of the audio signal is likely to contain dense transient events, quantizing the portion of the audio signal using a quantization mode that applies (e.g., enforces) a (substantially) constant signal-to-noise ratio over frequency (e.g., over frequency bands) for the portion of the audio signal.
By using this constant SNR quantization mode, audio signals containing dense transient events can be encoded in a manner that achieves improved perceived audio quality of the decoded output audio. On the other hand, conditionally applying the constant SNR quantization mode for portions of the audio signal that are determined to contain dense transient events (i.e., in which dense transient events are detected) allows avoiding degradation of other classes of audio signals (such as music and/or speech, for example).
In embodiments, the method may further include obtaining (e.g., determining, calculating, or computing) a value of a first feature relating to a perceptual entropy of the portion of the audio signal. Then, said determining may be based at least in part on the (obtained) value of the first feature.
In embodiments, the method may further include smoothing the value of the first feature over time to obtain a time-smoothed value of the first feature. Then, said determining may be based on the time-smoothed value of the first feature.
In embodiments, said determining may involve comparing the value of the first feature to a predetermined threshold for the value of the first feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the first feature is above the predetermined threshold for the value of the first feature.
In embodiments, said determining may be (further) based on a variation over time of the value of the first feature. For example, said determining may be based on a temporal variation, such as a standard deviation over time, or a maximum deviation from a mean value over time. For example, said determining may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the variation over time of the value of the first feature is below the predetermined threshold for the variation. In certain implementations, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison for the value of the first feature and the comparison for the variation of the variation of the first feature over time. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) both the value of the first feature is above its respective threshold and the variation of the first feature over time is below its respective threshold.
In embodiments, the first feature may be proportional to the perceptual entropy. Alternatively, the first feature may be proportional to a factor (component) of the perceptual entropy. The value of the first feature may be obtained in the frequency domain (e.g., MDCT domain).
In embodiments, the method may further include obtaining a val ue of a second feature relating to a measure of (spectral) sparsity in the frequency domain (e.g., MDCT domain) of the portion of the audio signal. The measure of sparsity may be given by or relate to the form factor. For example, the measure of sparsity may be proportional to the form factor or the perceptually weighted form factor. Said determining may be (further) based on the value of the second feature.
In embodiments, the method may further include smoothing the value of the second feature over time to obtain a time-smoothed value of the second feature. Said determining may be based on the time-smoothed value of the second feature.
In embodiments, said determining may involve comparing the value of the second feature to a predetermined threshold for the value of the second feature. Then, it may be determined whether the portion of the audio signal is likely to contain dense transient events in accordance with a result of the comparison. For example, it may be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is above the predetermined threshold for the value of the second feature. Notably, referring to the condition of whether the value of the second feature is above (i.e., exceeds) its threshold in the above determination assumes that the second feature is defined such that its value increases with increasing spectral density (as is the case for the form factor, for example); in the reverse case (i.e., if the second feature is defined such that its value decreases with increasing spectral density), it would be determined that the portion of the audio signal is likely to contain dense transient events if (e.g., only if) the value of the second feature is below the predetermined threshold for the value of the second feature.
Another aspect relates to an apparatus (e.g., an encoder for encoding a portion of an audio signal). The apparatus (e.g., encoder) may include a processor. The apparatus may further include a memory coupled to the processor and storing instructions for execution by the processor. The processor may be adapted to perform the method of any one of the aforementioned aspects and embodiments.
Another aspect relates to a software program. The software program may be adapted for execution on a processor and for performing the method steps outlined in the present disclosure when carried out on a computing device.
Another aspect relates to a storage medium. The storage medium may include a software program adapted for execution on a processor and for performing the method steps outlined in the present disclosure when carried out on a computing device.
Yet another aspect relates to a computer program product. The computer program may include executable instructions for performing the method steps outlined in the present disclosure when executed on a computer.
It should be noted thatthe methods and apparatus including its preferred embodiments as outlined in the present disclosure may be used stand-alone or in combination with the other methods and systems disclosed in this disclosure. Furthermore, all aspects of the methods and apparatus outlined in the present disclosure may be arbitrarily combined. In particular, the features of the claims may be combined with one another in an arbitrary manner.
BRIEF DESCRIPTION OF THE DRAWINGS
Example embodiments of the present disclosure are explained below with reference to the accompanying drawings, wherein:
Fig. 1 is a block diagram schematically illustrating an encoder to which embodiments of the disclosure may be applied;
Fig. 2 is a block diagram schematically illustrating a decoder corresponding to the encoder of Fig. 1; Fig. 3 is a flow chart illustrating an example of a method of encoding a portion of an audio signal according to embodiments of the disclosure;
Fig. 4 is a flow chart illustrating an example of a variation of the method of Fig. 3;
Fig. 5 is a flow chart illustrating an example of a method of detecting dense transient events in a portion of an audio signal according to embodiments of the disclosure;
Fig. 6 is a flow chart illustrating an example of a variation of the method of Fig. 5;
Fig. 7 is a flow chart illustrating an example of another method of encoding a portion of an audio signal according to embodiments of the disclosure;
Figs. 8, 9, 10, and 11 are histograms illustrating feasibility of methods according to embodiments of the disclosure; and
Figs. 12A, 12B, 13A, and 13B are graphs illustrating feasibility of methods according to embodiments of the disclosure.
DETAILED DESCRIPTION
The present disclosure describes two schemes (methods) for addressing the above issues. These schemes, directed to detecting dense transient events and encoding of portions of audio signals comprising dense transient events, respectively, may be employed individually or in conjunction with each other.
Broadly speaking, the present disclosure relates to improving audio quality of dense transient event audio signals (such as applause, crackling fire, rain, etc.), without negatively impacting audio quality of other classes of audio signals. The present disclosure further seeks to achieve this goal at low complexity at the encoder-side, with negligible memory footprint and delay. To this end, the present disclosure describes methods for detecting dense transient events in (portions of) audio signals, using features that are already computed in a perceptual audio encoder. The present disclosure further describes methods for quantizing dense transient event audio signals using a special constant signal-to-noise ratio quantization noise shaping mode to improve the audio quality of these dense transient audio signals. In order to avoid degradation of other classes of audio signals, the present disclosure further proposes to conditionally apply this special constant signal-to-noise ratio quantization noise shaping mode in accordance with a result of the detection of dense transient events in the audio signal. The present disclosure is particularly, though not exclusively, applicable to the AC-4 audio codec.
Throughout this disclosure a portion of an audio signal shall mean a section of certain length (e.g., in the time domain or in the frequency domain) of an audio signal. A portion may relate to a certain number of samples (e.g., Pulse Code Modulation, PCM, samples), to a certain number of frames, may be defined to extend over a certain amount of time (e.g., over a certain number of ms), or may relate to a certain number of frequency coefficients (e.g., MDCT coefficients). For example, the portion of the audio signal may indicate a frame of the audio signal or a sub-frame of the audio signal. Further, the audio signal may include more than one channel (e.g., two channels in a stereo configuration, or 5.1 channels, 7.1 channels, etc.). In this case, the portion of the audio signal shall mean a section of certain length, as described above, of the audio signal in a given one of the channels of the audio signal. Notably, the present disclosure is applicable to any or each of the channels of a multi-channel audio signal. Multiple channels may be processed in parallel or sequentially. Further, the present disclosure may be applied to a sequence of portions, and respective portions may be processed sequentially by the proposed methods and apparatus.
Further, throughout this disclosure dense transient events shall mean a series of individual, brief (measurable) events (e.g., hand claps of applause, fire crackles, splashes of rain) which persist as (e.g., impulsive) noise bursts. Dense transient signals (signals of dense transient events) within the meaning of the present disclosure (and for which the proposed detector for dense transient events would turn ON) shall include 20 to 60 measurable transient events per second, e.g. 30 to 50, or typically 40 measurable events per second. Time intervals between subsequent transient events in dense transient events may vary. Dense transient events are distinct from tonal audio signals (such as music), speech, and sparse transient events (such as castanets, for example). Further, dense transient events may be noisy (i.e., without strong, stable periodic components) and rough (i.e., with an amplitude modulated in the 20-60 Hz range). Dense transient events may also be referred to as sound textures. Examples of dense transient events incl ude applause, crackling fire, rain, running water, babble, and machinery, etc.
Fig. 1 is a block diagram of an encoder 100 (e.g., AC-4 encoder) to which embodiments of the disclosure may be applied. Fig. 2 is a block diagram of a corresponding decoder 200 (e.g., AC-4 decoder).
The encoder 100 comprises a filterbank analysis block 110, a parametric coding block 120, a filterbank synthesis block 130, a time-frequency transform block 140, a quantization block 150, a coding block 160, a psychoacoustic modeling block 170, and a bit allocation block 190. The parametric coding block 120 may comprise (not shown) parametric bandwidth extension coding tools (A-SPX), parametric multi-channel coding tools, and a companding tool for temporal noise shaping. The time-frequency transform block 140, the quantization block 150, the psychoacoustic modeling block 170, and the bit allocation block 190 may be said to form an audio spectral frontend (ASF) of the encoder 100. The present disclosure may be said to relate to an implementation (modification) of the ASF of the encoder 100. In particular, the present disclosure may be said to relate to modifying the psychoacoustic model in the ACF (e.g., of AC-4) to enforce a different noise shaping guided by an additional detector located in the ASF for detecting dense transient events. However, the present disclosure is not so limited and may be likewise applied to other encoders.
The encoder 100 receives an input audio signal 10 (e.g., samples of an audio signal, such as PCM samples, for example) as an input. The input audio signal 10 may have one or more channels, e.g. may be a stereo signal with a pair of channels, or a 5.1 channel signal. However, the present disclosure shall not be limited to any particular number of channels. The input audio signal 10 (e.g., the samples of the audio signal) is subjected to a filterbank analysis, e.g. a QMF analysis, at the filterbank analysis block 110 to obtain a filterbank representation of the audio signal. Without intended limitation, reference will be made to a QMF filterbank in the remainder of this disclosure. Then, parametric coding, which may involve bandwidth extension and/or channel extension is performed at the parametric coding block 120. After filterbank synthesis (e.g., QMF synthesis) at the filterbank synthesis block 130, the audio signal is provided to the time-frequency transform block 140, at which a time-frequency analysis (e.g., MDCT analysis) is performed. Without intended limitation, reference will be made to a MDCT as an example of a time-frequency transform in the remainder of this disclosure. The MDCT yields a sequence of blocks of frequency coefficients (MDCT coefficients). Each block of frequency coefficients corresponds to a block of samples of the audio signal. The number of samples in each block of samples of the audio signal is given by the transform length that is used by the MDCT.
Then, a psychoacoustic model is applied to the MDCT coefficients at the psychoacoustic modeling block 170. The psychoacoustic model may group the MDCT coefficients into frequency bands (e.g., scalefactor bands), the respective bandwidths of which may depend on a sensitivity of the human auditory sensitivity at the frequency bands' center frequency. A masking threshold 180 (e.g., psychoacoustic threshold) is applied to the MDCT coefficients after psychoacoustic modeling, and a bit allocation for each frequency band is determined at the bit allocation block 190. The number of allocated bits for a frequency band may translate into a quantization step size (e.g., scalefactor). Then, the (masked) MDCT coefficients in each frequency band are quantized at the quantization block 150 in accordance with the determined bit allocation for the respective frequency band, i.e., the MDCT coefficients are quantized in accordance with the psychoacoustic model. The quantized MDCT coefficients are then encoded at the coding block 160. Eventually the encoder 100 outputs a bitstream (e.g., AC-4 bitstream) 20 that can be used for storing or for transmission to a decoder. Notably, the above-described operations at each block may be performed for each of the channels of the audio signal.
The corresponding decoder 200 (e.g., AC-4 decoder) is shown in Fig. 2 and comprises an inverse coding block 260, an inverse quantization block 250, a stereo and multi-channel (MC) audio processing block 245, an inverse time-frequency transform block 240, a filterbank analysis block 230, an inverse parametric coding block 220, and a filterbank synthesis block 210. The inverse parametric coding block 220 comprises a companding block 222, an A-SPX block 224, and a parametric multi-channel coding block 226. The decoder 200 receives an input bitstream (e.g., AC- 4 bitstream) 20 and outputs an output audio signal (e.g., PCM samples) 30 for one or more channels. The blocks of the decoder 200 reverse respective operations of the blocks of the encoder 100.
Notably, any of the methods described below may also comprise applying a time-frequency transform to the portion of the audio signal. In the example of the AC-4 audio codec, an MDCT is applied to the (portion of the) audio signal. The time-frequency transform (e.g. MDCT) may be applied to the (samples of the) (portion of the) audio signal in accordance with a (pre-)selected transform length (e.g., using an analysis window determined by the transform length; for the case of MDCT, the analysis window is determined by the transform length of the previous, the current and the next MDCT). As an output, this yields a sequence of blocks of frequency coefficients (e.g., MDCT coefficients). Each block of frequency coefficients in said sequence corresponds to a respective block of samples, wherein the number of samples in each block of samples is given by the transform length. Further, the blocks of samples corresponding to the sequence of blocks of frequency coefficients may correspond to a frame or a half-frame, depending on the relevant audio codec. Further, in any of the methods described below, a psychoacoustic model may be calculated for frequency bands (e.g., for the so called scalefactor bands, which groups of frequency sub-bands, e.g., groups of MDCT lines). According to the psychoacoustic model, all frequency coefficients (e.g., MDCT coefficients) of a frequency band (e.g., scalefactor band) may be quantized with the same scalefactor, wherein the scalefactor determines the quantizer step size (quantization step size). Before actual quantization, a masking threshold may be applied to the frequency bands to determine how the frequency coefficients in a given frequency band shall be quantized. For example, the masking threshold may determine, possibly together with other factors, the quantization step size for quantization. At least part of the methods described below relate to selecting or modifying quantization parameters (e.g., masking thresholds and scalefactors) for quantization. If certain conditions are met, the quantization parameters may be selected or modified such that a specific noise shaping scheme is applied (e.g., so that a constant SNR over frequency is enforced). Fig. 3 is a flow chart illustrating an example of a method 300 of encoding a portion (e.g., frame) of an audio signal according to embodiments of the disclosure. This method may be advantageously applied for encoding portions of an audio signal that contain dense transient events, such as applause, crackling fire, or rain, for example.
At step S310. a value of a first feature relating to a perceptual entropy of the portion of the audio signal is obtained. For example, the value of the first feature may be determined, computed, or calculated, possibly following analysis of the portion of the audio signal. The val ue of the first feature may be obtained in the frequency domain (e.g., in the MDCT domain). For example, the portion of the audio signal may be analyzed in the frequency domain (e.g., MDCT domain). Alternatively, the value of the first feature may also be obtained in the time domain. For example, speech codecs are typically time-domain codecs based on linear prediction. Linear prediction filter coefficients model the signal spectrum and also the masking model in speech codecs is derived from the linear prediction coefficients, so that features relating to perceptual entropy can be derived also in time-domain codecs.
Approaches for determining measures of perceptual entropy are described in James D. Johnston, Estimation of perceptual entropy using noise masking criteria, ICASSP, 1988, which is hereby incorporated by reference in its entirety. Any of the approaches described therein may be used for the present purpose. However, the present disclosure shall not be limited to these approaches, and also other approaches are feasible.
The first feature may be given by or may be proportional to the perceptual entropy of the given portion of the audio signal.
In general, the perceptual entropy is a measure of the amount of perceptually relevant information contained in a (portion of a) given audio signal. It represents a theoretical limit on the compressibility of the given audio signal (provided that a perceivable loss in audio quality is to be avoided). As will be detailed below, the perceptual entropy may be determined for each frequency band in an MDCT representation of the portion of the audio signal and may be generally said to depend, for a given frequency band (e.g., scalefactor band) on a ratio between the energy spectrum (energy) of the given frequency band and a psychoacoustic threshold in an applicable psychoacoustic model for the given frequency band.
In more detail, the val ue of the first feature may be calculated in a psychoacoustic model, for example in the manner described in document 3GPP TS 26.403 (Vl.0.0), section 5.6.1.1.3, which section is hereby incorporated by reference in its entirety. In this psychoacoustic model, the perceptual entropy is determined as follows.
First, the perceptual entropy is determined for each scalefactor band (as an example of a frequency band) via < cl
Figure imgf000016_0001
with cl = log2 (8), c2 = log2 (2.5), c3 = 1— c2/cl. The energy spectrum (or energy) en for the n- th scalefactor band is given by
kOffset(n+ l)-l
enin) = ^ X(k) X{k)
k=kOffset(n)
where n denotes the index of the respective scalefactor band, X(k) is the value of the frequency coefficient (e.g., MDCT line) for index k, and kOffset(n) is the index of the lowest-frequency (i.e., first) MDCT line of the n-th scalefactor band. The number nl denotes the estimate of the number of lines in the scalefactor band that will not be zero after quantization. This number can be derived from the form factor ffac(n) via
ffac(n)
nl =
enin)
{kOffsetin + 1) - kOffset(n)
The form factor ffac(n) is defined as
kOffset(n+l)-l
ffacin) = ^ Vl (fc)l
k=kOffset(n)
In the above, thr(n) denotes the psychoacoustic threshold for the n-th scalefactor band. One way to determine the psychoacoustic threshold thr is described in section 5.4.2 of document 3GPP TS 26.403 (Vl.0.0), which section is hereby incorporated by reference in its entirety.
The total perceptual entropy of a given portion (e.g., frame) of the audio signal is the sum of the scalefactor band perceptual entropies, pe = peOffset + ^ sfb Pe(n)
n
where peOffset is a constant value (that may be zero in some implementations) that can be added to achieve a more linear relationship between perceptual entropy and the number of bits needed for encoding the portion (e.g., frame) of the audio signal.
It is understood that the above expression for the perceptual entropy can be split into several components (e.g., terms and/or factors). It is considered that a combination of any, some, or all of these components may be used instead of the full expression for the perceptual entropy for obtaining the value of the first feature.
In general, the perceptual entropy of a given frequency band (e.g., scalefactor band) in the context of this disclosure can be said to depend on a ratio between the energy spectrum (energy) en of the given frequency band and the psychoacoustic threshold thr for the given frequency band. Accordingly, the first feature may be said to depend on the ratio between the energy spectrum (energy) en of the given frequency band and the psychoacoustic threshold thr for the given frequency band.
At step S320. a quantization mode for quantizing the portion of the audio signal is selected based on the value of the first feature. In general, the quantization mode may be said to be selected based on the first feature. This may involve a determination of, based at least in part on the value of the first feature, whether a quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal (e.g., for the frequency coefficients, such as MDCT coefficients, for example, of a frequency domain representation of the portion of the audio signal). This quantization mode may be referred to as constant SNR mode, constant SNR quantization mode, or constant SNR quantization noise shaping mode. Applying the constant SNR quantization mode may be referred to as applying a dense transient event improvement (e.g., applause improvement), or simply, as applying an improvement, to the portion of the audio signal. Without intended limitation, applying this improvement may also be referred to as applying a fix in the remainder of this disclosure, without this term implying that the improvement is only of temporal nature.
Notably, applying the constant SNR quantization mode is a rather unusual choice for encoding an audio signal. As has been found, the constant SNR quantization mode is suitable for quantizing portions of dense transient events and may produce a pleasant auditory result for such audio signals. However, given the circumstances applying the constant SNR quantization mode may degrade other audio signals, such as music and speech, or combinations of dense transient events with music or speech, which typically require non-constant SNR for best perceptual quality. This issue is addressed by the selection process for the quantization mode at step S320.
Selection of the quantization mode at step S320 may be said to correspond to modifying the psychoacoustic model that is used for quantizing the audio signal (e.g., modifying the frequency coefficients, or MDCT coefficients) to apply (e.g., enforce) a different noise shaping in the quantization process.
Optionally at this step, the obtained value of the first feature may be smoothed over time, in order to avoid unnecessary toggling of the selection at step S320. In particular, frame-to-frame switching of the selection can be avoided by considering a time-smoothed version of the value of the first feature. In this case, the selection (e.g., the determination) would be based, at least in part, on the time-smoothed value of the first feature.
As has been found, the perceptual entropy is a suitable feature for discriminating portions of an audio signal that contain dense transient events (e.g., applause, crackling fire, rain, etc.) from portions that contain speech or music. This is illustrated in the histogram of Fig. 8. This histogram, as well as the remaining histograms that are discussed in this disclosure, is normalized so that bar heights add to one and uniform bin width is used. In this histogram, the horizontal axis indicates a (time-smoothed) measure of perceptual entropy, and the vertical axis indicates a (normalized) count of items per bin of the measure of perceptual entropy. For this histogram, as well as for the remaining histograms concerning perceptual entropy in this disclosure, the estimated total number of bits per (encoded) AC-4 frame is used as the measure of perceptual entropy. However, methods according to this disclosure are not limited to considering such measure of perceptual entropy, and other measures of perceptual entropy are feasible as well. Bin counts 810 (dark grey) in the histogram relate to a set of audio items that have been manually classified as applause items (in particular, applause items that are improved by the fix), whereas bin counts 820 (white) relate to a set of audio items that have been manually classified as non-applause items (e.g., speech or music). As can be seen from the histogram, the perceptual entropy is consistently higher for the applause items than for the non-applause items, so that the perceptual entropy can provide for suitable discrimination between the two classes of audio items.
Further, the perceptual entropy is also a suitable feature for discriminating portions of an audio signal that contain dense transient events and are improved by the fix, and portions of an audio signal that contain dense transient events but that may not be improved by the fix (e.g., portions that contain dense transient evens, but that also contain speech and/or music). This is illustrated in the histogram of Fig. 9, in which the horizontal axis indicates a (time-smoothed) measure of perceptual entropy, and the vertical axis indicates a (normalized) count of items per bin of the measure of perceptual entropy. Bin counts 910 (dark grey) in the histogram relate to a set of audio items that have been manually classified as applause items that are improved by the fix, whereas bin counts 920 (white) relate to a set of audio items that have been manually classified as applause items that are not improved by the fix. As can be seen from the histogram, the perceptual entropy is consistently higher for the applause items that are improved by the fix than for applause items that are not improved by the fix, so that the perceptual entropy can provide for suitable discrimination between the two classes of audio items. In other words, the (time-smoothed) perceptual entropy can also be used to sub-classify audio items relating to dense transient events (such as applause, crackling fire, rain, etc.).
Accordingly, the determination of whether a quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may involve comparing the value of the first feature (or, if available, the time-smoothed value of the first feature) to a predetermined threshold for the value of the first feature. This threshold may be determined manually, for example, to have a value that ensures reliable classification of audio items into applause items (or applause items that are improved by the fix) and non-applause items. The quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with (e.g., depending on) a result of this comparison. For example, the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be selected if (e.g., only if) the value of the first feature (or the time-smoothed value of the first feature) is above the predetermined threshold for the first feature. Notably, reference to applause, as an example of an audio item containing dense transient events is made without intended limitation, and the present disclosure shall not be construed to be in any way limited by this reference.
Alternatively or additionally, the determination may be based on a variation over time of the value of the first feature (notably, the variation over time would be determined from the un-smoothed version of the value of the first feature). This variation over time may be the standard deviation over time or a maximum deviation from the mean over time, for example. In general, the time variation may indicate a temporal variation or temporal peakedness of the value of the first feature.
As has been found, also the time variation of the perceptual entropy is suitable for discriminating portions of an audio signal that contain dense transient events (e.g., applause, crackling fire, rain, etc.) from portions that contain speech and/or music. This is illustrated in the graphs of Figs. 12A and 12B and Figs. 13A and 13B.
Fig. 12A illustrates the broad band energy (in dB) for different channels of an applause audio signal (as an example of an audio signal of dense transient events) as a function of time, Fig. 12B illustrates the perceptual entropy for the different channels of the applause audio signal as a function of time, Fig. 13A illustrates the broad band energy (in dB) for different channels of a music audio signal as a function of time, and Fig. 13B illustrates the perceptual entropy for the different channels of the music audio signal as a function of time. As can be seen from these graphs, dense transient event signals (e.g., applause signals) have consistently very low standard deviation (with respect to time) of the perceptual entropy at a high average perceptual entropy, whereas non-dense transient event signals may have high bursts of perceptual entropy, but at lower average perceptual entropy. Therefore, any features derived from perceptual entropy that indicate temporal variation or temporal peakedness of perceptual entropy may also be used to detect dense transient events and to discriminate dense transient events from, for example, music and/or speech.
Accordingly, the determination of whether the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may involve comparing the variation over time of the value of the first feature to a predetermined threshold for the variation over time of the value of the first feature. Also this threshold may be determined manually, for example, in line with the criteria set out above for the threshold for the val ue of the first feature. Then, the decision of whether or not to select the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be made in accordance with (e.g., depending on) a result of this comparison. For example, the quantization mode that applies (e.g., enforces) a substantially constant signal-to- noise ratio over frequency may be selected if (e.g., only if) the variation over time of the value of the first feature is below the predetermined threshold for the variation over time of the value of the first feature.
As indicated above, either or both of the (time-smoothed) value of the first feature and the variation over time of the value of the first feature may be referred to for determining whether to use the constant SNR quantization mode. If both are referred to, the decision of whether or not to select the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency may be made in accordance with (e.g., depending on) the results of both the aforementioned comparisons to respective thresholds. For example, the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may be selected if (e.g., only if), the (time-smoothed) value of the first feature is above the predetermined threshold for the value of the first feature and the time variation of the value of the first feature is below the predetermined threshold for the variation over time of the value of the first feature.
On the other hand, if the aforementioned criteria of the determination are not met, a quantization mode that does not apply a substantially constant SNR over frequency (i.e., that applies different SNRs to different frequencies or frequency bands) may be selected at this point. In other words, the constant SNR quantization mode is conditionally applied depending on whether the aforementioned criteria of the determination are met.
At step S330. the portion of the audio signal is quantized using the selected quantization mode. More specifically, frequency coefficients (e.g., MDCT coefficients) of the portion of the audio signal may be quantized at this step. Quantization may be performed in accordance with the psychoacoustic model. Further, quantization may involve noise shaping (i.e., shaping of quantization noise). If the selected quantization mode is the quantization mode that applies (e.g., enforces) a (substantially) constant SNR over frequency (e.g., over frequency bands), this may involve selecting appropriate quantization parameters, such as masking thresholds and/or quantization step sizes (e.g., scalefactors) or appropriately modifying the quantization parameters, to achieve the substantially SNR over frequency (e.g., over frequency bands, such as scalefactor bands).
Notably, the perceptual entropy of (a portion of) an audio signal is computed during normal encoding operation of state-of-the-art audio encoders, such as AC-4, for example. Thus, relying on the perceptual entropy for purposes of selecting a quantization mode does not significantly add to complexity, delay, and memory footprint of the encoding process.
Fig. 4 is a flow chart illustrating an example of a variation 400 of the method 300 of Fig. 3.
Step S410 in variation 400 corresponds to step S310 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here. At step S415. a value of a second feature relating to a measure of sparsity (e.g., spectral sparsity) in the frequency domain of the portion of the audio signal is obtained. For example, the value of the second feature may be determined, computed, or calculated, possibly following analysis of the portion of the audio signal. The value of the second feature may be obtained in the frequency domain (e.g., in the MDCT domain). For example, the portion of the audio signal may be analyzed in the frequency domain (e.g., MDCT domain). Alternatively, the value of the second feature may also be obtained in the time domain. Several measures of sparsity are described in Niall P. Hurley and Scott T. Rickard, Comparing Measures of Sparsity, http://ieeexplore.ieee.org/xpl/Recentlssue.jsp?punumber=18, vol. 55, issue 10, 2009, which is hereby incorporated by reference in its entirety. Any of the measures of sparsity described therein may be used for the present purpose. However, the present disclosure shall not be limited to these measures of sparsity, and also other measures of sparsity are feasible.
The measure of sparsity may be given by or relate to the form factor. That is, the value of the second feature may be given by or relate to the form factor (in the frequency domain) for the portion of the audio signal. For example, the value of the second feature may be proportional to the form factor or the perceptually weighted form factor. The perceptually weighted form factor may be said to be an estimate of a number of frequency coefficients (e.g., per frequency band) that are (expected to be) not quantized to zero.
In general, the form factor depends on a sum of the square root of the absolute values of the frequency coefficients of a frequency-domain representation of a portion of an audio signal, e.g., for each frequency band. An overall from factor may be obtained by summing the form factors for all frequency bands. A prescription for calculating the form factor in the context of the perceptual model of AC-4 has been given above in the context of the discussion of step S310. Alternatively, a perceptually weighted form factor may be used as the measure of sparsity (e.g., as the second feature). An example for a perceptually weighted form factor is given by the number nl that has been discussed above in the context of S310. An overall perceptually weighted form factor may be obtained by summing perceptually weighted form factors for all frequency bands. Notably, for the remainder of the disclosure, the second feature is assumed to have a higher value for a spectrally denser representation of the (portion of the) audio signal, and to have a lower value for a spectrally sparser representation of the (portion of the) audio signal.
At step S420. a quantization mode for quantizing the portion of the audio signal is selected based (at least in part) on the value of the first feature and the value of the second feature. In general, the quantization mode may be said to be selected based on the first feature and the second feature. This may involve a determination of, based (at least in part) on the value of the first feature and the value of the second feature, whether the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal (e.g., for the frequency coefficients, such as MDCT coefficients, for example, of a frequency domain representation of the portion of the audio signal).
Selection of the quantization mode at step S420 may be said to correspond to modifying the psychoacoustic model that is used for quantizing the audio signal (e.g., modifying the frequency coefficients, or MDCT coefficients) to apply (e.g., enforce) a different noise shaping in the quantization process.
Optionally at this step, the obtained value of the second feature may be smoothed over time, in order to avoid unnecessary toggling of the selection at step S420. In particular, frame-to-frame switching of the selection can be avoided by considering a time-smoothed version of the value of the second feature. In this case, the selection (e.g., the determination) would be based, at least in part, on the (time-smoothed, if available) value of the first feature and the time-smoothed value of the second feature.
The reason for considering also the value of the second feature is the following. As has been found, the (time-smoothed) perceptual entropy alone may not under all circumstances be sufficient for distinguishing between dense transient event audio items (such as applause items, for example) that are improved by the fix and audio items that contain dense transient events together with speech (including cheering) and/or music (and that may not be improved by the fix). This is illustrated in the histogram of Fig. 10 in which the horizontal axis indicates a (time-smoothed) measure of perceptual entropy, and the vertical axis indicates a (normalized) count of items per bin of the measure of perceptual entropy. Bin counts 1010 (dark grey) in the histogram relate to a set of audio items that have been manually classified as applause items that are improved by the fix, whereas bin counts 1120 (white) relate to a set of audio items that have been manually classified as applause containing speech (including cheering) and/or music. As can be seen from the histogram, distinguishing between these two classes of audio items may be difficult, depending on circumstances.
However, as has further been found, the sparsity in the frequency domain (spectral sparsity) is a suitable feature for discriminating portions of an audio signal that contain dense transient events (e.g., applause, crackling fire, rain, etc.) and that are improved by the fix from portions that contain dense transient events together with speech (including cheering) or music (and that may not be improved by the fix). This is illustrated in the histogram of Fig. 11 in which the horizontal axis indicates a (time-smoothed) measure of sparsity in the frequency domain, and the vertical axis indicates a (normalized) count of items per bin of the measure of sparsity in the frequency domain. For this histogram, the estimated number of frequency coefficients (e.g., MDCT lines) that are not quantized to zero is used as the measure of sparsity in the frequency domain. However, methods according to this disclosure are not limited to considering such measure of sparsity in the frequency domain, and other measures of sparsity in the frequency domain are feasible as well. Bin counts 1110 (dark grey) in the histogram relate to a set of audio items that have been manually classified as applause items that are improved by the fix, whereas bin counts 1120 (white) relate to a set of audio items that have been manually classified as applause containing speech (including cheering) and/or music. As can be seen from the histogram, the measure of sparsity in the frequency domain is consistently higher for the applause items than for the items relating to applause containing speech (including cheering) and/or music, so that the sparsity in the frequency domain can provide for suitable discrimination between the two classes of audio items.
Accordingly, the determination of whether the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal may involve, in addition to the determination based on the value of the first feature (see, e.g., step S320 described above) comparing the val ue of the second feature (or, if available, the time-smoothed value of the second feature) to a predetermined threshold for the value of the first feature. This threshold may be determined manually, for example, to have a val ue that ensures reliable classification of audio items into applause items that are improved by the fix and items relating to applause containing speech (including cheering) and/or music. The quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be conditionally selected in accordance with (e.g., depending on) a result of the comparison. For example, the quantization mode that applies (e.g., enforces) the substantially constant signal-to- noise ratio over frequency may be selected if (e.g., only if) the value of the second feature (or the time-smoothed value of the second feature) is above the predetermined threshold for the second feature. Notably, reference to applause, as an example of an audio item containing dense transient events is made without intended limitation, and the present disclosure shall not be construed to be in any way limited by this reference.
In other words, in certain implementations, the decision of whether or not to select the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency may be based on the result of the comparison of the (time-smoothed) value of the first feature to its respective threshold and/or the result of the comparison of the time variation of the value of the first feature to its respective threshold, and the result of the comparison of the (time-smoothed) value of the second feature to its respective threshold. For example, it may be determined that the quantization mode that applies (e.g., enforces) a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal if (e.g., only if) the (time-smoothed) value of the first feature is above the predetermined threshold for the value of the first feature and/or the time variation of the value of the first feature is below the predetermined threshold for the time variation of the value of the first feature, and the (time-smoothed) val ue of the second feature is above the predetermined threshold for the value of the second feature.
On the other hand, if the aforementioned criteria of the determination are not met, a quantization mode that does not apply a substantially constant SNR over frequency (i.e., that applies different SNRs to different frequencies or frequency bands) may be selected at this point. In other words, the constant SNR quantization mode is conditionally applied depending on whether the aforementioned criteria of the determination are met.
Notwithstanding the above, relying on the value of the first feature alone in step S420 (as is done in step S320 in method 300, for example) may nevertheless produce an auditory result that is on the overall perceived as an improvement over conventional techniques for encoding dense transient events.
Step S430 in variation 400 corresponds to step S330 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here.
Notably, also the form factor and the perceptually weighted form factor of (a portion of) an audio signal are computed during normal encoding operation of state-of-the-art audio encoders, such as AC-4, for example. Thus, relying on these features as a measure of sparsity in the frequency domain for purposes of selecting a quantization mode does not significantly add to complexity, delay, and memory footprint of the encoding process.
Next, a method 500 for detecting dense transient events (e.g., applause, crackling fire, rain, etc.) in a portion of an audio signal (e.g., for classifying a portion of an audio signal as to whether the portion is likely to contain dense transient events) according to embodiments of the disclosure will be described with reference to Fig. 5. Herein, it is understood that the portion is classified as likely to contain dense transient events if (e.g., only if) a probability that the portion contains dense transient events is found to exceed a predetermined probability threshold.
Step S510 in variation 500 corresponds to step S310 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here.
At step S520. it is determined whether the portion of the audio signal is likely to contain dense transient events based at least in part on the value of the first feature. This step corresponds to the determination of, based at least in part on the value of the first feature, whether the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal in step S320 of method 300 in Fig. 3, except for that this determination is replaced by the determination of whether the portion of the audio signal is likely to contain dense transient events based at least in part on the value of the first feature. Otherwise, the details of the determination, in particular the determination criteria, are the same as in step S320 of method 300 in Fig. 3 and any statements made above with respect to this step apply also here.
An apparatus or module performing steps S510 and S520 may be referred to as a detector for detecting dense transient events.
At optional step S530. metadata is generated for the portion of the audio signal. The metadata may be indicative of whether the portion of the audio signal is likely to contain dense transient events (e.g., whether the portion of the audio signal is determined at step S520 to be likely to contain dense transient events). To this end, the metadata may include a binary decision bit (e.g., flag) for each portion of the audio signal, which may be set if the portion of the audio signal is (determined to be) likely to contain dense transient events.
Providing this kind of metadata enables downstream devices to perform more efficient and/or improved post processing with regard to dense transient events. For example, specific post processing for dense transient events may be performed for a given portion of the audio signal if (e.g., only if, or if and only if) the metadata indicates that the portion of the audio signal is likely to contain dense transient events.
However, the result of the determination (classification) of step S520 may also be used for other purposes apart from generating metadata, and the present disclosure shall not be construed as being limited to generating metadata that is indicative of the result of the determination (classification).
Fig. 6 is a flow chart illustrating an example of a variation 600 of the method 500 of Fig. 5.
Step S610 in variation 600 corresponds to step S510 of method 500 in Fig. 5 (and thereby to step S310 of method 300 in Fig. 3 and step S410 of variation 400 in Fig. 4) and any statements made above with respect to this step (or these steps) apply also here.
Step S615 in variation 600 corresponds to step S415 of variation 400 of Fig. 4 and any statements made above with respect to this step apply also here.
At step S620. it is determined whether the portion of the audio signal is likely to contain dense transient events based (at least in part) on the value of the first feature and the value of the second feature. This step corresponds to the determination of, based at least in part on the value of the first feature and the value of the second feature, whether the quantization mode that applies (e.g., enforces) the substantially constant signal-to-noise ratio over frequency (e.g., for all frequency bands) shall be used for the portion of the audio signal in step S420 of variation 400 in Fig. 4, except for that this determination is replaced by the determination of whether the portion of the audio signal is likely to contain dense transient events based (at least in part) on the value of the first feature and the value of the second feature. Otherwise, the details of the determination, in particular the determination criteria, are the same as in step S420 of variation 400 in Fig. 4 and any statements made above with respect to this step apply also here.
Step S630 in variation 600 corresponds to step S530 in Fig. 5 and any statements made above with respect to this step apply also here.
Next, an example of another method 700 of encoding a portion (e.g., frame) of an audio signal according to embodiments of the disclosure will be described with reference to the flow chart of Fig. 7. This method may be advantageously applied for encoding portions of an audio signal that contain dense transient events, such as applause, crackling fire, or rain, for example.
At step S710. it is determined whether the portion of the audio signal is likely to contain dense transient events (e.g., applause, crackling fire, rain, etc.). This determination may involve the same criteria and decisions as the determination of, based at least in part on the value of the first feature, whether a quantization mode that applies a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal in step S320 of method 300 in Fig. 3 or the determination of, based at least in part on the value of the first feature and the val ue of the second feature, whether a quantization mode that applies a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal in step S420 of variation 400 in Fig. 4. Accordingly, this step may comprise obtaining the value of the first feature (e.g., in the manner described with reference to step S310 of method 300 in Fig. 3) and/or obtaining the value of the second feature (e.g., in the manner described with reference to step S415 of variation 400 in Fig. 4). However, the present disclosure is not limited to these determinations, and other processes for determining whether the portion of the audio signal is likely to contain dense transient events are feasible as well.
At step S720. if (e.g., only if) it is determined that the portion of the audio signal is likely to contain dense transient events, the portion of the audio signal is quantized using a quantization mode that applies a (substantially) constant signal-to-noise ratio over frequency for the portion of the audio signal. In other words, the constant SNR quantization mode is conditionally applied depending on whether the portion of the audio signal is determined to be likely to contain dense transient events. The quantization mode that applies the (substantially) constant SNR has been described above, for example with reference to step S330 of method 300 in Fig. 3.
As indicated above, the quantization mode that applies the (substantially) constant signal-to-noise ratio over frequency for the portion of the audio signal (constant SNR quantization mode) is particularly suitable for encoding portions of an audio signal that contain dense transient events. The determination at step 710 ensures that portions of the audio signal for which the constant SNR quantization mode is not suitable are not quantized using this quantization mode, thereby avoiding degradation of such portions.
It is understood that the proposed methods of encoding a portion of an audio signal and of detecting dense transient events in a portion of an audio signal may be implemented by respective suitable apparatus (e.g., encoders for encoding a portion of an audio signal). Such apparatus (e.g., encoder) may comprise respective units adapted to carry out respective steps described above. For instance, such apparatus for performing method 300 may comprise a first feature determination unit adapted to perform aforementioned step S310 (and likewise aforementioned steps S410, S510, and S610), a quantization mode selection unit adapted to perform aforementioned step S320, and a quantization unit adapted to perform aforementioned step S330 (and likewise aforementioned steps S430 and S720). Likewise, an apparatus for performing variation 400 of method 300 may comprise the first feature determination unit, a second feature determination unit adapted to perform aforementioned step S415, a modified quantization mode selection unit adapted to perform aforementioned step S420, and the quantization unit. An apparatus for performing method 500 may comprise the first feature determination unit, an audio content determination unit adapted to perform aforementioned step S520, and optionally a metadata generation unit adapted to perform aforementioned step S530 (and likewise aforementioned step S630). An apparatus for performing variation 600 of method 500 may comprise the first feature determination unit, the second feature determination unit, a modified audio content determination unit adapted to perform aforementioned step S620, and optionally the metadata generation unit. An apparatus for performing method 700 may comprise a dense transient event detection unit adapted to perform aforementioned step S710, and the quantization unit. It is further understood that the respective units of such apparatus (e.g., encoder) may be embodied by a processor of a computing device that is adapted to perform the processing carried out by each of said respective units, i.e. that is adapted to carry out each of the aforementioned steps. This processor may be coupled to a memory that stores respective instructions for the processor.
It should be noted that the description and drawings merely illustrate the principles of the proposed methods and apparatus. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the proposed methods and apparatus and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass equivalents thereof.
The methods and apparatus described in the present disclosure may be implemented as software, firmware and/or hardware. Certain components may e.g. be implemented as software running on a digital signal processor or microprocessor. Other components may e.g. be implemented as hardware and or as application specific integrated circuits. The signals encountered in the described methods and apparatus may be stored on media such as random access memory or optical storage media. They may be transferred via networks, such as radio networks, satellite networks, wireless networks or wireline networks, e.g. the Internet.

Claims

1. A method of encoding a portion of an audio signal, the method comprising:
determining whether the portion of the audio signal is likely to contain dense transient events; and
if it is determined that the portion of the audio signal is likely to contain dense transient events, quantizing the portion of the audio signal using a quantization mode that applies a substantially constant signal-to-noise ratio over frequency for the portion of the audio signal.
2. The method of Claim 1, further comprising obtaining a value of a first feature relating to a perceptual entropy of the portion of the audio signal, wherein said determining is based at least in part on the value of the first feature.
3. The method of Claim 1 or 2, further comprising obtaining a value of a second feature relating to a measure of sparsity in the frequency domain of the portion of the audio signal, wherein said determining is further based on the value of the second feature.
4. The method of Claim 2 or Claim 3 in its dependence on Claim 2, further comprising smoothing the val ue of the first feature over time to obtain a time-smoothed value of the first feature, wherein said determining is based on the time-smoothed val ue of the first feature.
5. The method of Claim 2 or any one of Claims 3 to 4 in their dependence on Claim 2, wherein said determining involves comparing the value of the first feature to a predetermined threshold for the value of the first feature; and
said quantization mode that applies the substantially constant signal-to-noise ratio over frequency is selected if the value of the first feature is above the predetermined threshold for the value of the first feature.
6. The method of Claim 2 or any one of Claims 3 to 5 in their dependence on Claim 2, wherein said determining is based on a variation over time of the value of the first feature.
7. The method of Claim 6,
wherein said determining involves comparing the variation over time of the value of the first feature to a predetermined threshold for the variation; and
said quantization mode that applies the substantially constant signal-to-noise ratio over frequency is selected if the variation of the value of the first feature is below the predetermined threshold for the variation.
8. The method of Claim 3 or any one of Claims 4 to 7 in their dependence on Claim 3, further comprising smoothing the val ue of the second feature over time to obtain a time- smoothed value of the second feature, wherein said determining is based on the time-smoothed value of the second feature.
9. The method of Claim 3 or any one of Claims 4 to 8 in their dependence on Claim 3, wherein said determining involves comparing the value of the second feature to a predetermined threshold for the value of the second feature; and
said quantization mode that applies the substantially constant signal-to-noise ratio over frequency is selected if the value of the second feature is above the predetermined threshold for the value of the second feature.
10. The method of Claim 2 or any one of Claims 3 to 9 in their dependence on Claim 2, wherein the first feature is proportional to the perceptual entropy; and optionally
the value of the first feature is obtained in the frequency domain.
11. A method of encoding a portion of an audio signal, the method comprising:
obtaining a value of a first feature relating to a perceptual entropy of the portion of the audio signal;
selecting a quantization mode for quantizing the portion of the audio signal based on the value of the first feature; and
quantizing the portion of the audio signal using the selected quantization mode, wherein selecting the quantization mode involves determining, based at least in part on the value of the first feature, whether a quantization mode that applies a substantially constant signal-to-noise ratio over frequency shall be used for the portion of the audio signal.
12. The method of Claim 11, further comprising obtaining a value of a second feature relating to a measure of sparsity in the frequency domain of the portion of the audio signal, wherein said determining is further based on the value of the second feature.
13. The method of Claim 11 or 12, wherein said determining is based on a variation over time of the value of the first feature.
14. The method of any one of Claims 11 to 13, wherein the first feature is proportional to the perceptual entropy; and optionally
the value of the first feature is obtained in the frequency domain.
15. A method of detecting dense transient events in a portion of an audio signal, the method comprising:
obtaining a value of a first feature relating to a perceptual entropy of the portion of the audio signal; and
determining whether the portion of the audio signal is likely to contain dense transient events based at least in part on the value of the first feature.
16. The method of Claim 15, further comprising generating metadata for the portion of the audio signal, wherein the metadata is indicative of whether the portion of the audio signal is likely to contain dense transient events.
17. The method of Claim 15 or 16, further comprising obtaining a value of a second feature relating to a measure of sparsity in the frequency domain of the portion of the audio signal,
wherein said determining is further based on the value of the second feature.
18. The method of any one of Claims 15 to 17, wherein said determining is based on a variation over time of the value of the first feature.
19. The method of any one of Claims 15 to 18, wherein the first feature is proportional to the perceptual entropy; and optionally
the value of the first feature is obtained in the frequency domain.
20. An apparatus comprising a processor and a memory coupled to the processor and storing instructions for execution by the processor, wherein the processor is adapted to perform the method of any one of Claims 1 to 19.
PCT/EP2018/067970 2017-07-03 2018-07-03 Low complexity dense transient events detection and coding WO2019007969A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN201880049530.1A CN110998722B (en) 2017-07-03 2018-07-03 Low complexity dense transient event detection and decoding
EP18733915.5A EP3649640A1 (en) 2017-07-03 2018-07-03 Low complexity dense transient events detection and coding
US16/628,235 US11232804B2 (en) 2017-07-03 2018-07-03 Low complexity dense transient events detection and coding
JP2019572693A JP7257975B2 (en) 2017-07-03 2018-07-03 Reduced congestion transient detection and coding complexity

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201762528198P 2017-07-03 2017-07-03
EP17179316 2017-07-03
US62/528,198 2017-07-03
EP17179316.9 2017-07-03

Publications (1)

Publication Number Publication Date
WO2019007969A1 true WO2019007969A1 (en) 2019-01-10

Family

ID=59276592

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/067970 WO2019007969A1 (en) 2017-07-03 2018-07-03 Low complexity dense transient events detection and coding

Country Status (1)

Country Link
WO (1) WO2019007969A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1074976A2 (en) * 1999-08-05 2001-02-07 Ricoh Company, Ltd. Block switching based subband audio coder

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1074976A2 (en) * 1999-08-05 2001-02-07 Ricoh Company, Ltd. Block switching based subband audio coder

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; General audio codec audio processing functions; Enhanced aacPlus general audio codec; Encoder specification AAC part (Release 6); 3GPP TS 26.403 V6.0.0", 3RD GENERATION PARTNERSHIP PROJECT (3GPP); TECHNICALSPECIFICATION (TS), XX, XX, vol. 26.403, no. V6.0.0, 1 September 2004 (2004-09-01), pages 1 - 23, XP002410983 *
HERRE J ET AL: "Continuously signal-adaptive filterbank for high-quality perceptual audio coding", APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, 1997. 1997 I EEE ASSP WORKSHOP ON NEW PALTZ, NY, USA 19-22 OCT. 1997, NEW YORK, NY, USA,IEEE, US, 19 October 1997 (1997-10-19), pages 1 - 4, XP002102336, ISBN: 978-0-7803-3908-8, DOI: 10.1109/ASPAA.1997.625588 *
NIALL P. HURLEY; SCOTT T. RICKARD, COMPARING MEASURES OF SPARSITY, vol. 55, no. 10, 2009, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/xpl/Recentlssue.jsp?punumber=18>
PRINCEN J ET AL: "Audio coding with signal adaptive filterbanks", 1995 INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING; 9-12 MAY ,1995 ; DETROIT, MI, IEEE, NEW YORK, NY, USA, vol. 5, 9 May 1995 (1995-05-09), pages 3071 - 3074, XP010151993, ISBN: 978-0-7803-2431-2, DOI: 10.1109/ICASSP.1995.479494 *

Similar Documents

Publication Publication Date Title
JP6726785B2 (en) Bit allocation device
CN108831501B (en) High frequency encoding/decoding method and apparatus for bandwidth extension
KR101765740B1 (en) Audio signal coding and decoding method and device
CN110189760B (en) Apparatus for performing noise filling on spectrum of audio signal
KR100962681B1 (en) Classification of audio signals
CN111968655B (en) Signal encoding method and device and signal decoding method and device
KR102452637B1 (en) Signal encoding method and apparatus and signal decoding method and apparatus
CN106169297B (en) Coding method and equipment
CN106716528B (en) Method and device for estimating noise in audio signal, and device and system for transmitting audio signal
KR20160122160A (en) Signal encoding method and apparatus, and signal decoding method and apparatus
JP6728142B2 (en) Method and apparatus for identifying and attenuating pre-echo in a digital audio signal
US20080255860A1 (en) Audio decoding apparatus and decoding method
US11232804B2 (en) Low complexity dense transient events detection and coding
WO2019007969A1 (en) Low complexity dense transient events detection and coding
CN112771610B (en) Decoding dense transient events with companding
Nagisetty et al. Super-wideband fine spectrum quantization for low-rate high-quality MDCT coding mode of the 3GPP EVS codec

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18733915

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019572693

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018733915

Country of ref document: EP

Effective date: 20200203