WO2009081315A1 - Codage et décodage d'un signal audio ou vocal - Google Patents

Codage et décodage d'un signal audio ou vocal Download PDF

Info

Publication number
WO2009081315A1
WO2009081315A1 PCT/IB2008/055250 IB2008055250W WO2009081315A1 WO 2009081315 A1 WO2009081315 A1 WO 2009081315A1 IB 2008055250 W IB2008055250 W IB 2008055250W WO 2009081315 A1 WO2009081315 A1 WO 2009081315A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
encoding
encoded
input signal
predetermined frequency
Prior art date
Application number
PCT/IB2008/055250
Other languages
English (en)
Inventor
Albertus C. Den Brinker
Steven L. J. D. E. Van De Par
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2009081315A1 publication Critical patent/WO2009081315A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/0204Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders using subband decomposition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the invention relates to an encoding method for encoding an input signal comprising at least one of audio or speech into an output data stream. Further the invention relates to a decoding method for decoding an input data stream into an output signal, an encoder for encoding an input signal comprising at least one of audio or speech into an output data stream, a decoder for decoding an input data stream into an output signal, and a computer program product.
  • Audio coders predominantly use subband or transform coding as the basic technique for lossy compression. The results are that good quality audio can be attained at compression ratios of up to 15 for 16 bits 44.1 kHz sampled audio signals. Extending the subband coding with a parametric representation of the high frequency band (Spectral Band Replication, SBR) and stereo or multichannel data improves the efficiency even further.
  • SBR Spectral Band Replication
  • Speech coders target accurate and low bit rate coding of speech signals. Speech signals can be compressed to extremely low bit rates giving a good speech quality. Taking audio as input for a speech coder results in poor quality audio, mainly due to the restrictive signal model that is used in the speech coder which gives the speech coder its high efficiency for compression of speech signals.
  • AMR- WB+ standard (R. Salami, R. Lefebvre, A. Lakaniemi, K. Kontrola, S. Bruhn and A. Taleb, "Extended AMR-WB for high-quality audio on mobile devices", IEEE Communications Magazine, May 2006, pp. 90-97) describes a system that has both a transform-based audio coder (frequency domain coder, FDC) and a CELP -based speech coder (time domain coder, TDC).
  • FDC frequency domain coder
  • CELP -based speech coder time domain coder
  • an object of the invention to provide an enhanced encoding method for encoding an input signal comprising at least one of audio or speech into an output data stream that does not require hard switching, and therefore does not lead to suboptimal decisions or to audible artifacts resulted from the hard switching.
  • This object is achieved by an encoding method for encoding an input signal comprising at least one of audio or speech into an output data stream which comprises the following steps.
  • First step comprises encoding at least a part of the input signal using a frequency domain encoding to produce a first encoded signal.
  • Said frequency domain encoding comprises encoding only frequencies from the part that are below a predetermined frequency.
  • Second step comprises encoding at least some of a remaining part of the input signal using a time domain encoding to produce a second encoded signal.
  • Said remaining part of the input signal comprises those parts of the input signal that are not being encoded with the frequency domain encoding.
  • Said time domain encoding encodes at least some parts of said remaining part of the input signal, e.g. a part of the input signal comprising the frequencies above the predetermined frequency, or a part comprising both the frequencies above the predetermined frequency and the residual frequencies that are below the predetermined frequency and were not encoded by the frequency domain encoding.
  • Third step comprises combining the first encoded signal and the second encoded signal to produce the output data stream.
  • the proposed solution reduces the artifacts of hard switching as used by the prior art. It actually eliminates the need for hard switching as the invention proposes to make a sort of a soft decision by allowing different encoding methods to encode different parts of the input signal. Said combination of the different encoding is motivated by perceptual properties and aims at an encoding method that better balances separate coding strengths.
  • the basic perceptual property of the human auditory system which is exploited in the present invention, is that auditory frequency resolution is better at low frequency while temporal resolution is better at high frequencies. This motivates the configuration of using a frequency domain encoding for dominantly low frequencies and the time domain encoding for dominantly encoding the remaining frequencies.
  • the frequency domain encoding has to be run predominantly using long frames in order to get the efficiency and good quality for audio.
  • the time domain encoding produces always a roughness on the stable tones typically present in music.
  • a frequency domain encoding using long frames produces strong artefacts for highly dynamic input signals, such as speech. In other words long frames result in insufficient time resolution for speech signals. This creates pre-echos, echo, and double speech kind of artifacts and reduces the temporal peakedness of voiced parts. Also it can create spectral holes in absence of sufficient bit rate; these are typically created in the higher frequency ranges.
  • a time domain encoding hardly ever creates spectral holes.
  • the inventors have noticed that it is advantageous to use the frequency domain encoding for (very) low frequencies because the human hearing system is insensitive to their specific temporal structure and because a time domain encoding is incapable of handling these low frequencies adequately due to its limited frequency resolution (short frame sizes). Therefore it is advantageous according to the invention to apply the time domain encoding to the higher frequency ranges if the input signal exhibits in these ranges temporal structure, which needs to be preserved. Using the time domain encoding automatically prevents occurrence of spectral holes.
  • the predetermined frequency is indicated in the output data stream. It allows a decoder to properly decode the output data stream produced with the encoding method according to the invention. It is especially advantageous when the predetermined frequency varies over time.
  • the predetermined frequency is user-configurable. In a further embodiment, the predetermined frequency is varying across the duration of the input signal and is derived based on properties of the input signal. The character of the input signal changes over time. Therefore, it is advantageous to vary the predetermined frequency accordingly to obtain the best coding results.
  • the predetermined frequency is varying across the duration of the input signal and is derived based on a spectral peakedness and a temporal peakedness of the input signal.
  • the spectral and temporal peakedness measures essentially determine the amount of samples largely deviating from the zero or mean value relative to the amount of samples close to the zero or mean value.
  • the spectral peakedness measure determines this from a frequency-domain representation of the input signal or a preprocessed version thereof.
  • the temporal peakedness measure determines this from the time-domain representation of the input signal or a pre-processed version thereof.
  • said spectral and/or temporal peakedness measures are derived for slices of the input signal.
  • the constituent encoders operate by segmenting the input signal and encoding the resulting signal segments. It is efficient to determine the predetermined frequency by considering slices (segments) of the input signal and calculating the optimal predetermined frequency for each segment.
  • the remaining part of the input signal to be encoded using the time domain encoding comprises frequencies above the predetermined frequency.
  • specific parts, i.e. the low- frequency range and the high-frequency range, of the input signal are encoded with appropriate techniques resulting in a high-quality efficient representation of the input signal.
  • the remaining part of the input signal to be encoded using the time domain encoding comprises a difference between the input signal and the signal corresponding to the first encoded signal.
  • common information related to a spectral envelope of the input signal is shared by frequency domain encoding and time domain encoding. This means that said common information needs to be encoded in the output data stream only once, thus reducing the bitrate and improving the efficiency of encoding.
  • the common information comprises scale factors for the frequency domain encoding or flattening filter parameters for the time domain encoding. Both, scale factors and flattening filter parameters are strongly related to the spectral envelope of the input signal. This common basis enforces that these two types of parameters corresponding to the frequency domain and the time domain coding, respectively, are related and therefore the efficiency of the encoding can be increased by encoding these parameters jointly or relatively.
  • a difference between the actual scale factors and said scale factors inferred from the flattening filter parameters is encoded. From the flattening filter parameters, an estimate of the spectral envelope can be calculated. Consequently, scale factors associated with this spectral envelope can be obtained. These inferred scale factors are very similar to those that are directly derived from the input signal. Therefore, by inferring scale factors from the flattening filter parameters and encoding a difference with respect to the actual scale factors, the efficiency of the code is improved.
  • a difference between the actual coefficients of the flattening filter and said parameters of the flattening filter inferred from the scale factors is encoded. From the scale factors a flattening filter can be inferred.
  • the input signal is adaptively pre-processed by means of a linear prediction analysis filter before encoding into the first and second encoded signals.
  • the linear prediction flattens the spectral envelope of the input signal. This means that the spectral character of the input to the time- and frequency-domain encoding is normalized.
  • the invention further provides decoding method, encoder, and decoder claims as well as a computer program product enabling a programmable device to perform the encoding and/or decoding method according to the invention, and a data stream as produced by the encoding method according to the invention.
  • Fig. 1 shows a flow chart for an encoding method for encoding an input signal comprising at least one of audio or speech into an output data stream in accordance with the invention
  • Fig. 2 shows a flow chart for an encoding method wherein the input signal is adaptively pre-processed by means of a linear prediction analysis filter before encoding into the first and the second signals;
  • Fig. 3 shows a flow chart for a decoding method for decoding an input data stream into an output signal in accordance with the invention;
  • Fig. 4 shows a flow chart for a decoding method whereby the output signal is post-processed in dependence of the data in the input data stream by means of a linear prediction
  • Fig. 5 schematically shows an example of an encoder comprising a control unit, a frequency domain encoder, a time domain encoder, and a merger
  • Fig. 6 schematically shows an example of the control unit that determines a predetermined frequency based on a spectral peakedness and temporal peakedness of the input signal
  • Fig. 7 schematically shows an encoder wherein the remaining part of the input signal to be encoded using the time domain encoding comprises a difference between the input signal and the signal corresponding to the first encoded signal;
  • Fig. 8A schematically shows a parallel encoder wherein the input signal is adaptively pre-processed by a linear prediction analysis filter before it is fed into the frequency domain encoder and the time domain encoder;
  • Fig. 8B schematically shows a cascaded encoder wherein the input signal is adaptively pre-processed by a linear prediction analysis filter before it is fed into the frequency domain encoder and the time domain encoder;
  • Fig. 9 schematically shows an example of a decoder comprising a decomposer, a frequency domain decoder, a time domain decoder, and an adder.
  • Fig. 1 shows a flow chart for an encoding method for encoding an input signal comprising at least one of audio or speech into an output data stream in accordance with the invention.
  • the proposed encoding method comprises the following steps.
  • First step 101 comprises encoding at least a part of the input signal using a frequency domain encoding to produce a first encoded signal.
  • Said frequency domain encoding comprises encoding only frequencies from the part that are below a predetermined frequency.
  • Second step 102 comprises encoding at least some of a remaining part of the input signal using a time domain encoding to produce a second encoded signal.
  • Third step 103 comprises combining the first encoded signal and the second encoded signal to produce the output data.
  • the predetermined frequency is indicated in the output data stream.
  • the predetermined frequency or its indicator is combined with the first and second encoded signals into an output data stream.
  • the predetermined frequency indicator is for example a code for a specific value of the predetermined frequency, or an address of a device, e.g. a server on the Internet, wherefrom said value can be retrieved. Having the predetermined frequency or its indicator explicitly in the output data stream is convenient in case the frequency or time domain encoders do not include this predetermined frequency in the first or second encoded signals themselves. In case the predetermined frequency is included in the first or second encoded signal it is not necessary to include the predetermined frequency once more in the output data stream.
  • the predetermined frequency is transmitted implicitly, e.g. data produced by the frequency domain encoder may comprise an indicator for the highest frequency band comprising signal information which may serve for the predetermined frequency.
  • the predetermined frequency is used in the time domain encoding to select appropriate codebooks or excitation sequences. Further, said predetermined frequency can also be used to adapt the weighting filter that is typically used in the optimization procedure to determine the optimal excitation signal.
  • the predetermined frequency does not vary over time it does not need to be transmitted, as the predetermined frequency can then be hard-wired in the decoder to enable an appropriate decoding.
  • the predetermined frequency is user-configurable.
  • the predetermined frequency for specific audio/speech pieces is determined by the user.
  • the user chooses a value of the predetermined frequency from e.g. predetermined set of values such as 100, 250, 500, 1000, or 2000 Hz.
  • predetermined set of values such as 100, 250, 500, 1000, or 2000 Hz.
  • Such set of predetermined values is determined using linear or logarithmic division of the full frequency band.
  • Another way of determining the predetermined frequency is to use refined distortion measures as in rate-distortion control, standard methods such as proposed in S. A. Ramprashad, The multimode transform predictive coding paradigm, IEEE Trans. Speech Audio Process, 11 (2): 117-129, March 2003 or US2007/0106502, or some heuristic rules based upon insights in perception and coding strengths of the frequency and time domain coding.
  • said spectral and/or temporal peakedness measures are derived for slices of the input signal.
  • the constituent encoders operate by segmenting the input signal and encoding the resulting signal segments. It is efficient to determine the predetermined frequency by considering slices (segments) of the input signal and calculating the optimal predetermined frequency for each segment.
  • the decision about the value of the predetermined frequency is taken preferably at regular intervals.
  • the regular intervals allow to closely follow the dynamics of the input signal. From the point of view of the implementation it is preferable that the regular intervals coincide with the shortest update that occurs in the frequency and time domain encoding. However, the intervals over which the peakedness is to be calculated could be even smaller than the shortest frames used by the time and frequency encoding.
  • Peakedness refers here to any measure that correlates with the degree of presence of peaks in a signal (spectral or temporal). Various measures are known to be used for this purpose. As example the normalized fourth moment that is used to measure the degree of fluctuations in an envelope signal (cf. Hartmann and Pumplin, "Noise power fluctuations and masking of sine signals," 1988, J. Acoust. Soc. Am., Vol. 83, pp. 2277- 2289) is the basis for a measure for the peakedness P:
  • x(n) is the signal for which the peakedness is calculated.
  • the kurtosis measure can be used as a measure of peakedness.
  • a common spectral flatness measure is the ratio of the geometric mean of bins of the magnitude spectrum divided by the arithmetic mean of the same bins.
  • the remaining part of the input signal to be encoded using the time domain encoding comprises frequencies above the predetermined frequency. In this way, specific parts of the input signal are encoded with appropriate techniques resulting in a high-quality efficient representation of the input signal.
  • the remaining part of the input signal to be encoded using the time domain encoding comprises a difference between the input signal and the signal corresponding to the first encoded signal.
  • signal components in the low- frequency range, which are not adequately represented in the output of the frequency-domain encoding can be still incorporated in the representation generated by the time-domain encoder. This in turn allows a high-quality efficiency of encoding.
  • the content of the remaining part of the input signal that is to be encoded by the time domain encoding influences the architecture of the encoder that uses the encoding method according to the invention. If the remaining part of the input signal comprises is determined independently of the first encoded signal the encoder has a parallel structure, in which the time and frequency encoders operate in parallel on the same input signal.
  • the encoder has a cascaded structure and the time domain encoding can commence after the frequency domain encoding.
  • the encoder architecture options will be discussed when Fig. 5 and Fig. 7 are described.
  • common information related to a spectral envelope of the input signal is shared by frequency domain encoding and time domain encoding. This means that said common information needs to be encoded in the output data stream only once, thus reducing the bitrate and improving the efficiency of encoding.
  • the predetermined frequency can be sent in combination with other data or to be inferred from other data.
  • the predetermined frequency could be inferred from the frequency domain encoded data stream by observing which parts of the spectrum are not encoded by the frequency domain encoding.
  • the data produced by the frequency domain encoding may contain an indicator for the highest band containing signal information and this in turn is used to determine the predetermined frequency.
  • the common information comprises scale factors for the frequency domain encoding or flattening filter parameters for the time domain encoding. Both, scale factors and flattening filter parameters are strongly related to the spectral envelope of the input signal.
  • the frequency domain encoding usually generates so-called scale factors while the time domain encoding usually generates a flattening filter. Both types of information reflect a spectral envelope of the input signal.
  • the sub-encoders and sub-decoders namely these related to the frequency and time domain encoding/decoding respectively, have to communicate the common information with each other.
  • a difference between the actual scale factors and said scale factors inferred from the flattening filter parameters is encoded. From the flattening filter parameters, an estimate of the spectral envelope can be calculated. Consequently, scale factors associated with this spectral envelope can be obtained. These inferred scale factors are very similar to those that are directly derived from the input signal. Therefore, by inferring scale factors from the flattening filter parameters and encoding a difference with respect to the actual scale factors, the efficiency of the code is improved.
  • a difference between the actual coefficients of the flattening filter and said parameters of the flattening filter inferred from the scale factors is encoded. From the scale factors a flattening filter can be inferred.
  • the parameters coincide for a large part with those of a flattening filter calculated from the input signal. Therefore, encoding the difference increases the efficiency of the code.
  • the scale factors are used to calculate an approximation of the flattening filter and its corresponding parameters, e.g. reflection coefficients or line spectral frequencies, and only the differences between the actual parameters of the flattening filter and those calculated from the scale factors are to be comprised in the output data stream.
  • a simple procedure to attain parameters of a flattening filter from the scale factors is to first make a characteristic of scale factor as function of frequency. This yields a piece-wise constant characteristic. It may be advantageous to smooth this function before further processing. Squaring the values of this function and performing an inverse Fourier transform yields an auto-correlation function.
  • This function can be used as input to a flattening filter design algorithm (e.g. Levinson-Durbin) which yields the parameters of a flattening filter corresponding to the scale factors.
  • Yet another alternative is, for example, to split the flattening filter into cascade of two filters.
  • the first filter parameters are then derived from the scale factors and the second filter parameters are a refinement to attain flattening.
  • Fig. 2 shows a flow chart for an encoding method wherein the input signal is adaptively pre-processed by means of a linear prediction analysis filter before encoding into the first and the second signals. This is illustrated by step 104 preceding the steps of the frequency domain encoding 101, the time domain encoding 102, and combining 103 the outcome of the previous steps together with the predetermined frequency into the output data stream.
  • the linear prediction flattens the spectral envelope of the input signal. This means that the spectral character of the input to the time- and frequency-domain encoding is normalized. This a priori knowledge allows simplified encoding and reduces the effects of smearing due to windowing. The quality of the encoded signal is thus improved.
  • Fig. 3 shows a flow chart for a decoding method for decoding an input data stream into an output signal in accordance with the invention.
  • a decoding method for decoding an input data stream into an output signal comprises the following steps.
  • First step 201 comprises decomposing the input data stream into a first encoded signal and a second encoded signal.
  • decoding of the first encoded signal is performed by using a frequency domain decoding to produce a first decoded signal.
  • the first decoded signal comprises a reconstructed part of the output signal comprising frequencies that are below a predetermined frequency.
  • decoding of the second encoded signal is performed by using a time domain decoding to produce a second decoded signal.
  • Said second decoded signal comprises a reconstructed remaining part of the output signal.
  • adding of the first decoded signal and the second decoded signal is performed to produce the output signal.
  • the input data stream comprises e.g. a file or a stream retrieved from e.g. storage or from the network (e.g. Internet).
  • Said input data stream comprises data generated by the encoding method according to the invention.
  • an indicator for the predetermined frequency is derived from the input data stream.
  • Said indicator comprises e.g. the predetermined frequency value itself, or a code which can be mapped directly to the predetermined frequency value, or a device address from which the predetermined frequency value can be retrieved.
  • common information related to a spectral envelope of the output signal to be reconstructed from the first encoded signal and the second encoded signal is shared between the frequency domain decoding and the time domain decoding.
  • the common information comprises scale factors for the frequency domain decoding or flattening filter coefficients for the time domain decoding. The decoder performs the reverse operation to this done in the encoder in order to retrieve the appropriate information for the decoding steps 202 and 203.
  • a difference between the actual scale factors and the scale factors inferred from the flattening filter coefficients is decoded.
  • a difference between the actual coefficients of the flattening filter and the flattening filter inferred from the scale factors is decoded.
  • Fig. 4 shows a flow chart for a decoding method whereby the output signal is post-processed in dependence of the data in the input data stream by means of a linear prediction.
  • the encoding might comprise a step of adaptive preprocessing of the input signal by means of a linear prediction analysis filter before the actual encoding into the first and the second signals. This is done in order to improve a quality of the encoding.
  • the parameters of the linear prediction filter are comprised in the input data stream, it is advantageous to use these parameters to post-process the output signal in order to make it even more to resemble the originally encoded input signal.
  • Fig. 5 schematically shows an example of an encoder 300 comprising a control unit 310, a frequency domain encoder 330, a time domain encoder 320, and a merger 340.
  • the encoder 300 encodes the input signal 301 comprising at least one of audio or speech into an output data stream 305.
  • the input signal 301 is fed into the control unit 310, and to the time domain encoder 320 and frequency encoder 330 as well.
  • the control unit 310 determines a predetermined frequency 302 based on the input signal 301. Said predetermined frequency is provided to the encoders 320 and 330 as well as to the merger 340.
  • the frequency domain encoder 330 encodes the input signal 301 into a first encoded signal 303.
  • the first encoded signal 303 comprises encoded frequencies from the low frequency part of the input signal 301, whereby said frequencies are below the predetermined frequency 302 as derived by the control unit 310.
  • the time domain encoder 320 encodes a remaining part of the input signal 301 into a second encoded signal 304.
  • the second encoded signal 304 comprises frequencies from a remaining part of the input signal 301.
  • Said first encoded signal and second encoded signal are fed into the merger 340, which combines the first encoded signal 303, the second encoded signal 304 and the predetermined frequency 302 to produce the output data stream 305.
  • Including of the predetermined frequency 302 in the output data stream 305 is optional for the reasons discussed before.
  • Said architecture of the encoder is referred to as parallel encoder, since the frequency domain encoder 330 and the time domain encoder 320 are in parallel operating on the input signal
  • Said encoders 320 and 330 operate independently from each other. However, the control unit 310 provides them with the predetermined frequency 302 to ensure that they operate on complementary frequency parts.
  • encoder 300 is parallel, this means that the frequency domain encoder and the time domain encoder are operating in parallel independent of each other.
  • control unit determines a predetermined frequency based on properties of the input signal.
  • Fig. 6 schematically shows an example of the control unit 310 that determines a predetermined frequency based on a spectral peakedness and temporal peakedness of the input signal 301.
  • the input signal 301 is fed into a spectral whitening unit 311 that applies linear predictive filtering to whiten the spectrum of the input signal.
  • a gain normalization can be applied in the spectral whitening unit 311. This results in flattening of the signal on a coarse temporal scale.
  • the input signal 301 is fed into a transform unit 313 that transforms chunks of the time-domain signal into frequency-domain signal. The output of this unit could be used by the frequency domain encoder 330.
  • Units 312 and 314 derive a peakedness measure for the signals provided from the units 311 and 313, respectively.
  • the unit 312 calculates the temporal peakedness measure. This type of peakedness is based on the spectrally whitened signal x w (n) across the interval [0- • • N - 1] and is expressed, for example, as:
  • the size of the interval can vary. However, it is advantageous to determine the peakedness measure across a number of short, overlapping intervals within a long frame rather than calculating the peakedness measure across the long frame.
  • the unit 314 calculates the spectral peakedness measure. This type of peakedness is based on the spectral domain representation X(k) of the input signal for a frequency interval [0- • -F - 1] and is expressed, for example, as:
  • Said spectral-domain representation is real valued and comprises e.g. MDCT coefficients or absolute values of complex amplitude values resulting after from a Discrete Fourier Transform.
  • the temporal and spectral peakedness measures obtained from units 312 and 314 are fed into a unit 315, which combines these two measures to make a decision about the value of the predetermined frequency.
  • F s max(F mn ,F 1n ⁇ 8 ) .
  • This frequency indicates the largest frequency region starting from zero until the frequency that has a spectral peakedness larger than 1.65 and which is at least 500 Hz wide.
  • This frequency region is consequently encoded with the frequency domain coder 330.
  • Said frequency domain encoder is efficient for encoding signals with high spectral peakedness as well as it provides a good quality for spectrally peaked signals.
  • the temporal peakedness measure is incorporated to account for the fact that spectrally peaked signals can have high temporal peakedness.
  • a frequency domain coder 330 can be efficient for encoding such signals, the quality associated with coding of such signals is lower because the perceptual system is highly sensitive for modifications of the temporal structure of the signal.
  • This final predetermined frequency F c is used as an output of the control unit 310 and is used to control the frequency partitioning between the frequency and time domain encoders.
  • the remaining part of the input signal to be encoded using the time domain encoding comprises frequencies above the predetermined frequency 302, said predetermined frequency obtained from the control unit 310.
  • the frequency range of the input signal to be encoded is strictly divided into two non-overlapping ranges.
  • the frequency domain encoding encodes the lower frequency range of the input signal, while the higher frequency range above the predetermined frequency is encoded by the time domain encoding.
  • Fig. 7 schematically shows an encoder 400 wherein the remaining part of the input signal to be encoded using the time domain encoding comprises a difference between the input signal 301 and the signal corresponding to the first encoded signal 303 as obtained from the frequency domain decoder 350.
  • the structure of the encoder 400 is cascaded.
  • the time domain encoder functions as a residual encoder.
  • the frequency encoder performs encoding on the low- frequency range of the input signal, whereby only selected frequencies that are below the predetermined frequency are encoded.
  • the part of the low- frequency range of the input signal that is not encoded by the frequency domain encoder is encoded by the time domain encoder. This is achieved by providing a residual input signal that is obtained by subtracting 410 the signal corresponding to a first encoded signal 303 as obtained from the frequency domain decoder 350 decoding said first encoded signal 303 from the input signal 301.
  • Fig. 8A schematically shows a parallel encoder 300 wherein the input signal 301 is adaptively pre-processed by a linear prediction analysis filter 360 before it is fed into the frequency domain encoder 330 and the time domain encoder 320 as signal 306.
  • the linear prediction analysis filter parameters 307 are provided to the merger 340 in order to be incorporated in the output data stream 305. Availability of the linear prediction analysis filter parameters 307 in the output data stream allows applying the linear prediction synthesis filter during decoding. This improves the quality of the reconstructed signal.
  • the linear prediction flattens the spectral envelope of the input signal. This means that the spectral character of the input to the time- and frequency-domain encoding is normalized. This a priori knowledge allows simplified encoding and reduces the effects of smearing due to windowing. The quality of the encoded signal is thus improved.
  • linear prediction analysis filter 360 has impact on the complexity of the encoder.
  • the time domain encoder 320 is substantially simplified, as the whitening filter used in this encoder 320 can be totally or largely removed.
  • the linear prediction analysis filter could be configured to be controlled by the control unit 310, which could also have impact on complexity of the encoder as the components having redundant functionality could be eliminated.
  • Fig. 8B schematically shows a cascaded encoder 400 wherein the input signal 301 is adaptively pre-processed by a linear prediction analysis filter 360 before it is fed into the frequency domain encoder 330 and the time domain encoder 320 as signal 306.
  • the linear prediction analysis filter parameters 307 are provided to the merger 340 in order to be incorporated in the output data stream 305. Availability of the linear prediction analysis filter parameters 307 in the output data stream allows applying the linear prediction synthesis filter during decoding. This improves the quality of the reconstructed signal.
  • the advantages of the encoder presented in Fig. 8 A apply also for the cascaded encoder with the input signal being adaptively pre-processed by a linear prediction filter.
  • Fig. 9 schematically shows an example of a decoder 500 comprising a decomposer 510, a frequency domain decoder 530, a time domain decoder 520, and an adder 540.
  • the decoder 500 decodes an input data stream 305 into an output signal 506.
  • the input data stream for the decoder 500 and the output data stream for encoders 300 and 400 have the same reference number to indicate that the format of the data stream fed into the decoder 500 is compliant with the data stream format as generated by the encoders 300 or 400.
  • the input data stream 305 is first provided to the decomposer 510, which decomposes the input data stream 305 into a first encoded signal 501, a second encoded signal 502, and a predetermined frequency 505.
  • the frequency domain decoder 530 decodes the first encoded signal 501 to produce a first decoded signal 503.
  • the first decoded signal 503 comprises reconstructed frequencies of the first encoded signal 501 that are below the predetermined frequency 505.
  • the time domain decoder 520 decodes the second encoded signal 502 to produce a second decoded signal 504.
  • the second decoded signal 504 comprises reconstructed frequencies pertaining to the remaining part of the output signal 506.
  • the adder 540 adds subsequently the first decoded signal 503 and the second decoded signal 504 to produce the output signal 506.
  • the output signal 506 could be post-processed by means of a linear prediction synthesis filter in order to improve the quality of the signal reconstruction.
  • the structure of the decoder discussed above is suitable for decoding the input data streams obtained using, both, parallel and cascaded encoders.
  • Said encoder and decoder in accordance with the embodiments of the invention can be used in a transmission system for communication of an audio signal.
  • Such transmission system comprises a transmitter, which is coupled with a receiver through a network.
  • the network could be e.g. Internet.
  • the transmitter is for example a signal recording device and the receiver is for example a signal player device.
  • Said recording device comprises an encoder according to the invention.
  • Said signal player device comprises a decoder according to the invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

L'invention se rapporte à un procédé de codage destiné à coder un signal d'entrée qui comprend au moins un signal audio et/ou un signal vocal dans un train de données de sortie, qui comprend les étapes suivantes. La première étape consiste à coder au moins une partie du signal d'entrée au moyen d'un codage dans le domaine fréquentiel de manière à produire un premier signal codé. Ledit codage dans le domaine fréquentiel consiste à coder uniquement les fréquences qui proviennent de la partie où elles se situent en dessous d'une fréquence prédéterminée. La deuxième étape consiste à coder au moins une partie de la partie restante du signal d'entrée au moyen d'un codage dans le domaine temporel de manière à produire un second signal codé. La troisième étape consiste à combiner le premier signal codé et le second signal codé de façon à produire le train de données de sortie.
PCT/IB2008/055250 2007-12-18 2008-12-12 Codage et décodage d'un signal audio ou vocal WO2009081315A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP07123512.1 2007-12-18
EP07123512 2007-12-18

Publications (1)

Publication Number Publication Date
WO2009081315A1 true WO2009081315A1 (fr) 2009-07-02

Family

ID=40466862

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2008/055250 WO2009081315A1 (fr) 2007-12-18 2008-12-12 Codage et décodage d'un signal audio ou vocal

Country Status (1)

Country Link
WO (1) WO2009081315A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010003545A1 (fr) * 2008-07-11 2010-01-14 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. Appareil et procédé de décodage d’un signal audio encodé
WO2011062536A1 (fr) * 2009-11-19 2011-05-26 Telefonaktiebolaget Lm Ericsson (Publ) Extension de largeur de bande de signal d'excitation amélioré
US8296159B2 (en) 2008-07-11 2012-10-23 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and a method for calculating a number of spectral envelopes
CN115381467A (zh) * 2022-10-31 2022-11-25 浙江浙大西投脑机智能科技有限公司 一种基于注意力机制的时频信息动态融合解码方法及装置

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1852849A1 (fr) * 2006-05-05 2007-11-07 Deutsche Thomson-Brandt Gmbh Procédé et appareil d'encodage sans perte d'un signal source utilisant un courant de données encodées avec perte et un courant d'extension de données encodées sans perte

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1852849A1 (fr) * 2006-05-05 2007-11-07 Deutsche Thomson-Brandt Gmbh Procédé et appareil d'encodage sans perte d'un signal source utilisant un courant de données encodées avec perte et un courant d'extension de données encodées sans perte

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NAKHAI M R ET AL: "SPLIT BAND CELP (SB-CELP) SPEECH CODER", 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PHOENIX, AZ, MARCH 15 - 19, 1999; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], NEW YORK, NY : IEEE, US, 15 March 1999 (1999-03-15), pages 461 - 464, XP000900157, ISBN: 978-0-7803-5042-7 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010003545A1 (fr) * 2008-07-11 2010-01-14 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. Appareil et procédé de décodage d’un signal audio encodé
US8275626B2 (en) 2008-07-11 2012-09-25 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and a method for decoding an encoded audio signal
US8296159B2 (en) 2008-07-11 2012-10-23 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and a method for calculating a number of spectral envelopes
US8612214B2 (en) 2008-07-11 2013-12-17 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and a method for generating bandwidth extension output data
WO2011062536A1 (fr) * 2009-11-19 2011-05-26 Telefonaktiebolaget Lm Ericsson (Publ) Extension de largeur de bande de signal d'excitation amélioré
US8856011B2 (en) 2009-11-19 2014-10-07 Telefonaktiebolaget L M Ericsson (Publ) Excitation signal bandwidth extension
CN115381467A (zh) * 2022-10-31 2022-11-25 浙江浙大西投脑机智能科技有限公司 一种基于注意力机制的时频信息动态融合解码方法及装置
CN115381467B (zh) * 2022-10-31 2023-03-10 浙江浙大西投脑机智能科技有限公司 一种基于注意力机制的时频信息动态融合解码方法及装置

Similar Documents

Publication Publication Date Title
JP7469350B2 (ja) マルチチャンネル信号を符号化するためのオーディオエンコーダおよび符号化されたオーディオ信号を復号化するためのオーディオデコーダ
US9715883B2 (en) Multi-mode audio codec and CELP coding adapted therefore
CN110223704B (zh) 对音频信号的频谱执行噪声填充的装置
JP6184519B2 (ja) 音声信号復号化または符号化の時間領域レベル調整
EP1730725B1 (fr) Codage efficace de donnees audio numeriques spectrales utilisant de la similitude spectrale
RU2345506C2 (ru) Многоканальный синтезатор и способ для формирования многоканального выходного сигнала
US11790922B2 (en) Apparatus for encoding or decoding an encoded multichannel signal using a filling signal generated by a broad band filter
KR20100063086A (ko) 주파수 서브-대역들 내의 스펙트럼 다이나믹스에 기초한 오디오 코딩에서의 시간적 마스킹
US20230206930A1 (en) Multi-channel signal generator, audio encoder and related methods relying on a mixing noise signal
WO2009081315A1 (fr) Codage et décodage d'un signal audio ou vocal
JP5291004B2 (ja) 通信ネットワークにおける方法及び装置
RU2752520C1 (ru) Управление полосой частот в кодерах и/или декодерах
TWI793666B (zh) 對多頻道音頻信號的頻道使用比例參數的聯合編碼的音頻解碼器、音頻編碼器和相關方法以及電腦程式
TWI841856B (zh) 音頻量化器和音頻去量化器及相關方法以及電腦程式

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08864583

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08864583

Country of ref document: EP

Kind code of ref document: A1