WO2006048824A1 - Efficient audio coding using signal properties - Google Patents

Efficient audio coding using signal properties Download PDF

Info

Publication number
WO2006048824A1
WO2006048824A1 PCT/IB2005/053570 IB2005053570W WO2006048824A1 WO 2006048824 A1 WO2006048824 A1 WO 2006048824A1 IB 2005053570 W IB2005053570 W IB 2005053570W WO 2006048824 A1 WO2006048824 A1 WO 2006048824A1
Authority
WO
WIPO (PCT)
Prior art keywords
encoding
audio signal
properties
audio
oet
Prior art date
Application number
PCT/IB2005/053570
Other languages
French (fr)
Inventor
Tor J. F. Norden
Sören V. ANDERSEN
Sören H. JENSEN
Willem B. Kleijn
Nicolle H. Van Schijndel
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to EP05797846A priority Critical patent/EP1815463A1/en
Priority to US11/718,242 priority patent/US20090063158A1/en
Priority to JP2007539679A priority patent/JP2008519308A/en
Publication of WO2006048824A1 publication Critical patent/WO2006048824A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the invention relates to high efficiency, high quality audio signal coding. More specifically, the invention relates to the class of audio codecs which are adaptive to an input signal, Le. having a number of encoding settings to be optimised for obtaining encoded signal being optimal in terms of a rate-distortion criterion.
  • the invention provides an audio encoder and a method of optimising audio encoder settings.
  • a crucial problem within encoding is to find the most efficient representation for each input signal. Since audio signals can exhibit a wide range of characteristics and, for different signal characteristics, different encoding methods are most efficient, it is desirable to use flexible codecs, e.g. codecs that combine different encoding methods. For example, audio signals are split and encoded as a sinusoidal part and a residual. Usually, tonal signals are coded with a specific coding method aimed at signals made up out of sinusoids and the residual signal is encoded with a waveform or noise encoder. Consequently, within such codecs it has to be decided which settings (or which encoding template) to use, e.g. which part of the signal to encode by which encoding method.
  • Such decision can be based on the full input signal, i.e. the input signal itself, and after trying many encoding possibilities, calculating for each possibility the resulting (perceptual) distortion.
  • the decision about encoding settings becomes a problem regarding complexity.
  • Patent application US 2004/0006644 describes a method of transcoding an input signal. Different transcoding methods can be selected depending on the input signal to be transcoded. In US 2004/0006644 it is proposed to select between different methods based on prior established properties of the input signal to be transcoded. However, US 2004/0006644 does not disclose any method for optimising encoder settings.
  • an object of the present invention to provide an audio encoder and an audio encoding method capable of providing a low complexity optimizing of an encoder template and yet provide an encoded signal which is efficient in terms of a rate- distortion criterion.
  • the invention provides an audio encoder adapted to encode an audio signal according to an encoding template, the audio encoder comprising: optimizing means adapted to generate an optimized encoding template based on a predetermined set of properties of the audio signal, the optimized encoding template being optimized with respect to a predetermined encoding efficiency criterion, and encoding means adapted to generate an encoded audio signal in accordance with the optimized encoding template.
  • 'encoding template 1 is understood the set of parameters, i.e. settings, that has to be selected for a specific encoder.
  • Optimized encoding template' it is to be construed an encoding template wherein some or all parameters are selected or modified in response to the predetermined set of properties of the audio signal so as to result in an encoded output signal which is more optimal in terms of the predetermined encoding efficiency criterion.
  • 'predetermined set of properties of the audio signal' is understood a parametric description of the audio signal comprising one or more parameters descriptive of signal properties of the audio signal.
  • the predetermined set of properties of the audio signal may e.g. be in form of a property vector with scalar values representing each parameter.
  • the audio encoder is capable of optimizing the encoding template to be used for the encoding process by using prior knowledge of relevant properties of the audio signal to be encoded.
  • the audio encoder estimates a rate and/or distortion measure based on the predetermined set of properties of the audio signal and hereby provides an optimized encoding template without actually encoding the audio signal.
  • decisions regarding optimal encoder settings can be performed without the need for trying a large number of possible settings and monitor a resulting encoded output signal with respect to rate and distortion before a final decision on an optimal encoding template can be made.
  • the audio encoder may comprise analysis means adapted to analyze the audio signal and generate the set of properties of the audio signal in response thereto. However, the set of properties of the audio signal may be established outside the audio encoder. The audio encoder is then adapted to receive as input the audio signal together with the predetermined set of properties of the audio signal.
  • the optimizing means comprises means adapted to predict a perceptual distortion associated with the encoding template based on the predetermined set of properties of the audio signal.
  • 'distortion associated with the encoding template' is understood a resulting difference between the encoded audio signal and the audio signal itself by encoding the audio signal according to the encoding template.
  • 'perceptual distortion' is understood a measure of distortion relevant with respect to what is perceived by the human auditory system, i.e. a measure of distortion that reflects a perceived sound quality.
  • the perceptual distortion measure is based on a perceptual model, such as a representation of the human masking curve etc.
  • the optimizing means comprises means adapted to predict a bit rate associated with the encoding template based on the predetermined set of properties of the audio signal.
  • the optimizing means is adapted to predict both a perceptual distortion and a bit rate associated with the encoding template based on the predetermined set of properties of the audio signal.
  • the encoder is capable of optimizing the encoding template according to a criterion being the best sound quality at a given maximum target bit rate or the lowest possible bit rate at a predetermined minimum sound quality in terms of perceptual distortion.
  • the set of properties of the audio signal comprises at least one property selected from the group consisting of: tonality, noisiness, harmonicity, stationarity, linear prediction gain, long-term prediction gain, spectral flatness, low- frequency spectral flatness, high-frequency spectral flatness, zero crossing rate, loudness, voicing ratio, spectral centroid, spectral bandwidth, a Mel cepstrum, frame energy, spectral flatness for ERB bands 1-10, spectral flatness for ERB bands 10-20, spectral flatness for ERB bands 20-30, and spectral flatness for ERB bands 30-37.
  • the predetermined set of properties of the audio signal comprises a property vector with scalars representing one or more of the mentioned parameters.
  • the predetermined set of properties of the audio signal comprise perceptually relevant properties, i.e. properties that are relevant with respect to what is perceived by the human auditory system.
  • the predetermined set of properties of the audio signal may comprise properties that can be determined by standard definitions known in the art.
  • the set of audio signal properties is specifically designed to take into account relevant properties for a specific encoder in question.
  • E.g. tonality and noisiness parameters may be included in case of a combined encoder having a sinusoidal encoder part and a noise encoder part.
  • a bit rate distribution task becomes simple and is easily determined from the tonality and noisiness parameter.
  • a very simple decision criterion may be to select the sinusoidal encoder part in case the tonality parameter exceeds a certain value, otherwise the noise encoder part is selected.
  • a very simple decision criterion may be to select the sinusoidal encoder part in case the tonality parameter exceeds a certain value, otherwise the noise encoder part is selected.
  • the audio encoder is adapted to optimize the encoding template for each segment of the audio signal.
  • the encoder being able to track rapid changes in the audio signal, such as transients, and adapt its encoding template accordingly.
  • the optimizing means may be adapted to optimize a segmentation of the audio signal based on the set of properties of the audio signal. Apart from the encoding template it has proven to be encoding efficient to use adaptive segmentation. Using an up- front adaptive segmentation based on signal properties of the audio signal such adaptive segmentation becomes even more efficient, since in prior art encoders adaptive segmentation only adds an extra and complex optimizing task apart from optimizing the encoding template.
  • the optimizing means may be adapted to select the optimized encoding template from a set of predefined encoding templates. In order to further facilitate the encoding template optimizing process, it may be preferred that the predefined set of encoding templates covers the majority of the entire encoder parameter space. The optimizing task may then be to evaluate the predefined set of encoding parameters and select the best one in terms of the predetermined encoding efficiency criterion.
  • the encoding means comprises first and second sub-encoders, while the optimizing means is adapted to optimize first and second encoding templates for the first and second sub-encoders in response to the predetermined set of properties of the audio signal.
  • the audio encoder may comprise three, four, five, ten or even more separate sub-encoders and be adapted to optimize encoding templates for all sub-encoders based on the predetermined set of properties of the audio signal.
  • this embodiment covers combined codecs.
  • the invention provides a method of encoding an audio signal, the method comprising the steps of: generating an optimized encoding template based on a predetermined set of properties of the audio signal, the optimized encoding template being optimized with respect to a predetermined encoding efficiency criterion, and - generating an encoded audio signal in accordance with the optimized encoding template.
  • the invention provides a method of optimizing an encoding template of an audio encoder adapted to encode an audio signal, the method comprising the steps of: receiving a predetermined set of properties of the audio signal, - optimizing the encoding template with respect to a predetermined encoding efficiency criterion, based on the predetermined set of properties of the audio signal.
  • Optimizing the encoding template for the encoder based on the predetermined set of properties of the audio signal makes the optimizing considerably less complex than prior art methods of optimizing encoding templates.
  • prior art methods of optimizing encoding efficiency are based on necessary bit rate and a resulting distortion obtained for an actually encoded audio signal.
  • prior art methods involve the encoding process.
  • an optimizing method based on a predetermined set of properties of the audio signal the encoding process in the optimizing method is eliminated. This is especially advantageous in encoder with a large number of settings to be optimized. Instead the optimizing may be based on a prediction of a perceptual distortion measure and a prediction of a bit rate for a given encoding template.
  • prediction accuracy can be improved by carefully considering e.g. which data to include in the predetermined set of properties of the audio signal and establishing a precise model of the encoder(s) in questions.
  • prior art methods may provide poor results as it may not be possible to actually test the entire parameter space but only a very coarsely cover the parameter space.
  • predictions may prove to be fast enough to cover the entire parameter space and thus end up with an encoding template closer to the theoretically optimum, provided a given computation power available.
  • the method according to the third aspect may comprise an initial set of analyzing the audio signal and generate the set of predetermined properties of the audio signal in accordance therewith.
  • the optimizing step comprises predicting a perceptual distortion measure (see the above definitions).
  • the optimizing step comprises predicting a bit rate.
  • the optimizing step comprises predicting of both a perceptual distortion and a bit rate so as to enable an optimization of the encoding template according to a criterion being the best sound quality at a given maximum target bit rate or the lowest possible bit rate at a predetermined minimum sound quality in terms of perceptual distortion.
  • the optimizing method is performed for each segment of the audio signal.
  • the optimizing method comprises optimizing segmentation of the audio signal based on the predetermined set of properties of the audio signal.
  • the invention provides a device comprising an audio encoder according to the first aspect.
  • Such device is preferably an audio device such as a solid state audio device, a CD player, a CD recorder, a DVD player, a DVD recorder, a harddisk recorder, a mobile communication device, (portable) computers etc.
  • the device may also be devices other than audio devices.
  • the invention provides a computer readable program code adapted to encode an audio signal according to the method of the second aspect.
  • the invention provides a computer readable program code adapted to optimize an encoding template according to the method of the third aspect.
  • the computer readable program code according to the fifth and sixth aspects may comprise software algorithms adapted for a signal processor, personal computers etc. It may be present on a portable medium such as a disk or memory card or memory stick, or it may be present in a ROM chip or in other way stored in a device.
  • Fig. 1 illustrating a prior art encoder where encoding settings are either fixed or iteratively adjusted based on a resulting distortion of the encoded signal
  • Fig. 2 illustrates an encoder according to the invention, where a decision of encoder settings is based on a prior analysis of an input signal
  • Fig. 3 illustrates a preferred Gaussian mixture based minimum mean square error (MMSE) estimator for estimating encoding distortion
  • Fig. 4 illustrates a prior art combined encoder where bit rate distribution between two sub encoders is decided upon by evaluating distortion of the encoded signal
  • Fig. 5 illustrates a combined encoder according to the invention, where bit rate distribution between two sub encoders is decided upon based on properties of the input signal
  • Fig. 6 illustrates an encoder according to the invention, where an adaptive segmentation of the input signal is decided upon based on properties of the input signal.
  • Fig. 1 illustrates a prior art encoder ENC that receives an input signal IN and generates an encoded output signal OUT in response thereto.
  • ENC encoder settings or an encoding template is either fixed or based on an optimising algorithm involving an encoding of the input signal.
  • Different encoding templates are tried, each involving an encoding of the input audio signal IN, and for each encoding template e.g. distortion and bit rate associated with each encoding template is monitored, and finally the most efficient encoding template is selected to be used to generate the output signal OUT.
  • Fig. 2 illustrates the principle of the invention by means of a preferred audio encoder embodiment.
  • An input audio signal IN is received and analysed by signal analysing means AN.
  • the analysing means AN generates in response a property vector PV comprising a set of properties of the audio signal IN.
  • This property vector PV is then received by an encoding template optimising unit ET OPT that generates an optimised encoding template OET based on the received property vector PV.
  • the optimised encoding template OET and the input audio signal IN are then used by an encoder means ENC to generate an encoded output signal OUT being an encoded version of the input audio signal ESf.
  • the audio encoder of Fig. 2 the property vector PV and a mathematical model of the different encoding configurations, for example its rate-distortion performance, is used to generate the optimised encoding template OET. Then, it is not necessary to try all possible encoding templates, because the property vector PV already indicates the input-type-dependent performance of the encoding templates.
  • the audio encoder according to the invention is capable of optimising an encoding template for the encoder means without having to encode the input audio signal IN but is capable of deciding upon an optimal encoding template using properties of the input audio signal IN only.
  • the analysing means AN shown in the diagram of Fig. 2 is optional.
  • an audio encoder according to the invention may be adapted to receive as inputs the input audio signal IN and a property vector PV.
  • a disadvantage of the use of a property vector PV may be that encoding becomes (slightly) sub-optimal.
  • the ad-hoc methods currently in use in audio coding are most likely much further from an optimal solution.
  • a predetermined set of properties of an input audio signal can be used in several ways, which can be used simultaneously. They will be further described in the following. For simplicity reasons a predetermined set of properties of an input audio signal is denoted a property vector in the following.
  • a property vector is used to estimate distortions, such as a perceptual distortions, for different encoding templates. E.g. the combination of different encoding methods or different settings within one encoding method. This has two advantages in terms of complexity: 1) no actual encoding necessary, 2) no need for calculations of the (perceptual) distortion. In other words, the property vector is used to obtain (perceptual) distortions without actual encodings and calculations of the corresponding distortion.
  • a property vector is used to determine directly which part of an input signal to code by which encoding method in a hybrid encoder, i.e. in an encoder comprising a combination of several encoding methods or sub-encoders. This goes one step further than the previous item: in this case, the property vector does not only indicate the input-type-dependent performance of the coding methods, but also indicates which one(s) to use.
  • the property vector indicates that the signal contains a prominent sinusoid and thus, it is sufficient to check which encoding method can efficiently encode sinusoids, such as a sinusoidal encoder, and then start with that one.
  • the property vector can also be used to estimate potential interactions between the coding methods. Knowledge about these interactions is also important for efficient configuration of the codec.
  • a property vector is to estimate an optimal time- variant adaptive segmentation of codecs.
  • the adaptive segmentation can be set up-front based on the time- varying characteristics of the input signal, which leads to lower complexity compared to methods that explore the effect of several segmentation possibilities.
  • the first embodiment is a property vector based scheme for instantaneous distortion estimation.
  • the framework is based on a property vector extracted from the frame to be encoded, from which the distortion estimation is to be performed.
  • the task of estimating the incurred coding distortion, ⁇ for a coder Q ⁇ .) is addressed.
  • the incurred distortion is expressed as
  • the estimation is separated into a property extraction, /(.) , and an estimation, g(.) .
  • the random input vector X is processed into a dimension reduced random vector P , from which an estimate, ⁇ , of the coding distortion, ⁇ , is to be found.
  • the aim of the scheme is to perform an unbiased estimate, and to minimise the estimation error variance,
  • the minimum mean square error estimator (MMSE) for this task, i.e., the one minimising ⁇ is the conditional mean estimator,
  • Fig. 3 illustrates the chosen implementation using a model-based approach as described in J. Lindblom, J. Samuelsson, and P. Hedelin, "Model based spectrum prediction," in Proc. IEEE Workshop Speech Coding, (Delawan, WI, USA), 2000, pp. 117-119.
  • T O-L indicates that the joint pdf, /J ⁇ ? (0, p) , is off-line trained.
  • this estimator calculates a weighted sum of conditional means
  • the complexity reduction obtained by distortion estimation instead of encoding and distortion calculation depends on 3 factors: the complexity of the distortion estimation using a property vector, the complexity of the encoding method, and the complexity of distortion calculation.
  • the complexity of the distortion estimation obviously depends on the model that is used. For the embodiment presented above, assuming each RD point is estimated independently, the complexity can be stated as: N ⁇ • N mlxt • ⁇ C pm ⁇ ct + C pdf ), in which Ng 0 is the number of RD points, N mixl is the number of mixtures, C product is the complexity of the matrix vector product, and C pdf is the complexity of the Gaussian pdf evaluation.
  • the matrix vector product has the 'dimension' of the employed property vector, but the matrix is symmetric and the complexity can thus be reduced to approximately half of that.
  • the complexity of the encoding method obviously depends on the method that is used and widely varies from codec to codec. Nevertheless, this complexity is expected to be higher than that of the distortion estimation.
  • the implemented estimation scheme has been evaluated for a Code-Excited Linear Prediction (CELP) like encoder, Q(.) , using the incurred Signal to Noise Ration (SNR) as the distortion to be estimated, ⁇ . It has been tested for six different property vectors: the 10th order linear prediction gain (G LPC ), the long-term prediction gain (G LTP ), spectral flatness (G ), low- frequency spectral flatness (G low ), high-frequency spectral flatness G high , and the combination of LPC and LTP gain (G LPC G LTP ). All estimators were based on
  • the property vector scheme has also been evaluated for a sinusoidal encoder, using 30 sinusoids per frame.
  • the encoder is based on psycho-acoustical matching pursuit as found in R. Heusdens and S. van de Par, "Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustical matching pursuits," in Proc. IEEE Int. Conf. Acoust, Speech, and Signal Proc, (Orlando, FL, USA), 2002, vol. 2, pp. 1809-1812, using a perceptual spectral distortion measure as found in S. van de Par, S. Kohlrausch, A. Charestan, and R. Heusdens, "A new psychoacoustical masking model for audio coding applications," in Proc. Proc.
  • the hybrid encoder of the embodiment comprises two encoding methods: a sinusoidal encoder followed by a transform encoder.
  • the sinusoidal encoder is similar to the one described in connection with the first embodiment.
  • the transform encoder is based on an MDCT filter bank, such as found in R. D. Koilpillai and P. P. Vaidyanathan, "Cosine- modulated fir filter banks satisfying perfect reconstruction," IEEE Trans. Signal Processing, vol. 40, no. 4, pp. 770-783, April 1992, and codes the residual of the sinusoidal encoder.
  • the key question is which signal component to encode by the sinusoidal encoder and which component by the transform encoder. In this embodiment, this question translates to which part of the available bit budget to spend by the sinusoidal encoder and which part by the transform encoder.
  • Fig. 4 illustrates a prior art approach.
  • An input signal IN is applied to a sinusoidal encoder SENC that delivers a residual signal res to a transform encoder TENC that is thus intended to encode what the sinusoidal encoder SENC can not encode.
  • a rate- distortion optimising unit R-D OPT distributes bit rates R-SE and R-TE for the two encoders SENC, TENC, respectively.
  • the optimising unit R-D OPT receives a resulting distortion D from the last encoder TENC.
  • Several different bit distributions R-SE, R-TE are tried and the optimal one is then chosen by the rate-distortion optimising unit R-D OPT, i.e. the one resulting in the lowest distortion D, and this distribution R-SE, R-TE is then used to generate an encoded output signal OUT.
  • the following bit distributions are tried: 100% to the sinusoidal encoder (SENC) and 0% to the transform encoder (TENC), 75% SENC and 25% TENC, 50% SENC and 50% TENC, 25% SENC and 75% TENC, 0% SENC and 100% TENC.
  • the signal is encoded using the different bit distributions and from the resulting parameters a signal is synthesis to determine the corresponding perceptual distortion.
  • the perceptually-relevant distortion measure found in S. van de Par, A. Kohlrausch, G. Charestan and R. Heusdens, "A new psychoacoustical masking model for audio coding applications," in Proc. Proc. IEEE Int. Conf.
  • Fig. 5 illustrates an approach according to the invention.
  • a property vector PV as described above, is input to a bit rate optimising unit R-OPT that determines optimal bit distributions R-SE, R-TE to the two encoders SENC, TENC.
  • R-OPT bit rate optimising unit
  • an analysing unit AN analyses the input signal IN and generates the property vector PV in response thereto. Instead of trying different bit distributions, the optimal distribution R-SE, R-TE is estimated using this property vector PV.
  • Examples of the latter are: using more mixtures, limiting the possible outcomes of the estimator between 0 and 100 % (the current estimator is based on Gaussians, and a Gaussian can take any value), changing the task of the model (instead of estimating percentages in-between 0-100 %, one could classify frames into classes: 0, 25, 50, 75, 100 %). And another model can be used instead of the Gaussian mixture model.
  • Fig. 6 illustrates the third embodiment, a property vector PV based scheme to determine an up-front optimised segmentation OSEG adapted to the input signal IN.
  • a segmentation optimising unit SEG OPT with respect to the adaptive segmentation OSEG are based on the property vector PV and on a model of the different segmentations, for example their rate-distortion performance.
  • the optimised segmentation OSEG is then applied to the encoder ENC together with the input signal IN, and an encoded output signal OUT can be generated. Then it is not necessary to encode all different segmentation possibilities, because the property vector PV already indicates the input-type-dependent performance of the segmentations.
  • the use of a property vector for up-front segmentation is similar to that of rate-distortion estimation.
  • the property vector can be used to estimate the rate-distortion performance of different segmentation possibilities, choosing the one with the best performance.
  • a property vector for up-front adaptive time segmentation reduces computational complexity significantly compared to rate-distortion by means of full rate- distortion optimisation. Complexity is reduced by a factor about equal to the number of different segment lengths allowed (ignoring the extra complexity introduced by the property vector). For example, assuming that in a sinusoidal encoder with adaptive segmentation 4 different segment lengths are allowed: 10.7, 16.0, 21.3 and 26.8 ms. Then, complexity is reduced by a factor of 4 by up-front segmentation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

An audio encoder comprising optimizing means ET OPT adapted to generate an optimized encoding template OET based on properties PV of an input audio signal IN, such as in form of a property vector. The optimized encoding template OET is being optimized with respect to a predetermined encoding efficiency criterion. Encoding means ENC then generates an encoded audio signal OUT in accordance with the optimized encoding template OET. The audio encoder may comprise analyzing means AN adapted to generate the set of input signal properties PV based of the input signal IN. In a preferred embodiment the optimizing means ET OPT is adapted to estimate a resulting distortion associated with an encoding template. The optimizing means ET OPT may further be able to estimate bit rate associated with an encoding template. In one embodiment the optimizing means ET OPT is adapted to optimize a bit rate distribution to a number of sub-encoders based on the input signal properties (PV). In another embodiment, the optimizing means ET OPT is adapted to up-front decide on an adaptive segmentation based on the input signal properties (PV). The encoders according to the invention are advantageous in that complex processes of a plurality of encodings prior to deciding upon an optimized encoding template OET can be avoided since the optimal encoding template OET is found based on input signal properties (PV).

Description

Efficient audio coding using signal properties
The invention relates to high efficiency, high quality audio signal coding. More specifically, the invention relates to the class of audio codecs which are adaptive to an input signal, Le. having a number of encoding settings to be optimised for obtaining encoded signal being optimal in terms of a rate-distortion criterion. The invention provides an audio encoder and a method of optimising audio encoder settings.
A crucial problem within encoding is to find the most efficient representation for each input signal. Since audio signals can exhibit a wide range of characteristics and, for different signal characteristics, different encoding methods are most efficient, it is desirable to use flexible codecs, e.g. codecs that combine different encoding methods. For example, audio signals are split and encoded as a sinusoidal part and a residual. Usually, tonal signals are coded with a specific coding method aimed at signals made up out of sinusoids and the residual signal is encoded with a waveform or noise encoder. Consequently, within such codecs it has to be decided which settings (or which encoding template) to use, e.g. which part of the signal to encode by which encoding method. Such decision can be based on the full input signal, i.e. the input signal itself, and after trying many encoding possibilities, calculating for each possibility the resulting (perceptual) distortion. However, with the emerged flexible and adaptive codecs that combine many different encoding methods and therefore have a large number of possible settings, the decision about encoding settings becomes a problem regarding complexity.
Also in most codecs with only one coding method decisions have to be made, such as with respect to the encoder settings that may be different for different parts of the input signal. This is for example the case in codecs with adaptive time segmentation. Segmentation can be adapted by means of rate-distortion optimisation, but this increases complexity significantly. Another example can be found in parametric, sinusoidal coding. There it has to be decided how many sinusoids to allocate to a particular segment, the optimal number depending on the input signal. Also in transform or sub-band codecs decisions must be made with respect to the quantisation levels and scale factor bands (a group of frequency bands coded with the same quantisation levels). These decisions are based on the full input signal, considering the corresponding coding errors in the different frequency bands.
Patent application US 2004/0006644 describes a method of transcoding an input signal. Different transcoding methods can be selected depending on the input signal to be transcoded. In US 2004/0006644 it is proposed to select between different methods based on prior established properties of the input signal to be transcoded. However, US 2004/0006644 does not disclose any method for optimising encoder settings.
In conclusion, the state of the art does not satisfactorily answer how to determine the optimum encoder settings or which encoding method can best code which part of the input signal. Therefore, within the field of high quality audio coding there is a need for a method of efficiently optimising an encoding template (or encoder settings) so as to adapt the encoding to an input signal.
Thus, it may be seen as an object of the present invention to provide an audio encoder and an audio encoding method capable of providing a low complexity optimizing of an encoder template and yet provide an encoded signal which is efficient in terms of a rate- distortion criterion.
According to a first aspect the invention provides an audio encoder adapted to encode an audio signal according to an encoding template, the audio encoder comprising: optimizing means adapted to generate an optimized encoding template based on a predetermined set of properties of the audio signal, the optimized encoding template being optimized with respect to a predetermined encoding efficiency criterion, and encoding means adapted to generate an encoded audio signal in accordance with the optimized encoding template.
By the term 'encoding template1 is understood the set of parameters, i.e. settings, that has to be selected for a specific encoder. By Optimized encoding template' it is to be construed an encoding template wherein some or all parameters are selected or modified in response to the predetermined set of properties of the audio signal so as to result in an encoded output signal which is more optimal in terms of the predetermined encoding efficiency criterion. By 'predetermined set of properties of the audio signal' is understood a parametric description of the audio signal comprising one or more parameters descriptive of signal properties of the audio signal. The predetermined set of properties of the audio signal may e.g. be in form of a property vector with scalar values representing each parameter. By using a predetermined set of properties of the audio signal, e.g. by means of a property vector, the audio encoder is capable of optimizing the encoding template to be used for the encoding process by using prior knowledge of relevant properties of the audio signal to be encoded. Thus, preferably the audio encoder estimates a rate and/or distortion measure based on the predetermined set of properties of the audio signal and hereby provides an optimized encoding template without actually encoding the audio signal. In other words, using e.g. an input signal property vector, decisions regarding optimal encoder settings can be performed without the need for trying a large number of possible settings and monitor a resulting encoded output signal with respect to rate and distortion before a final decision on an optimal encoding template can be made.
This enables an encoder with a low complexity for encoding template optimizing compared with traditional encoders. This is especially advantageous for encoding schemes which have encoding templates comprising a large set of parameters to be optimized in order to achieve an optimum rate-distortion efficiency. An example is the class of encoders comprising two or more sub encoders and where at least one task is to decide about a bit rate distribution between the sub encoders in order to obtain an optimal rate-distortion efficiency. Although an exhaustive search among all possible encoding templates using the full input signal and a (perceptual) distortion measure would be optimal, this is probably inefficient and far too complex to be realisable with a limited amount of processing power available. It is to be understood that data representing the set of properties of the audio signal can be arranged in any convenient fashion, such as property vector or property matrix.
The audio encoder may comprise analysis means adapted to analyze the audio signal and generate the set of properties of the audio signal in response thereto. However, the set of properties of the audio signal may be established outside the audio encoder. The audio encoder is then adapted to receive as input the audio signal together with the predetermined set of properties of the audio signal.
Preferably, the optimizing means comprises means adapted to predict a perceptual distortion associated with the encoding template based on the predetermined set of properties of the audio signal. By 'distortion associated with the encoding template' is understood a resulting difference between the encoded audio signal and the audio signal itself by encoding the audio signal according to the encoding template. By 'perceptual distortion' is understood a measure of distortion relevant with respect to what is perceived by the human auditory system, i.e. a measure of distortion that reflects a perceived sound quality. Preferably, the perceptual distortion measure is based on a perceptual model, such as a representation of the human masking curve etc.
Preferably, the optimizing means comprises means adapted to predict a bit rate associated with the encoding template based on the predetermined set of properties of the audio signal.
Most preferably, the optimizing means is adapted to predict both a perceptual distortion and a bit rate associated with the encoding template based on the predetermined set of properties of the audio signal. Hereby the encoder is capable of optimizing the encoding template according to a criterion being the best sound quality at a given maximum target bit rate or the lowest possible bit rate at a predetermined minimum sound quality in terms of perceptual distortion.
Preferably the set of properties of the audio signal comprises at least one property selected from the group consisting of: tonality, noisiness, harmonicity, stationarity, linear prediction gain, long-term prediction gain, spectral flatness, low- frequency spectral flatness, high-frequency spectral flatness, zero crossing rate, loudness, voicing ratio, spectral centroid, spectral bandwidth, a Mel cepstrum, frame energy, spectral flatness for ERB bands 1-10, spectral flatness for ERB bands 10-20, spectral flatness for ERB bands 20-30, and spectral flatness for ERB bands 30-37. Preferably, the predetermined set of properties of the audio signal comprises a property vector with scalars representing one or more of the mentioned parameters. It is to be understood that several other types of parameters may be used, however. In principle any signal describing parameter may be selected. However, preferably the predetermined set of properties of the audio signal comprise perceptually relevant properties, i.e. properties that are relevant with respect to what is perceived by the human auditory system. The predetermined set of properties of the audio signal may comprise properties that can be determined by standard definitions known in the art.
It may be preferred that the set of audio signal properties is specifically designed to take into account relevant properties for a specific encoder in question. E.g. tonality and noisiness parameters may be included in case of a combined encoder having a sinusoidal encoder part and a noise encoder part. Hereby a bit rate distribution task becomes simple and is easily determined from the tonality and noisiness parameter. E.g. a very simple decision criterion may be to select the sinusoidal encoder part in case the tonality parameter exceeds a certain value, otherwise the noise encoder part is selected. However, it is to be understood that based on prior knowledge of the specific encoder in question it is possible to precisely predict encoding behavior even with only one, two or a few parameters to describe the audio signal.
Preferably, the audio encoder is adapted to optimize the encoding template for each segment of the audio signal. Thus, the encoder being able to track rapid changes in the audio signal, such as transients, and adapt its encoding template accordingly.
The optimizing means may be adapted to optimize a segmentation of the audio signal based on the set of properties of the audio signal. Apart from the encoding template it has proven to be encoding efficient to use adaptive segmentation. Using an up- front adaptive segmentation based on signal properties of the audio signal such adaptive segmentation becomes even more efficient, since in prior art encoders adaptive segmentation only adds an extra and complex optimizing task apart from optimizing the encoding template.
The optimizing means may be adapted to select the optimized encoding template from a set of predefined encoding templates. In order to further facilitate the encoding template optimizing process, it may be preferred that the predefined set of encoding templates covers the majority of the entire encoder parameter space. The optimizing task may then be to evaluate the predefined set of encoding parameters and select the best one in terms of the predetermined encoding efficiency criterion.
In a preferred embodiment the encoding means comprises first and second sub-encoders, while the optimizing means is adapted to optimize first and second encoding templates for the first and second sub-encoders in response to the predetermined set of properties of the audio signal. If preferred, the audio encoder may comprise three, four, five, ten or even more separate sub-encoders and be adapted to optimize encoding templates for all sub-encoders based on the predetermined set of properties of the audio signal. Thus, this embodiment covers combined codecs. In a second aspect the invention provides a method of encoding an audio signal, the method comprising the steps of: generating an optimized encoding template based on a predetermined set of properties of the audio signal, the optimized encoding template being optimized with respect to a predetermined encoding efficiency criterion, and - generating an encoded audio signal in accordance with the optimized encoding template.
The same explanation and preferred variants as described above for the first aspect of the invention apply for the second aspect as well. In a third aspect the invention provides a method of optimizing an encoding template of an audio encoder adapted to encode an audio signal, the method comprising the steps of: receiving a predetermined set of properties of the audio signal, - optimizing the encoding template with respect to a predetermined encoding efficiency criterion, based on the predetermined set of properties of the audio signal.
Optimizing the encoding template for the encoder based on the predetermined set of properties of the audio signal, such as using a property vector, makes the optimizing considerably less complex than prior art methods of optimizing encoding templates. The reason is that prior art methods of optimizing encoding efficiency are based on necessary bit rate and a resulting distortion obtained for an actually encoded audio signal. Thus, such prior art methods involve the encoding process. By an optimizing method based on a predetermined set of properties of the audio signal the encoding process in the optimizing method is eliminated. This is especially advantageous in encoder with a large number of settings to be optimized. Instead the optimizing may be based on a prediction of a perceptual distortion measure and a prediction of a bit rate for a given encoding template.
Although not as accurate as actually encoding a signal according to the encoding template, prediction accuracy can be improved by carefully considering e.g. which data to include in the predetermined set of properties of the audio signal and establishing a precise model of the encoder(s) in questions. For complex set of combined encoders each having a large number of possible settings, prior art methods may provide poor results as it may not be possible to actually test the entire parameter space but only a very coarsely cover the parameter space. In contrast, predictions may prove to be fast enough to cover the entire parameter space and thus end up with an encoding template closer to the theoretically optimum, provided a given computation power available.
The method according to the third aspect may comprise an initial set of analyzing the audio signal and generate the set of predetermined properties of the audio signal in accordance therewith.
Preferably, the optimizing step comprises predicting a perceptual distortion measure (see the above definitions).
Preferably, the optimizing step comprises predicting a bit rate. Preferably, the optimizing step comprises predicting of both a perceptual distortion and a bit rate so as to enable an optimization of the encoding template according to a criterion being the best sound quality at a given maximum target bit rate or the lowest possible bit rate at a predetermined minimum sound quality in terms of perceptual distortion.
Preferably, the optimizing method is performed for each segment of the audio signal. Preferably, the optimizing method comprises optimizing segmentation of the audio signal based on the predetermined set of properties of the audio signal.
In a fourth aspect the invention provides a device comprising an audio encoder according to the first aspect. Such device is preferably an audio device such as a solid state audio device, a CD player, a CD recorder, a DVD player, a DVD recorder, a harddisk recorder, a mobile communication device, (portable) computers etc. However, the device may also be devices other than audio devices.
In a fifth aspect the invention provides a computer readable program code adapted to encode an audio signal according to the method of the second aspect.
In a sixth aspect the invention provides a computer readable program code adapted to optimize an encoding template according to the method of the third aspect.
The computer readable program code according to the fifth and sixth aspects may comprise software algorithms adapted for a signal processor, personal computers etc. It may be present on a portable medium such as a disk or memory card or memory stick, or it may be present in a ROM chip or in other way stored in a device.
In the following the invention is described in more details with reference to the accompanying Figures of which
Fig. 1 illustrating a prior art encoder where encoding settings are either fixed or iteratively adjusted based on a resulting distortion of the encoded signal,
Fig. 2 illustrates an encoder according to the invention, where a decision of encoder settings is based on a prior analysis of an input signal,
Fig. 3 illustrates a preferred Gaussian mixture based minimum mean square error (MMSE) estimator for estimating encoding distortion, Fig. 4 illustrates a prior art combined encoder where bit rate distribution between two sub encoders is decided upon by evaluating distortion of the encoded signal,
Fig. 5 illustrates a combined encoder according to the invention, where bit rate distribution between two sub encoders is decided upon based on properties of the input signal, Fig. 6 illustrates an encoder according to the invention, where an adaptive segmentation of the input signal is decided upon based on properties of the input signal.
While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
Fig. 1 illustrates a prior art encoder ENC that receives an input signal IN and generates an encoded output signal OUT in response thereto. In the prior art encoder ENC encoder settings or an encoding template is either fixed or based on an optimising algorithm involving an encoding of the input signal. Different encoding templates are tried, each involving an encoding of the input audio signal IN, and for each encoding template e.g. distortion and bit rate associated with each encoding template is monitored, and finally the most efficient encoding template is selected to be used to generate the output signal OUT.
Fig. 2 illustrates the principle of the invention by means of a preferred audio encoder embodiment. An input audio signal IN is received and analysed by signal analysing means AN. The analysing means AN generates in response a property vector PV comprising a set of properties of the audio signal IN. This property vector PV is then received by an encoding template optimising unit ET OPT that generates an optimised encoding template OET based on the received property vector PV. The optimised encoding template OET and the input audio signal IN are then used by an encoder means ENC to generate an encoded output signal OUT being an encoded version of the input audio signal ESf.
Thus, in the audio encoder of Fig. 2 the property vector PV and a mathematical model of the different encoding configurations, for example its rate-distortion performance, is used to generate the optimised encoding template OET. Then, it is not necessary to try all possible encoding templates, because the property vector PV already indicates the input-type-dependent performance of the encoding templates. In contrast to the prior art encoder of Fig. 1, the audio encoder according to the invention is capable of optimising an encoding template for the encoder means without having to encode the input audio signal IN but is capable of deciding upon an optimal encoding template using properties of the input audio signal IN only. It is to be understood that the analysing means AN shown in the diagram of Fig. 2 is optional. Thus, an audio encoder according to the invention may be adapted to receive as inputs the input audio signal IN and a property vector PV.
The application of a property vector PV is efficient and reduces complexity in the optimising process. A disadvantage of the use of a property vector PV may be that encoding becomes (slightly) sub-optimal. However, the ad-hoc methods currently in use in audio coding are most likely much further from an optimal solution.
The application of a predetermined set of properties of an input audio signal can be used in several ways, which can be used simultaneously. They will be further described in the following. For simplicity reasons a predetermined set of properties of an input audio signal is denoted a property vector in the following.
In a first embodiment, a property vector is used to estimate distortions, such as a perceptual distortions, for different encoding templates. E.g. the combination of different encoding methods or different settings within one encoding method. This has two advantages in terms of complexity: 1) no actual encoding necessary, 2) no need for calculations of the (perceptual) distortion. In other words, the property vector is used to obtain (perceptual) distortions without actual encodings and calculations of the corresponding distortion.
In a second embodiment, a property vector is used to determine directly which part of an input signal to code by which encoding method in a hybrid encoder, i.e. in an encoder comprising a combination of several encoding methods or sub-encoders. This goes one step further than the previous item: in this case, the property vector does not only indicate the input-type-dependent performance of the coding methods, but also indicates which one(s) to use.
For example, if the input signal has a prominent sinusoid, it is not necessary to encode this with all encoding methods and choose the most efficient one. In contrast, the property vector indicates that the signal contains a prominent sinusoid and thus, it is sufficient to check which encoding method can efficiently encode sinusoids, such as a sinusoidal encoder, and then start with that one. Thus, looking at the property vector, it is immediately clear, without actually encoding, which encoding method can most efficiently encode (parts of) the input signal. The property vector can also be used to estimate potential interactions between the coding methods. Knowledge about these interactions is also important for efficient configuration of the codec.
In a third embodiment, a property vector is to estimate an optimal time- variant adaptive segmentation of codecs. By means of a property vector the adaptive segmentation can be set up-front based on the time- varying characteristics of the input signal, which leads to lower complexity compared to methods that explore the effect of several segmentation possibilities. __
The three mentioned embodiments will now be described in more details. The first embodiment is a property vector based scheme for instantaneous distortion estimation. The framework is based on a property vector extracted from the frame to be encoded, from which the distortion estimation is to be performed. In more detail, the task of estimating the incurred coding distortion, θ , for a coder Q{.) is addressed. For a given frame Λ: , the incurred distortion is expressed as
θ = δ(x,x) = δ(x,Q(x)), (1)
where <5(.,.) is an appropriate distortion measure.
The estimation is separated into a property extraction, /(.) , and an estimation, g(.) . The random input vector X is processed into a dimension reduced random vector P , from which an estimate, Θ , of the coding distortion, Θ , is to be found. The aim of the scheme is to perform an unbiased estimate, and to minimise the estimation error variance,
Figure imgf000012_0001
The performance of such a scheme is highly dependent on the choice of property vector. Thus, the basic task for the property extractor, /(.) , is to extract properties,
P , that contain sufficient information about Θ for a required estimator accuracy, a\ , i.e. sufficiently high mutual information, l(β;P) such as found in T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, 1991. The aim of the estimator, g(.) , is to find an estimate, θ , of the incurred distortion, θ , based on an observation of the property vector P = p . The minimum mean square error estimator (MMSE) for this task, i.e., the one minimising σ\ , is the conditional mean estimator,
θmme = E[Θ \ P = p] = $θfΘ]P (θ \ P = p)dθ (3) Fig. 3 illustrates the chosen implementation using a model-based approach as described in J. Lindblom, J. Samuelsson, and P. Hedelin, "Model based spectrum prediction," in Proc. IEEE Workshop Speech Coding, (Delawan, WI, USA), 2000, pp. 117-119. In Fig. 3 T O-L indicates that the joint pdf, /J^? (0, p) , is off-line trained.
Employing a Gaussian mixture model (GMM) for the joint pdf,
Figure imgf000013_0001
the MMSE at each coding instant is approximated as
θ = g{p) = ' Θ\P {θ \ P = p)dθ , (4)
where f<~?p {θ | P = p) is the conditional model pdf, which can be shown to be a mixture of
Gaussian densities, and is easily derived from the joint model pdf, /J^ (#,/*) . In practice, this estimator calculates a weighted sum of conditional means,
Figure imgf000013_0002
where M is the number of mixture components, and {p'} and {mj elP=p} represent the weights and the means of the conditional model pdf, /Q M P {Θ \ P = p), respectively. The estimator output will approach the true conditional mean, c.f. Eq. (3), as the model pdf approaches the true pdf.
The complexity reduction obtained by distortion estimation instead of encoding and distortion calculation depends on 3 factors: the complexity of the distortion estimation using a property vector, the complexity of the encoding method, and the complexity of distortion calculation. The complexity of the distortion estimation obviously depends on the model that is used. For the embodiment presented above, assuming each RD point is estimated independently, the complexity can be stated as: N^ • Nmlxt • {Cpmώιct + Cpdf ), in which Ng0 is the number of RD points, Nmixl is the number of mixtures, Cproduct is the complexity of the matrix vector product, and Cpdf is the complexity of the Gaussian pdf evaluation. The matrix vector product has the 'dimension' of the employed property vector, but the matrix is symmetric and the complexity can thus be reduced to approximately half of that.
The complexity of the encoding method obviously depends on the method that is used and widely varies from codec to codec. Nevertheless, this complexity is expected to be higher than that of the distortion estimation.
The implemented estimation scheme has been evaluated for a Code-Excited Linear Prediction (CELP) like encoder, Q(.) , using the incurred Signal to Noise Ration (SNR) as the distortion to be estimated, Θ . It has been tested for six different property vectors: the 10th order linear prediction gain (GLPC), the long-term prediction gain (GLTP), spectral flatness (G ), low- frequency spectral flatness (Glow), high-frequency spectral flatness Ghigh, and the combination of LPC and LTP gain (GLPCGLTP). All estimators were based on
32-mixture models, and the results were evaluated on the Timit speech database, using separate evaluation and training sets.
The results were that the estimation error variance
Figure imgf000014_0001
decreased as the mutual information, /(Θ; P) , was increased in the employed property vector, P . Thus, closeness to the true distortion increased with the mutual information, l(Θ;P), of the employed property vector. The results showed that a high accuracy estimation can be performed, given a property vector with sufficiently high mutual information, l(Θ;P) . The results confirmed the feasibility of the using a property vector to indicate the input-type-dependent performance of encoding configurations, thereby reducing complexity.
The property vector scheme has also been evaluated for a sinusoidal encoder, using 30 sinusoids per frame. The encoder is based on psycho-acoustical matching pursuit as found in R. Heusdens and S. van de Par, "Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustical matching pursuits," in Proc. IEEE Int. Conf. Acoust, Speech, and Signal Proc, (Orlando, FL, USA), 2002, vol. 2, pp. 1809-1812, using a perceptual spectral distortion measure as found in S. van de Par, S. Kohlrausch, A. Charestan, and R. Heusdens, "A new psychoacoustical masking model for audio coding applications," in Proc. Proc. IEEE Int. Conf. Acoust., Speech, and Signal Proc, (Orlando, FL, USA), 2002, vol. 2, pp. 1805-1808., as the distortion to be estimated, θ . It was tested for eight different property vectors: zero crossing rate (ZCR), loudness (L), voicing ratio (V), spectral centroid (SC), spectral bandwidth (BW), spectral flatness (SF), a 12 order Mel cepstrum (MFCC), and a 4 dimensional property vector, based on the combination L+SF+SC+BW. All estimators were based on 16-mixture models, and the results were evaluated on an audio database containing 900.000 frames of 35 ms, separated into an evaluation and a training set. Also for this implementation the results indicated that it is possible to estimate the distortion with a high accuracy, given a property vector with sufficiently high mutual information, /(Θ; P) .
In the following the second embodiment will be described where a property vector is used to determine which part of an input signal to be encoded by which encoding method in a hybrid encoder.
The hybrid encoder of the embodiment comprises two encoding methods: a sinusoidal encoder followed by a transform encoder. The sinusoidal encoder is similar to the one described in connection with the first embodiment. The transform encoder is based on an MDCT filter bank, such as found in R. D. Koilpillai and P. P. Vaidyanathan, "Cosine- modulated fir filter banks satisfying perfect reconstruction," IEEE Trans. Signal Processing, vol. 40, no. 4, pp. 770-783, April 1992, and codes the residual of the sinusoidal encoder. The key question is which signal component to encode by the sinusoidal encoder and which component by the transform encoder. In this embodiment, this question translates to which part of the available bit budget to spend by the sinusoidal encoder and which part by the transform encoder.
Fig. 4 illustrates a prior art approach. An input signal IN is applied to a sinusoidal encoder SENC that delivers a residual signal res to a transform encoder TENC that is thus intended to encode what the sinusoidal encoder SENC can not encode. A rate- distortion optimising unit R-D OPT distributes bit rates R-SE and R-TE for the two encoders SENC, TENC, respectively. In response, the optimising unit R-D OPT receives a resulting distortion D from the last encoder TENC. Several different bit distributions R-SE, R-TE are tried and the optimal one is then chosen by the rate-distortion optimising unit R-D OPT, i.e. the one resulting in the lowest distortion D, and this distribution R-SE, R-TE is then used to generate an encoded output signal OUT.
In the chosen example the following bit distributions are tried: 100% to the sinusoidal encoder (SENC) and 0% to the transform encoder (TENC), 75% SENC and 25% TENC, 50% SENC and 50% TENC, 25% SENC and 75% TENC, 0% SENC and 100% TENC. The signal is encoded using the different bit distributions and from the resulting parameters a signal is synthesis to determine the corresponding perceptual distortion. For this, the perceptually-relevant distortion measure found in S. van de Par, A. Kohlrausch, G. Charestan and R. Heusdens, "A new psychoacoustical masking model for audio coding applications," in Proc. Proc. IEEE Int. Conf. Acoust, Speech, and Signal Proc, (Orlando, Florida, USA), 2002, vol. 2, pp. 1805-1808, is used, which utilises the spectral auditory masking properties of the input signal. The optimisation algorithm selects that bit distribution that results in the lowest perceptual distortion. Fig. 5 illustrates an approach according to the invention. The difference from the prior art approach of Fig. 4 is that a property vector PV, as described above, is input to a bit rate optimising unit R-OPT that determines optimal bit distributions R-SE, R-TE to the two encoders SENC, TENC. In the shown embodiment an analysing unit AN analyses the input signal IN and generates the property vector PV in response thereto. Instead of trying different bit distributions, the optimal distribution R-SE, R-TE is estimated using this property vector PV.
To determine which properties are useful for this task, twelve property vectors have been examined: eight 1 -dimensional vectors (zero crossing rate, loudness (L), voicing ratio, spectral centroid, spectral bandwidth (BW), spectral flatness, frame energy, LPC flatness), two 4-dimensional vectors (L+BW and SFERB: spectral flatness for ERB band 1- 10, 10-20, 20-30, 30-37), one 8-dimensional vector based on the combination of the two 4 dimensional property vectors, and one 12-dimensional vector (a 12 order Mel cepstrum). A Gaussian mixture model is used to estimate the bit distributions, such as described above. All estimators are based on 32-mixture models, which are trained using an audio database containing 6.000 frames of 43 ms. The best results are obtained by using the multi¬ dimensional property vectors. Therefore the 4 dimensional property vector SFERB is used for the evaluation using a different database than the one used for training.
A comparison of the two approaches of Figs. 4 and 5 has been performed. The resulting perceptual distortions have been determined per frame, using the distortion measure found in S. van de Par, A. Kohlrausch, G. Charestan and R. Heusdens, "A new psychoacoustical masking model for audio coding applications," in Proc. Proc. IEEE Int. Conf. Acoust., Speech, and Signal Proc, (Orlando, Florida, USA), 2002, vol. 2, pp. 1805- 1808. The two approaches result in similar distortions, indicating the feasibility of using a property vector for determining bit distributions. However, the embodiment presented in Fig. 5 may be improved in several ways, for example by using better properties or improving the Gaussian mixture model illustrated in Fig. 3. Examples of the latter are: using more mixtures, limiting the possible outcomes of the estimator between 0 and 100 % (the current estimator is based on Gaussians, and a Gaussian can take any value), changing the task of the model (instead of estimating percentages in-between 0-100 %, one could classify frames into classes: 0, 25, 50, 75, 100 %). And another model can be used instead of the Gaussian mixture model.
The use of a property vector PV for estimation of bit distributions R-SE, R-TE among the different codec strategies SENC, TENC reduces computational complexity significantly compared to a codec in which .this distribution is determined by means of rate- distortion optimisation. In the mentioned embodiment complexity is reduced by a factor equal to the number of bit distributions examined in the optimisation. So, complexity is reduced by a factor of 5 in the mentioned example.
Fig. 6 illustrates the third embodiment, a property vector PV based scheme to determine an up-front optimised segmentation OSEG adapted to the input signal IN.
Decisions in a segmentation optimising unit SEG OPT with respect to the adaptive segmentation OSEG are based on the property vector PV and on a model of the different segmentations, for example their rate-distortion performance. The optimised segmentation OSEG is then applied to the encoder ENC together with the input signal IN, and an encoded output signal OUT can be generated. Then it is not necessary to encode all different segmentation possibilities, because the property vector PV already indicates the input-type-dependent performance of the segmentations.
Actually, the use of a property vector for up-front segmentation is similar to that of rate-distortion estimation. In the same way as described for the first embodiment, the property vector can be used to estimate the rate-distortion performance of different segmentation possibilities, choosing the one with the best performance.
The use of a property vector for up-front adaptive time segmentation reduces computational complexity significantly compared to rate-distortion by means of full rate- distortion optimisation. Complexity is reduced by a factor about equal to the number of different segment lengths allowed (ignoring the extra complexity introduced by the property vector). For example, assuming that in a sinusoidal encoder with adaptive segmentation 4 different segment lengths are allowed: 10.7, 16.0, 21.3 and 26.8 ms. Then, complexity is reduced by a factor of 4 by up-front segmentation.
As will be understood the encoding principles according to the invention may be applied within a large range of applications, such as solid state audio devices, CD players/recorders, DVD players/recorders, mobile communication devices, (portable) computers, multimedia streaming of audio such as on the internet etc. In the claims reference signs to the Figures are included for clarity reasons only. These references to exemplary embodiments in the Figures should not in any way be construed as limiting the scope of the claims.

Claims

CLAIMS:
1. An audio encoder adapted to encode an audio signal (IN) according to an encoding template, the audio encoder comprising: optimizing means (ET OPT) adapted to generate an optimized encoding template (OET) based on a predetermined set of properties (PV) of the audio signal (IN), the optimized encoding template (OET) being optimized with respect to a predetermined encoding efficiency criterion, and encoding means (ENC) adapted to generate an encoded audio signal (OUT) in accordance with the optimized encoding template (OET).
2. An audio encoder according to claim 1, further comprising analysis means
(AN) adapted to analyze the audio signal (IN) and generate the set of properties (PV) of the audio signal (IN) in response thereto.
3. An audio encoder according to claim 1, wherein the optimizing means (ET OPT) comprises means adapted to predict a perceptual distortion associated with the encoding template based on the predetermined set of properties (PV) of the audio signal (IN).
4. An audio encoder according to claim 1, wherein set of properties (PV) of the audio signal (IN) comprises at least one property selected from the group consisting of: tonality, noisiness, harmonicity, stationarity, linear prediction gain, long-term prediction gain, spectral flatness, low-frequency spectral flatness, high-frequency spectral flatness, zero crossing rate, loudness, voicing ratio, spectral centroid, spectral bandwidth, a Mel cepstrum, frame energy, spectral flatness for ERB bands 1-10, spectral flatness for ERB bands 10-20, spectral flatness for ERB bands 20-30, and spectral flatness for ERB bands 30-37.
5. An audio encoder according to claim 1, adapted to optimize the encoding template for each segment of the audio signal.
6. An audio encoder according to claim 1, wherein the predicting means (ET
OPT) further comprises means adapted to predict a resulting bit rate associated with the encoding template, based on the set of properties (PV) of the audio signal (IN).
7. An audio encoder according to claim 1, wherein the optimizing means (ET
OPT) is adapted to optimize a segmentation of the audio signal based on the set of properties (PV) of the audio signal.
8. An audio encoder according to claim 1, wherein the optimizing means (ET OPT) is adapted to select the optimized encoding template (OET) from a set of predefined encoding templates.
9. An audio encoder according to claim 1, wherein the encoding means comprises first (SENC) and second (TENC) sub-encoders, and wherein the optimizing means (R-OPT) is adapted to generate optimized first (R-SE) and second (R-TE) encoding templates for the first (SENC) and second (TENC) sub-encoders in response to the predetermined set of properties (PV) of the audio signal (EST).
10. A method of encoding an audio signal (IN), the method comprising the steps of: generating an optimized encoding template (OET) based on a predetermined set of properties (PV) of the audio signal (IN), the optimized encoding template (OET) being optimized with respect to a predetermined encoding efficiency criterion, and generating an encoded audio signal (OUT) in accordance with the optimized encoding template (OET).
11. A method of optimizing an encoding template (OET) of an audio encoder adapted to encode an audio signal (IN), the method comprising the steps of: receiving a predetermined set of properties (PV) of the audio signal (IN), - optimizing the encoding template (OET) with respect to a predetermined encoding efficiency criterion, based on the predetermined set of properties (PV) of the audio signal (IN).
12. A device comprising an audio encoder according to claim 1.
13. A computer readable program code adapted to encode an audio signal according to the method of claim 10.
PCT/IB2005/053570 2004-11-05 2005-11-02 Efficient audio coding using signal properties WO2006048824A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP05797846A EP1815463A1 (en) 2004-11-05 2005-11-02 Efficient audio coding using signal properties
US11/718,242 US20090063158A1 (en) 2004-11-05 2005-11-02 Efficient audio coding using signal properties
JP2007539679A JP2008519308A (en) 2004-11-05 2005-11-02 Efficient audio coding using signal characteristics

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP04105545 2004-11-05
EP04105545.0 2004-11-05

Publications (1)

Publication Number Publication Date
WO2006048824A1 true WO2006048824A1 (en) 2006-05-11

Family

ID=35965990

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2005/053570 WO2006048824A1 (en) 2004-11-05 2005-11-02 Efficient audio coding using signal properties

Country Status (6)

Country Link
US (1) US20090063158A1 (en)
EP (1) EP1815463A1 (en)
JP (1) JP2008519308A (en)
KR (1) KR20070085788A (en)
CN (1) CN101053020A (en)
WO (1) WO2006048824A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7818168B1 (en) * 2006-12-01 2010-10-19 The United States Of America As Represented By The Director, National Security Agency Method of measuring degree of enhancement to voice signal
KR101411900B1 (en) * 2007-05-08 2014-06-26 삼성전자주식회사 Method and apparatus for encoding and decoding audio signal
CN101221766B (en) * 2008-01-23 2011-01-05 清华大学 Method for switching audio encoder
GB0915766D0 (en) * 2009-09-09 2009-10-07 Apt Licensing Ltd Apparatus and method for multidimensional adaptive audio coding
US9047875B2 (en) * 2010-07-19 2015-06-02 Futurewei Technologies, Inc. Spectrum flatness control for bandwidth extension
PL2951820T3 (en) * 2013-01-29 2017-06-30 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for selecting one of a first audio encoding algorithm and a second audio encoding algorithm
WO2024194336A1 (en) * 2023-03-21 2024-09-26 Telefonaktiebolaget Lm Ericsson (Publ) Coding of granular synthesis databases

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5341456A (en) * 1992-12-02 1994-08-23 Qualcomm Incorporated Method for determining speech encoding rate in a variable rate vocoder
US20020049585A1 (en) * 2000-09-15 2002-04-25 Yang Gao Coding based on spectral content of a speech signal
US20040006644A1 (en) 2002-03-14 2004-01-08 Canon Kabushiki Kaisha Method and device for selecting a transcoding method among a set of transcoding methods

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0111612B1 (en) * 1982-11-26 1987-06-24 International Business Machines Corporation Speech signal coding method and apparatus
EP0556354B1 (en) * 1991-09-05 2001-10-31 Motorola, Inc. Error protection for multimode speech coders
US6134518A (en) * 1997-03-04 2000-10-17 International Business Machines Corporation Digital audio signal coding using a CELP coder and a transform coder
US6240386B1 (en) * 1998-08-24 2001-05-29 Conexant Systems, Inc. Speech codec employing noise classification for noise compensation
US6691084B2 (en) * 1998-12-21 2004-02-10 Qualcomm Incorporated Multiple mode variable rate speech coding
US6785645B2 (en) * 2001-11-29 2004-08-31 Microsoft Corporation Real-time speech and music classifier
AUPS270902A0 (en) * 2002-05-31 2002-06-20 Canon Kabushiki Kaisha Robust detection and classification of objects in audio using limited training data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5341456A (en) * 1992-12-02 1994-08-23 Qualcomm Incorporated Method for determining speech encoding rate in a variable rate vocoder
US20020049585A1 (en) * 2000-09-15 2002-04-25 Yang Gao Coding based on spectral content of a speech signal
US20040006644A1 (en) 2002-03-14 2004-01-08 Canon Kabushiki Kaisha Method and device for selecting a transcoding method among a set of transcoding methods

Non-Patent Citations (13)

* Cited by examiner, † Cited by third party
Title
CHRISTENSEN M G ET AL: "ARDOR: Adaptive Rate-Distortion Optimized Sound Coder", AALBORG UNIVERSITY. DEPARTMENT OF COMMUNICATION TECHNOLOGY, 3 July 2004 (2004-07-03), XP002361146 *
DAS A ET AL: "Multimode variable bit rate speech coding: an efficient paradigm for high-quality low-rate representation of speech signal", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1999. PROCEEDINGS., 1999 IEEE INTERNATIONAL CONFERENCE ON PHOENIX, AZ, USA 15-19 MARCH 1999, PISCATAWAY, NJ, USA,IEEE, US, vol. 4, 15 March 1999 (1999-03-15), pages 2307 - 2310, XP010327890, ISBN: 0-7803-5041-3 *
HEUSDENS R ET AL: "Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustical matching pursuits", 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP), vol. VOL. 4 OF 4, 13 May 2002 (2002-05-13) - 17 May 2002 (2002-05-17), ORLANDO, FL, pages II-1809 - II-1812, XP010804247, ISBN: 0-7803-7402-9 *
NORDEN F ET AL: "Open Loop Rate-Distortion Optimized Audio Coding", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2005. PROCEEDINGS. (ICASSP '05). IEEE INTERNATIONAL CONFERENCE ON PHILADELPHIA, PENNSYLVANIA, USA MARCH 18-23, 2005, PISCATAWAY, NJ, USA,IEEE, 18 March 2005 (2005-03-18), pages 161 - 164, XP010792354, ISBN: 0-7803-8874-7 *
NORDEN F ET AL: "Property vector based distortion estimation", SIGNALS, SYSTEMS AND COMPUTERS, 2004. CONFERENCE RECORD OF THE THIRTY-EIGHTH ASILOMAR CONFERENCE ON PACIFIC GROVE, CA, USA NOV. 7-10, 2004, PISCATAWAY, NJ, USA,IEEE, 7 November 2004 (2004-11-07), pages 2275 - 2279, XP010781123, ISBN: 0-7803-8622-1 *
R. D. KOILPILLAI; P. P. VAIDYANATHAN: "Cosine- modulated fir filter banks satisfying perfect reconstruction", IEEE TRANS. SIGNAL PROCESSING, vol. 40, no. 4, April 1992 (1992-04-01), pages 770 - 783
R. HEUSDENS; S. VAN DE PAR: "Rate-distortion optimal sinusoidal modeling of audio and speech using psychoacoustical matching pursuits", PROC. IEEE INT. CONF. ACOUST., SPEECH, AND SIGNAL PROC., (ORLANDO, FL, USA, vol. 2, 2002, pages 1809 - 1812
S. VAN DE PAR ET AL.: "A new psychoacoustical masking model for audio coding applications", PROC. IEEE INT. CONF. ACOUST., SPEECH, AND SIGNAL PROC., (ORLANDO, FLORIDA, USA, vol. 2, 2002, pages 1805 - 1808
S. VAN DE PAR ET AL.: "A new psychoacoustical masking model for audio coding applications", PROC. PROC. IEEE INT. CONF. ACOUST., SPEECH, AND SIGNAL PROC., (ORLANDO, FL, USA, vol. 2, 2002, pages 1805 - 1808
S. VAN DE PAR ET AL.: "A new psychoacoustical masking model for audio coding applications", PROC. PROC. IEEE INT. CONF. ACOUST., SPEECH, AND SIGNAL PROC., (ORLANDO, FLORIDA, USA, vol. 2, 2002, pages 1805 - 1808
T. M. COVER; J. A. THOMAS: "Elements of Information Theory", 1991, JOHN WILEY & SONS
VAFIN R ET AL.: "Towards optimal quantizatino in multistage audio encoding", ACOUSTICS, SPEECH AND SIGNAL PROCESSING 2004, PROCEEDINGS (ICASSP '04), IEEE INTERNATIONAL CONFERENCE ON, MONTREAL, QUEBEC, CANADA, vol. 4, 17 May 2004 (2004-05-17), pages 205 - 208
VAFIN R ET AL: "Towards optimal quantization in multistage audio coding", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2004. PROCEEDINGS. (ICASSP '04). IEEE INTERNATIONAL CONFERENCE ON MONTREAL, QUEBEC, CANADA 17-21 MAY 2004, PISCATAWAY, NJ, USA,IEEE, vol. 4, 17 May 2004 (2004-05-17), pages 205 - 208, XP010718441, ISBN: 0-7803-8484-9 *

Also Published As

Publication number Publication date
EP1815463A1 (en) 2007-08-08
JP2008519308A (en) 2008-06-05
CN101053020A (en) 2007-10-10
KR20070085788A (en) 2007-08-27
US20090063158A1 (en) 2009-03-05

Similar Documents

Publication Publication Date Title
CN101903945B (en) Encoder, decoder, and encoding method
US5517595A (en) Decomposition in noise and periodic signal waveforms in waveform interpolation
CN103765510B (en) Code device and method, decoding apparatus and method
US20140108008A1 (en) Method and apparatus for encoding and decoding audio/speech signal
KR20080101873A (en) Apparatus and method for encoding and decoding signal
CN105719655A (en) Apparatus and method for encoding and decoding signal for high frequency bandwidth extension
US10084475B2 (en) Low bit rate signal coder and decoder
US20090063158A1 (en) Efficient audio coding using signal properties
JP2004517348A (en) High performance low bit rate coding method and apparatus for non-voice speech
JPWO2008108078A1 (en) Encoding apparatus and encoding method
JP2008519308A5 (en)
JP2002544551A (en) Multipulse interpolation coding of transition speech frames
US8825494B2 (en) Computation apparatus and method, quantization apparatus and method, audio encoding apparatus and method, and program
EP2087485B1 (en) Multicodebook source -dependent coding and decoding
Vali et al. End-to-end optimized multi-stage vector quantization of spectral envelopes for speech and audio coding
Korse et al. Entropy Coding of Spectral Envelopes for Speech and Audio Coding Using Distribution Quantization.
Atal A model of LPC excitation in terms of eigenvectors of the autocorrelation matrix of the impulse response of the LPC filter
RU2414009C2 (en) Signal encoding and decoding device and method
Hasanabadi et al. MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning
EP0713208B1 (en) Pitch lag estimation system
JP3192051B2 (en) Audio coding device
Athaudage et al. Model-based speech signal coding using optimized temporal decomposition for storage and broadcasting applications
Ozaydin Residual Lsf Vector Quantization Using Arma Prediction
Deshpande et al. Audio Spectral Enhancement: Leveraging Autoencoders for Low Latency Reconstruction of Long, Lossy Audio Sequences
Ramadan Compressive sampling of speech signals

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KN KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2005797846

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 11718242

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2007539679

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 1949/CHENP/2007

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 200580037908.9

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 1020077012691

Country of ref document: KR

WWP Wipo information: published in national office

Ref document number: 2005797846

Country of ref document: EP