WO2018073486A1 - Low-delay audio coding - Google Patents

Low-delay audio coding Download PDF

Info

Publication number
WO2018073486A1
WO2018073486A1 PCT/FI2016/050744 FI2016050744W WO2018073486A1 WO 2018073486 A1 WO2018073486 A1 WO 2018073486A1 FI 2016050744 W FI2016050744 W FI 2016050744W WO 2018073486 A1 WO2018073486 A1 WO 2018073486A1
Authority
WO
WIPO (PCT)
Prior art keywords
samples
vector
quantized
source
zero
Prior art date
Application number
PCT/FI2016/050744
Other languages
French (fr)
Inventor
Adriana Vasilache
Anssi Sakari RÄMÖ
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to PCT/FI2016/050744 priority Critical patent/WO2018073486A1/en
Publication of WO2018073486A1 publication Critical patent/WO2018073486A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/032Quantisation or dequantisation of spectral components
    • G10L19/038Vector quantisation, e.g. TwinVQ audio
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/10Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation
    • G10L19/107Sparse pulse excitation, e.g. by using algebraic codebook
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3082Vector coding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques

Definitions

  • the example and non-limiting embodiments of the present invention relate to low- delay coding of audio signals at high sound quality.
  • some embodiments of the present invention relate to lattice vector quantization of a signal that represents a segment of an audio signal.
  • an audio coding technique When such an audio coding technique is applied in an audio processing system that involves e.g. capturing and processing an audio signal and related processing, encoding the captured/processed audio signal, transmitting the encoded audio signal from one entity to another, decoding the received encoded audio signal and reproducing the decoded audio signal, the overall processing delay typically increases clearly beyond the mere coding delay, thereby rendering such audio coding techniques unsuitable for applications that cannot tolerate long latency such as telephony, wireless microphones or audio co-creation systems.
  • Speech coding techniques such as adaptive multi-rate (AMR), adaptive multi-rate wideband (AMR-WB) and 3GPP enhanced voice services (EVS) employ coding delay in the range of 25 to 32 ms, which makes them somewhat better suited for some latency-critical applications, including conversational applications such as mobile telephony and/or voice over internet protocol (VoIP).
  • AMR adaptive multi-rate
  • AMR-WB adaptive multi-rate wideband
  • EVS 3GPP enhanced voice services
  • these coding techniques are speech coding techniques that make use of some characteristics of human voice and that operate on bandwidth-limited audio signals at a relatively low-bitrates, thereby providing an audio quality that is not well-suited for applications that require high-quality full-band audio and/or carry audio content different from human voice.
  • speech coding techniques such as.
  • ITU-T G.726, G.728 and G.722 that enable very low coding delay even in a range below 1 ms, but also these coding techniques operate on voice band (e.g. at 8 or 16 kHz sampling frequency) and provide a rather modest compression ratio.
  • Some recently introduced audio coding techniques such as Opus (in a low-delay mode) and AAC-ULD enable relatively low coding delay in a range from 2.5 to 20 ms for full-band audio at a relatively good sound quality.
  • the AAC-ULD coding technique enables good sound quality using a coding delay of approximately 8 ms at bit-rates around 72 to 96 kilobits per second (kbps) or using a coding delay of approximately 2 ms at bit-rates around 128 to 192 kbps. While such coding delays make these audio coding techniques feasible candidates for many low-latency applications and usage scenarios, there is still a need for high-quality full-band audio coding technique that enables extremely low coding delay, e.g. one that is around 2.5 ms or below at bit rates at or close to 128 kbps and below.
  • a method for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal comprising quantizing the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, detecting a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, determining, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence, and quantizing the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
  • an apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal configured to quantize the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, detect a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, determine, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero- valued quantized samples of said sequence, and quantize the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
  • an apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal comprising means for quantizing the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, means for detecting a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, means for determining, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence, and means for quantizing the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
  • an apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal comprises at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: quantize the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, detect a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, determine, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero- valued quantized samples of said sequence, and quantize the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattic
  • a computer program comprising computer readable program code configured to cause performing at least a method according to the example embodiment described in the foregoing when said program code is executed on a computing apparatus.
  • the computer program according to an example embodiment may be embodied on a volatile or a non-volatile computer-readable record medium, for example as a computer program product comprising at least one computer readable non- transitory medium having program code stored thereon, the program which when executed by an apparatus cause the apparatus at least to perform the operations described hereinbefore for the computer program according to an example embodiment of the invention.
  • a volatile or a non-volatile computer-readable record medium for example as a computer program product comprising at least one computer readable non- transitory medium having program code stored thereon, the program which when executed by an apparatus cause the apparatus at least to perform the operations described hereinbefore for the computer program according to an example embodiment of the invention.
  • Figure 1 illustrates a block diagram of some components and/or entities of an audio processing system within which one or more example embodiments may be implemented.
  • Figure 2 illustrates a block diagram of some components and/or entities of an audio encoder according to an example embodiment
  • Figure 3 illustrates a method according to an example embodiment
  • Figure 4 illustrates a method according to an example embodiment
  • Figure 5 illustrates a mapping table according to an example embodiment
  • Figure 6 illustrates a block diagram of some components and/or entities of an audio decoder according to an example embodiment
  • Figure 7 illustrates a block diagram of some components and/or entities of an apparatus for implementing an audio encoder and/or an audio decoder according to an example embodiment.
  • FIG. 1 schematically illustrates a block diagram of some components and/or entities of an audio processing system 100.
  • the audio processing system comprises an audio capturing entity 1 10 for capturing an input audio signal 1 15 that represents at least one sound, an audio encoding entity 120 for encoding the input audio signal 1 15 into an encoded audio signal 125, an audio decoding entity 130 for decoding the encoded audio signal 125 obtained from the audio encoding entity into a reconstructed audio signal 135, and an audio reproduction entity 140 for playing back the reconstructed audio signal 135.
  • the audio capturing entity 1 10 may comprise e.g. a microphone, an arrangement of two or more microphones or a microphone array, each operable for capturing a respective sound signal.
  • the audio capturing entity 1 10 serves to process one or more sound signals that each represent an aspect of the captured sound into the input audio signal 1 15 for provision to the audio encoding entity 120 and/or for storage in a storage means for subsequent use.
  • the audio encoding entity 120 employs an audio coding algorithm, referred herein to as an audio encoder, to process the input audio signal 1 15 into the encoded audio signal 125.
  • the audio encoder may be considered to implement a transform from a signal domain (the input audio signal 1 15) to the compressed domain (the encoded audio signal 125).
  • the audio encoding entity 120 may further include a pre-processing entity for processing the input audio signal 1 15 from a format in which it is received from the audio capturing entity 1 10 into a format suited for the audio encoder.
  • This pre-processing may involve, for example, level control of the input audio signal 1 15 and/or modification of frequency characteristics of the input audio signal 1 15 (e.g. low-pass, high-pass or bandpass filtering).
  • the preprocessing may be provided as a pre-processing entity that is separate from the audio encoder, as a sub-entity of the audio encoder or as a processing entity whose functionality is shared between a separate pre-processing and the audio encoder.
  • the audio decoding entity 130 employs an audio decoding algorithm, referred herein to as an audio decoder, to process the encoded audio signal 125 into the reconstructed audio signal 135.
  • the audio encoder may be considered to implement a transform from an encoded domain (the encoded audio signal 125) back to the signal domain (the reconstructed audio signal 135).
  • the audio decoding entity 130 may further include a post-processing entity for processing the reconstructed audio signal 1 15 from a format in which it is received from the audio decoder into a format suited for the audio reproduction entity 140. This post-processing may involve, for example, level control of the reconstructed audio signal 135 and/or modification of frequency characteristics of the reconstructed audio signal 135 (e.g.
  • the post-processing may be provided as a post- processing entity that is separate from the audio decoder, as a sub-entity of the audio decoder or as a processing entity whose functionality is shared between a separate post-processing and the audio decoder.
  • the audio reproduction entity 140 may comprise, for example, headphones, a headset, a loudspeaker or an arrangement of one or more loudspeakers.
  • the audio processing system 100 may include a storage means for storing pre-captured or pre-created audio signals, among which the audio input signal for provision to the audio encoding entity 120 can be selected.
  • the audio processing system 100 may comprise a storage means for storing the reconstructed audio signal 135 for subsequent analysis, processing, playback and/or transmission to a further entity.
  • the dotted vertical line in Figure 1 serves to denote that, typically, the audio encoding entity 120 and the audio decoding entity 130 are provided in separate devices that may be connected to each other via a network or via a transmission channel.
  • the network/channel may enable a wireless connection, a wired connection or a combination of the two between the audio encoding entity 120 and the audio decoding entity 130.
  • the audio encoding entity 120 may further comprise a (first) network interface for encapsulating the encoded audio signal 125 into a sequence of protocol data units (PDUs) for transfer to the decoding entity 130 over a network/channel, whereas the audio decoding entity 130 may further comprise a (second) network interface for decapsulating the encoded audio signal 125 from the sequence of PDUs received from the audio encoding entity 120 over the network/channel.
  • PDUs protocol data units
  • the input audio signal 1 15 may comprise a multi-channel signal (e.g. a stereo signal) that comprises two or more separate audio channels.
  • the following examples outline a few possibilities for making use of the examples provided in the following for a single-channel input audio signal 1 15 for processing a multi-channel input audio signal 1 15 provided as a multi-channel signal: -
  • the audio encoding entity 120 may separately process each channel of the input audio signal 1 15 into a respective channel of the encoded audio signal 125, while the channels of the encoded audio signal 125 are processed in the audio decoding entity into respective channels of the reconstructed audio signal 135.
  • the processing of a single channel in the audio encoding means 120 and the audio decoding means 130 may follow the approach according to the respective examples provided in the following for a single-channel input audio signal 1 15.
  • the audio encoding entity 120 may jointly process one or more channels of the input audio signal 1 15 into a channel of the encoded audio signal 125, while channels of the encoded audio signal 125 are processed in the audio decoding entity 130 into desired number of reconstructed audio channels for provision as the reconstructed audio signal 135.
  • the audio encoding means 120 may process one or more derived audio signals that are derived from channels of the input audio signal 1 15 into respective encoded derived audio signal for provision as the encoded audio signal 125 or as part thereof, whereas the decoding means 130 may process one or more encoded derived audio signals received in the encoded audio signal 125 into one or more channels of the reconstructed audio signal 135.
  • a derived audio signal in the encoding means 120 comprises a downmix signal derived e.g. as a sum or as an average of two or more channels of the input audio signal 1 15 and the encoding means 120 further derives, for two or more channels, a respective set of (one or more) audio parameters that are descriptive of the difference between the downmix signal and a respective channel of the input audio signal 1 15 for inclusion in the encoded audio signal 125.
  • the audio decoding means 130 decodes the encoded downmix signal and applies, for the two or more channels, the respective set of audio parameters to reconstruct the respective channel of the reconstructed audio signal 135.
  • Figure 2 illustrates a block diagram of some components and/or entities of an audio encoder 121 that may be provided as part of the audio encoding entity 120 according to an example.
  • the audio encoding entity 120 may include further components or entities in addition to the audio encoder 121 , e.g. the pre-processing entity referred to in the foregoing, which pre-processing entity may be arranged to process the input audio signal 1 15 before passing it for the audio encoder 121 .
  • the audio encoder 121 carries out encoding of the input audio signal 1 15 into the encoded audio signal 125, in other words the audio encoder 121 implements a transform from the signal domain to the encoded domain.
  • the audio encoder 121 may be arranged to process the input audio signal 1 15 as a sequence of input frames, each input frame including digital audio signal at a predefined sampling frequency and comprising a time series of input samples.
  • the audio encoder 121 employs a fixed predefined frame length.
  • the frame length may be a selectable frame length that may be selected from a plurality of predefined frame lengths, or the frame length may be an adjustable frame length that may be selected from a predefined range of frame lengths.
  • a frame length may be defined as number samples L included in the frame, which at the predefined sampling frequency maps to a corresponding duration in time.
  • the audio encoder 121 processes in the input audio signal 1 15 through a linear predictive coding (LPC) encoder 122, a long-term prediction (LTP) encoder 124 and a residual encoder 126.
  • LPC encoder 122 carries out an LPC encoding procedure to process the input audio signal 1 15 into a first residual signal 123, which is provided as input to the LTP encoder 124.
  • the LTP encoder 124 carries out LTP encoding to process the first residual signal 123 into a second residual signal 127, which is provided as input to the residual encoder 126.
  • the residual encoder 126 carries out residual encoding procedure to process the second residual signal 127 into the encoded audio signal 125 for provision to the decoding means (and/or for storage by a storage means).
  • LPC encoding in general is a coding technique well known in the art and it makes use of short-term redundancies in the input audio signal 125.
  • LTP encoding in general is a technique known in the art, and it makes use of long(er) term redundancies (e.g. in a range above approximately 2 ms) in the input audio signal 125: while the LPC encoder 122 is typically successful in modeling any short- term redundancies, possible long-term redundancies are still there in the first residual signal 123 and hence the LTP encoder 124 may provide an improvement for encoding audio input signals 1 15 that include a periodic or a quasi-periodic signal component whose periodicity falls into the range of long(er) term redundancies.
  • Typical example of an audio signal that includes such a periodic or quasi-periodic signal component is human voice (especially during time periods of voiced sound that typically represent vowel sounds of human speech).
  • the input audio signal 1 15 is processed into the encoded audio signal 125 frame by frame.
  • the LPC encoder 122 carries out the LPC encoding for a frame of input audio signal 1 15 and produces a corresponding frame of the first residual signal 123, which is processed by the LTP encoder 124 into a corresponding frame of the second residual signal 127, which in turn is processed by the residual encoder 126 into a corresponding frame of the encoded audio signal 125.
  • Respective non-limiting examples of operation of the LPC encoder 122, the LTP encoder 124 and the residual encoder 126 outlined above are provided in the following.
  • the LPC encoder 122 carries out an LPC analysis based on past values of the reconstructed audio signal 135 using a backward prediction technique known in the art.
  • a 'local' copy of the reconstructed audio signal 135 may be stored in a past audio buffer, which may be provided e.g. in a memory in the audio encoder 121 or in the LPC encoder 122, thereby making the reconstructed audio signal 135 available for the LPC analysis in the LPC encoder 122.
  • the references to the reconstructed audio signal 135 in context of the audio encoder 121 refer to the local copy available therein. This aspect will be described in more detail later below.
  • the LPC encoder 122 may determine the LPC filter coefficients e.g. by minimizing the error term
  • ⁇ i 0 a t x t - 0
  • > t t + l: t + N t Ipc
  • a i 0: K LPC
  • the backward prediction computes LPC filter coefficients on basis of past samples of the reconstructed audio signal 135 and carries out LPC analysis filtering for a frame of the input audio signal 1 15 using the computed LPC filter coefficients to produce a corresponding frame of the first residual signal 123.
  • the LPC analysis filtering involves processing a time series of input samples into a corresponding time series of first residual samples.
  • the LPC encoder 122 passes the first residual signal 123 to the LTP encoder 124 for computation of the second residual signal 127 therein.
  • the LPC analysis filtering to compute the first residual signal 123 on basis of the input audio signal 1 15 may be carried out e.g. according to the following equation:
  • a i 0: K LPC
  • L denotes the frame length (in number of samples)
  • i i (t) t + l-.
  • t + L denotes a corresponding frame of the first residual signal 123 (i.e. the time series of first residual samples).
  • the backward prediction in the LPC encoder 122 employs a predefined window length, d implying that the backward prediction bases the LPC analysis on t samples of the reconstructed audio signal 135.
  • the analysis window covers 608 most recent samples of the reconstructed audio signal 135, which at the sampling frequency of 48 kHz corresponds to approx. 12.7 ms. This, however, is a non-limiting example and a shorter or longer window may be employed instead, e.g. a window having a duration of 16 ms or a duration selected from the range 12 to 30 ms.
  • a suitable length/duration of the analysis window depends also on the existence and/or characteristics of other encoding components employed in the first audio encoding mode.
  • the analysis window has a predefined shape, which may be selected in view of desired LPC analysis characteristics.
  • Several analysis windows for the LPC analysis applicable for the LPC encoder 122 are known in the art, e.g. a (modified) Hamming window and a (modified) Hanning window, as well as hybrid windows such as one specified in the ITU-T Recommendation G.728 (section 3.3).
  • the LPC encoder 122 employs a predefined LPC model order, denoted as resulting in a set of LPC filter coefficients.
  • the LPC analysis in the LPC encoder 122 relies on past values of the reconstructed audio signal 135, there is no need to transmit parameters that are descriptive of the computed LPC filter coefficients to the decoding entity 130, but the decoding entity 130 is able to compute an identical set of LPC filter coefficients for LPC synthesis filtering therein on basis of the reconstructed audio signal 135 available in the audio decoding entity 130. Consequently, a relatively high LPC model order may be employed since it does not have an effect on the resulting bit-rate of the encoded audio signal 125, thereby enabling accurate modeling of spectral envelope of the input audio signal 1 15 especially for input audio signals 1 15 that include a periodic or a quasi-periodic signal component.
  • LPC model order ipc may be selected as a value between 30 and 60.
  • the zero-input response of the LPC analysis filter derived in the LPC encoder 122 may be removed from the first residual signal 123 before encoding the residual signal 123 in the residual encoder 124.
  • the zero-input response removal may be provided, for example, as part of the LPC encoder 122 (before passing the first residual signal 123 obtained by the LPC analysis filtering to the LTP encoder 124) or in the LTP encoder 124 (before carrying out an encoding procedure therein).
  • the zero input response may be calculated as
  • a i K LPC denote the LPC filter coefficients
  • L denotes the frame length (in number of samples)
  • x(t , t t - K LPC + l-.
  • t denotes a signal reconstructed on basis of one or more past frames of the encoded audio signal, i.e. the most recent samples of the reconstructed audio signal 135.
  • the computation of the zero input response is a recursive process: for the first sample of the zero input response all x(t) refer to past samples of the reconstructed audio signal 135, whereas the following samples of the zero input response are computed at least in part using signal samples computed for the zero input response.
  • the calculated zero input response is added back to the reconstructed audio signal 135. Consequently, also in the audio decoding entity 131 , after reconstructing a frame of the reconstructed audio signal 135 therein, the zero input response is added to the reconstructed audio signal 135, as will be described in the following.
  • the LTP encoder 124 carries out an LTP analysis based on past values of the reconstructed audio signal 135.
  • LTP analysis may be considered to constitute a backward prediction technique.
  • the local copy of the reconstructed audio signal 135 required also for the backward predictive LTP analysis may be employed for this purpose.
  • LTP parameters may consider values of d in a predefined range from d min to d max in the procedure of searching the LTP parameters that minimize the above error term.
  • the value of the LTP lag d is expressed as number of samples, and the values d min and d max that define the predefined range may be set, in dependence of the applied sampling frequency, such that they cover e.g. a value range that corresponds to LTP lag values d from approximately 2 ms to approximately 20 ms.
  • the value of d min may be set to a value that excludes LTP lag values d that are shorter than the frame length L from consideration.
  • the LTP lag d typically corresponds to the pitch period of the speech signal carried by the input audio signal 1 15.
  • the respective values of the LTP lag c/ and LTP gain g may be applied in the LTP encoder 124 to carry out LTP analysis filtering of a frame of the first residual signal 123 into a corresponding frame of the second residual signal 127
  • the LTP analysis filtering involves processing a time series of first residual samples into a corresponding time series of second residual samples.
  • the LTP encoder 124 passes the second residual signal 127 to the residual encoder 126 for derivation of the encoded audio signal 125 therein.
  • the LTP analysis filtering to compute the second residual signal 127 on basis of the first residual signal 123 may be carried out e.g.
  • the audio encoder 121 may be provided without the LTP encoder 124.
  • the residual encoder 126 may carry out the residual encoding procedure on basis of the first residual signal 123 instead of the second residual signal 127.
  • such scenario may, at least conceptually, involve copying the first residual signal 123 into the second residual signal 127 for use as basis for the residual encoding procedure in the residual encoder 126.
  • the application of the LTP encoder 124 is applied to carry out the LTP analysis for each frame of the first residual signal 123, but the basis for the residual encoding in the residual encoder 126 for a given frame is selected in dependence of the performance of the LTP encoder 124.
  • the LTP encoder 124 may select one of the first residual signal 123 and the second residual signal 127 on basis of a selected norm, e.g. an Euclidean norm: the LTP encoder 124 may compute a first norm as a norm of (a frame of) the first residual signal 123 and a second norm as a norm of (the corresponding frame of) the second residual signal 127.
  • the second residual signal 127 is selected as basis for the residual encoding in response to the first norm exceeding the second norm, whereas the first residual signal 123 is selected as basis for the residual encoding otherwise.
  • the second residual signal 127 is selected as basis for the residual encoding in response to the first norm multiplied by a weighting factor that is smaller than unity exceeding the second norm, whereas the first residual signal 123 is selected as basis for the residual encoding otherwise.
  • the selection involves selecting whether to apply the LTP encoding for the given frame of the input signal or not.
  • the encoded parameters that are transmitted to the audio decoding entity 130 include an indication of the selection (i.e. whether the LTP encoding has been applied or not) for the given frame is included in.
  • the residual encoder 126 carries out a residual encoding procedure that involves deriving encoded residual parameters on basis of the second residual signal 127.
  • a gain-shape encoder e.g. a gain-shape
  • the residual encoder 126 may be arranged to convert a frame of the second residual signal 127 from the time domain into a transform domain by using a predefined transform.
  • the predefined transform may comprise discrete cosine transform (DCT).
  • the predefined transform may comprise another energy-compacting transform known in the art, such as modified discrete cosine transform (MDCT), discrete sine transform (DST), etc.
  • MDCT modified discrete cosine transform
  • DST discrete sine transform
  • L L and it may be identified by a codeword ldx v , whereas the quantized gain may be denoted as g r and it may be identified by a codeword ldx g .
  • MAXW may be set to value 2 and f may be set to value 0.98.
  • a pyramidally truncated Z 4 8 lattice quantizer may be applied, e.g. one described in the article by Thomas R. Fisher titled "A pyramid Vector Quantizer", IEEE Transactions on Information Theory, Vol. 32, Issue 4, pp. 568-583, July 1986, ISSN 0018-9448.
  • the number of bits B is a predefined fixed value.
  • the number of bits B may be selected or defined on frame-by-frame basis. Non-limiting examples for applicable number of bits B are provided in the following.
  • the search procedure may also consider a suitable value for the gain g r .
  • the gain g r is the unquantized value
  • the quantized gain g r and the respective codeword ldx g may be derived separately using the scalar quantizer (as already referred to in the foregoing).
  • the candidate scaling factors g s i may be computed using the following equation:
  • K e.g. Li norm
  • Application of the predefined maximum norm K implies quantization that is limited to make use of those shells of the pyramidally truncated Z 4 8 lattice that have norm that is at most K.
  • the procedure continues with detecting the number of zero-valued elements k at the end of the initial quantized vector v ⁇ j), as indicated in block 304. If k equals zero, i.e. if the last element of the initial quantized vector, i.e. v ⁇ L), is non-zero, the initial quantized vector v ⁇ j) is selected to represent the current frame of the second residual signal 127, as indicated in block 308, and a codeword Idxi that identifies the initial quantized vector v ⁇ j) is computed and included in the encoded parameters as the codeword ldx v .
  • K' e.g. Li norm
  • the re-quantization commences by determining a value of the modified maximum norm K', as indicated in block 314.
  • the selection of the modified maximum norm K' may be provided e.g.
  • mapping function that returns a suitable value of the modified maximum norm K' in dependence of the given values of the number of bits B and the vector dimension L-k.
  • a mapping function may be provided via a mapping table that stores the respective number of bits B m for a plurality of pairs of a maximum norm K m and a vector dimension L m and searching the mapping table in the following manner:
  • the residual encoding procedure e.g.
  • the one illustrated by the flowchart 300 depicted in Figure 3 results in providing residual encoding parameters including the codeword ldx g , that identifies the quantized gain g r , a codeword ldx v that identifies the selected one of the quantized vectors v ⁇ j) and v 2 (j), and the value of k.
  • the residual encoding parameters are provided for inclusion in the encoded parameters for transmission to the decoding entity 130 for the audio decoding procedure therein.
  • This aspect will be discussed more detail in the following as part of description of the decoding entity 130.
  • a non-limiting example of a mapping table referred to in the foregoing is provided in Figure 5. Each row of the mapping table represents a given maximum norm K m , whereas each column of the mapping table represents a given vector dimension L m .
  • Each cell of the mapping table indicates the number of bits required for lattice quantization using the respective maximum norm K m and vector dimension L m .
  • the pyramidal shell of norm k of the lattice Z n contains all lattice points having the Li norm equal to k.
  • a pyramidal lattice truncation to norm k implies truncation of the lattice Z n such that only those pyramidal shells that have norm that is smaller than or equal to k are considered.
  • the number of lattice points at the shell of the pyramidal lattice Z n that has norm k may be computed based on the following equations:
  • the number of lattice points in a pyramidal truncation of the lattice Z n to norm k may by expressed as
  • the number of bits required to uniquely indicate a lattice point in a pyramidal truncation of the lattice Z n to norm k may be computed as where the symbol ⁇ x] denotes rounding to the smallest integer value that is larger than or equal to x.
  • the audio encoder 121 stores at least a predefined number of most recent samples of the reconstructed audio signal 135 to enable the backward prediction in the LPC encoder 122. As described in the foregoing, this may be implemented by generating a local copy of the reconstructed audio signal 135 in the audio encoder 121 and storing the local copy of the reconstructed audio signal 135 in the past audio buffer in the LPC encoder 122 or otherwise within the audio encoder 121 .
  • the audio encoder 121 may further comprise a local audio synthesis element that is arranged to generate the local copy of the reconstructed audio signal 135 for the current frame and to update the past audio buffer by discarding the L oldest samples therein and inserting the samples that constitute the local copy of the reconstructed audio signal 135 in the past audio buffer to facilitate audio encoder 121 operation for processing of the next frame of the audio input signal 1 15.
  • a local audio synthesis element that is arranged to generate the local copy of the reconstructed audio signal 135 for the current frame and to update the past audio buffer by discarding the L oldest samples therein and inserting the samples that constitute the local copy of the reconstructed audio signal 135 in the past audio buffer to facilitate audio encoder 121 operation for processing of the next frame of the audio input signal 1 15.
  • the past audio buffer stores at least the most recent samples of the reconstructed audio signal 135 to cover the analysis window applied by the LPC encoder 122.
  • the past audio buffer may store at least the d max most recent samples of the reconstructed audio signal 135 to enable evaluation of LTP lag values up to d max .
  • Figure 6 illustrates a block diagram of some components and/or entities of an audio decoder 131 that may be provided as part of the audio decoding entity 130 according to an example.
  • the audio decoder 131 carries out decoding of the encoded audio signal 125 into the reconstructed audio signal 135, thereby serving to implement a transform from the encoded domain (back) to the signal domain and, in a way, reversing the encoding operation carried out in the audio encoder 121 .
  • a residual encoder 136 carries out residual decoding procedure to processes the encoded audio signal 125 into a reconstructed second residual signal 137, which is provided as input to a LTP decoder 134.
  • the LTP decoder 134 carries out LTP decoding procedure to generate a reconstructed first residual signal 133 for provision as input to a LPC decoder 132, which in turn carries out LPC synthesis on basis of the reconstructed first residual signal 133 to output the reconstructed audio signal 135.
  • the audio decoder 131 process the encoded audio signal 125 frame by frame.
  • the residual decoding procedure in the residual decoder 136 involves computing the reconstructed second residual signal 137 on basis of the encoded audio signal 125.
  • a frame of reconstructed second residual signal 137 is provided as a respective time series of reconstructed second residual samples.
  • the residual decoder 134 In order to enable meaningful reconstruction of the residual signal, the residual decoder 134 must employ the same or otherwise matching residual coding technique as employed in the residual encoder 124.
  • the residual decoding procedure involves dequantizing residual encoding parameters received as part of the encoded audio signal 125 and using the dequantized parameters to create the frame of the reconstructed second residual signal 137, i.e. the time series of reconstructed second residual samples.
  • the encoded audio signal 125 includes the residual encoding parameters described in the foregoing, i.e. the codewords ldx g and ldx v and the value of k, where the codeword ldx g identifies the quantized gain g r , the codeword Idxv identifies a vector of the lattice codebook that represents the current frame and k indicates the number of zero-valued elements at the end of the initial quantized vector v _(j as detected in the audio encoder 121 .
  • the residual decoder 136 further has a priori knowledge of the number of bits B available for quantization of a frame of the second residual signal 127 and the length L, as well as access to the predefined mapping function that returns a suitable value of the norm (e.g. the predefined maximum norm K or the modified maximum norm K') in dependence of the given values of the number of bits B and the vector dimension L-k.
  • the predefined mapping function that returns a suitable value of the norm (e.g. the predefined maximum norm K or the modified maximum norm K') in dependence of the given values of the number of bits B and the vector dimension L-k.
  • the residual decoder 136 defines the value of L-k by using the received value of k and may employ the predefined mapping function to derive the modified maximum norm K employed in the residual encoder 126 in generation of the received codeword ldx v . This can be carried out by using a predefined mapping table as basis for the mapping, for example by using the procedure described in the foregoing in context of the residual encoding procedure.
  • the k zeros are appended at the end of the vector v r (j before the multiplication by g r .
  • the inverse transform is carried out such that only the first L-k transform domain samples are considered in the procedure (e.g. by considering only the first L-k columns when applying a matrix-based inverse transform).
  • the applied inverse transform is an inverse transform of the transform applied in the residual encoder 126, e.g. inverse DCT, inverse MDCT, inverse DST, etc.
  • the reconstructed second residual signal 137 is provided for LTP decoding procedure in the LTP decoder 134, which results in a reconstructed first residual signal 133.
  • a frame of reconstructed first residual signal 133 is provided as a respective time series of reconstructed first residual samples.
  • the LTP decoder 134 carries out LTP analysis to find the LTP lag d and the LTP gain g, for example, by using the procedure described in the foregoing in context of the LTP encoder 124.
  • the LTP decoding procedure involves LTP synthesis filtering to compute the first residual signal 133 on basis of the second residual signal 137 using the derived values of the LTP lag c/ and the LTP gain g.
  • the audio decoder 131 may be provided without the LTP decoder 134.
  • the residual decoder 136 may provide its output as the reconstructed first residual signal 133 instead of the reconstructed second residual signal 137.
  • such scenario may, at least conceptually, involve copying the reconstructed second residual signal 137 into the reconstructed first residual signal 133 for use as basis for the LPC decoding procedure in the LPC decoder 132.
  • the reconstructed first residual signal 133 is provided for LPC decoding procedure in the LPC decoder 132, which results in the reconstructed audio signal 135.
  • a frame of reconstructed audio signal 135 is provided as a respective time series of reconstructed output samples.
  • the LPC decoding procedure comprises the LPC decoder 132 carrying out the LPC analysis based on past values of the reconstructed audio signal 135 using the same backward prediction technique as applied in the LPC encoder 122. Hence, the backward prediction computes LPC filter coefficients on basis of past samples of the reconstructed audio signal 135.
  • the LPC decoder further carries out LPC synthesis filtering of the reconstructed residual signal 133 by using the LPC filter coefficients derived for the current frame in the LPC decoder 132, thereby generating the reconstructed audio signal 135.
  • the LPC synthesis filtering in the LPC decoder 132 involves processing a time series of reconstructed first residual samples into a corresponding time series of reconstructed output samples that hence constitute a corresponding frame of the reconstructed audio signal 135.
  • the LPC decoder 132 may find the LPC filter coefficients for the LPC synthesis therein, for example, by using the procedure outlined in the foregoing for the LPC encoder 122.
  • the LPC synthesis may be carried out e.g. by using the following equation:
  • L denotes the frame length (in number of samples)
  • the resulting LPC filter coefficients are also the same or similar.
  • the past values of the reconstructed audio signal 135 required for the LPC analysis in the LPC decoder 131 are stored in a past audio buffer, which may be provided e.g. in a memory in the audio decoder 131 or in the LPC decoder 132.
  • the LPC decoder 132 After having derived the reconstructed audio signal 135, the LPC decoder 132 further adds the zero input response of the LPC synthesis filter to the reconstructed audio signal 135 before passing it from the audio decoder 131 for audio playback, storage and/or further processing and before using this signal to update the past audio buffer of the audio decoder 131 (as will be described later in this text).
  • the zero input response may be calculated on basis of the reconstructed audio signal 135, for example, as described in the foregoing for computation of the zero input response in the audio encoder 121 .
  • the audio decoder 131 stores at least the most recent samples of the reconstructed audio signal 135 to enable the backward prediction in the LPC decoder 132.
  • the LTP decoder 134 is available in the audio decoder 131
  • at least the d max most recent samples of the reconstructed audio signal 135 may be stored to enable evaluation of LTP lag values up to d max . This may be implemented by storing sufficient number of most recent samples in the past audio buffer of the audio decoder 131 .
  • the audio decoder 131 updates the past audio buffer therein by discarding the L oldest samples in the past audio buffer and inserting the samples of the reconstructed audio signal 135 in the past audio buffer to facilitate the audio decoding of the next frame.
  • Figure 7 illustrates a block diagram of some components of an exemplifying apparatus 600.
  • the apparatus 600 may comprise further components, elements or portions that are not depicted in Figure 7.
  • the apparatus 600 may be employed in implementing e.g. the audio encoder 121 and/or the audio decoder 131 .
  • the apparatus 600 comprises a processor 616 and a memory 615 for storing data and computer program code 617.
  • the memory 615 and a portion of the computer program code 617 stored therein may be further arranged to, with the processor 616, to implement the function(s) described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 .
  • the apparatus 600 comprises a communication portion 612 for communication with other devices.
  • the communication portion 612 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses.
  • a communication apparatus of the communication portion 612 may also be referred to as a respective communication means.
  • the apparatus 600 may further comprise user I/O (input/output) components 418 that may be arranged, possibly together with the processor 616 and a portion of the computer program code 617, to provide a user interface for receiving input from a user of the apparatus 600 and/or providing output to the user of the apparatus 600 to control at least some aspects of operation of the audio encoder 121 and/or the audio decoder 131 implemented by the apparatus 600.
  • user I/O input/output
  • the user I/O components 618 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc.
  • the user I/O components 618 may be also referred to as peripherals.
  • the processor 616 may be arranged to control operation of the apparatus 600 e.g. in accordance with a portion of the computer program code 617 and possibly further in accordance with the user input received via the user I/O components 618 and/or in accordance with information received via the communication portion 612.
  • processor 616 is depicted as a single component, it may be implemented as one or more separate processing components.
  • memory 615 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent / semi-permanent/ dynamic/cached storage.
  • the computer program code 617 stored in the memory 615 may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 600 when loaded into the processor 616.
  • the computer-executable instructions may be provided as one or more sequences of one or more instructions.
  • the processor 616 is able to load and execute the computer program code 617 by reading the one or more sequences of one or more instructions included therein from the memory 615.
  • the one or more sequences of one or more instructions may be configured to, when executed by the processor 616, cause the apparatus 600 to carry out operations, procedures and/or functions described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 .
  • the apparatus 600 may comprise at least one processor 616 and at least one memory 615 including the computer program code 617 for one or more programs, the at least one memory 615 and the computer program code 617 configured to, with the at least one processor 616, cause the apparatus 600 to perform operations, procedures and/or functions described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 .
  • the computer programs stored in the memory 615 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 617 stored thereon, the computer program code, when executed by the apparatus 600, causes the apparatus 600 at least to perform operations, procedures and/or functions described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 .
  • the computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program.
  • the computer program may be provided as a signal configured to reliably transfer the computer program.
  • references(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc.
  • FPGA field-programmable gate arrays
  • ASIC application specific circuits
  • signal processors etc.

Abstract

According to an example embodiment, a technique for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal is provided. In an example, the technique comprises quantizing the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, detecting a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, determining, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence, and quantizing the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.

Description

Low-delay audio coding TECHNICAL FIELD
The example and non-limiting embodiments of the present invention relate to low- delay coding of audio signals at high sound quality. In particular, some embodiments of the present invention relate to lattice vector quantization of a signal that represents a segment of an audio signal.
BACKGROUND
Development of speech and audio coding techniques has evolved into solutions that enable high compression ratio at a good sound quality across input audio signals of various characteristics and across a wide-range of encoding bit-rates. Typically, achieving a high compression ratio in an audio coding technique that operates on a full-band audio signal (typically employing a sampling frequency of 48 kHz) requires usage of a relatively long analysis window in a range of 150 milliseconds (ms) or above to ensure sufficient sound quality. Consequently, a coding delay (which is also commonly referred to as an algorithmic delay) of such audio coding techniques is in the range of 150 ms or above. Examples of commonly employed audio coding techniques of this type include e.g. MPEG-1/MPEG-2 audio layer 3 (MP3) and MPEG-2/MPEG-4 advanced audio coding (AAC).
When such an audio coding technique is applied in an audio processing system that involves e.g. capturing and processing an audio signal and related processing, encoding the captured/processed audio signal, transmitting the encoded audio signal from one entity to another, decoding the received encoded audio signal and reproducing the decoded audio signal, the overall processing delay typically increases clearly beyond the mere coding delay, thereby rendering such audio coding techniques unsuitable for applications that cannot tolerate long latency such as telephony, wireless microphones or audio co-creation systems. Speech coding techniques, such as adaptive multi-rate (AMR), adaptive multi-rate wideband (AMR-WB) and 3GPP enhanced voice services (EVS) employ coding delay in the range of 25 to 32 ms, which makes them somewhat better suited for some latency-critical applications, including conversational applications such as mobile telephony and/or voice over internet protocol (VoIP). However, although enabling high compression ratio, these coding techniques are speech coding techniques that make use of some characteristics of human voice and that operate on bandwidth-limited audio signals at a relatively low-bitrates, thereby providing an audio quality that is not well-suited for applications that require high-quality full-band audio and/or carry audio content different from human voice. There are also speech coding techniques such as. ITU-T G.726, G.728 and G.722 that enable very low coding delay even in a range below 1 ms, but also these coding techniques operate on voice band (e.g. at 8 or 16 kHz sampling frequency) and provide a rather modest compression ratio. Some recently introduced audio coding techniques such as Opus (in a low-delay mode) and AAC-ULD enable relatively low coding delay in a range from 2.5 to 20 ms for full-band audio at a relatively good sound quality. As an example, assuming sampling frequency of 32 kHz, the AAC-ULD coding technique enables good sound quality using a coding delay of approximately 8 ms at bit-rates around 72 to 96 kilobits per second (kbps) or using a coding delay of approximately 2 ms at bit-rates around 128 to 192 kbps. While such coding delays make these audio coding techniques feasible candidates for many low-latency applications and usage scenarios, there is still a need for high-quality full-band audio coding technique that enables extremely low coding delay, e.g. one that is around 2.5 ms or below at bit rates at or close to 128 kbps and below.
SUMMARY
According to an example embodiment, a method for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal is provided, the method comprising quantizing the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, detecting a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, determining, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence, and quantizing the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
According to another example embodiment, an apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal is provided, the apparatus configured to quantize the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, detect a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, determine, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero- valued quantized samples of said sequence, and quantize the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
According to another example embodiment, an apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal is provided, the apparatus comprising means for quantizing the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, means for detecting a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, means for determining, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence, and means for quantizing the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
According to another example embodiment, an apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal is provided, wherein the apparatus comprises at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: quantize the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm, detect a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, determine, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero- valued quantized samples of said sequence, and quantize the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
According to another example embodiment, a computer program is provided, the computer program comprising computer readable program code configured to cause performing at least a method according to the example embodiment described in the foregoing when said program code is executed on a computing apparatus.
The computer program according to an example embodiment may be embodied on a volatile or a non-volatile computer-readable record medium, for example as a computer program product comprising at least one computer readable non- transitory medium having program code stored thereon, the program which when executed by an apparatus cause the apparatus at least to perform the operations described hereinbefore for the computer program according to an example embodiment of the invention. The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb "to comprise" and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.
Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.
BRIEF DESCRIPTION OF FIGURES
The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, where
Figure 1 illustrates a block diagram of some components and/or entities of an audio processing system within which one or more example embodiments may be implemented. Figure 2 illustrates a block diagram of some components and/or entities of an audio encoder according to an example embodiment;
Figure 3 illustrates a method according to an example embodiment;
Figure 4 illustrates a method according to an example embodiment; Figure 5 illustrates a mapping table according to an example embodiment; and
Figure 6 illustrates a block diagram of some components and/or entities of an audio decoder according to an example embodiment;
Figure 7 illustrates a block diagram of some components and/or entities of an apparatus for implementing an audio encoder and/or an audio decoder according to an example embodiment.
DESCRIPTION OF SOME EMBODIMENTS
Figure 1 schematically illustrates a block diagram of some components and/or entities of an audio processing system 100. The audio processing system comprises an audio capturing entity 1 10 for capturing an input audio signal 1 15 that represents at least one sound, an audio encoding entity 120 for encoding the input audio signal 1 15 into an encoded audio signal 125, an audio decoding entity 130 for decoding the encoded audio signal 125 obtained from the audio encoding entity into a reconstructed audio signal 135, and an audio reproduction entity 140 for playing back the reconstructed audio signal 135. The audio capturing entity 1 10 may comprise e.g. a microphone, an arrangement of two or more microphones or a microphone array, each operable for capturing a respective sound signal. The audio capturing entity 1 10 serves to process one or more sound signals that each represent an aspect of the captured sound into the input audio signal 1 15 for provision to the audio encoding entity 120 and/or for storage in a storage means for subsequent use. The audio encoding entity 120 employs an audio coding algorithm, referred herein to as an audio encoder, to process the input audio signal 1 15 into the encoded audio signal 125. In this regard, the audio encoder may be considered to implement a transform from a signal domain (the input audio signal 1 15) to the compressed domain (the encoded audio signal 125). The audio encoding entity 120 may further include a pre-processing entity for processing the input audio signal 1 15 from a format in which it is received from the audio capturing entity 1 10 into a format suited for the audio encoder. This pre-processing may involve, for example, level control of the input audio signal 1 15 and/or modification of frequency characteristics of the input audio signal 1 15 (e.g. low-pass, high-pass or bandpass filtering). The preprocessing may be provided as a pre-processing entity that is separate from the audio encoder, as a sub-entity of the audio encoder or as a processing entity whose functionality is shared between a separate pre-processing and the audio encoder.
The audio decoding entity 130 employs an audio decoding algorithm, referred herein to as an audio decoder, to process the encoded audio signal 125 into the reconstructed audio signal 135. The audio encoder may be considered to implement a transform from an encoded domain (the encoded audio signal 125) back to the signal domain (the reconstructed audio signal 135). The audio decoding entity 130 may further include a post-processing entity for processing the reconstructed audio signal 1 15 from a format in which it is received from the audio decoder into a format suited for the audio reproduction entity 140. This post-processing may involve, for example, level control of the reconstructed audio signal 135 and/or modification of frequency characteristics of the reconstructed audio signal 135 (e.g. low-pass, high- pass or bandpass filtering). The post-processing may be provided as a post- processing entity that is separate from the audio decoder, as a sub-entity of the audio decoder or as a processing entity whose functionality is shared between a separate post-processing and the audio decoder.
The audio reproduction entity 140 may comprise, for example, headphones, a headset, a loudspeaker or an arrangement of one or more loudspeakers. Instead of using the audio capturing entity 1 10, the audio processing system 100 may include a storage means for storing pre-captured or pre-created audio signals, among which the audio input signal for provision to the audio encoding entity 120 can be selected. Instead of using the audio reproduction entity 140, the audio processing system 100 may comprise a storage means for storing the reconstructed audio signal 135 for subsequent analysis, processing, playback and/or transmission to a further entity.
The dotted vertical line in Figure 1 serves to denote that, typically, the audio encoding entity 120 and the audio decoding entity 130 are provided in separate devices that may be connected to each other via a network or via a transmission channel. The network/channel may enable a wireless connection, a wired connection or a combination of the two between the audio encoding entity 120 and the audio decoding entity 130. As an example in this regard, the audio encoding entity 120 may further comprise a (first) network interface for encapsulating the encoded audio signal 125 into a sequence of protocol data units (PDUs) for transfer to the decoding entity 130 over a network/channel, whereas the audio decoding entity 130 may further comprise a (second) network interface for decapsulating the encoded audio signal 125 from the sequence of PDUs received from the audio encoding entity 120 over the network/channel. In the following, operation of some elements of the audio processing system 100 are described via more detailed examples by assuming that the input audio signal 1 15 includes a single audio channel. This, however, is a non-limiting example that has been adopted for clarity and brevity of description and in other examples the input audio signal 1 15 may comprise a multi-channel signal (e.g. a stereo signal) that comprises two or more separate audio channels. The following examples outline a few possibilities for making use of the examples provided in the following for a single-channel input audio signal 1 15 for processing a multi-channel input audio signal 1 15 provided as a multi-channel signal: - The audio encoding entity 120 may separately process each channel of the input audio signal 1 15 into a respective channel of the encoded audio signal 125, while the channels of the encoded audio signal 125 are processed in the audio decoding entity into respective channels of the reconstructed audio signal 135. In this regard, the processing of a single channel in the audio encoding means 120 and the audio decoding means 130 may follow the approach according to the respective examples provided in the following for a single-channel input audio signal 1 15.
- The audio encoding entity 120 may jointly process one or more channels of the input audio signal 1 15 into a channel of the encoded audio signal 125, while channels of the encoded audio signal 125 are processed in the audio decoding entity 130 into desired number of reconstructed audio channels for provision as the reconstructed audio signal 135. As a more detailed example in this regard, the audio encoding means 120 may process one or more derived audio signals that are derived from channels of the input audio signal 1 15 into respective encoded derived audio signal for provision as the encoded audio signal 125 or as part thereof, whereas the decoding means 130 may process one or more encoded derived audio signals received in the encoded audio signal 125 into one or more channels of the reconstructed audio signal 135. As a particular example, a derived audio signal in the encoding means 120 comprises a downmix signal derived e.g. as a sum or as an average of two or more channels of the input audio signal 1 15 and the encoding means 120 further derives, for two or more channels, a respective set of (one or more) audio parameters that are descriptive of the difference between the downmix signal and a respective channel of the input audio signal 1 15 for inclusion in the encoded audio signal 125. The audio decoding means 130 decodes the encoded downmix signal and applies, for the two or more channels, the respective set of audio parameters to reconstruct the respective channel of the reconstructed audio signal 135.
Figure 2 illustrates a block diagram of some components and/or entities of an audio encoder 121 that may be provided as part of the audio encoding entity 120 according to an example. The audio encoding entity 120 may include further components or entities in addition to the audio encoder 121 , e.g. the pre-processing entity referred to in the foregoing, which pre-processing entity may be arranged to process the input audio signal 1 15 before passing it for the audio encoder 121 . The audio encoder 121 carries out encoding of the input audio signal 1 15 into the encoded audio signal 125, in other words the audio encoder 121 implements a transform from the signal domain to the encoded domain. The audio encoder 121 may be arranged to process the input audio signal 1 15 as a sequence of input frames, each input frame including digital audio signal at a predefined sampling frequency and comprising a time series of input samples. Typically, the audio encoder 121 employs a fixed predefined frame length. In other examples, the frame length may be a selectable frame length that may be selected from a plurality of predefined frame lengths, or the frame length may be an adjustable frame length that may be selected from a predefined range of frame lengths. A frame length may be defined as number samples L included in the frame, which at the predefined sampling frequency maps to a corresponding duration in time.
As an example in this regard, the audio encoder 121 may employ a fixed frame length of 1 ms and sampling frequency of 48 kHz, resulting in frames of L=48 samples. These values, however, serve as non-limiting examples and different frame length and/or sampling frequency may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.
The audio encoder 121 processes in the input audio signal 1 15 through a linear predictive coding (LPC) encoder 122, a long-term prediction (LTP) encoder 124 and a residual encoder 126. The LPC encoder 122 carries out an LPC encoding procedure to process the input audio signal 1 15 into a first residual signal 123, which is provided as input to the LTP encoder 124. The LTP encoder 124 carries out LTP encoding to process the first residual signal 123 into a second residual signal 127, which is provided as input to the residual encoder 126. The residual encoder 126 carries out residual encoding procedure to process the second residual signal 127 into the encoded audio signal 125 for provision to the decoding means (and/or for storage by a storage means).
LPC encoding in general is a coding technique well known in the art and it makes use of short-term redundancies in the input audio signal 125. Along similar lines, LTP encoding in general is a technique known in the art, and it makes use of long(er) term redundancies (e.g. in a range above approximately 2 ms) in the input audio signal 125: while the LPC encoder 122 is typically successful in modeling any short- term redundancies, possible long-term redundancies are still there in the first residual signal 123 and hence the LTP encoder 124 may provide an improvement for encoding audio input signals 1 15 that include a periodic or a quasi-periodic signal component whose periodicity falls into the range of long(er) term redundancies. Typical example of an audio signal that includes such a periodic or quasi-periodic signal component is human voice (especially during time periods of voiced sound that typically represent vowel sounds of human speech).
In a signal path through the LPC encoder 122, the LTP encoder 124 and the residual encoder 126, the input audio signal 1 15 is processed into the encoded audio signal 125 frame by frame. In other words, in the signal path the LPC encoder 122 carries out the LPC encoding for a frame of input audio signal 1 15 and produces a corresponding frame of the first residual signal 123, which is processed by the LTP encoder 124 into a corresponding frame of the second residual signal 127, which in turn is processed by the residual encoder 126 into a corresponding frame of the encoded audio signal 125. Respective non-limiting examples of operation of the LPC encoder 122, the LTP encoder 124 and the residual encoder 126 outlined above are provided in the following.
The LPC encoder 122 carries out an LPC analysis based on past values of the reconstructed audio signal 135 using a backward prediction technique known in the art. To enable access to the past values of the reconstructed audio signal 135, a 'local' copy of the reconstructed audio signal 135 may be stored in a past audio buffer, which may be provided e.g. in a memory in the audio encoder 121 or in the LPC encoder 122, thereby making the reconstructed audio signal 135 available for the LPC analysis in the LPC encoder 122. Hence, the references to the reconstructed audio signal 135 in context of the audio encoder 121 refer to the local copy available therein. This aspect will be described in more detail later below.
In the LPC analysis, the LPC encoder 122 may determine the LPC filter coefficients e.g. by minimizing the error term
<KLPC
<i=0 atx t - 0||> t = t + l: t + Nt Ipc where a i = 0: KLPC, ao = 1 denote the LPC filter coefficients,
Figure imgf000013_0001
denotes the analysis window length (in number of samples), x(t , t = t - NLPC: t denotes a signal reconstructed on basis of one or more past frames of the encoded audio signal, i.e. the most recent samples of the reconstructed audio signal 135, and the symbol ||-|| denotes an applied norm, e.g. the Euclidean norm.
The backward prediction computes LPC filter coefficients on basis of past samples of the reconstructed audio signal 135 and carries out LPC analysis filtering for a frame of the input audio signal 1 15 using the computed LPC filter coefficients to produce a corresponding frame of the first residual signal 123. In other words, the LPC analysis filtering involves processing a time series of input samples into a corresponding time series of first residual samples. The LPC encoder 122 passes the first residual signal 123 to the LTP encoder 124 for computation of the second residual signal 127 therein. The LPC analysis filtering to compute the first residual signal 123 on basis of the input audio signal 1 15 may be carried out e.g. according to the following equation:
<KLPC
ri (t) =∑ <;i=o atx(t— i) , t = t + l: t + L where a i = 0: KLPC, a0 = 1 denote the LPC filter coefficients, L denotes the frame length (in number of samples), x(t), t = t + 1: t + L denotes a frame of the input audio signal 1 15 (i.e. the time series of input samples), and i i (t), t = t + l-. t + L denotes a corresponding frame of the first residual signal 123 (i.e. the time series of first residual samples).
In an example, the backward prediction in the LPC encoder 122 employs a predefined window length, d implying that the backward prediction bases the LPC analysis on
Figure imgf000014_0001
t samples of the reconstructed audio signal 135. In an example, the analysis window covers 608 most recent samples of the reconstructed audio signal 135, which at the sampling frequency of 48 kHz corresponds to approx. 12.7 ms. This, however, is a non-limiting example and a shorter or longer window may be employed instead, e.g. a window having a duration of 16 ms or a duration selected from the range 12 to 30 ms. A suitable length/duration of the analysis window depends also on the existence and/or characteristics of other encoding components employed in the first audio encoding mode. The analysis window has a predefined shape, which may be selected in view of desired LPC analysis characteristics. Several analysis windows for the LPC analysis applicable for the LPC encoder 122 are known in the art, e.g. a (modified) Hamming window and a (modified) Hanning window, as well as hybrid windows such as one specified in the ITU-T Recommendation G.728 (section 3.3). The LPC encoder 122 employs a predefined LPC model order, denoted as
Figure imgf000014_0002
resulting in a set of
Figure imgf000014_0003
LPC filter coefficients. Since the LPC analysis in the LPC encoder 122 relies on past values of the reconstructed audio signal 135, there is no need to transmit parameters that are descriptive of the computed LPC filter coefficients to the decoding entity 130, but the decoding entity 130 is able to compute an identical set of LPC filter coefficients for LPC synthesis filtering therein on basis of the reconstructed audio signal 135 available in the audio decoding entity 130. Consequently, a relatively high LPC model order
Figure imgf000014_0004
may be employed since it does not have an effect on the resulting bit-rate of the encoded audio signal 125, thereby enabling accurate modeling of spectral envelope of the input audio signal 1 15 especially for input audio signals 1 15 that include a periodic or a quasi-periodic signal component. On the other hand, required computing capacity increases with increasing LPC model order ipc, and hence selection of the most appropriate LPC model order
Figure imgf000015_0001
for a given use case may involve a trade-off between the desired accuracy of modeling the spectral envelope of the input audio signal 1 15 and the available computational resources. As a non-limiting example, the LPC model order ipc may be selected as a value between 30 and 60.
In an example, the zero-input response of the LPC analysis filter derived in the LPC encoder 122 may be removed from the first residual signal 123 before encoding the residual signal 123 in the residual encoder 124. The zero-input response removal may be provided, for example, as part of the LPC encoder 122 (before passing the first residual signal 123 obtained by the LPC analysis filtering to the LTP encoder 124) or in the LTP encoder 124 (before carrying out an encoding procedure therein).
The zero input response may be calculated as
where a i =
Figure imgf000015_0002
KLPC denote the LPC filter coefficients, L denotes the frame length (in number of samples), and x(t , t = t - KLPC + l-. t denotes a signal reconstructed on basis of one or more past frames of the encoded audio signal, i.e. the most recent samples of the reconstructed audio signal 135. The computation of the zero input response is a recursive process: for the first sample of the zero input response all x(t) refer to past samples of the reconstructed audio signal 135, whereas the following samples of the zero input response are computed at least in part using signal samples computed for the zero input response.
After encoding a frame of the input audio signal 125 in the audio encoder 121 , the calculated zero input response is added back to the reconstructed audio signal 135. Consequently, also in the audio decoding entity 131 , after reconstructing a frame of the reconstructed audio signal 135 therein, the zero input response is added to the reconstructed audio signal 135, as will be described in the following.
The LTP encoder 124 carries out an LTP analysis based on past values of the reconstructed audio signal 135. Various approaches for carrying out the LTP analysis are known in the art. Since the computation in this regard is based on information of signal history, also the LTP analysis may be considered to constitute a backward prediction technique. To enable access to the past values of the first residual signal 123, the local copy of the reconstructed audio signal 135 required also for the backward predictive LTP analysis may be employed for this purpose. In the LTP analysis, the LTP encoder 124 may determine LTP parameters LTP lag d and LTP gain g for example by finding values of the LTP lag d and LTP gain g that minimize the error term e(t) = ||x(t) - gx(t - d) \\, t = t: t + L - 1 , where L denotes the frame length (in number of samples), x(t), t = t - dmax: t + L— 1 denotes a signal reconstructed on basis of one or more past frames of the encoded audio signal, i.e. the most recent samples of the reconstructed audio signal 135, and the symbol ||-|| denotes an applied norm, e.g. the Euclidean norm. The determination of LTP parameters may consider values of d in a predefined range from dmin to dmax in the procedure of searching the LTP parameters that minimize the above error term. In the equation above, the value of the LTP lag d is expressed as number of samples, and the values dmin and dmax that define the predefined range may be set, in dependence of the applied sampling frequency, such that they cover e.g. a value range that corresponds to LTP lag values d from approximately 2 ms to approximately 20 ms. In order to ensure matching LTP analysis to be carried out in the audio decoding entity 130, the value of dmin may be set to a value that excludes LTP lag values d that are shorter than the frame length L from consideration. For speech signals and especially for voiced segments thereof, the LTP lag d typically corresponds to the pitch period of the speech signal carried by the input audio signal 1 15.
Once found, the respective values of the LTP lag c/ and LTP gain g may be applied in the LTP encoder 124 to carry out LTP analysis filtering of a frame of the first residual signal 123 into a corresponding frame of the second residual signal 127 In other words, the LTP analysis filtering involves processing a time series of first residual samples into a corresponding time series of second residual samples. The LTP encoder 124 passes the second residual signal 127 to the residual encoder 126 for derivation of the encoded audio signal 125 therein. The LTP analysis filtering to compute the second residual signal 127 on basis of the first residual signal 123 may be carried out e.g. according to the following equation: r2 ( = ri ( - gr-i_(t - d), t = t: t + L - 1 , where L denotes the frame length (in number of samples), i i (t), t = t-. t + L - 1 denotes the first residual signal 123 and r2 (t), t = t-. t + L - 1 denotes a frame of the second residual signal 127.
Although described as part of the exemplifying audio encoder 121 depicted in Figure 2, in other examples the audio encoder 121 may be provided without the LTP encoder 124. In such a scenario the residual encoder 126 may carry out the residual encoding procedure on basis of the first residual signal 123 instead of the second residual signal 127. Alternatively, such scenario may, at least conceptually, involve copying the first residual signal 123 into the second residual signal 127 for use as basis for the residual encoding procedure in the residual encoder 126.
In a further example, the application of the LTP encoder 124 is applied to carry out the LTP analysis for each frame of the first residual signal 123, but the basis for the residual encoding in the residual encoder 126 for a given frame is selected in dependence of the performance of the LTP encoder 124. As an example in this regard, the LTP encoder 124 may select one of the first residual signal 123 and the second residual signal 127 on basis of a selected norm, e.g. an Euclidean norm: the LTP encoder 124 may compute a first norm as a norm of (a frame of) the first residual signal 123 and a second norm as a norm of (the corresponding frame of) the second residual signal 127. The second residual signal 127 is selected as basis for the residual encoding in response to the first norm exceeding the second norm, whereas the first residual signal 123 is selected as basis for the residual encoding otherwise. In a variation of such an example, the second residual signal 127 is selected as basis for the residual encoding in response to the first norm multiplied by a weighting factor that is smaller than unity exceeding the second norm, whereas the first residual signal 123 is selected as basis for the residual encoding otherwise. In other words, the selection involves selecting whether to apply the LTP encoding for the given frame of the input signal or not. In such an approach, the encoded parameters that are transmitted to the audio decoding entity 130 include an indication of the selection (i.e. whether the LTP encoding has been applied or not) for the given frame is included in.
The residual encoder 126 carries out a residual encoding procedure that involves deriving encoded residual parameters on basis of the second residual signal 127. The residual encoding may employ, for example, a gain-shape coding technique (e.g. a gain-shape encoder), wherein relative amplitudes of samples in a source vector vr(j ,j = 1: 1 are encoded separately from a gain gr of the source vector vr(j),j = 1: L, thereby resulting in encoded parameters that include pieces of information that identify a codevector that represents the source vector vr(j ,j = l-. L and the gain value gr, where a reconstructed version of the second residual signal 127 is formed by multiplying each relative amplitude value in the source vector vr(j ,j = 1: L by the gain value gr.
The residual encoder 126 may be arranged to convert a frame of the second residual signal 127 from the time domain into a transform domain by using a predefined transform. In an example, the predefined transform may comprise discrete cosine transform (DCT). In other examples, the predefined transform may comprise another energy-compacting transform known in the art, such as modified discrete cosine transform (MDCT), discrete sine transform (DST), etc. In the following, we refer to the second residual signal 127 converted into a transform domain by DCT (or by other transform known in the art) as transformed residual signal C whereas a frame of the transformed residual signal C is referred to as c(j) with y'=1 , L. Herein, L (also) denotes the length of the transform, such that a frame of second residual signal 127 of length L time-domain samples is transformed into L transform domain samples c(j) with '=1 , L that constitute the frame of the transformed residual signal.
In an example, the gain-shape coding technique applied for encoding a frame of transformed residual signal c(j) finds the source vector vr (j),j = 1-. L and the gain gr that represent the frame of the transformed residual signal c(j) and makes use of a suitable vector quantizer in finding a quantized version of the source vector vr (j),j = 1: L, whereas quantized value of the gain gr may be derived separately e.g. by using a suitable scalar quantizer. The quantized source vector may be denoted as vr (j),j = l-. L and it may be identified by a codeword ldxv, whereas the quantized gain may be denoted as gr and it may be identified by a codeword ldxg. The quantized source vector vr (j),j = l-. L may be also referred to as reconstructed vector.
In another example, the frame of transformed residual signal c(j) is weighted to determine a frame of weighted transformed residual signal cQ) and the encoding procedure involves finding a pair of the source vector vr (j),j = l-. L and the gain gr that represent the frame of weighted transformed residual signal cQ), wherein the weighting may be applied, for example, in the following manner:
6(f) = w(;)cQ'); j = 1: L
w(l) = MAXW; w(j) = wQ') 0_1); j = 2: L; 0 < / < 1' where w(j),j = l-. L denotes the weights applied to the respective individual transform domain samples c(/)j = 1: L, where MAXW denotes the weight applied to the first transform domain sample c(1) and where f denotes the weighting coefficient. As non-limiting illustrative examples, MAXW may be set to value 2 and f may be set to value 0.98. If weighting of transformed residual signal C is applied, the residual encoding procedure described in more detail in the following is based on the vector that represents the frame of weighted transformed residual signal cQ) instead of the vector that represents the frame of transformed residual signal c(j).
The vector quantizer referred to in the foregoing serves to quantize the L dimensional source vector vr(j ,j = l-. L derived from the vector c(/)j = l-. L (or from the vector c(J),j = l-. L) that represents the current frame of the second residual signal 127 into the respective quantized source vector vr(j ,j = l-. L such that quantization distortion according to a predefined criterion is minimized. As an example, the vector quantizer may employ a pyramidally truncated lattice quantizer. In the example case of the frame length of L=48 samples (i.e. 1 ms if assuming 48 kHz sampling frequency) a pyramidally truncated Z48 lattice quantizer may be applied, e.g. one described in the article by Thomas R. Fisher titled "A pyramid Vector Quantizer", IEEE Transactions on Information Theory, Vol. 32, Issue 4, pp. 568-583, July 1986, ISSN 0018-9448. The frame length L=48 and the pyramidally truncated Z48 lattice serve as non-limiting and illustrative examples and a different frame length and/or a different lattice quantizer may be applied instead. However, for clarity and brevity of description, the frame length L=48 and the pyramidally truncated Z48 lattice are applied in the following as representative examples to illustrate various details and variations of the residual encoding according to some embodiments of the present invention.
The residual encoding procedure encodes the source vector vr(j ,j = l-. L that represents the shape of the frame of the second residual signal 127 using at most B bits. In an example, the number of bits B is a predefined fixed value. In another example, the number of bits B may be selected or defined on frame-by-frame basis. Non-limiting examples for applicable number of bits B are provided in the following.
In an example, a search procedure is carried out to find the quantized source vector vr(j = 1: that minimizes the quantization distortion. The search procedure may also consider a suitable value for the gain gr. In this regard, the search procedure may involve testing a plurality of candidate values for the source vector vr (j ,j = 1: L and the gain gr and selecting the pair of the quantized source vector vr(j ,j = l-. L and the gain gr that minimizes the quantization distortion according to the predefined criterion. In this regard, the search procedure may involve testing a plurality of candidate values for a scaling factor gs i and for each candidate value computing the respective candidate source vector as vr i (j) = gSiic(J ,j = 1: L, using the Z48 lattice quantizer to quantize vr i (j) into a corresponding candidate quantized source vector vr i (j), finding a corresponding candidate gain gr i and deriving a resulting candidate reconstructed frame of the residual signal as c£ (y) = gr,iVr,i(j)>j = 1 : L-
Consequently, the pair of the candidate quantized source vector vr i(j) and the corresponding candidate gain gr i that result in minimizing the quantization distortion among the tested candidate values for the scaling factor gs i are selected as the pair of the quantized source vector vr(j ,j = l-. L and the gain gr that represent the current frame of the transformed residual signal c(j) (or the weighted transformed residual signal cQ'))- Herein, the gain gr is the unquantized value, whereas the quantized gain gr and the respective codeword ldxg may be derived separately using the scalar quantizer (as already referred to in the foregoing). The predefined criterion employed in computing the quantization distortion may comprise, for example, the Euclidean distance between the candidate reconstructed frame of the residual signal £ (/)>_/ = l-. L and the frame of transformed residual signal c(j) (or the weighted transformed residual signal cQ')).
In the example search procedure outlined in the foregoing, the candidate scaling factors gs i may be computed using the following equation:
9s.i = 7 Lm^ax - R - ^ \\ J \ > * = , 7 = 1: where imax and imax denote respective predefined minimum and maximum values for /' (e.g. such that imin = 1 and imax = 20), R denotes a predefined maximum Li norm applied in the lattice quantization by the Z48 lattice quantizer, and the expression ||c(/) ll denotes the Li norm of the frame of transformed residual signal c(j).
An example of the residual encoding procedure is illustrated by the flowchart 300 depicted in Figure 3. The residual encoding procedure according to this example commences by initial quantization of the source vector vr(j),j = l-. L using the pyramidally truncated Z48 lattice quantizer by applying a predefined maximum norm K (e.g. Li norm), as indicated in block 302. Application of the predefined maximum norm K implies quantization that is limited to make use of those shells of the pyramidally truncated Z48 lattice that have norm that is at most K. The initial quantization results in an initial quantized vector v^j), j = l-. L.
The procedure continues with detecting the number of zero-valued elements k at the end of the initial quantized vector v^j), as indicated in block 304. If k equals zero, i.e. if the last element of the initial quantized vector, i.e. v^L), is non-zero, the initial quantized vector v^j) is selected to represent the current frame of the second residual signal 127, as indicated in block 308, and a codeword Idxi that identifies the initial quantized vector v^j) is computed and included in the encoded parameters as the codeword ldxv.
In contrast, if k last elements of the initial quantized vector v^j) are zero (where k > 0), the residual encoding procedure proceeds to block 310 to re-quantize the first L-k elements of the source vector vr(j),j = l-. L - k using the pyramidally truncated Z48 lattice quantizer, this time by applying a modified maximum norm K' (e.g. Li norm), that is larger than or equal to the predefined maximum norm K (i.e. K*≥K). The first L-k elements of the source vector vr(j),j = l-. L - k may be referred to as a shortened source vector. The re-quantization results in a re-quantized vector v2(j > j = 1- L - k that will be selected to represent the current frame of the second residual signal 127, as indicated in block 312, and a codeword Idx2 that identifies the re-quantized vector v2(j , j = 1: L - k is computed and included in the encoded parameters as the codeword ldxv.
Using the modified maximum norm K' that is larger than or equal to the predefined maximum norm K but performing the quantization on the shortened source vector vr(j),j = l-. L - k enables more accurate modeling of the vector c(j) subject to quantization while using the same or substantially the same number of bits B for the quantization in comparison to the initial quantization of block 302.
Figure 4 depicts a flowchart 400 that provides an example of the re-quantization of block 310, i.e. quantization of the shortened source vector vr(j ,j = l: L - k by applying the modified maximum norm K' while applying the same maximum number of bits B as in the initial quantization. The re-quantization commences by determining a value of the modified maximum norm K', as indicated in block 314. The modified maximum norm K' may be selected in dependence of the number of bits B and the dimension L-k of the shortened source vector vr(j ,j = 1: L - k that is subject for quantization. The selection of the modified maximum norm K' may be provided e.g. by a predefined mapping function that returns a suitable value of the modified maximum norm K' in dependence of the given values of the number of bits B and the vector dimension L-k. As an example, such a mapping function may be provided via a mapping table that stores the respective number of bits Bm for a plurality of pairs of a maximum norm Km and a vector dimension Lm and searching the mapping table in the following manner:
- Find the number of bits Br defined for the predefined maximum norm K and the vector dimension L;
- Identify the highest maximum norm Kr for which the number of bits Bm for vector dimension L-k does not exceed Br; and
- Select Kr as the modified maximum norm K'.
Once the modified maximum norm K' has been determined, the shortened source vector vr(j ,j = l-. L - k is quantized using the pyramidally truncated Z48 lattice by applying the modified maximum norm K', as indicated in block 316. This results in the re-quantized vector v2(j), j = l: L - k, which will be selected to represent the current frame of the second residual signal 127 and which is identified by the codeword Idx2. The residual encoding procedure, e.g. the one illustrated by the flowchart 300 depicted in Figure 3, results in providing residual encoding parameters including the codeword ldxg, that identifies the quantized gain gr, a codeword ldxv that identifies the selected one of the quantized vectors v^j) and v2(j), and the value of k. The residual encoding parameters are provided for inclusion in the encoded parameters for transmission to the decoding entity 130 for the audio decoding procedure therein. In particular, in case where / =0 the codeword ldxv=ldxi is provided in the residual encoding parameters, whereas for cases where k>0 the codeword Idxv=ldx2 is provided in the residual encoding parameters. In the decoding entity 130, the codeword ldxv and the value of k, together with a priori knowledge of the number of bits B, the length L of the source vector vr(j ,j = 1: L and the predefined maximum norm K with access to the above-mentioned predefined mapping function provide sufficient information to derive the value of the modified maximum norm K' for the decoding procedure. This aspect will be discussed more detail in the following as part of description of the decoding entity 130. A non-limiting example of a mapping table referred to in the foregoing is provided in Figure 5. Each row of the mapping table represents a given maximum norm Km, whereas each column of the mapping table represents a given vector dimension Lm. Each cell of the mapping table indicates the number of bits required for lattice quantization using the respective maximum norm Km and vector dimension Lm. As an example, if there are 92 bits available for quantization of the source vector v rii)> i = one sees from the mapping table of Figure 5 that the predefined maximum norm K for vector dimension Km=48 is 30 (cf. the cell with light gray background). Assuming that k=5, i.e. that five last elements of the initial quantized vector v-i_(J), j = 1: 1 are zero-valued, vector dimension of the shortened source vector vr(J), j = 1: L - k is 48-5=43. From the mapping table of Figure 5 one can see that for vector dimension Lm=43 the highest maximum norm Km for which the number of bits is at most 92 is 32 (cf. the cell with dark gray background), which in this example would hence serve as the modified maximum norm K' for quantization of the shortened source vector vr(J), j = 1: L - k. In the following, some further considerations concerning the predefined maximum norms Km, vector dimensions Lm and respective number of bits Bm stored in the mapping table, e.g. the exemplifying mapping table of Figure 5, are provided. Let S(n ,k) denote the number of lattice points in a pyramidal shell k of a lattice Zn. The pyramidal shell of norm k of the lattice Zn contains all lattice points having the Li norm equal to k. A pyramidal lattice truncation to norm k implies truncation of the lattice Zn such that only those pyramidal shells that have norm that is smaller than or equal to k are considered.
The number of lattice points at the shell of the pyramidal lattice Zn that has norm k may be computed based on the following equations:
5(n, k) = 5(n - l, k) + 5(n - 1, k - 1) + S n, k - 1)
S(n, 0) = l; S(l, fc) = 2k
Consequently, the number of lattice points in a pyramidal truncation of the lattice Zn to norm k may by expressed as
N(n, k) =∑=0 S(n, i) .
Moreover, the number of bits required to uniquely indicate a lattice point in a pyramidal truncation of the lattice Zn to norm k may be computed as
Figure imgf000025_0001
where the symbol \x] denotes rounding to the smallest integer value that is larger than or equal to x. The audio encoder 121 stores at least a predefined number of most recent samples of the reconstructed audio signal 135 to enable the backward prediction in the LPC encoder 122. As described in the foregoing, this may be implemented by generating a local copy of the reconstructed audio signal 135 in the audio encoder 121 and storing the local copy of the reconstructed audio signal 135 in the past audio buffer in the LPC encoder 122 or otherwise within the audio encoder 121 . In this regard, the audio encoder 121 may further comprise a local audio synthesis element that is arranged to generate the local copy of the reconstructed audio signal 135 for the current frame and to update the past audio buffer by discarding the L oldest samples therein and inserting the samples that constitute the local copy of the reconstructed audio signal 135 in the past audio buffer to facilitate audio encoder 121 operation for processing of the next frame of the audio input signal 1 15.
The past audio buffer stores at least the
Figure imgf000026_0001
most recent samples of the reconstructed audio signal 135 to cover the analysis window applied by the LPC encoder 122. In case the LTP encoder 124 is available in the audio encoder, the past audio buffer may store at least the dmax most recent samples of the reconstructed audio signal 135 to enable evaluation of LTP lag values up to dmax.
Figure 6 illustrates a block diagram of some components and/or entities of an audio decoder 131 that may be provided as part of the audio decoding entity 130 according to an example. The audio decoder 131 carries out decoding of the encoded audio signal 125 into the reconstructed audio signal 135, thereby serving to implement a transform from the encoded domain (back) to the signal domain and, in a way, reversing the encoding operation carried out in the audio encoder 121 . In the audio decoder 131 , a residual encoder 136 carries out residual decoding procedure to processes the encoded audio signal 125 into a reconstructed second residual signal 137, which is provided as input to a LTP decoder 134. The LTP decoder 134 carries out LTP decoding procedure to generate a reconstructed first residual signal 133 for provision as input to a LPC decoder 132, which in turn carries out LPC synthesis on basis of the reconstructed first residual signal 133 to output the reconstructed audio signal 135. The audio decoder 131 process the encoded audio signal 125 frame by frame.
The residual decoding procedure in the residual decoder 136 involves computing the reconstructed second residual signal 137 on basis of the encoded audio signal 125. A frame of reconstructed second residual signal 137 is provided as a respective time series of reconstructed second residual samples. In order to enable meaningful reconstruction of the residual signal, the residual decoder 134 must employ the same or otherwise matching residual coding technique as employed in the residual encoder 124. The residual decoding procedure involves dequantizing residual encoding parameters received as part of the encoded audio signal 125 and using the dequantized parameters to create the frame of the reconstructed second residual signal 137, i.e. the time series of reconstructed second residual samples.
In an example, the encoded audio signal 125 includes the residual encoding parameters described in the foregoing, i.e. the codewords ldxg and ldxv and the value of k, where the codeword ldxg identifies the quantized gain gr, the codeword Idxv identifies a vector of the lattice codebook that represents the current frame and k indicates the number of zero-valued elements at the end of the initial quantized vector v _(j as detected in the audio encoder 121 . In addition to these received residual encoding parameters the residual decoder 136 further has a priori knowledge of the number of bits B available for quantization of a frame of the second residual signal 127 and the length L, as well as access to the predefined mapping function that returns a suitable value of the norm (e.g. the predefined maximum norm K or the modified maximum norm K') in dependence of the given values of the number of bits B and the vector dimension L-k. In case the received value k equals zero, the residual decoder 136 may directly dequantize the received codeword ldxv into a reconstructed vector vr(j , j = 1: 1 by using the pyramidally truncated Z48 lattice (de)quantizer (possibly in view of the predefined maximum norm K), the dequantization thereby resulting in the reconstructed vector vr(j) = v^j), j = l-. L. In case the received value k is larger than zero, the residual decoder 136 defines the value of L-k by using the received value of k and may employ the predefined mapping function to derive the modified maximum norm K employed in the residual encoder 126 in generation of the received codeword ldxv. This can be carried out by using a predefined mapping table as basis for the mapping, for example by using the procedure described in the foregoing in context of the residual encoding procedure. Consequently, the residual decoder 136 may dequantize the received codeword ldxv into a reconstructed vector vr(j , j = l: L - k by using the pyramidally truncated Z4s lattice (de)quantizer (possibly in view of the predefined maximum norm K), the dequantization thereby resulting in the reconstructed vector vr(j) = v2(j), j = 1: L - k.
Once the reconstructed vector vr(j , j = l-. L - k has been derived, the residual decoder 136 proceeds into generating a frame of a reconstructed transform-domain residual signal c(j), j = l-. L, which may be found by multiplying each element of the reconstructed source vector vr(j , j = l-. L - k by the (de)quantized gain value gr, e.g. by (y) = grvr(j), j = 1: L - k. . In case the received value k is larger than zero, the reconstructed source vector vr(j , j = l-. L - k is shorter than L, which needs to be compensated before or during an inverse transform to be applied to the reconstructed transform-domain residual signal c(j), j = l-. L - k in the residual decoder 136. In an example, such a compensation involves appending the reconstructed transform-domain residual signal c(j), j = l-. L - k by introducing k zeros at the end of the vector c(j)„ thereby resulting in the reconstructed transform- domain residual signal c(j), j = 1-. L. In another example, the k zeros are appended at the end of the vector vr(j before the multiplication by gr. In a further example, the inverse transform is carried out such that only the first L-k transform domain samples are considered in the procedure (e.g. by considering only the first L-k columns when applying a matrix-based inverse transform).
The residual decoder 136 further applies an inverse transform to convert the reconstructed transform-domain residual signal c(j), j = l-. L into corresponding time-domain signal, which serves as a frame of the reconstructed second residual signal 137, which is denoted herein as 2(t), t = t + l-. t + L. \n case the weighting of the transformed residual signal c(j) has been applied in the encoder (as described in the foregoing), the corresponding inverse weighting needs to be applied to the reconstructed transform-domain residual signal c(j), j = l-. L before applying the inverse transform in order to compensate the effect of the weighting. The applied inverse transform is an inverse transform of the transform applied in the residual encoder 126, e.g. inverse DCT, inverse MDCT, inverse DST, etc.
As described in the foregoing, the reconstructed second residual signal 137 is provided for LTP decoding procedure in the LTP decoder 134, which results in a reconstructed first residual signal 133. A frame of reconstructed first residual signal 133 is provided as a respective time series of reconstructed first residual samples. In other words, the LTP decoding procedure processes the frame of the reconstructed second residual signal 2 (t), t = t + l-. t + L into a corresponding frame of the reconstructed first residual signal r^t), t = t + 1: t + L. In this regard, the LTP decoder 134 carries out LTP analysis to find the LTP lag d and the LTP gain g, for example, by using the procedure described in the foregoing in context of the LTP encoder 124. Moreover, the LTP decoding procedure involves LTP synthesis filtering to compute the first residual signal 133 on basis of the second residual signal 137 using the derived values of the LTP lag c/ and the LTP gain g. In this regard, the following equation may be employed: r^t) = r2(t) + r^t - d), t = t: t + L - l , where L denotes the frame length (in number of samples), r^t^. t = t-. t + L - 1 denotes the reconstructed first residual signal 133 and 2 (t), t = t-. t + L— 1 denotes the frame of the reconstructed second residual signal 137.
Although described as part of the exemplifying audio decoder 131 depicted in Figure 6, in other examples the audio decoder 131 may be provided without the LTP decoder 134. In such a scenario the residual decoder 136 may provide its output as the reconstructed first residual signal 133 instead of the reconstructed second residual signal 137. Alternatively, such scenario may, at least conceptually, involve copying the reconstructed second residual signal 137 into the reconstructed first residual signal 133 for use as basis for the LPC decoding procedure in the LPC decoder 132.
In a further example, the LTP decoder 134 is available in the audio decoder 131 for carrying out the LTP decoding procedure therein in accordance with the indication in this regard received in the encoded parameters: if the encoded parameters include an indication that the LTP encoding was applied in the audio encoder 121 in encoding the respective frame, the LTP decoder 134 is employed to process the frame of the reconstructed second residual signal 2 (t), t = t + l-. t + L into a corresponding frame of the reconstructed first residual signal r^t), t = t + l-. t + L. In contrast, in case the encoded parameters include an indication that the LTP encoding was not applied in the audio encoder 121 in encoding the respective frame, the LTP decoder 134 operation is omitted for the respective frame and the frame of the reconstructed second residual signal 2 (t), t = t + l-. t + L \s provided instead or as the reconstructed first residual signal r^t), t = t + l-. t + L for processing by the LPC decoder 132.
As described in the foregoing, the reconstructed first residual signal 133 is provided for LPC decoding procedure in the LPC decoder 132, which results in the reconstructed audio signal 135. A frame of reconstructed audio signal 135 is provided as a respective time series of reconstructed output samples. In other words, the LPC decoding procedure processes the frame of the reconstructed first residual signal r^t), t = t + l-. t + L into a corresponding frame of the reconstructed audio signal x(t), t = t + l-. t + L.
The LPC decoding procedure comprises the LPC decoder 132 carrying out the LPC analysis based on past values of the reconstructed audio signal 135 using the same backward prediction technique as applied in the LPC encoder 122. Hence, the backward prediction computes LPC filter coefficients on basis of past samples of the reconstructed audio signal 135. The LPC decoder further carries out LPC synthesis filtering of the reconstructed residual signal 133 by using the LPC filter coefficients derived for the current frame in the LPC decoder 132, thereby generating the reconstructed audio signal 135.
The LPC synthesis filtering in the LPC decoder 132 involves processing a time series of reconstructed first residual samples into a corresponding time series of reconstructed output samples that hence constitute a corresponding frame of the reconstructed audio signal 135. The LPC decoder 132 may find the LPC filter coefficients for the LPC synthesis therein, for example, by using the procedure outlined in the foregoing for the LPC encoder 122. The LPC synthesis may be carried out e.g. by using the following equation:
where a i =
Figure imgf000031_0001
KLPC denote the LPC filter coefficients, L denotes the frame length (in number of samples), x(t), t = t + 1: t + L denotes a frame of the reconstructed audio signal 135 (i.e. the time series of reconstructed output samples), and r^t^. t = t + 1: t + L denotes a corresponding frame of the reconstructed first residual signal 133 (i.e. the time series of reconstructed residual samples).
Since the LPC analyses in the LPC encoder and the LPC decoder 132 are carried out using the same approach and they are further performed on the same or similar audio signals, the resulting LPC filter coefficients are also the same or similar. The past values of the reconstructed audio signal 135 required for the LPC analysis in the LPC decoder 131 are stored in a past audio buffer, which may be provided e.g. in a memory in the audio decoder 131 or in the LPC decoder 132.
After having derived the reconstructed audio signal 135, the LPC decoder 132 further adds the zero input response of the LPC synthesis filter to the reconstructed audio signal 135 before passing it from the audio decoder 131 for audio playback, storage and/or further processing and before using this signal to update the past audio buffer of the audio decoder 131 (as will be described later in this text). The zero input response may be calculated on basis of the reconstructed audio signal 135, for example, as described in the foregoing for computation of the zero input response in the audio encoder 121 .
Along the lines described in the foregoing for the audio encoder 121 , also the audio decoder 131 stores at least the
Figure imgf000032_0001
most recent samples of the reconstructed audio signal 135 to enable the backward prediction in the LPC decoder 132. In case the LTP decoder 134 is available in the audio decoder 131 , at least the dmax most recent samples of the reconstructed audio signal 135 may be stored to enable evaluation of LTP lag values up to dmax. This may be implemented by storing sufficient number of most recent samples in the past audio buffer of the audio decoder 131 . After having carried out the decoding procedure, the audio decoder 131 updates the past audio buffer therein by discarding the L oldest samples in the past audio buffer and inserting the samples of the reconstructed audio signal 135 in the past audio buffer to facilitate the audio decoding of the next frame. Figure 7 illustrates a block diagram of some components of an exemplifying apparatus 600. The apparatus 600 may comprise further components, elements or portions that are not depicted in Figure 7. The apparatus 600 may be employed in implementing e.g. the audio encoder 121 and/or the audio decoder 131 .
The apparatus 600 comprises a processor 616 and a memory 615 for storing data and computer program code 617. The memory 615 and a portion of the computer program code 617 stored therein may be further arranged to, with the processor 616, to implement the function(s) described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 .
The apparatus 600 comprises a communication portion 612 for communication with other devices. The communication portion 612 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses. A communication apparatus of the communication portion 612 may also be referred to as a respective communication means. The apparatus 600 may further comprise user I/O (input/output) components 418 that may be arranged, possibly together with the processor 616 and a portion of the computer program code 617, to provide a user interface for receiving input from a user of the apparatus 600 and/or providing output to the user of the apparatus 600 to control at least some aspects of operation of the audio encoder 121 and/or the audio decoder 131 implemented by the apparatus 600. The user I/O components 618 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc. The user I/O components 618 may be also referred to as peripherals. The processor 616 may be arranged to control operation of the apparatus 600 e.g. in accordance with a portion of the computer program code 617 and possibly further in accordance with the user input received via the user I/O components 618 and/or in accordance with information received via the communication portion 612.
Although the processor 616 is depicted as a single component, it may be implemented as one or more separate processing components. Similarly, although the memory 615 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent / semi-permanent/ dynamic/cached storage.
The computer program code 617 stored in the memory 615, may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 600 when loaded into the processor 616. As an example, the computer-executable instructions may be provided as one or more sequences of one or more instructions. The processor 616 is able to load and execute the computer program code 617 by reading the one or more sequences of one or more instructions included therein from the memory 615. The one or more sequences of one or more instructions may be configured to, when executed by the processor 616, cause the apparatus 600 to carry out operations, procedures and/or functions described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 . Hence, the apparatus 600 may comprise at least one processor 616 and at least one memory 615 including the computer program code 617 for one or more programs, the at least one memory 615 and the computer program code 617 configured to, with the at least one processor 616, cause the apparatus 600 to perform operations, procedures and/or functions described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 .
The computer programs stored in the memory 615 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 617 stored thereon, the computer program code, when executed by the apparatus 600, causes the apparatus 600 at least to perform operations, procedures and/or functions described in the foregoing in context of the audio encoder 121 and/or the audio decoder 131 . The computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program. As another example, the computer program may be provided as a signal configured to reliably transfer the computer program.
Reference(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described.
Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.

Claims

Claims
1 . A method for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal, the method comprising quantizing the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm; detecting a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector; determining, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence; and quantizing the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
2. A method according to claim 1 , wherein said lattice quantizer comprises a pyramidally truncated lattice quantizer.
3. A method according to claim 1 or 2, further comprising using a predefined transform to convert a time series of input samples that represent said frame of the input audio signal in time domain into a series of transform domain samples that represent said frame of the input audio signal in a transform domain, which series of transform domain samples serves as basis for said source vector.
4. A method according to claim 3, wherein said predefined transform comprises discrete cosine transform.
5. A method according to claim 3 or 4, further comprising modeling said series of transform domain samples as a shape vector of relative amplitude values and a scalar gain value such that the shape vector multiplied by the scalar gain value matches or substantially matches said series of transform domain samples; and using relative amplitude values of said shape vector as said source vector.
6. A method according to any of claims 3 to 5, further comprising processing a time series of input samples of said frame of the input audio signal using linear predictive filter coefficients computed using a backward prediction into a residual signal that comprises a respective time series of residual samples; and applying said predefined transform to said time series of residual samples.
7. A method according to any of claims 3 to 5, further comprising processing a time series of input samples of said frame of the input audio signal using linear predictive filter coefficients computed using a backward prediction into a residual signal that comprises a respective first time series of residual samples; applying long-term prediction to said first time series of residual samples to derive a respective second time series of residual samples; and applying said predefined transform to said second time series of residual samples.
8. A method according to any of claims 1 to 7, further comprising outputting, in response to a zero-length sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, the initial quantized vector; and outputting, in response to a non-zero-length sequence of consecutive zero- valued quantized samples at the end of the initial quantized vector, the re- quantized vector
9. A method according to any of claims 1 to 8, further comprising outputting an indication of the length of said sequence of consecutive zero- valued quantized samples at the end of the initial quantized vector; outputting, in response to a zero-length sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, a codeword that identifies the initial quantized vector; and outputting, in response to a non-zero-length sequence of consecutive zero- valued quantized samples at the end of the initial quantized vector, a codeword that identifies the re-quantized vector.
10. An apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal, the apparatus configured to: quantize the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm; detect a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector; determine, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence; and quantize the source samples of the shortened source vector into respective re- quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
1 1 . An apparatus according to claim 10, wherein said lattice quantizer comprises a pyramidally truncated lattice quantizer.
12. An apparatus according to claim 10 or 1 1 , wherein the apparatus is further configured to use a predefined transform to convert a time series of input samples that represent said frame of the input audio signal in time domain into a series of transform domain samples that represent said frame of the input audio signal in a transform domain, which series of transform domain samples serves as basis for said source vector.
13. An apparatus according to claim 12, wherein said predefined transform comprises discrete cosine transform.
14. An apparatus according to claim 12 or 13, wherein the apparatus is further configured to model said series of transform domain samples as a shape vector of relative amplitude values and a scalar gain value such that the shape vector multiplied by the scalar gain value matches or substantially matches said series of transform domain samples; and use relative amplitude values of said shape vector as said source vector.
An apparatus according to any of claims 12 to 14, wherein the apparatus is further configured to process a time series of input samples of said frame of the input audio signal using linear predictive filter coefficients computed using a backward prediction into a residual signal that comprises a respective time series of residual samples; and apply said predefined transform to said time series of residual samples.
An apparatus according to any of claims 12 to 14, wherein the apparatus is further configured to process a time series of input samples of said frame of the input audio signal using linear predictive filter coefficients computed using a backward prediction into a residual signal that comprises a respective first time series of residual samples; apply long-term prediction to said first time series of residual samples to derive a respective second time series of residual samples; and apply said predefined transform to said second time series of residual samples.
17. An apparatus according to any of claims 10 to 16, wherein the apparatus is further configured to output, in response to a zero-length sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, the initial quantized vector; and output, in response to a non-zero-length sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, the re-quantized vector
18. An apparatus according to any of claims 10 to 17, wherein the apparatus is further configured to output an indication of the length of said sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector; output, in response to a zero-length sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, a codeword that identifies the initial quantized vector; and output, in response to a non-zero-length sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector, a codeword that identifies the re-quantized vector.
19. An apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal, the apparatus comprising means for quantizing the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm; means for detecting a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector; means for determining, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence; and means for quantizing the source samples of the shortened source vector into respective re-quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
20. An apparatus for encoding a source vector of a predefined number of source samples that represent a frame of an input audio signal, wherein the apparatus comprises at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: quantize the source samples of the source vector into respective quantized samples of an initial quantized vector using at most a predefined number of bits by employing a lattice quantizer restricted to a predefined maximum norm; detect a sequence of consecutive zero-valued quantized samples at the end of the initial quantized vector; determine, in response to detecting a sequence of non-zero length, a modified maximum norm that is greater than or equal to the predefined maximum norm and determining a shortened source vector by excluding those source samples that are represented by said zero-valued quantized samples of said sequence; and quantize the source samples of the shortened source vector into respective re- quantized samples of a re-quantized vector using at most the predefined number of bits by employing said lattice quantizer restricted to the modified maximum norm.
21 . A computer program comprising computer readable program code configured to cause performing of the method of any of claims 1 to 9 when said program code is run on a computing apparatus.
22. A computer program product comprising computer readable program code tangibly embodied on a non-transitory computer readable medium, the program code configured to cause performing the method according to any of claims 1 to 9 when run a computing apparatus.
PCT/FI2016/050744 2016-10-21 2016-10-21 Low-delay audio coding WO2018073486A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/FI2016/050744 WO2018073486A1 (en) 2016-10-21 2016-10-21 Low-delay audio coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/FI2016/050744 WO2018073486A1 (en) 2016-10-21 2016-10-21 Low-delay audio coding

Publications (1)

Publication Number Publication Date
WO2018073486A1 true WO2018073486A1 (en) 2018-04-26

Family

ID=57286527

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2016/050744 WO2018073486A1 (en) 2016-10-21 2016-10-21 Low-delay audio coding

Country Status (1)

Country Link
WO (1) WO2018073486A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5987407A (en) * 1997-10-28 1999-11-16 America Online, Inc. Soft-clipping postprocessor scaling decoded audio signal frame saturation regions to approximate original waveform shape and maintain continuity
US20080097757A1 (en) * 2006-10-24 2008-04-24 Nokia Corporation Audio coding

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5987407A (en) * 1997-10-28 1999-11-16 America Online, Inc. Soft-clipping postprocessor scaling decoded audio signal frame saturation regions to approximate original waveform shape and maintain continuity
US20080097757A1 (en) * 2006-10-24 2008-04-24 Nokia Corporation Audio coding

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LEFEBVRE R ET AL: "8 kbit/s coding of speech with 6 ms frame-length", 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. PROCEEDINGS. (ICASSP). ORLANDO, FL, MAY 13 - 17, 2002; [IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP)], NEW YORK, NY : IEEE, US, 27 April 1993 (1993-04-27), pages 612 - 615vol.2, XP031984230, ISBN: 978-0-7803-7402-7, DOI: 10.1109/ICASSP.1993.319384 *
M. BLAIN ET AL: "Optimum rate allocation in pyramid vector quantizer transform coding of imagery", ICASSP '87. IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, vol. 12, 6 April 1987 (1987-04-06), pages 729 - 732, XP055336157, DOI: 10.1109/ICASSP.1987.1169591 *
MORIYA T ET AL: "TRANSFORM CODING OF SPEECH USING A WEIGHTED VECTOR QUANTIZER", IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, IEEE SERVICE CENTER, PISCATAWAY, US, vol. 6, no. 2, 1 February 1988 (1988-02-01), pages 425 - 431, XP000616836, ISSN: 0733-8716, DOI: 10.1109/49.617 *
TAKEHIRO MORIYA ET AL: "Progress in LPC-based frequency-domain audio coding", APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, vol. 5, 31 May 2016 (2016-05-31), XP055336101, DOI: 10.1017/ATSIP.2016.11 *
THOMAS R. FISHER: "A pyramid Vector Quantizer", IEEE TRANSACTIONS ON INFORMATION THEORY, vol. 32, no. 4, July 1986 (1986-07-01), pages 568 - 583

Similar Documents

Publication Publication Date Title
JP7244609B2 (en) Method and system for encoding left and right channels of a stereo audio signal that selects between a two-subframe model and a four-subframe model depending on bit budget
JP6692948B2 (en) Method, encoder and decoder for linear predictive coding and decoding of speech signals with transitions between frames having different sampling rates
JP5587501B2 (en) System, method, apparatus, and computer-readable medium for multi-stage shape vector quantization
RU2439718C1 (en) Method and device for sound signal processing
US8392176B2 (en) Processing of excitation in audio coding and decoding
CN106415717B (en) Audio signal classification and coding
CN111968655B (en) Signal encoding method and device and signal decoding method and device
EP3762923A1 (en) Audio coding
JP2009512895A (en) Signal coding and decoding based on spectral dynamics
CN114097028A (en) Method and system for metadata in codec audio streams and for flexible intra-object and inter-object bit rate adaptation
JP5544370B2 (en) Encoding device, decoding device and methods thereof
TW201434033A (en) Systems and methods for determining pitch pulse period signal boundaries
EP2617034B1 (en) Determining pitch cycle energy and scaling an excitation signal
US11176954B2 (en) Encoding and decoding of multichannel or stereo audio signals
US10950251B2 (en) Coding of harmonic signals in transform-based audio codecs
WO2018073486A1 (en) Low-delay audio coding
JP7123911B2 (en) System and method for long-term prediction in audio codecs
EP3252763A1 (en) Low-delay audio coding
JP5774490B2 (en) Encoding device, decoding device and methods thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16794669

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16794669

Country of ref document: EP

Kind code of ref document: A1