EP2005424A2

EP2005424A2 - Method for post-processing a signal in an audio decoder

Info

Publication number: EP2005424A2
Application number: EP07731774A
Authority: EP
Inventors: Stéphane RAGOT; Cyril Guillaume
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2006-03-20
Filing date: 2007-03-20
Publication date: 2008-12-24
Also published as: KR20080109038A; US20090299755A1; WO2007107670A3; CN101405792B; KR101373207B1; WO2007107670A2; CN101405792A; JP5457171B2; JP2009530679A

Abstract

The invention relates to a method for post-processing, in an audio decoder, a signal reconstructed by the temporal and frequential shaping (805, 807) of an excitation signal obtained on the basis of at least one parameter in a first frequency band, said temporal and frequential shaping being carried out at least on the basis of a temporal envelope and a frequential envelope received and decoded (801, 802) in a second frequency band. The method is such that, once the shaping (805,807) has been carried out, steps of comparing the amplitude of the reconstructed signal with the received and decoded temporal envelope (s) are followed, and an amplitude compression is applied to the reconstructed signal if at least one threshold of the temporal envelope is exceeded. The invention relates to a post-processing module for implementing the inventive method, and to an audio decoder. It is used for transmitting and storing digital signals such as audiofrequency signals: speech, music, etc.

Description

METHOD FOR POST-PROCESSING A SIGNAL IN A DECODER

AUDIO

The present invention relates to a method of post-processing a signal in an audio decoder.

The invention finds a particularly advantageous application in the field of transmission and storage of digital signals such as audio-frequency signals: speech, music, etc.

Various techniques exist for converting into digital form an audio-frequency signal, such as speech, music, etc. The most common techniques are "waveform coding" methods, such as MIC or ADPCM (PCM or ADPCM) coding, "parametric coding by synthesis analysis" methods such as CELP coding (Code Excited Linear Prediction), and the methods of "perceptual coding in subbands or by transform". These conventional techniques for encoding and quantifying audio-frequency signals are described, for example, in the works of A. Gersho and RM Gray, Vector Quantization and Signal Compression, Kluwer Academy Publisher, 1992, and B. Kleijn and KK Paliwal editors, Speech Coding and Synthesis, Elsevier, 1995.

In conventional speech coding, the encoder generates a fixed rate bit stream. This fixed rate constraint simplifies the implementation and use of the encoder and decoder (called "coded" set). Examples of such systems are: ITU-T G.711 coding at 64 kbit / s, ITU-T G.729 coding at 8 kbit / s or GSM-EFR at 12.2 kbit / s.

In some applications, such as mobile telephony or voice over IP, it is preferable to generate a variable rate bit stream, the bit rate values being taken in a predefined set. Several multi-rate coding techniques can be distinguished that are more flexible than fixed rate coding:

the multi-mode coding controlled by the source and / or the channel, as implemented in the AMR-NB, AMR-WB, SMV ₁ or VMR-WB systems, hierarchical coding, or "scalable" coding, which generates a so-called hierarchical bitstream because it comprises a core rate and one or more improvement layer (s). The 48, 56 and 64 kbit / s G.722 system is a simple example of scalable rate scaling. The MPEG-4 CELP codec is scalable in terms of bit rate and bandwidth. Other examples of such encoders are found in the articles by B. Kovesi, D. Massaloux, A. Sollaud, A Scalable Speech and Audio Coding Scheme with Continuous Bitrate Flexibility, ICASSP 2004, and H. Taddéi et al, A Scalable. Three Bitrate (8, 14.2 and 24 kbit / s) Audio Coder; 107th AES Convention, 1999. - Multiple description coding.

The invention is more particularly concerned with hierarchical coding. The basic concept of hierarchical audio coding, for example, is illustrated in the article by Y. Hiwasaki, T. Mori, H. Ohmuro, J. Ikedo, D. Tokumoto, and A. Kataoka, Scalable Speech Coding Technology for High-Quality. Ubiquitous Communications, NTT Technical Review, March 2004. The bitstream includes a base layer and one or more enhancement layers. The base layer is generated by a fixed low rate codec, known as a "core coded", guaranteeing the minimum quality of the coding; this layer must be received by the decoder to maintain an acceptable level of quality. Improvement layers are used to improve the quality; it may happen that they are not all received by the decoder. The main advantage of hierarchical coding is that it allows an adaptation of the bit rate by simple truncation of the bit stream. The number of layers, that is to say the number of possible truncations of the bit stream, defines the granularity of the coding: we speak of coding with "high granularity" if the bit stream comprises few layers, of the order of 2 to 4 with steps in the range of 4 to 8 kbit / s; a "fine granularity" coding allows a large number of layers with a step of the order of 1 kbit / s.

More specifically, the invention relates to scalable rate and bandwidth encoding techniques with a CELP heart-type coder in a telephone band and one or more broadband enhancement layer (s). Examples of such systems are given in the aforementioned article H. Taddéi et al with a high granularity of 8, 14.2, 24 kbit / s, and in the aforementioned article by B. Kovesi with fine granularity of 6.4 to 32 kbit / s.

In 2004, the ITU-T launched a standardized core hierarchical coder project. This encoder, called G.729EV (EV for Embedded Variable Bitrate) is an appendix of the known G.729 encoder. The objective of the G.729EV standardization is to obtain a G.729 core hierarchical encoder, producing a signal whose band extends from the narrow band (300-3400 Hz) to the broadband (50-7000 Hz). ) at a rate of 8 to 32 kbit / s for conversational services. This encoder is inherently interoperable with Recommendation G.729, which ensures compatibility with existing VoIP devices.

In response to this project, a three-layer coding scheme was proposed, namely 8-12 kbit / s cascaded CELP coding, followed by a 14 kbit / s parametric band transform coding from 14 to 32 kbit / s. This coder is known as ITU-T SG16 / WP3 D214 (ITU-T, COM 16, D214 (WP 3/16), "High level description of the scalable 8-32 kbit / s algorithm submitted to the Qualification Test by Matsushita, Mindspeed and Siemens, "Q.10 / 16, Study Period 2005-2008, Geneva, 26 July - 5 August 2005). The notion of band extension refers to the coding of the high band of a signal. In the context of the invention, the input audio signals are sampled at 16 kHz over a useful band of 50 to 7000 Hz. For the aforementioned ITU-T SG16 / WP3 D214 encoder, the high band typically corresponds to frequencies between 3400 Hz. and 7000 Hz. This band is coded according to a band extension technique based on the time and frequency envelope encoder extraction, these envelopes being then applied to the decoder to a reconstructed synthetic excitation signal in the high band. from the parameters estimated in the low band (between 50 and 3400 Hz) sampled at 8 kHz. The low band will be designated in the sequence "first frequency band"; the high band is then called "second frequency band".

This band extension technique is shown schematically in FIG. At the encoder, the high frequency components of the original signal are isolated by a bandpass filter (100) between 3400 and 7000 Hz. Then, the temporal and frequency envelopes of the signal are calculated respectively by the modules (101) and (102). These envelopes are quantized together with 2 kbit / s at the block (103).

At the decoder, a synthetic excitation is reconstructed by the module

(104) reconstruction from the parameters of the cascaded CELP decoder. The temporal and frequency envelopes are decoded by the block

(105) inverse quantization. The synthetic excitation from the reconstruction module (104) is then shaped by a scaling module (106) from the time envelope and by a filtering module (107) from the frequency envelope.

The band extension mechanism that has just been described with reference to the ITU-T SG16 / WP3 D214 codec is therefore based on the shaping of a synthetic excitation by temporal and frequency envelopes. However, in the absence of coupling between the excitation and the shaping, the application of such a model is delicate and causes the appearance of artifacts in the form of very audible one-time "clicks" due to strong amplitude overruns. Also, the technical problem to be solved by the object of the present invention is to propose a method of post-processing, in an audio decoder, a signal reconstructed by temporal and frequency formatting of an excitation signal obtained. from at least one estimated parameter in a first frequency band, which would make it possible to avoid the artifacts induced by the shaping of the synthetic excitation signal, said temporal and frequency formatting being made from a temporal envelope and a frequency envelope received and decoded in a second frequency band.

The solution to the technical problem that is posed, according to the present invention, in that said method comprises the steps consisting in comparing the amplitude of said reconstructed signal with said received and decoded time envelope, and, in case of exceeding at least one threshold function of said temporal envelope, to apply to said reconstructed signal an amplitude compression.

Thus, the method according to the invention compensates for the lack of adequate coupling between the excitation and the shaping functions by means of a post-processing by amplitude compression of the audio signal supplied by the decoder in the second frequency band, or high band.

According to one embodiment, said amplitude compression consists in applying to the amplitude of said signal at least one linear attenuation if said amplitude is greater than at least one trigger threshold according to said received and decoded time envelope.

It will be noted that, in addition to limiting the amplitude of the signal and therefore the artifacts associated with the high amplitudes, the method of the invention has the advantage of being adaptive in the sense that the triggering threshold is variable since it follows the value of the time envelope received and decoded.

The invention also relates to a computer program comprising program code instructions for implementing the post-processing method according to the invention when said program is executed on a computer. The invention further relates to a post-processing module, in an audio decoder, of a signal reconstructed by shaping an excitation signal obtained from at least one estimated parameter in a first frequency band. , said temporal and frequency formatting being made from a time envelope and a frequency envelope received and decoded in a second frequency band, the module being remarkable in that it comprises a comparator of the amplitude said reconstructed signal to said received and decoded time envelope and amplitude compression means adapted, in case of a positive comparison, to apply to said reconstructed signal an amplitude compression.

Finally, the invention relates to an audio decoder, comprising a module for estimating at least one parameter of an excitation signal in a first frequency band, a module for reconstructing a signal of excitation from said parameter, a decoding module of a temporal envelope in a second frequency band, a module (802) for decoding a frequency envelope in a second frequency band, a module (805) for setting in temporal form of said excitation signal, by means, at least, of said decoded time envelope (σ) and a frequency forming module (807) of said excitation signal, by means of, at least, said frequency envelope decoded, remarkable in that said decoder comprises a post-processing module according to the invention. The following description with reference to the accompanying drawings, given as non-limiting examples, will make it clear what the invention consists of and how it can be achieved.

FIG. 1 is a diagram of a high-band coding / decoding stage in accordance with the prior art. FIG. 2 is a high level diagram of a hierarchical audio coder to

8, 12, 13.65 kbit / s.

FIG. 3 is a diagram of the high band encoder for the 13.65 kbit / s mode of the coder of FIG. 2.

FIG. 4 is a diagram showing the frame division performed by the high band encoder of FIG.

FIG. 5 is a high-level diagram of an 8, 12, 13.65 kbit / s hierarchical audio decoder associated with the coder of FIG. 2.

Fig. 6 is a diagram of the high band decoder for the 13.65 kbit / s mode of the decoder of Fig. 5. Fig. 7 is a flowchart of a first embodiment of an amplitude compression function.

FIG. 8 is a graph of the amplitude compression function of FIG. 7.

Fig. 9 is a flowchart of a second embodiment of an amplitude compression function.

Figure 10 is a graph of the amplitude compression function of Figure 9. Fig. 11 is a flowchart of a third embodiment of an amplitude compression function.

FIG. 12 is a graph of the amplitude compression function of FIG. 11. It will be recalled that the present invention is more particularly part of an overall hierarchical audio coding and decoding scheme in subbands operating at three possible rates: 8, 12 or 13.65 kbit / s. In practice, the encoder always operates at the maximum rate of 13.65 kbit / s, while the decoder can receive the heart at 8 kbit / s and one or two enhancement layers at 12 or 13.65 kbit / s.

The hierarchical audio coder is shown schematically in FIG.

The broadband input signal sampled at 16 kHz is first decomposed into two subbands by QMF ("Quadrature Mirror") filtering.

Filterbank "). The first frequency band, or low band, between 0 and 4000 Hz is obtained by low-pass filtering L and decimation 401, and the second frequency band, or high band, between 4000 and 8000 Hz by filtering 402 passes. H and decimation 403. In a preferred embodiment, the filters L and H are of length 64 and conform to those described in the J. Johnston article, ICASSP, flight. 5, pp. 291-294, 1980.

The low band is pre-processed by a high pass filter 404 eliminating components below 50 Hz before CELP 405 coding in 8 and 12 kbit / s narrowband. This high-pass filtering takes account of the fact that the wide band is defined as covering the interval 50-7000 Hz. According to one embodiment, the narrow-band CELP coding corresponds to that of the ITU-T SG16 / WP3 D135 coder ( ITU-T, COM 16, D135 (WP 3/16), "France Telecom G729EV Candidate: High level description and complexity evaluation," Q.10 / 16, Study Period 2005-2008, Geneva, 26 July - 5 August 2005) ; it is a cascaded CELP encoding comprising as a first 8 kbit / s stage a modified G.729 coding (ITU-T G729 Recommendation, Coding of Speech at 8 kbps using Conjugate Structure Algebraic Code Excited Linear Prediction ( CS-ACELP), March 1996) without a pre-processing filter and as a second stage at 12 kbit / s an additional fixed CELP dictionary. CELP coding allows to determine the parameters of the excitation signal in the low band.

The high band is first folded spectrally 406 to compensate for the folding due to the high pass filter 402 combined with the decimation 403. The high band is then pretreated by a low pass filter 407 eliminating the components between 3000 and 4000 Hz. of the high band, that is to say the components between 7000 and 8000 Hz of the original signal. A band extension 408, or high band coding, at 13.65 kbit / s is realized.

The different bit streams generated by the coding modules 405 and 408 are multiplexed and structured into a hierarchical bit stream in the multiplexer 409.

The coding is done in blocks of samples, or frames, of 20 ms, ie 320 samples. The hierarchical coding rate is 8, 12 and 13.65 kbit / s. The high band encoder 408 is detailed in FIG. 3. Its principle is similar to the parametric band extension of the ITU-T SG16 / WP3 D214 encoder.

The high band signal x _h i is coded in frames of N / 2 samples, where N is the number of samples of the original broadband frame and the division by 2 is due to the decimation by 2 of the high band. In a preferred embodiment, N / 2 = 160 samples, or 20 ms at 8 kHz sampling. For each frame, every 20 ms, time and frequency envelopes are extracted by the modules 600 and 601 as in the ITU-T SG16 / WP3 D214 encoder. These envelopes are then jointly quantized in block 602.

An overview of the operation of the frequency envelope extraction by the module 600 is now presented.

This operation requires future samples, commonly called "lookahead" because the spectral analysis uses a temporal window centered on the current frame that overflows on the future frame. In a preferred embodiment, the "lookahead" in the high band is set at L = 16 samples, ie 2 ms. The frequent envelope extraction can be carried out for example as follows: calculation of the short-term spectrum with windowing of the current frame and lookahead, and discrete Fourier transform,

- division of the spectrum into subbands,

calculation of the short-term energy of each of the sub-bands and conversion into an effective value (r.m.s.).

The frequency envelope is thus defined as the rms value of each of the sub-bands of the signal Xh ,.

The temporal envelope extraction by the module 601 is explained now with the aid of FIG. 4 which details the temporal division of the signal x _h ,.

Each frame of 20 ms consists of 160 samples:

Xh ₁ = [X ₀ Xi ... X159]

The last 16 samples of Xp ₁ actually correspond to

Lookahead for the current frame.

The time envelope of the current frame is calculated as follows:

- Xm division into 16 subframes of 10 samples, - calculation of the energy of each of the subframes and conversion into an effective value (r.m.s.).

The time envelope is thus defined as the rms value of each of the 16 subframes of the signal X _h ,.

FIG. 5 represents a hierarchical audio decoder associated with the encoder which has just been described with reference to FIGS. 2 and 3.

The bits describing each frame of 20 ms are demultiplexed by the demultiplexer 500. The bitstream of the 8 and 12 kbit / s layers is used by the decoding module 501 CELP to generate the parameters of synthesis of the excitation signal in the band. The low band synthetic speech signal is then postfiltered by block 502.

The portion of the bit stream associated with the 13.65 kbit / s layer is decoded by the band extension module 503. The expanded band output signal, sampled at 16 kHz, is obtained through the synthesis QMF filter bank 504, 505, 507, 508 and 509, incorporating the reverse folding 506.

The high band decoder 503 of FIG. 5 is described in detail in FIG.

This decoder repeats the principle of synthesis of the high band described for the coder of FIG. 1, with however two modifications: a frequency envelope interpolation module 806 and a post-processing module 808. These two frequency envelope interpolation and post-processing modules are intended to improve the quality of coding in the high band. The module 806 interpolates between the frequency envelope of the preceding frame and the frequency envelope of the current frame so that this envelope evolves every 10 ms, instead of 20 ms.

The high band decoder of FIG. 6 demultiplexes in the demultiplexer 800 the parameters received in the bitstream and decodes the time and frequency envelope information in the modules 801 and

802 decoding. A synthetic excitation signal is generated in a reconstruction module 803 from the CELP excitation parameters received by the 8 and 12 kbit / s layers. This excitation is filtered in the 804 low-pass filter to keep only the frequencies between 0 and 3000 Hz which correspond to the 4000 to 7000 Hz band of the original signal. As in the encoder of FIG. 1, the synthetic excitation signal is shaped by the modules 805 and 807:

the output of the temporal shaping module 805 ideally has an effective value (r.m.s.) per subframes which corresponds to the decoded time envelope; the module 805 therefore corresponds to the application of an adaptive gain in time,

the output of the frequency shaping module 807 ideally has an effective value (rms) per sub-band which corresponds to the decoded frequency envelope; the module 807 can be realized by means of a filterbank or a transform with overlap. The signal resulting from the shaping of the excitation is finally processed by the post-processing module 808 to obtain the reconstructed high band y.

The post-processing module 808 will now be described in detail. The post-processing performed by the module 808 consists in applying to the signal x coming from the frequency shaping module 807 an amplitude compression so as to limit the amplitude of the signal and thus avoid the artifacts that could occur as a result of lack of coupling between excitation and shaping. The output signal y of the post-processing module 808 will be written as: y = C (x) = σ.F (x / σ)

where σ is the decoded time envelope. The properties of the post-treatment proposed by the invention are as follows:

this post-treatment acts instantaneously, that is to say sample per sample without causing a delay in treatment,

the trigger threshold for the amplitude compression is provided by the time envelope as decoded by the time envelope decoding module 801. By definition, σ> 0,

the post-processing is of the adaptive type because the value of σ changes at each subframe of 10 samples, namely every 1.25 ms,

the decoded time envelope for the current frame corresponds to a temporal support offset by 2 ms, ie 16 samples, as illustrated in FIG. 4. Thus, the adaptive post-processing keeps in memory the effective value (rms) of the two sub-bits. -sames associated with the "lookahead": these two subframes correspond to the two subframes of the beginning of the current frame.

The flowchart of FIG. 7 details a first compression function, denoted C ₁ (X), of post-processing. The beginning and end of the calculation are identified by blocks 1000 and 1006. The value of the output is first initialized at x (block 1001). Then two tests are done (blocks 1002 and

1004) to check if y is in the range [-σ, σ]. Three cases are possible: - if y is in the interval [-σ, σ], the computation of y is complete: y = x and Ci (x) = x; F ^ x / σ) = x / σ

if y> σ, its value is modified as defined in block 1003; the difference between y and + σ is attenuated by a factor of 16. - if y <-σ, its value is modified as defined in block 1005; the difference between y and -σ is attenuated by a factor of 16.

To illustrate the operation of the operation y = Ci (x), we show in Figure 8 the curve of y / σ as a function of x / σ. The data are normalized by σ to make the input / output characteristic independent of the value of σ. This normalized characteristic is denoted F-ι (x / σ); we have as a result: Ci (x) = σ Fi (x / σ).

FIG. 8 clearly shows that the function Ci (x) performs symmetrical amplitude compression with a "trigger threshold" set at +/- σ. More precisely, the slope of Fi (x / σ) is 1 between [-1. + 1] and 1/16 elsewhere. Equivalently, the slope of Ci (x) is 1 between [-σ, + σ] and 1/16 elsewhere.

Two variants of the post-processing are described in FIGS. 9 to 12. The corresponding functions are denoted respectively C ₂ (X) and C ₃ (X).

The post-processing C ₂ (x) shown in FIGS. 9 and 10 is identical to C-ι (x) but with a value of the "trigger threshold" which goes from +/- σ to +/- 2σ. Thus, the slope of C ₂ (x) is 1 between [-2σ, + 2σ] and 1/16 elsewhere.

The post-processing C ₃ (x) is a more evolved variant of Ci (x), in which the amplitude compression is performed in two successive steps. As shown in FIG. 11, the trip interval is always set to [-σ, + σ] (blocks 1402 and 1406), whereas the value of y is only attenuated by a factor Vi, unless the value of y modified by blocks 1403 and 1407 is outside the range [-2.5 σ, + 2.5 σ] in which case the value of y is further modified by blocks 1405 and 1409. The operation of C ₃ ( x) is illustrated in Figure 12 where we can see that the slope of C ₃ (x) is: - 1/16 on [-∞, -4σ] and [4σ, + ∞], - 1/2 on [-Aa, -σ] and [σ, 4σ] and - 1 on [-σ, + σ].

Claims

1. A method of post-processing, in an audio decoder, a signal reconstructed by temporal and frequency formatting (805,807) of an excitation signal obtained from at least one estimated parameter in a first band of frequency, said temporal and frequency shaping being made from, at least, a time envelope and a frequency envelope received and decoded (801, 802) in a second frequency band, characterized in that said method comprises after said shaping (805,807), the steps of comparing the amplitude of said reconstructed signal with said received and decoded time envelope (σ), and, in case of exceeding at least one threshold depending on said time envelope, applying to said reconstructed signal an amplitude compression.

2. Method according to claim 1, characterized in that said received and decoded time envelope (σ) is defined as an effective value (r.m.s.) by subframes of the signal of the second frequency band (Xhi).

3. Method according to any one of claims 1 to 2, characterized in that said amplitude compression consists in applying to the amplitude of said reconstructed signal at least one linear attenuation if said amplitude is greater than at least one trigger threshold function of said received and decoded time envelope (σ).

4. Method according to one of claims 1 to 3, characterized in that said amplitude compression is performed according to a piecewise linear attenuation law triggered by a plurality of trigger thresholds according to said received and decoded time envelope .

A computer program comprising program code instructions for implementing the post-processing method according to any one of claims 1 to 4 when said program is run on a computer.

6. Module for post-processing, in an audio decoder, a signal reconstructed by temporal and frequency formatting of an excitation signal obtained from at least one estimated parameter in a first frequency band, said temporal and frequency shaping being made from at least one time envelope and a frequency envelope received and decoded in a second frequency band, characterized in that said post-processing module (808) comprises a comparing the amplitude of said reconstructed signal to said received and decoded time envelope (σ) and amplitude compression means capable, in case of exceeding at least one threshold depending on said time envelope, to apply to said reconstructed signal a amplitude compression.

7. Audio decoder, comprising a module (501) for estimating at least one parameter of an excitation signal in a first frequency band, a module (803) for reconstructing an excitation signal from of said parameter, a module (801) for decoding a time envelope (σ) in a second frequency band, a module (802) for decoding a frequency envelope in a second frequency band, a module (805) for temporally shaping said excitation signal, by means of, at least, said decoded time envelope (σ) and a frequency forming module (807) of said excitation signal, by means of, at least, said envelope frequency decoded, characterized in that said decoder further comprises a post-processing module (808) according to claim 6.

8. Decoder according to claim 7, characterized in that it comprises a module (806) frequency envelope interpolation.