EP1316087A1

EP1316087A1 - Transmission error concealment in an audio signal

Info

Publication number: EP1316087A1
Application number: EP01969857A
Authority: EP
Inventors: Balazs Kovesi; Dominique Massaloux; David Deleam
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2000-09-05
Filing date: 2001-09-05
Publication date: 2003-06-04
Anticipated expiration: 2021-09-05
Also published as: AU2001289991A1; US20100070271A1; ATE382932T1; WO2002021515A1; DE60132217D1; JP2004508597A; ES2298261T3; HK1055346A1; JP5062937B2; DE60132217T2; US20040010407A1; EP1316087B1; FR2813722A1; US8239192B2; US7596489B2; FR2813722B1; IL154728A0; IL154728A

Abstract

A method of concealing transmission error in a digital audio signal, wherein a signal that has been decoded after transmission is received, the samples decoded while the transmitted data is valid are stored, at least one short-term prediction operator and one long-term prediction operator are estimated as a function of stored valid samples, and any missing or erroneous samples in the decoder signal are generated using the estimated operators. The energy of the synthesized signal that is thus generated is controlled by means of a gain that is computed and adapted sample by sample.

Description

CONCEALING TRANSMISSION ERRORS IN A SIGNAL

AUDIO

1. DOMATTTE: TECHNT UE

The present invention relates to techniques for concealing consecutive transmission errors in transmission systems using any type of digital coding of the speech and / or sound signal.

There are conventionally two main categories of coders: so-called time coders, which carry out the compression of samples of digitized signal sample by sample (case of MIC or MICDA coders

[DAUMER] [MASTER] for example)

- and parametric coders which analyze successive frames of samples of the signal to be coded to extract, from each of these frames, a certain number of parameters which are then coded and transmitted (case of vocoders [TREMAIN], coders IMBE [HARDWICK ], or transform coders [BRANDENBURG]).

There are intermediate categories which supplement the coding of the parameters representative of the parametric coders by the coding of a residual time waveform. To simplify, these coders can be classified in the category of parametric coders.

In this category we find predictive coders and in particular the family of coders with synthesis analysis such as RPE-LTP ([HELLWIG]) or CELP ([ATAL]).

For all these coders, the coded values are then transformed into a binary train which will be transmitted on a transmission channel. Depending on the quality of this channel and the type of transport, disturbances can affect the transmitted signal and produce errors on the bit stream received by the decoder. These errors can occur in isolation in the bitstream but very frequently occur in bursts. It is then a packet of bits corresponding to a complete portion of signal which is erroneous or not received. This type of problem is encountered for example for transmissions on mobile networks. It is also encountered in transmissions on packet networks and in particular on internet-type networks.

When the transmission system or the modules responsible for reception make it possible to detect that the data received is highly erroneous (for example on mobile networks), or that a data block has not been received (case of transmission systems by packets for example), error concealment procedures are then implemented. These procedures make it possible to extrapolate to the decoder the samples of the missing signal from the signals and data available from the frames preceding and possibly following the erased zones.

Such techniques have been implemented mainly in the case of parametric coders

(techniques for recovering deleted frames). They make it possible to greatly limit the subjective degradation of the signal perceived at the decoder in the presence of erased frames. Most of the algorithms developed are based on the technique used for the coder and the decoder, and in fact constitute an extension of the decoder. A general object of the invention is to improve, for any system of speech and sound compression, the subjective quality of the speech signal restored to the decoder when, due to poor quality of the transmission channel or due to the loss or non-reception of a packet in a packet transmission system, a set of consecutive coded data has been lost.

To this end, it proposes a technique making it possible to conceal successive transmission errors (error packets) whatever the coding technique used, the proposed technique being able to be used for example in the case of time coders whose structure lends itself less well a priori to the concealment of error packets.

2. STATE OF THE PRIOR ART

Most predictive coding algorithms offer techniques for recovering erased frames ([GSM-FR], [REC G.723.1A], [SALAMI], [HONKA EN], [COX-2], [CHEN- 2], [CHEN-3], [CHEN-4], [CHEN-5], [CHEN-6], [CHEN-7], [KROON-2], [WATKINS]). The decoder is informed of the occurrence of a frame erased in one way or another, for example in the case of radio mobile systems by the transmission of the frame erasure information coming from the channel decoder. The purpose of the devices for recovering erased frames is to extrapolate the parameters of the erased frame from the last previous frame (s) considered to be valid. Some parameters manipulated or coded by predictive coders have a strong inter-frame correlation (case of short-term prediction parameters, also called "Lear" of "Linear Predictive Coding" (see [RABINER]) which represent the spectral envelope, and long-term prediction settings for voiced sounds, for example). Because of this correlation, it is much it is more advantageous to reuse the parameters of the last valid frame to synthesize the erased frame than to use erroneous or random parameters.

For the CELP coding algorithm (from "Code Excited Linear Prediction", refer to [RABINER]), the parameters of the erased frame are conventionally obtained as follows:

- the LPC filter is obtained from the LPC parameters of the last valid frame either by copying the parameters or with the introduction of a certain damping (cf. encoder G723.1 [REC G.723.1A]).

the voicing is detected to determine the degree of harmonicity of the signal at the level of the erased frame ([SALAMI], this detection taking place as follows:

"in the case of an unvoiced signal: an excitation signal is generated randomly (drawing of a code word and gain of the past excitation slightly damped [SALAMI], random selection in the past excitation [ CHEN], use of possibly totally erroneous transmitted codes [HONKANEN], ...) "in the case of a voiced signal: the LTP delay is generally the delay calculated in the previous frame, possibly with a slight" jitter "([ SALAMI]), the LTP gain being taken very close to 1 or equal to 1. The excitation signal is limited to the long-term prediction made from the past excitation.

In all the examples cited above, the procedures for concealing erased frames are strongly linked to the decoder and use modules of this decoder, such as the signal synthesis module. They use also intermediate signals available within this decoder such as the excitation signal passed and stored during the processing of valid frames preceding the erased frames.

Most of the methods used to conceal the errors produced by packets lost during the transport of data coded by coders of temporal type use techniques of substitution of waveforms such as those presented in [GOODMAN],

[ERDÔL], [AT&T]. Methods of this type reconstruct the signal by selecting portions of the decoded signal before the lost period and do not use synthesis models. Smoothing techniques are also used to avoid the artefacts produced by the concatenation of the different signals.

For transform coders, the techniques for reconstructing erased frames are also based on the coding structure used: algorithms, such as [PICTEL, MAHIEUX-2], aim to regenerate the lost transformed coefficients from the values taken by these coefficients before erasure.

The method described in [PARIKH] can be applied to any type of signal; it is based on the construction of a sinusoidal model from the valid decoded signal preceding the erasure, to regenerate the part of the signal lost.

Finally, there is a family of techniques for concealing erased frames developed in conjunction with channel coding. These methods, such as that described in [FINGSCHEIDT], use information provided by the channel decoder, for example information concerning the degree of reliability parameters received. They are fundamentally different from the present invention which does not presuppose the existence of a channel coder.

A prior art which can be considered as closest to the present invention is that which is described in [COMBESCURE], which proposed a method for concealing erased frames equivalent to that used in CELP coders for a transform coder. The disadvantages of the proposed method were the introduction of audible spectral distortions

("synthetic" voice, parasitic resonances, ...), due in particular to the use of poorly controlled long-term synthesis filters (single harmonic component in voiced sounds, generation of the excitation signal limited to the use of portions of the residual signal passed). In addition, the energy control was carried out in [COMBESCURE] at the level of the excitation signal, the energy target of this signal was kept constant throughout the duration of the erasure, which also generated annoying artifacts.

3. PRESENTATION OF THE INVENTION

The invention allows for the concealment of erased frames without marked distortion at higher error rates and / or for longer erased intervals.

It notably proposes a method for concealing a transmission error in an audio-digital signal according to which a decoded signal is received after transmission, the decoded samples are stored when the transmitted data are valid, at least one short prediction operator is estimated term and at least one long term prediction operator based valid samples stored and possible missing or erroneous samples are generated in the decoded signal using the operators thus estimated.

According to a first particularly advantageous aspect of the invention, the energy of the synthesis signal thus generated is controlled using a gain calculated and adapted sample by sample.

This contributes in particular to improving the performance of the technique on erasure zones of a longer duration.

In particular, the gain for controlling the synthesis signal is advantageously calculated as a function of at least one of the following parameters: energy values previously stored for the samples corresponding to valid data, fundamental period for the voiced sounds, or any parameter characterizing the frequency spectrum.

Also advantageously, the gain applied to the synthesis signal decreases progressively as a function of the duration during which the synthesis samples are generated.

Also preferably, we distinguish in valid data stationary sounds and non-stationary sounds and we implement laws of adaptation of this gain (decreasing speed, for example), different on the one hand for the samples generated following valid data corresponding to stationary sounds and on the other hand for the samples generated following valid data corresponding to non-stationary sounds. According to another independent aspect of the invention, the contents of the memories used for the decoding processing are updated as a function of the synthesis samples generated.

In this way, on the one hand, the possible desynchronization of the coder and the decoder is limited (see paragraph 5.1.4 below), and sudden discontinuities between the erased zone reconstructed according to the invention and the samples according to this are avoided. zoned.

In particular, a coding analogous to that implemented at the transmitter is implemented at least partially on the synthesized samples possibly followed by a decoding operation (possibly partial), the data obtained serving to regenerate the memories of the decoder.

In particular, this optionally partial coding-decoding operation can be advantageously used to regenerate the first erased frame because it makes it possible to exploit the content of the memories of the decoder before the cut, when these memories contain information not provided by the last valid samples. decoded (for example in the case of addition-overlap transformers, see paragraph 5.2.2.2.1 point 10).

According to a still different aspect of the invention, an excitation signal is generated at the input of the short-term prediction operator which, in the neighboring zone, is the sum of a harmonic component and a weakly harmonic component or non harmonic, and in the voiced zone limited to the non harmonic component.

In particular, the harmonic component is advantageously obtained by implementing a filtering by means of the long-term prediction operator applied to a residual signal calculated by implementing reverse short-term filtering on the stored samples.

The other component can be determined with the idea of a long-term prediction operator to which pseudo-random disturbances (for example gain or period disturbances) are applied.

In a particularly preferred manner, for the generation of a voiced excitation signal, the harmonic component represents the low frequencies of the spectrum, while the other component represents the high frequency part.

According to yet another aspect, the long-term prediction operator is determined from the samples of valid stored frames, with a number of samples used for this estimation varying between a minimum value and a value equal to at least twice the estimated fundamental period for voiced sound.

Furthermore, the residual signal is advantageously modified by treatments of the non-linear type to eliminate amplitude peaks.

Also, according to another advantageous aspect, voice activity is detected by estimating noise parameters when the signal is considered to be non-active, and parameters of the synthesized signal are made to tend towards those of the estimated noise.

Even more preferably, the spectral envelope of the noise of the decoded samples is estimated. valid and one generates a synthesized signal evolving towards a signal having the same spectral envelope.

The invention also provides a method for processing sound signals, characterized in that a discrimination is made between speech and musical sounds and when musical sounds are detected, a method of the aforementioned type is implemented without the estimation of a long-term prediction operator, the excitation signal being limited to a non-harmonic component obtained for example by generating uniform white noise.

The invention further relates to a device for concealing a transmission error in an audio-digital signal which receives as input a decoded signal which is transmitted to it by a decoder and which generates missing or erroneous samples in this decoded signal, characterized in that 'It includes processing means capable of implementing the above method.

It also relates to a transmission system comprising at least one encoder, at least one transmission channel, a module capable of detecting that transmitted data has been lost or is greatly erroneous, at least one decoder and an error concealment device which receives the decoded signal, characterized in that this error concealment device is a device of the aforementioned type.

4. PRESENTATION OF THE FIGURES

Other characteristics and advantages of the invention will emerge from the following description, which is purely illustrative and not limiting and should be read in conjunction with the accompanying drawings in which:

- Figure 1 is a block diagram illustrating a transmission system according to a possible embodiment of the invention;

- Figure 2 and Figure 3 are block diagrams illustrating an implementation according to a possible embodiment of the invention;

Figures 4 to 6 schematically illustrate the windows used with the error concealment method according to a possible embodiment of one invention;

Figures 7 and 8 are schematic representations illustrating a possible embodiment of the invention in the case of musical signals.

5. DESCRIPTION OF ONE OR MORE POSSIBLE EMBODIMENTS OF THE INVENTION

5.1 Principle of a possible embodiment

FIG. 1 shows a device for coding and decoding the digital audio signal, comprising an encoder 1, a transmission channel 2, a module 3 making it possible to detect that the transmitted data has been lost or is strongly erroneous, a decoder 4, and a module 5 for concealing errors or lost packets in accordance with a possible embodiment of the invention.

It will be noted that this module 5, in addition to the indication of erased data, receives the decoded signal in valid period and transmits signals used to update it to the decoder.

More specifically, the processing implemented by module 5 is based on:

1. memorization of the decoded samples when the transmitted data are valid (processing 6);

2. during a block of erased data, the synthesis of the samples corresponding to the lost data (processing 7);

3. when the transmission is restored, the smoothing between the synthesis samples produced during the erased period and the decoded samples (processing 8);

4. updating the memories of the decoder (processing 9) (updating which is carried out either during the generation of the erased samples, or at the time of re-establishment of the transmission).

.7.7 End p ri nâR va 77 cl &

After decoding the valid data, the memory of the decoded samples is updated, containing a sufficient number of samples for the regeneration of any periods erased subsequently. Typically, a signal of the order of 20 to 40 ms is stored. The energy of the valid frames is also calculated and the energies corresponding to the last valid frames processed are stored in memory (typically of the order of 5 s).

. 1. 2 P nâπn. a h l do âp. onnéw ffacθθa.

The following operations are carried out, illustrated in FIG. 3:

1. Estimation of the current spectral envelope: We compute this spectral envelope in the form of an LPC filter [RABINER] [KLEIJN]. The analysis is carried out by conventional methods ([KLEIJN]) after windowing of the samples stored in valid period. In particular, an LPC analysis is implemented (step 10) to obtain the parameters of a filter A (z), the reverse of which is used for LPC filtering (step 11). As the coefficients thus calculated do not have to be transmitted, a high order can be used for this analysis, which makes it possible to obtain good performances on the musical signals.

2. Detection of voiced sounds and calculation of LTP parameters:

A method for detecting voiced sounds (processing 12 in FIG. 3: V / NV detection, for "voiced / unvoiced") is used on the last stored data. For example one can use for that the normalized correlation ([KLEIJN]), or the criterion presented in the example of realization which follows.

When the signal is declared voiced, the parameters allowing the generation of a long-term synthesis filter, also called LTP filter, are calculated.

([KLEIJN]) (Figure 3: LTP analysis, we define by B (Z) the calculated LTP inverse filter). Such a filter is generally represented by a period corresponding to the fundamental period and a gain. The accuracy of this filter can be improved by using fractional pitch or a multi-coefficient structure [KROON].

When the signal is declared unvoiced, a particular value is assigned to the LTP synthesis filter (see paragraph 4). It is particularly interesting in this estimation of the LTP synthesis filter to restrict the zone analyzed at the end of the period preceding the erasure. The length of the analysis window varies between a minimum value and a value linked to the fundamental period of the signal.

3. Calculation of a residual signal

A residual signal is calculated by reverse filtering LPC (processing 10) of the last stored samples. This signal is then used to generate an excitation signal from the LPC 11 synthesis filter (see below).

4. Summary of missing samples:

The synthesis of the replacement samples is carried out by introducing an excitation signal (calculated in 13 from the signal at the output of the inverse LPC filter) in the LPC synthesis filter 11 (l / A (z)) calculated in 1. This excitation signal is generated in two different ways depending on whether the signal is voiced or unvoiced:

4.1 In the voiced area

The excitation signal is the sum of two signals, one strongly harmonic component and the other less harmonic or not at all.

The strongly harmonic component is obtained by LTP filtering (processing module 14) using the parameters calculated in 2, of the residual signal mentioned in 3. The second component can also be obtained by LTP filtering but made non-periodic by random modifications of the parameters, by generation of a pseudo-random signal.

It is particularly interesting to limit the bandwidth of the first component to the low frequencies of the spectrum. Similarly, it will be interesting to limit the second component to the highest frequencies.

4.2 In unvoiced areas:

When the signal is unvoiced, a non-harmonic excitation signal is generated. It is interesting to use a generation method similar to that used for voiced sounds, with variations in parameters (period, gain, signs) allowing it to be made non-harmonic.

4.3 Control of the amplitude of the residual signal:

When the signal is unvoiced, or weakly voiced, the residual signal used for generating the excitation is processed to eliminate the amplitude peaks significantly above the average.

5. Control of the energy of the synthesis signal

The energy of the synthesis signal is controlled using a gain calculated and adapted sample by sample. In the case where the erasure period is relatively long, it is necessary to gradually lower the energy of the synthesis signal. The gain adaptation law is calculated according to different parameters: stored energy values before erasure (see in 1), fundamental period, and local stationarity of the signal at the time of cutting.

If the system includes a module allowing the discrimination of stationary (like music) and non-stationary (like speech) sounds, different adaptation laws can also be used.

In the case of transform coders with addition-recovery, the first half of the memory of the last frame correctly received contains fairly precise information on the first half of the first lost frame (its weight in the addition-recovery is more important than that of the current frame). This information can also be used to calculate the adaptive gain.

6. Evolution of the synthesis procedure over time:

In the case of relatively long erasure periods, the synthesis parameters can also be changed. If the system is coupled to a voice activity detection device with estimation of the noise parameters (such as [REC-G.723.1A], [SALAMI-2],

[BENYASSINE]), it is particularly interesting to tend the parameters of generation of the signal to be reconstructed towards those of the estimated noise: in particular at the level of the spectral envelope (interpolation of the LPC filter with that of the estimated noise, the coefficients of l interpolation evolving over time until the noise filter is obtained) and energy (level progressively evolving towards that of the noise, for example by windowing).

c>^" . 3. An rt≤ta hl J RFi & m & Ω JR la raπnm FΠ ^' nπ When re-establishing transmission, it is particularly important to avoid sudden breaks between the erased period which has been reconstructed according to the techniques defined in the preceding paragraphs and the periods which follow, during which all the information transmitted is available. to decode the signal. The present invention performs weighting in the time domain with interpolation between the replacement samples preceding the reestablishment of the communication and the decoded samples valid according to the erased period. This operation is a priori independent of the type of coder used.

In the case of transform-encoders with addition-overlap, this operation is common with the updating of the memories described in the following paragraph (see embodiment).

.5. 7. 4 Mi RR à jmir ORR m. Moi Ύ-RΠ du keyπod & iir

When the decoding of valid samples resumes after an erased period, there may be a degradation when the decoder uses data normally produced in the previous and stored frames. It is important to properly update these memories to avoid these artifacts.

This is particularly important for coding structures using recursive methods, which for a sample or a sequence of samples, use information obtained after decoding of the previous samples. For example, these are predictions ([KLEIJN]) which allow the redundancy of the signal to be extracted. This information is normally available both to the coder, who must have done this for these preceding samples have a form of local decoding, and at the remote decoder present at the reception. As soon as the transmission channel is disturbed and the remote decoder no longer has the same information as the local decoder present at transmission, there is desynchronization between the encoder and the decoder. In the case of highly recursive coding systems, this desynchronization can cause audible degradations which can last a long time or even increase over time if there are instabilities in the structure. In this case, it is therefore important to endeavor to resynchronize the coder and the decoder, that is to say to make an estimation of the memories of the decoder as close as possible to those of the coder. However, resynchronization techniques depend on the coding structure used. One will be presented, the principle of which is general in this patent, but the complexity of which is potentially significant.

One possible method consists in introducing into the decoder on reception a coding module of the same type as that present on the transmission, making it possible to carry out the coding-decoding of the samples of the signal produced by the techniques mentioned in the preceding paragraph during the periods deleted. In this way the memories necessary to decode the following samples are completed with a priori similar data.

(subject to a certain stationarity during the erased period) of those which have been lost. In the event that this assumption of stationarity is not respected, after a long erased period for example, we do not have sufficient information anyway to do better. In fact it is generally not necessary to carry out the complete coding of these samples, one is limited to the modules necessary to update the memories.

This update can be carried out at the time of production of the replacement samples, which distributes the complexity over the entire erasure zone, but is combined with the synthesis procedure described above. When the coding structure allows, the above procedure can also be limited to an intermediate zone at the start of the period of valid data succeeding an erased period, the updating procedure then being combined with the decoding operation. .

5.2. Description of specific implementation examples

Specific examples of possible implementation are given below. The case of TDAC or TCDM ([MAHIEUX]) type transform coders is particularly addressed.

<=>. ?. . ! ΩRRC.ri t. i nri du di π o.q? . i f

TDAC type digital coding / decoding system.

Broadband encoder (50-7000 Hz) at 24 kb / s or 32 kb / s. 20 ms frame (320 samples).

40 ms windows (640 samples) with addition - 20 ms overlaps. A binary frame contains the coded parameters obtained by the TDAC transformation on a window. After decoding these parameters, by doing the reverse transformation TDAC, we obtain an output frame of 20 ms which is the sum of the second half of the previous window and the first half of the current window. In FIG. 4, the two parts of windows used for the reconstruction of the frame n (in time) have been marked in bold. Thus, a lost binary frame disturbs the reconstruction of two consecutive frames (the current one and the next one, Figure 5). On the other hand, by correctly replacing the lost parameters, it is possible to recover the parts of the information coming from the preceding and following binary frame (FIG. 6), for the reconstruction of these two frames.

<=,. 2. 2 Mi ΠR PΠ œyra

All the operations described below are implemented on reception, in accordance with FIGS. 1 and 2, either within the module for concealing erased frames which communicates with the decoder, or in the decoder itself (updating of memories decoder).

5.2.2.1 During a valid period

Corresponding to paragraph 5.1.2, the memory of the decoded samples is updated. This memory is used for LPC and LTP analyzes of the signal passed in the event of erasure of a binary frame. In the example presented here, the LPC analysis is performed over a signal period of 20 ms (320 samples). In general, LTP analysis requires more samples to be stored. In our example, to be able to do the LTP analysis correctly, the number of samples stored is equal to twice the maximum value of the pitch. For example, if the maximum value of the MaxPitch pitch is fixed at 320 samples (50 Hz, 20 ms), the last 640 samples will be memorized (40 ms of the signal). We also calculate the energy of the frames valid and stored in a 5 s long circular buffer. When an erased frame is detected, the energy of the last valid frame is compared to the maximum and minimum of this circular buffer in order to know its relative energy.

5.2.2.2 During a block of deleted data

When a bit frame is lost, there are two different cases:

5.2.2.2.1 First bit frame lost after a valid period

First, we analyze the stored signal to estimate the parameters of the model used to synthesize the regenerated signal. This model then allows us to synthesize 40 ms of signal, which corresponds to the lost 40 ms window. By doing the TDAC transformation followed by the TDAC reverse transformation on this synthesized signal (without coding - decoding of the parameters), the 20 ms output signal is obtained. With these TDAC - reverse TDAC operations, the information from the previous window correctly received is used (see FIG. 6). At the same time, the memories of the decoder are updated. Thus, the next bit frame, if received, can be decoded normally, and the decoded frames will be automatically synchronized (Figure 6). The operations to be carried out are as follows:

1. Windowing of the memorized signal. For example, an asymmetrical Hamming window of 20 ms can be used.

2. Calculation of the autocorrelation function on the window signal. 3. Determination of the coefficients of the LPC filter. For this, conventionally we use the iterative Levinson-Durbin algorithm. The order of analysis can be high, especially when the encoder is used to encode music sequences.

4. Voice detection and long-term analysis of the stored signal for modeling the possible periodicity of the signal (voiced sounds). In the embodiment presented, the inventors limited the estimate of the fundamental period Tp to whole values, and calculated an estimate of the degree of voicing in the form of the MaxCorr correlation coefficient (see below) evaluated at the selected period. Let Tm = max (T, Fs / 200), where Fs is the sampling frequency, so Fs / 200 samples correspond to a duration of 5 ms. To better model the evolution of the signal at the end of the previous frame, the correlation coefficients Corr (T) corresponding to a delay T are calculated by using only 2 * Tm samples at the end of the stored signal:

where ^m o '^"m _LMEM -ι ^is ^a memory of the signal previously decoded. In this formula, it is seen that the length of this memory _MEM L must be at least 2 times the maximum value of the fundamental period (also called" pitch ") MaxPi tch. The minimum value of the fundamental MinPi tch corresponding to a frequency of 600 Hz (26 samples at Fs = 16 kHz) has also been fixed. We calculate Corr (T) for T = 2, §, MaxPi tch. If T 'is the smallest delay such as Corr (T') <0 (we thus eliminate the very short term correlations), then we seek MaxCorr, maximum of Corr (T) for T '<T <= MaxPitch. Let Tp be the period corresponding to MaxCorr (Corr (Tp) = MaxCorr). We are also looking for MaxCorrMP, maximum of Corr (T) for T '<T <= 0.75 * MinPitch,. If Tp <MinPitch or MaxCorrMP> 0.7 * MaxCorr and if the energy of the last valid frame is relatively low, we decide that the frame is unvoiced, because using the LTP prediction we might get a resonance in the high frequencies very annoying. The chosen pitch is Tp = MaxPitch / 2, and the MaxCorr correlation coefficient fixed at a low value (0.25).

We also consider the frame as unvoiced when more than 80% of its energy is concentrated in the last MinPitch samples. It is therefore a speech start-up, but the number of samples is not sufficient to estimate the possible fundamental period, it is better to treat it as an unvoiced frame, and even to decrease the energy of the synthesized signal (to signal this, we set DiminFlag = l).

In the case where MaxCorr> 0.6, we check that we have not found a multiple (4, 3 or 2 times) of the fundamental period. For that, one seeks the local maximum of the correlation around T _p / 4, T _p / 3 and T _p / 2. Let us denote Tj _. the position of this maximum, and MaxCorrL = Corr (T] _). If ^> MinPitch and MaxCorrL> 0.75 * MaxCorr, we choose i as the new fundamental period.

If T _p is less than MaxPitch / 2, we can check if it is really a voiced frame by looking for the local maximum of the correlation around 2 * TP (TPP) and checking if Corr (T _PP )> 0.4. If Corr (T) <0.4 and if the signal energy decreases, we set DiminFlag = l and we decrease the value of MaxCorr, otherwise we look for the next local maximum between the current T _P and MaxPitch.

Another voicing criterion consists in checking whether at least in 2/3 of the cases the signal delayed by the fundamental period has the same sign as the non-delayed signal.

We verify this over a length equal to the maximum between 5ms and 2 * T _P.

It is also checked whether the energy of the signal tends to decrease or not. If so, set DiminFlag = l and decrease the value of MaxCorr as a function of the degree of decrease.

The voicing decision also takes into account the signal energy: if the energy is strong, the value of MaxCorr is increased, so it is more likely that the frame is decided voiced. On the other hand, if the energy is very low, the value of MaxCorr is reduced.

Finally, we take the voicing decision according to the value of MaxCorr: the frame is unvoiced if and only if MaxCorr <0.4. The fundamental period T _p of an unvoiced frame is bounded, it must be less than or equal to MaxPitch / 2.

5. Calculation of the residual signal by LPC reverse filtering of the last stored samples. This residual signal is stored in the ResMem memory.

6. Equalization of the energy of the residual signal. In the case of an unvoiced or weakly voiced signal (MaxCorr <

0.7), the energy of the residual signal stored in ResMem may suddenly change from party to party. The repetition of this excitation leads to a very unpleasant periodic disturbance in the synthesized signal.

To avoid this, it is ensured that no significant amplitude peak occurs in the excitation of a weakly voiced frame. As the excitation is constructed from the last Tp samples of the residual signal, this vector of Tp samples is processed. The method used in our example is as follows: "We calculate the mean MeanAmpl of the absolute values of the last Tp samples of the residual signal.

"If the vector of samples to be processed contains n zero crossings, it is cut into n + 1 sub-vectors, the sign of the signal in each sub-vector therefore being invariant.

"We are looking for the maximum MaxAmplSv amplitude of each sub-vector. If MaxAmplSv> l .5 * MeanAmpl, we multiply the sub-vector by 1.5 * MeanAmpl / MaxAmplSv.

7. Preparation of the excitation signal with a length of

640 samples corresponding to the length of the TDAC window. We distinguish 2 cases according to the voicing: § The excitation signal is the sum of two signals, a strongly harmonic component limited in band at the low frequencies of the excb spectrum and another less harmonic limited to the highest frequencies exch. The strongly harmonic component is obtained by LTP filtering of order 3 of the residual signal: excb (i) = 0.15 * exc (i-Tp-1) + 0.7 * exc (i-Tp) + 0.15 * exc (i-Tp + 1) The coefficients [0.15, 0.7, 0.15] correspond to a low pass FIR filter of 3 dB attenuation at Fs /. The second component is also obtained by an LTP filtering made non-periodic by the random modification of its fundamental period Tph. Tph is chosen as the integer part of a random real value Tpa. The initial value of Tpa is equal to Tp then it is modified sample by sample by adding a random value in [-0.5, 0.5]. In addition, this LTP filtering is combined with a high pass IIR filtering: exch (i) = -0.0635 * (exe (i-Tph-1) + exc (i-Tph + 1)) + 0.1182 * exc (i-Tph ) -0.9926 * exch (i-1) - 0.7679 * exch (i-2) The voiced excitation is then the sum of these 2 components:

Exe (i) = excb (i) + exch (i)

"In the case of an unvoiced frame, the excitation signal exe is also obtained by LTP filtering of order 3 with the coefficients [0.15, 0.7, 0.15] but it is made non-periodic by increasing the fundamental period d 'a value equal to 1 every 10 samples, and inversion of the sign with a probability of 0.2.

8. Synthesis of the replacement samples by introducing the excitation signal exe into the LPC filter calculated in 3.

9. Control of the energy level of the synthesis signal. The energy gradually tends towards a level fixed in advance from the first synthesized replacement frame. This level can be defined, for example, as the energy of the lowest output frame found during the last 5 seconds preceding the erasure. We have defined two gain adaptation laws which are chosen as a function of the DiminFlag flag calculated at 4. The speed of decrease of the energy also depends on the fundamental period. There is a third more radical adaptation law which is used when we detect that the start of the generated signal does not correspond well to the original signal, as explained later (see point 11).

10. TDAC transformation on the signal synthesized at 8, as explained at the beginning of this chapter. The TDAC coefficients obtained replace the lost TDAC coefficients. Then, by doing the reverse transformation TDAC, we get the output frame. These operations have three purposes:

"In the case of the first lost window, this way we use the information from the correctly received previous window which contains half of the data necessary to reconstruct the first disturbed frame (Figure 6).

"The memory of the decoder is updated for decoding the next frame (synchronization of the encoder and the decoder, see paragraph 5.1.4).

"The continuous transition (without breaking) of the output signal is automatically ensured when the first correctly received bit frame arrives after an erased period which has been reconstructed according to the techniques presented above (see paragraph 5.1.3).

11. The addition-recovery technique makes it possible to check whether the synthesized voiced signal corresponds well to the original signal or not because for the first half of the first frame lost the weight of the last window memory correctly received is greater ( figure 6). So by taking the correlation between the first half of the first synthesized frame and the first half of the frame obtained after the TDAC g reverse TDAC operations, we can estimate the similarity between the lost frame and the replacement frame. A weak correlation (<0.65) indicates that the original signal is enough different from that obtained by the replacement method, and it is better to decrease the energy of the latter quickly to the minimum level.

5.2.2.2.2 Frames lost following the first frame of an erased area

In the previous paragraph, points 1-6 relate to the analysis of the decoded signal preceding the first erased frame and allowing the construction of a synthesis model (LPC and possibly LTP) of this signal. For the following erased frames, the analysis is not repeated, the replacement of the lost signal is based on the parameters (LPC coefficients, pitch, MaxCorr, ResMem) calculated during the first erased frame. We therefore only perform the operations corresponding to the synthesis of the signal and to the synchronization of the decoder, with the following modifications with respect to the first erased frame: "In the synthesis part (points 7 and 8), only 320 new samples are generated, because the TDAC transformation window covers the last 320 samples generated during the previous erased frame and these new 320 samples. "If the erasure period is relatively long, it is important to change the synthesis parameters to the parameters of white noise or towards those of background noise (see point

5 in paragraph 3.2.2.2). As the system presented in this example does not include VAD / CNG, we have, for example, the possibility of making one or more of the following modifications: "Progressive interpolation of the LPC filter with a flat filter to make the synthesized signal less colored. "Gradual increase in the value of the pitch. "In voiced mode, we switch to unvoiced mode after a certain time (for example when the minimum energy is reached).

5.3 Specific processing for musical signals. If the system includes a module allowing speech / music discrimination, it is then possible, after selecting a music synthesis mode, to implement a processing specific to the musical signals. In FIG. 7, the music synthesis module has been referenced by 15, that of speech synthesis by 16 and the speech / music switch by 17.

For example, such processing implements the following steps for the music synthesis module, illustrated in FIG. 8:

1. Estimation of the current spectral envelope:

We compute this spectral envelope in the form of an LPC filter [RABINER] [KLEIJN]. The analysis is carried out by conventional methods ([KLEIJN]). After windowing the samples stored in a valid period, an LPC analysis is used to calculate an LPC filter A (Z) (step 19). A high order (> 100) is used for this analysis in order to obtain good performance on the musical signals.

2. Summary of missing samples:

The synthesis of the replacement samples is carried out by introducing an excitation signal into the LPC synthesis filter (l / A (z)) calculated in step 19. This excitation signal - calculated in a step 20 - is a white noise whose amplitude is chosen to obtain a signal having the same energy as that of the last N samples stored in valid period. In FIG. 8, the filtering step is referenced by 21. Example of the control of the amplitude of the residual signal: If the excitation is presented as a uniform white noise multiplied by a gain, one can calculate this gain G as follows:

Estimated gain of the LPC filter:

Durbin's algorithm gives the energy of the residual signal. Knowing also the energy of the signal to be modeled, the gain G ^ c: of the LPC filter is estimated as the ratio of these two energies. Calculation of the target energy:

The target energy is estimated equal to the energy of the last N samples stored in a valid period (N is typically <the length of the signal used for the LPC analysis).

The energy of the synthesized signal is the product of the energy of white noise by G ² and G ^ ^. We choose G so that this energy is equal to the target energy.

3. Control of the energy of the synthesis signal

As for the speech signals, except that the speed of decrease of the energy of the synthesis signal is much slower, and that it does not depend on fundamental (nonexistent) period: The energy of the synthesis signal is controlled at using a gain calculated and adapted sample by sample. In the case where the erasure period is relatively long, it is necessary to gradually lower the energy of the synthesis signal. The gain adaptation law can be calculated as a function of various parameters such as the energy values memorized before erasure, and local stationarity of the signal at the time of cutting. 6. Evolution of the synthesis procedure over time:

As for the speech signals:

In the case of relatively long erasure periods, the synthesis parameters can also be changed. If the system is coupled to a device for detecting voice activity or musical signals with estimation of the noise parameters (such as [REC-G.723.1A],

[SALAMI -2], [BENYASSINE]), it will be particularly interesting to make the parameters of generation of the signal to be reconstructed tend towards those of the estimated noise: in particular at the level of the spectral envelope (interpolation of the LPC filter with that of the noise estimated, the coefficients of the interpolation evolving over time until the noise filter is obtained) and of the energy (level progressively evolving towards that of the noise, for example by windowing).

6. GENERAL NOTE

As will be understood, the technique which has just been described has the advantage of being usable with any type of coder; in particular it makes it possible to remedy the problems of lost bit packets for time or transform coders, on speech and music signals with good performance: indeed in the present technique, the only signals memorized during periods when the data transmitted are valid are the samples from the decoder, information that is available regardless of the coding structure used.

7. BIBLIOGRAPHIC REFERENCES

[AT&T] AT&T (DA Kapilo, RV Cox) “A high quality low-complexity algorithm for frame erasure concealment (FEC) with G.711 ”, Delayed Contribution D.249 (WP 3/16), ITU, May 1999.

[ATAL] B.S. Atal and M.R. Schroeder. "Predictive coding of speech signal and subjectives error criteria". IEEE Trans. on Acoustics, Speech and Signal Processing, 27: 247-254, June 1979.

[BENYASSINE] A. Benyassine, E. Shlomot and H. Y. Su. "ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications". IEEE

Communication Magazine, September 97, PP. 56-63.

[BRANDENBURG] K. H. Brandenburg and M. Bossi. "Overview of MPEG audio: current and future standards for low-bit-rate audio coding". Journal of Audio Eng. Soc, Vol. 45-1 / 2, January / February 1997, PP.4-21.

[CHEN] J. H. Chen, R. V. Cox, Y. C. Lin, N. Jayant and M. J. Melchner. "A low-delay CELP coder for the CCITT 16 kb / s speech coding standard". IEEE Journal on Selected Areas on Communications, Vol. 10-5, June 1992, PP.830-849.

[CHEN-2] J. H. Chen, C.R. Watkins. "Linear prediction coefficient generation during frame erasure or packet loss". Patent US5574825, EP0673018.

[CHEN-3] J. H. Chen, C.R. Watkins. "Linear prediction coefficient generation during frame erasure or packet loss". Patent 884010.

[CHEN-4] JH Chen, CR Watkins. "Frame erasure or packet loss compensation method". Patent US5550543, EP0707308. [CHEN-5] JH Chen. "Excitation signal synthesis during frame erasure or packet loss". Patent US5615298, EP0673017.

[CHEN-6] J. H. Chen. "Computational complexity reduction during frame erasure of packet loss". Patent US5717822.

[CHEN-7] J. H. Chen. "Computational complexity reduction during frame erasure or packet loss". Patent US940212435, EP0673015.

[COX] R. V. Cox. "Three new speech coders from the ITU cover a range of applications". IEEE Communication Magazine, September 97, PP. 40-47.

[COX-2] R. V. Cox. "An improved frame erasure concealment method for ITU-T Rec. G728". Delayed contribution D.107

(WP 3/16), ITU-T, January 1998.

[COMBESCURE] P. Combescure, J. Schnitzler, K. Ficher, R.

Kirchherr, C. Lamblin, A. Le Guyader, D. Massaloux, C.

Quinquis, J. Stegmann, P. Vary. "At 16.24.32 kbit / s

Wideband Speech Codée Based on ATCELP ". Proc. Of ICASSP conference, 1998.

[DAUMER] W. R. Daumer, P. Mermelstein, X. Master and I.

Tokizawa. "Overview of the ADPCM coding algorithm". Proc. of GLOBECOM 1984, PP .23.1.1-23.1.4.

[ERDÔL]. N. Erdôl, C. Castelluccia, A. Zilouchian "Recovery of Missing Speech Packets Using the Short-Time Energy and Zero-Crossing Measurements" IEEE Trans. on Speech and Audio Processing, Vol. 1-3, July 1993, PP.295-303. [FINGSCHEIDT] T. Fingscheidt, P. Vary, "Robust speech decoding: a universai approach to bit error concealment", Proc. of ICASSP conference, 1997, pp. 1667-1670.

[GOODMAN] D.J. Goodman, G.B. Lockhart, O.J. Wasem, W.C. Wong. "Waveform Substitution Techniques for Recovering Missing Speech Segments in Packet Voice Communications". IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-34, December 1986, PP. 1440-1448.

[GSM-FR] GSM Recommendation 06.11. "Substitution and muting of lost frames for full rate speech traffic channels". ETSI / TC SMG, ver. : 3.0.1. , February 1992.

[HARDWICK] J. C. Hardwick and J. S. Lim. "The application of the IMBE speech coder to mobile communications". Proc. of ICASSP conference, 1991, PP.249-252.

[HELLWIG] K. Hellwig, P. Vary, D. Massaloux, J. P. Petit, C. Galand and M. Rosso. "Speech coded for the European mobile radio System". GLOBECOM conference, 1989, PP. 1065-1069.

[HONKANEN] T. Honkanen, J. Vainio, P. Kapanen, P. Haavisto, R. Salami, C. Laflamme and J. P. Adoul. "GSM enhanced full rate coded speech". Proc. of ICASSP conference, 1997, PP.771-774.

[KROON] P. Kroon, B.S. Atal. "On the use of pitch predictors with high temporal resolution". IEEE Trans. we

Signal Processing, Vol. 39-3, March 1991, PP.733-735.

[KROON-2] P. Kroon. "Linear prediction coefficient generation during frame erasure or packet loss". Patent US5450449, EP0673016. [MAHIEUX] Y. Mahieux, JP Petit. "High quality aaudio ^• transform coding at 64 kbit / s". IEEE Trans. on Com. , Vol. 42-11, Nov. 1994, PP.3010-3019.

[MAHIEUX-2] Y. Mahieux, "Concealment of transmission errors", patent 92 06720 filed on June 3, 1992.

[MASTER] X. Master. "7 kHz audio coding within 64 kbit / s". IEEE Journal on Selected Areas on Communications, Vol. 6-2, February 1988, PP.283-298.

[PARIKH] V.N. Parikh, J.H. Chen, G. Aguilar. "Frame Erasure Concealment Using Sinusoidal Analysis-Synthesis and Its Application to MDCT-Based Codées". Proc. of ICASSP conference, 2000.

[PICTEL] PictureTel Corporation, "Detailed Description of the PTC (PictureTel Transform Coder), ITU-T Contribution, SG15 / WP2 / Q6, October 8-9, 1996 Baltimore meeting, TD7

[RABINER] L.R. Rabiner, R.W. Schafer. "Digital processing of speech signais". Bell Laboratories Inc. , 1978.

[REC G.723.1A] ITU-T Annex A to recommendation G.723.1 "Silence compression scheme for dual rate speech coder for multimedia communications transmitting at 5.3 & 6.3 kbit / s"

[SALAMI] R. Salami, C. Laflamme, JP Adoul, A. Kataoka, S. Hayashi, T. Moriya, C. Lamblin, D. Massaloux, S. Proust, P. Kroon and Y. Shoham. "Design and description of CS-ACELP: a toll quality 8 kb / s speech coder". IEEE Trans. on Speech and Audio Processing, Vol. 6-2, March 1998, PP.116-130. [SALAMI -2] R. Salami, C. Laflamme, JP Adoul. "ITU-T G.729 Annex A: Reduced complexity 8 kb / s CS-ACELP coded for digital simultaneous voice and data". IEEE Communication Magazine, September 97, PP. 56-63.

[TREMAIN] T. E. Tremain. "The government standard linear predictive coding algorithm: LPC 10". Speech technology, April 1982, PP.40-49.

[WATKINS] C.R. Watkins, J.H. Chen. "Improving 16 kb / s G.728 LD-CELP Speech Coder for Frame Erasure Channels". Proc. of ICASSP conference, 1995, PP.241-244.

Claims

K VTϋNπTCATTONS

1. Method for concealing transmission error in an audio-digital signal according to which a decoded signal is received after transmission, the decoded samples are stored when the transmitted data are valid, at least one short-term prediction operator is estimated and at least for the voiced sounds a long-term prediction operator as a function of the valid samples stored and any missing or erroneous samples are generated in the decoded signal using the operators thus estimated, characterized in that the control l energy of the synthesis signal thus generated using a gain calculated and adapted sample by sample.

2. Method according to claim 1, characterized in that the gain for the control of the synthesis signal is calculated as a function of at least one of the following parameters: energy values previously stored for the samples corresponding to valid data, period fundamental for voiced sounds, or any parameter characterizing the frequency spectrum.

3. Method according to one of the preceding claims, characterized in that the gain applied to the synthesis signal decreases progressively as a function of the duration during which the synthesis samples are generated.

4. Method according to one of the preceding claims, characterized in that the valid data discriminates between stationary sounds and non-stationary sounds and implementation of gain adjustment laws making it possible to control the different synthesis signal on the one hand for the samples generated following valid data corresponding to stationary sounds and on the other hand for the samples generated as a result of valid data corresponding to non-stationary sounds.

5. Method according to one of the preceding claims, characterized in that the content of memories used for the decoding processing is updated as a function of the synthesis samples generated.

6. Method according to claim 5, characterized in that one implements at least partially on the synthesized samples a coding similar to that implemented at the transmitter possibly followed by an at least partial decoding operation, the data obtained to regenerate the memories of the decoder.

7. Method according to claim 6, characterized in that the first erased frame is regenerated by means of this coding-decoding operation, by exploiting the content of the memories of the decoder before cutting, when said memories contain information usable in this surgery.

8. Method according to one of the preceding claims, characterized in that an excitation signal is generated at the input of the short-term prediction operator which, in the neighboring zone, is the sum of a harmonic component and d '' a weakly harmonic or non-harmonic component, and in an unvoiced zone, limited to a non-harmonic component.

9. Method according to claim 8, characterized in that the harmonic component is obtained by implementing a filtering by means of the long-term prediction operator applied to a residual signal calculated by implementing a reverse short-term filtering on the stored samples.

10. Method according to claim 9, characterized in that the other component is determined using a long-term prediction operator to which pseudo-random disturbances are applied.

11. Method according to one of claims 8 to 10, characterized in that for the generation of a voiced excitation signal, the harmonic component is limited to the low frequencies of the spectrum, while the other component is limited to the high frequencies.

12. Method according to one of the preceding claims, characterized in that the long-term prediction operator is determined from the samples of valid stored frames, with a number of samples used for this estimation varying between a minimum value and a value equal to at least twice the estimated fundamental period for the voiced sound.

13. Method according to one of the preceding claims, characterized in that the residual signal is processed in a non-linear manner to eliminate peaks of amplitude.

14. Method according to one of the preceding claims, characterized in that speech activity is detected by estimating noise parameters and in that parameters of the synthesized signal are made to tend towards those of the estimated noise.

15. The method of claim 14, characterized in that the spectral envelope of the noise of the valid decoded samples is estimated and a synthesized signal is generated which evolves towards a signal having the same spectral envelope.

16. Method for processing sound signals, characterized in that a discrimination is made between voiced sounds and musical sounds and when musical sounds are detected, a method according to one of the claims is implemented previous without estimation of a long term prediction operator.

17. Device for concealing a transmission error in an audio-digital signal which receives as input a decoded signal which is transmitted to it by a decoder and which generates missing or erroneous samples in this decoded signal, characterized in that it comprises processing means capable of implementing the method according to one of the preceding claims.

18. Transmission system comprising at least one encoder, at least one transmission channel, a module capable of detecting that transmitted data has been lost or is greatly erroneous, at least one decoder and an error concealment device which receives the decoded signal, characterized in that this error concealment device is a device according to claim 17.