EP2080195A1

EP2080195A1 - Synthesis of lost blocks of a digital audio signal, with pitch period correction

Info

Publication number: EP2080195A1
Application number: EP07871872A
Authority: EP
Inventors: Balazs Kovesi; Stéphane RAGOT
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2006-10-20
Filing date: 2007-10-17
Publication date: 2009-07-22
Anticipated expiration: 2027-10-17
Also published as: EP2080195B1; DE602007013265D1; US20100318349A1; JP5289320B2; US8417519B2; RU2432625C2; CN101627423B; KR20090082415A; KR101406742B1; ES2363181T3; BRPI0718422A2; BRPI0718422B1; CN101627423A; ATE502376T1; RU2009118929A; MX2009004211A; WO2008096084A1; FR2907586A1; JP2010507121A; PL2080195T3

Abstract

The method involves determining a repetition period e.g. pitch period, in a valid block immediately preceding an invalid block, where the pitch period corresponds to inverse of fundamental frequency of an audio signal. Samples of the repetition period are corrected based on samples of another repetition period preceding the former repetition period for limiting amplitude of a transitory signal in the former repetition period. The corrected samples are copied in a replacing block. Independent claims are also included for the following: (1) a computer program comprising instructions for implementing a digital audio signal synthesizing method (2) a device for synthesizing a digital audio signal.

Description

Synthesis of lost blocks of a digital audio signal, with correction of pitch period

The present invention relates to the processing of digital audio signals (especially speech signals).

It intervenes in a coding / decoding system adapted for the transmission / reception of such signals. More particularly, the present invention relates to a reception processing for improving the quality of the decoded signals in the presence of data block losses.

Different techniques exist for converting into digital form and compressing a digital audio signal. The most common techniques are: waveform coding methods, such as MIC (for "Coded Pulse Modulation") and ADPCM (for "Pulse Modulation and Adaptive Differential Coding"), also known as "PCM" and "ADPCM" in English, methods of parametric coding by synthesis analysis such as CELP coding (for "Code Excited Linear Prediction"), and - perceptual encoding methods in subbands or by transform.

These techniques treat the input signal sequentially sample by sample (MIC or ADPCM) or sample blocks called "frames" (CELP and transform coding).

It is quickly recalled that a speech signal can be predicted from its recent past (for example from 8 to 12 samples at 8 kHz) using parameters evaluated on short windows (10 to 20 ms in this example). These short-term prediction parameters, representative of the vocal tract transfer function (for example to pronounce consonants), are obtained by LPC analysis methods (for

"Linear Prediction Coding"). There is also a longer-term correlation associated with quasi-periodicities of speech (for example voiced sounds such as vowels) which are due to the vibration of the vocal cords. It is therefore a question of determining at least the fundamental frequency of the voiced signal which varies typically from 60 Hz (deep voice) to 600 Hz (high voice) according to the speakers. A LTP (Long Term Prediction) analysis then determines the LTP parameters of a long-term predictor, and in particular the inverse of the fundamental frequency, often called the pitch period. We then define the number of samples in a pitch period by the ratio F _e / F ₀ (or its integer part), where:

- F _e is the sampling rate, and

- Fo is the fundamental frequency. We therefore note that the LTP long-term prediction parameters, including the pitch period, represent the fundamental vibration of the speech signal (when it is voiced), while the LPC short-term prediction parameters represent the spectral envelope. of this signal.

In some coders, all of these LPC and LTP parameters, thus resulting from a speech coding, can be transmitted in blocks to a peer decoder, via one or more telecommunication networks, to then restore the initial speech signal.

However, it is of interest (for example) that the ITJIT-T standardized G.722 48-, 56- and 64-kbit / s coding system for the transmission of broadband speech signals (which are sampled at 16 kHz). The G.722 coder has an ADPCM coding scheme in two sub-bands obtained by a QMF (for "Quadrature Mirror Filter") filter bank. For further details, reference may be made to the text of Recommendation G.722.

Figure 1 of the state of the art shows the coding and decoding structure according to Recommendation G.722. Blocks 101 to 103 represent the transmission filter bank QMF (spectral separation in high 102 and low 100 frequencies and subsampling 101 and 103) applied to the input signal Se. The following blocks 104 and 105 respectively correspond to low and high band ADPCM coders. The The low-band ADPCM encoder rate is specified by a mode of 0, 1, or 2, respectively indicating a rate of 6.5 or 4 bits per sample, while the high-band ADPCM encoder rate is fixed (two bits per second). sample). At the decoder, there are the equivalent ADPCM decoding blocks (blocks 106 and 107) whose outputs are combined in the receiving QMF filter bank (oversampling 108 and

110, inverse filters 109, 111 and joining the low and high frequency bands 112) to generate the synthesis signal Ss.

A general problem studied here concerns the correction of block losses at decoding. Indeed, the bit stream resulting from the coding is generally formatted in binary blocks for transmission on many types of networks. For example, there is talk of "IP packets" (for "Internet Protocol") for blocks transmitted via the Internet, "frames" for blocks transmitted over ATM networks (for "Asynchronous Transfer Mode"), or others. The blocks transmitted after coding can be lost for various reasons: - if a router of the network is saturated and empties its queue,

- if the block is received late (therefore not taken into account) during a continuous flow decoding and in real time,

- if a received block is corrupted (for example if its CRC parity code is not checked).

When a loss of one or more consecutive blocks, the decoder must reconstruct the signal without information on lost or errored blocks. It relies on the previously decoded information from the valid blocks received. This problem, called "lost block correction" (or, hereafter, "erasure correction") is actually more general than simply extrapolating missing information because the loss of frames often causes a loss of synchronization between encoder and decoder, especially when these are predictive, as well as continuity problems between the extrapolated information and the decoded information after a loss. The correction of erased frames thus also includes state restoration, convergence, and other techniques. Annex I of ITU-T Recommendation G.711 describes erasure correction for PCM encoding. As the PCM coding is not predictive, the frame loss correction is therefore simply to extrapolate the missing information and ensure the continuity between a reconstructed frame and correctly received frames, following a loss. The extrapolation is implemented by repetition of the signal passed synchronously with the fundamental frequency (or conversely, "pitch period"), that is to say by simply repeating pitch periods. The continuity is ensured by a smoothing (or "cross-fading") between samples received and extrapolated samples.

In the document :

"A Packet Loss Concealment Method using Pitch Waveform Repetition and Internai State Update on the Decoded Speech for the ADPCM Wideband Speech Coded Subband", M. Serizawa and Y. Nozawa, IEEE Speech Coding Workshop, pp. 68-70 (2002), it has been proposed a correction of erased frames for the G.722 standard encoder / decoder by extrapolating a lost frame using a repetition algorithm of the pitch periods (repetition which may be similar to that described in the appendix I of Recommendation G.711). To update the states of the G.722 encoder (filter memory and pitch matching memory), the thus extrapolated frame is divided into two subbands which are encoded again by ADPCM encoding.

However, such techniques of correction of frame loss by repetition of pitch periods can only function properly if the passed signal is stationary or at least cyclically stationary. They are therefore based on the implicit assumption that the signal associated with the lost frame (which must be extrapolated) is "similar" to the decoded signal until the loss of frame. In the case of the speech signal, this stationary hypothesis is strictly valid only for sounds such as a portion of vowels to be repeated. For example, a vowel "a" can be repeated several times (which gives "yyyy ..." without causing discomfort). However, a speech signal includes so-called "transient" sounds (non-stationary sounds including typically vowel attacks (beginnings) and sounds called "plosives" which correspond to short consonants such as "p", "b", "d", "t", "k"). Thus, if, for example, a frame is lost just after the sound "t", a correction of frame loss by simple repetition will generate a very unpleasant sequence listening to "t" (which will be understood in French as "teu- teu-teu-teu-teu ") in burst for a loss of several successive frames (for example five consecutive losses).

FIGS. 2a and 2b illustrate this acoustic effect in the case of an expanded band signal encoded by an encoder according to Recommendation G.722. More particularly, Figure 2a shows a decoded speech signal on an ideal channel (without loss of frame). This signal corresponds, in the example represented, to the French word "temps", divided into two phonemes: / t / then / an /. Vertical dashed lines indicate the boundaries between frames. We consider here the case of frames of length of the order of 10 ms. Figure 2b shows the signal decoded according to a technique similar to Serizawa et al reference above when a frame loss immediately follows the ItI phoneme. This figure

2b shows the problem of the repetition of the past signal. We note that the phoneme / t / is repeated in the extrapolated frame. It is also present in the following frame or frames because the extrapolation is slightly prolonged after a loss, in the example shown, in order to effect a cross-fade with the decoding under normal conditions (ie in the presence of useful information in the received signal).

The problem of repetition of plosives has apparently never been mentioned in the known prior art.

The present invention improves the situation.

To this end, it proposes a method for synthesizing a digital audio signal represented by successive blocks of samples, in which, on receiving such a signal, to replace at least one invalid block, a replacement block is generated at from samples of at least one valid block. In general, the method comprises the following steps: a) defining a repetition period of the signal in at least one valid block, and b) copying the samples of the repetition period into at least one replacement block.

In the process according to the invention: in step a), a last repetition period is determined in at least one valid block immediately preceding an invalid block, and in step b), samples of the last repetition period based on samples of a previous repetition period, and this to limit the amplitude of a possible transient signal that would be present in the last repetition period. The samples thus corrected are then copied back into the replacement block.

The method according to the invention is advantageously applied to the processing of a speech signal, both in the case of a voiced signal and in the case of an unvoiced signal. Thus, if the signal is voiced, the repetition period simply consists of the pitch period and step a) of the method aims in particular to determine a pitch period (typically given by the inverse of a fundamental frequency) a tone of the signal (for example the tone of a voice in a speech signal) in at least one valid block preceding the loss.

If the valid signal received is not voiced, there is not really a detectable pitch period. In this case, it can be expected to set a given arbitrary number of samples which will be considered as the length of the pitch period (which can then be called generically "repetition period") and carry out the process. sense of the invention on the basis of this repetition period. For example, we can choose the longest possible pitch period, typically 20 ms (corresponding to 50 Hz of a very deep voice), or 160 samples at 8 kHz sampling frequency. It is also possible to take the value corresponding to the maximum of a correlation function by limiting the search in a value interval (by example between MAX PITCH / 2 and MAX PITCH, where MAX PITCH is the maximum value in the search for pitch period).

Preferably, if a plurality of consecutive invalid blocks are to be replaced upon receipt and these blocks extend over at least one repetition period, the sample correction step b) is applied to all samples of the last period of repetition, taken one by one as a current sample.

In addition, if these invalid blocks go to extend over several repetition periods, the repetition period thus corrected in step b) is repeatedly copied to form the replacement blocks.

In a particular embodiment, for the aforementioned sample correction that is performed in step b), one can proceed as follows. For a current sample of the last repetition period, the amplitude of this current sample, in absolute value, is compared with the amplitude, in absolute value, of at least one sample temporally positioned substantially at a repetition period before the current sample, and we assign to the current sample the minimum amplitude, in absolute value, among these two amplitudes, while also affecting, of course, the sign of its initial amplitude.

The term "positioned substantially" is understood here to mean the fact that a neighborhood to be associated with the current sample is sought in the preceding repetition period. Thus, preferentially, for a current sample of the last repetition period: a set of samples is formed in a neighborhood centered around a sample temporally positioned at a repetition period before the current sample, an amplitude chosen from the amplitudes of the samples of said neighborhood, taken into absolute value, and comparing this selected amplitude with the amplitude of the current sample, in absolute value, to assign to the current sample the minimum amplitude, in absolute value, among the selected amplitude and the amplitude of the current sample .

This amplitude chosen from the amplitudes of the samples of said neighborhood is preferably the maximum amplitude in absolute value.

In addition, a damping (gradual attenuation) of the amplitude of the samples in the replacement blocks is usually applied. Here, advantageously, a transient character of the signal is detected before the loss of blocks and, where appropriate, a faster damping is applied than for a stationary (non-transitory) signal.

It is possible, in addition or alternatively, to also update (RAZ) the memories of the following filters in the synthesis processing, specifically adapted to transient sounds, to avoid finding the influence of such transient sounds in the processing of following valid blocks.

Preferably, the detection of a transient signal preceding the block loss is as follows: for a plurality of current samples of the last repetition period, to measure a ratio, in absolute value, of the amplitude of a sample running on the aforementioned chosen amplitude (determined in the vicinity as indicated above), and - then counting the number of occurrences, for the current samples, for which the above-mentioned ratio is greater than a first predetermined threshold (a value close to 4 for example, as will be seen later), and detect the presence of a transient signal if the number of occurrences is greater than a second predetermined threshold (for example if there is more than one occurrence, as will be seen later). These steps above can be used to trigger also the correction step b) within the meaning of the invention, in case of detection of a transient sound in the repetition period immediately preceding the loss of a block.

However, to decide whether or not to apply the correction step b) of the process within the meaning of the invention, the following procedure is preferentially carried out. If the digital audio signal is a speech signal, a degree of voicing is advantageously detected in the speech signal and the correction of step b) is not implemented if the speech signal is strongly voiced (which manifests itself by a correlation coefficient close to "1" in the search for a pitch period). In other words, this correction is implemented only if the signal is not voiced or if it is weakly voiced.

This avoids applying the correction of step b) and unnecessarily attenuating the signal in the replacement blocks, if the valid signal received is strongly voiced (thus stationary), which actually corresponds to the pronunciation of the signal. a stable vowel (eg "aaaa").

Thus, in short, the present invention is directed to the signal modification before repetition period repetition (or "pitch" for a voiced speech signal), for the synthesis of lost blocks at the decoding of digital audio signals. Transient repeat effects are avoided by comparing samples of a pitch period with those of the previous pitch period. The signal is modified preferentially by taking the minimum between the current sample and at least one sample substantially of the same position of the previous pitch period.

The invention offers several advantages, particularly in the context of decoding in the presence of block losses. In particular, it makes it possible to avoid artifacts coming from the erroneous repetition of transients (when a simple repetition of pitch period is used). In addition, it performs a transient detection which can be used to adapt the energy control of the extrapolated signal (via a variable attenuation). Other advantages and characteristics of the invention will appear on examining the detailed description, given by way of example below, and the appended drawings in which, in addition to FIGS. 1, 2a and 2b presented above: FIG. 2c illustrates, by way of comparison, the effect of the treatment in the sense of the invention on the same signal as that of FIGS. 2a and 2b, for which a TP frame has been lost, FIG. 3 represents the decoder according to recommendation G .722, but modified by integrating a device for correcting erased frames in the sense of the invention, FIG. 4 illustrates the principle of extrapolation of the low band, FIG. 5 illustrates the principle of pitch repetition (in the field of excitation), Figure 6 illustrates the modification of the excitation signal in the sense of the invention, followed by pitch repetition, Figure 7 illustrates the steps of the method of the invention, according to a particular embodiment, Figure 8 ill schematically a synthesis device for the implementation of the method in the sense of the invention, Figure 8a illustrates the general structure of a quadrature filter bank (QMF) with two channels, - Figure 8b shows the spectra of the signals x (n), xl (n), xh (n) of Figure 8a when the filters L (z) and H (z) are ideal (ie / _e = 2 / _e ).

An exemplary embodiment of the invention on the coding system according to Recommendation G.722 is described below. The description of the G.722 encoder (described above with reference to FIG. 1) is not repeated here. We restrict ourselves here to the description of a modified G.722 decoder, which incorporates a corrector of pitch periods to reproduce in case of loss of frames.

With reference to FIG. 3, the decoder within the meaning of the invention (here according to the G.722 recommendation) again presents an architecture in two subbands with the reception QMF filter banks (blocks 310 to 314). Compared to the decoder of the Figure 1, the decoder of Figure 3 further integrates a device 320 for clearing erased frames.

The G.722 decoder generates an output signal Ss sampled at 16 kHz and cut into time frames (or sample blocks) of 10, 20 or 40 ms. Its operation differs according to the presence or not of loss of frames.

In the complete absence of frame loss (and therefore if all the frames are received and valid, the bit stream of the low frequency band BF is decoded by the block 300 of the device 320 within the meaning of the invention, no cross-fading ( block 303) is realized and the reconstructed signal is simply given by z1 = x1 Similarly, the bitstream of the high frequency band HF is decoded by block 304. Switch 307 selects the channel uh = xh and the switch 309 selects the channel zh = uh = xh.

Nevertheless, in case of loss of one or more frames, in the low band BF, the erased frame is extrapolated in the block 301 from the signal x1 passed (pitch copy in particular) and the states of the ADPCM decoder are updated. in block 302. The erased frame is reconstructed as z1 = y1. This process is repeated as long as frame loss is detected. It is important to note that the extrapolation block 301 does not only generate an extrapolated signal on the current frame (lost): it also generates 10 ms of signal for the next frame in order to fade in the block 303.

Then, when a valid frame is received, it is decoded by the block 300 and a crossfade 303 is performed for the first 10 milliseconds between the valid frame x1 and the previously extrapolated frame y1.

In the high band HF, the erased frame is extrapolated in block 305 from the passed signal xh and the states of the ADPCM decoder are updated in block 306. In the preferred embodiment, the extrapolation yh is a simple repetition of the last period of the past xh signal. The switch 307 selects the channel uh = yh. This signal uh is advantageously filtered to give the signal vh. Indeed, the G.722 coding is a recursive predictive coding scheme (of the "backward" type). It uses in each subband an ARMA prediction operation (for "Auto-Regressive Moving Average") and a procedure for adapting the ARMA filter quantization and adaptation pitch, identical to the encoder and to the decoder. The prediction and the pitch adaptation are based on the decoded information (prediction error, reconstructed signal).

The transmission errors, more particularly the frame losses, lead to a desynchronization between the decoder and the encoder variables. The pitch adaptation and prediction procedures are then erroneous and skewed over a long period of time (up to 300-500 ms). In the high band, this bias can result, among other artefacts, in the appearance of a continuous component of very low amplitude (of the order of +/- 10 for a maximum dynamic signal +/- 32767 ).

However, after passing through the synthesis filter bank QMF, this DC component is found in the form of a sinusoid at 8kHz audible and very troublesome to the hearing.

The transformation of the DC component (or "DC component") into a sinusoid at 8 kHz is explained below. Figure 8a shows a two-channel quadrature filter bank (QMF). The signal x (n) is decomposed into two subbands by the analysis bank. We thus obtain a low band xl (n) and a high band xh (n). These signals are defined by their transform in z:

XL (z) = ((x (z ^{ι1 2} ) L (z ^{ι1 2} ) + X (-z ^{ι1 2} ) L (-z ¹¹² ))

XH (z) = ((x (z ^{ι1 2} ) H (z ^{ι1 2} ) + X (-z ^υ2 ) H (-z ^υ2 ))

The low-pass filters L (z) and high-pass filters H (z) being in quadrature, we have: H (z) = L (-z). If L (z) satisfies the constraints of perfect reconstruction, the signal obtained after the synthesis filter bank is identical to the signal x (n) with a shift.

Thus, if the sampling frequency of the signal x (n) is / _e ', the signals xl (n) and xh (n) are sampled at the frequency / _e = / _e ' / 2. Typically, one often has / _e '= 16 kHz, ie / _e = 8 kHz. It is further indicated that the filters L (z) and H (z) may be, for example, the QMF filters of 24 coefficients specified in the ITU-T Recommendation.

G.722.

Figure 8b shows the spectrum of the signals x (n), x1 (n) and xh (n) in the case where the filters L (z) and H (z) are ideal half-band filters. The frequency response of L (z) over the interval [-f e / 2, + fe '/ 2] is then given, in the ideal case, by:

WiA [ ¹ 0 ^* at ^{| /} tr ^| e ^≤ m ^Λ in ^V t ⁴

Note that the spectrum of xh (n) corresponds to the high folded band. This folding property (or "folding" in English), well known in the state of the art, is explained visually, as well as by means of the equation above defining XH (z). The folding of the high band is "inverted" by the synthesis filter bank which restores the spectrum of the high band in the natural order of the frequencies.

However, in practice, the L (z) and H (z) filters are not ideal. Their non-ideal nature results in the appearance of a spectral folding component which is canceled by the synthesis bench. The high band remains inverted, however.

Block 308 then performs a high pass filtering (HPF for "high pass filter") which removes the DC component (for "DC remove" in English). The use of such a filter is particularly advantageous, including outside the scope of the correction of pitch period in the low band within the meaning of the invention.

Moreover, the use of such an HPF filter (block 308) eliminating the DC component in the high band could be subject to separate protection, in a context general loss of frames at decoding. In generic terms, it will therefore be understood that in decoding context of a received signal with separation of this signal in high frequency band and in low frequency band, so in at least two channels as in decoding according to the G.722 standard when there is a loss of signal followed by a synthesis of a replacement signal, generally, on the high frequency path of the decoder, this may cause the presence of a DC component in the replacement signal. The effect of this continuous component can also extend in the decoded signal, for a certain time, when the received coded signal is again valid, however, because of the desynchronization between the encoder and the decoder and the memory size of the filters. .

Advantageously, a high-pass filter 308 is provided on the high frequency channel. This high-pass filter 308 is advantageously provided upstream, for example, of the QMF filter bank of this high-frequency channel of the G.722 decoder. This arrangement makes it possible to avoid the folding of the DC component at 8 kHz (value taken from the sampling rate f _e ) when it is applied to the QMF filter bank. More generally, when the decoder makes use of a bank of filters at the end of processing on the high frequency channel, the high-pass filter (308) is preferably provided upstream of this filter bank.

Thus, with reference again to FIG. 3, the switch 309 selects the channel zh = vh as long as there is a loss of frames.

Then, as soon as a valid frame is received, it is decoded by the block 304 and the switch 307 selects the channel uh = xh. For a few moments thereafter (for example after four seconds), the switch 309 still selects the channel zh = vh, but after a few seconds, it returns to the "normal" operation where the switch 309 again selects the channel zh = uh bypassing block 308 and therefore without applying the high-pass filter 308. In generic terms, it will therefore be understood that, preferentially, this high-pass filter 308 is temporarily applied (for a few seconds for example) during and after a loss of blocks, even if valid blocks are received again. Filter 308 could be used permanently. Nevertheless, it is activated only in case of frame losses, because the disturbance due to the DC component is generated only in this case, so that the output of the modified G.722 decoder (because integrating the mechanism of loss correction) is identical to that of the ITU-T G.722 decoder in the absence of frame loss. This filter 308 is applied only during the frame loss correction and for a few seconds following a loss. Indeed, in case of loss, the G.722 decoder is desynchronized from the encoder for a period of 100 to 500 ms following a loss and the continuous component in the high band is typically only present for a duration of 1 to 2 seconds. The filter 308 is maintained a little longer to have a safety margin (for example four seconds).

The decoder object of FIG. 3 is not described in greater detail, it being understood that the invention is in particular implemented in the block 301 for extrapolation of the low band. This block 301 is detailed in FIG.

With reference to FIG. 4, the extrapolation of the low band is based on an analysis of the past signal x1 (part of FIG. 4 referenced as ANALYS) followed by a synthesis of the signal y1 to be delivered (part of FIG. 4 referenced SYNTH) . Block 400 performs a linear prediction analysis (LPC) on the passed signal xl. This analysis is similar to that carried out in particular in the G.729 standardized coder. It can consist of windowing the signal, calculating the autocorrelation and finding the linear prediction coefficients by the Levinson-Durbin algorithm. Preferably, only the last 10 seconds of the signal are used and the LPC order is set to 8. Thus, nine LPC coefficients (noted hereinafter at ₀ , _a1s ..., a _p ) are obtained in the form:

A (z) = ao + ai z ^"1 + ... + a _p z ^{" p} with p = 8 and ao = 1. After LPC analysis, the past excitation signal is calculated by block 401. The past excitation signal is noted e (n) with n = -M, ..., - 1, where M corresponds to the number of samples past and stored.

Block 402 makes an estimate of the fundamental frequency or its inverse: the pitch period TQ. This estimation is carried out for example in a manner similar to the pitch analysis (called "open loop" especially as in the standardized encoder G.729).

The estimated pitch TQ is used by block 403 to extrapolate the excitation of the current frame.

Furthermore, the passed signal x1 is classified in block 404. Here it is possible to seek to detect the presence of transients, for example the presence of a plosive to apply the pitch period correction in the sense of the invention, but, in a preferred embodiment, the aim is rather to detect whether the signal Se is strongly voiced (for example when the correlation with respect to the pitch period is very close to 1). If the signal is strongly voiced (which corresponds to the pronunciation of a stable vowel, for example "aaaa ..."), then the signal Se is free of transients and the pitch period correction in the sense of the invention may not be implemented. Otherwise, preferentially, the correction of the pitch period in the sense of the invention will be applied in all other cases.

The details of the detection of a degree of voicing are not presented here because they are known per se and are beyond the scope of the invention.

Referring again to FIG. 4, the SYNTH synthesis follows the well-known model in the state of the art and called "source-filter". It consists in filtering the excitation extrapolated by an LPC filter. Here, the extrapolated excitation e (n) (or now n = 0, ..., LI, L being the length of the frame to be extrapolated) is filtered by the inverse filter 1 / A (z) (block 405). Then, the signal obtained is attenuated by the block 407 as a function of an attenuation calculated in the block 406, to be finally delivered in yl. The invention, as such, is carried out by block 403 of FIG. 4, whose functions are described in detail below.

FIG. 5 shows, as an illustration, the principle of simple excitation repetition as performed in the state of the art. The excitation can be extrapolated by simply repeating the last pitch period TQ, that is to say by copying the succession of the last samples of the past excitation, the number of samples in this succession corresponding to the number of samples that includes the pitch period T ₀ .

Referring now to FIG. 6, before repeating the last pitch period TQ, the latter is modified in the sense of the invention as follows.

For each sample n = -TQ, ..., - 1, the sample e (n) is modified in e _mo d (n) according to a formula of the type:

"mod (n) = + j) |), e (n) |) x sign (e (w))

As indicated above, preferentially, this signal modification is not applied if the signal x 1 (and therefore the input signal Se) is strongly voiced. Indeed, in the case of a strongly voiced signal, the simple repetition of the last pitch period, without modification, can give a better result, whereas a modification of the last pitch period and its repetition could result in a slight quality degradation.

FIG. 7 shows the processing corresponding to the application of this formula, in flowchart form, to illustrate the steps of the method according to one embodiment of the invention. Here we start from the past signal e (n) that delivers the block 401. In step 70, we obtain the information according to which the signal x1 is strongly voiced or not, from the module 404 determining the degree of voicing. If the signal is strongly voiced (arrow O at the output of the test 71), the last pitch period of the valid blocks, as is, is copied into the block 403 of FIG. 4 and the processing continues directly thereafter by the application of the inverse filtering 1 / A (z) by the module 405.

On the other hand, if the signal x1 is not strongly voiced (arrow N at the output of the test 71), we will seek to modify the last samples of the excitation signal e (n) corresponding to the last valid blocks received, these samples extending over a pitch period T ₀ (step 73), given by the module 402 of FIG. 4 (at step 72). In the embodiment illustrated in FIG. 7, it is sought to modify all the samples e (n) over an entire pitch period T ₀ , with n lying between H ₁ -T ₀ +! and

H ₁ , e {n ^ corresponding to the last valid sample received (step 74). It will thus be understood, with these notations, that a sample e (n) with n between H ₁ -T ₀ +! and H ₁ belongs simply to the last pitch period validly received.

At step 75, each sample e (n) of the last pitch period is made to correspond to a NEIGH neighborhood in the preceding pitch period, ie in the penultimate pitch period. This measure is advantageous but not necessary. The advantage it provides will be described later. It is simply indicated here that this neighborhood comprises an odd number of samples 2k + 1, in the example described. Of course, alternatively, this number may be even. Moreover, in the example of Figure 6, we have k = 1. Indeed, with reference again to FIG. 6, it can be seen that the third sample of the last pitch period noted e (3) is selected (step 74) and the NEIGH neighborhood samples associated with it in the penultimate pitch period (step 75) are shown in bold and are e (2-7O), e (3-7O) and e (4-7O). They are therefore distributed around e (3-7O).

In step 76, the maximum, in absolute value, is determined among the NEIGH neighborhood samples (ie the sample e (2-7O) in the example of FIG. 6). This feature is advantageous but not necessary. The advantage it provides will be described later. Typically, alternatively, one could choose to determine the average on NEIGH neighborhood, for example.

In step 77, the minimum, in absolute value, is determined between the value of the current sample e (n) and the value of the maximum M found on NEIGH neighborhood in step 76. In the example illustrated in FIG. Figure 6, this minimum between e (3) and e (2-7O) is indeed the sample of the penultimate pitch period e (2-7O). Still at this step 77, the amplitude of the current sample e (n) is then replaced by this minimum. In FIG. 6, the amplitude of the sample e (3) becomes equal to that of the sample e (2-7O). The same method is applied to all the samples of the last period, from e (l) to e (12). In FIG. 6, the corrected samples are represented by dashed lines. The samples of pitch periods extrapolated T _{) +} ι, T _{) +} 2, corrected according to the invention, are represented by closed arrows.

It will thus be understood that, by the advantageous implementation of this step 77, if a plosive is indeed present on the last pitch period T _] (high intensity of the signal, in absolute value, as represented in FIG. determine the minimum between this plosive intensity and that of the samples substantially at the same time position in the preceding pitch period (the term "substantially" meaning here "to a neighborhood ± k near", hence the advantage of the performing step 75), and replace, if necessary, the intensity of the plosive by a lower intensity belonging to the penultimate pitch period On the other hand, if the intensity of the samples of the last pitch period T _} is lower than that of the penultimate period 7J _-1 , by selecting the minimum between the current sample e (3) and the value of intensity e (2-r ₀ ) in the penultimate pitch period T) _-1 , the last period is not modified and thus avoids the risk that a plosive (of high intensity) can be copied from the penultimate pitch period T _{] Λ} .

Thus, in step 76, the maximum value M in absolute value of the samples of the neighborhood (and not another parameter such as the average on this neighborhood, for example) is determined so as to compensate for the effect of choosing the minimum to step 77 for replace the value e (n). This measurement therefore makes it possible not to limit too much the amplitude of the replacement pitch periods r _{j + 1} , T _{3 + 2} (FIG. 6).

Moreover, the neighborhood determination step 75 is advantageously implemented because a pitch period is not always regular and, if a sample e (n) has a maximum intensity in a period of pitch T ₀ , it is not always the same for a sample e (n + 7O) in a next pitch period. In addition, a pitch period may extend to a time position falling between two samples (at a given sampling frequency). We speak of "fractional pitch". It is therefore always preferable to take a neighborhood centered around a sample e (n-70), if this sample e (n-70) is to be associated with a sample e (n) positioned at a following pitch period.

Finally, since the treatments of steps 75 to 77 relate essentially to the absolute values of the samples, step 78 consists simply in reassigning the sign of the initial sample e (n) to the modified sample e _mod (n).

Steps 75 to 78 are repeated for a sample e (n) following (n before n + 1 in step 79), until the pitch period T ₀ is exhausted (ie until reaching the last valid sample e (^)).

The modified signal e _mo d (n) is thus delivered to the inverse filter 1 / A (z) (reference 405 of FIG. 4) for the following decoding.

It should be noted, however, two possible alternative embodiments. It is thus possible to correct the last pitch period T ₃ , to apply this correction T ₃ to this last pitch period T ₃ and to copy the correction for the following pitch periods, ie: T ₃ = T _{3 + I} = T _{3 + 2} = T _y In a variant, the last pitch period T ₃ is left intact and its correction T ₃ is copied in the following pitch periods r _{j +} i and T _{3 + 2} . The comparison of FIGS. 5 and 6 shows how the modification of the excitation thus made is advantageous. Thus, in short, in the case where a plosive is present in the last pitch period, it will be automatically eliminated before repeating pitch because it will have no equivalent in the penultimate pitch period. This embodiment thus makes it possible to eliminate one of the most troublesome artifacts of the repetition of pitch and consisting of the repetition of plosives.

Furthermore, a faster attenuation of the synthesized and repeated signal is advantageously provided if a plosive is detected in the last pitch period. An exemplary embodiment of a transient detection, in general, can consist in counting the number of occurrences of the following condition (1):

If this condition is verified for example more than once on the current frame, then the signal passed x1 has a transient (for example a plosive), which makes it possible to force a fast attenuation by the block 406 on the synthesis signal yl (eg attenuation over 10 ms).

FIG. 2c then illustrates the decoded signal when the invention is implemented, for comparison with FIGS. 2a and 2b for which a frame containing the ItI plosive was lost. The repetition of the phoneme ItI is here avoided, thanks to the implementation of the invention. The differences following the frame loss are not related to the actual plosive detection. In fact, the attenuation of the signal after the frame loss in FIG. 2c is explained by the fact that in this case, the G.722 decoder is reset (complete update of the states in the block 302 of FIG. ), whereas in the case of Figure 2b, the G.722 decoder is not reset. It will be understood, however, that the invention relates to the detection of plosives for the extrapolation of an erased frame and not to the problem of restarting after a loss of frame. Nevertheless, listening, the signal shown in Figure 2c is of better quality than that of Figure 2b.

The present invention also relates to a computer program intended to be stored in memory of a device for synthesizing a digital audio signal. This program then comprises instructions for implementing the method in the sense of the invention, when it is executed by a processor of such a synthesis device. Moreover, Figure 7 described above can illustrate a flowchart of such a computer program.

Furthermore, the present invention also provides a device for synthesizing a digital audio signal consisting of a succession of blocks. This device could also include a memory storing the aforementioned computer program and could consist of the block 403 of Figure 4 with the features described above. With reference to FIG. 8, this device SYN comprises: an input E for receiving blocks of the signal e (n), preceding at least one current block to be synthesized, and an output S for delivering the synthesized signal e _mod (n) and having at least this synthesized current block. The synthesis device SYN within the meaning of the invention comprises means such as a working memory MEM (or storage of the aforementioned computer program) and a PROC processor cooperating with this memory MEM, for the implementation of the method within the meaning of the invention, and thus to synthesize the current block from at least one of the preceding blocks of the signal e (n).

The present invention also relates to a decoder of a digital audio signal consisting of a succession of blocks, this decoder comprising the device 403 within the meaning of the invention for synthesizing invalid blocks. More generally, the present invention is not limited to the embodiments described above by way of example; it extends to other variants.

In variant embodiments, the pitch period correction and / or transient detection parameters may be as follows. An interval with a different number of three samples can be considered in the penultimate pitch period. We can take for example k = 2 to have five samples considered in all. Similarly, the threshold value for the transient detection (of ¹ A in the example of condition (1) above) can be adapted. In addition, the signal can be declared as transient if the detection condition is not satisfied at least m times, with m ≥ 1.

Moreover, the invention can also be applied to other contexts than that described above.

For example, signal detection and modification can be performed in the signal domain (rather than the field of excitation). Typically, for the correction of frame losses in a CELP decoder (which also functions according to the source-filter model), the excitation is extrapolated by repetition of pitch and possibly addition of a random contribution and this excitation is filtered by a filter of type 1 / A (z), where A (z) is derived from the last correctly received predictor filter.

It can also be applied to a G.711 decoder, just as naturally.

Of course, simply copying the penultimate pitch period T _{] Λ} to constitute the new synthesized periods 7J ₊₁ , 7J _{+ 2} would already overcome the problem of repetition of plosives, if, moreover, we take care to detect plosives in the penultimate pitch period (for example using a condition of the type of condition (1) above). This embodiment is within the scope of the invention.

In addition, for clarity of the above discussion, there is described a sample correction, in step b), followed by copying of the corrected samples into the replacement block (s). Of course and in a strictly equivalent way technically, it is also possible to first copy the samples from the last repetition period and then correct them all in the replacement block (s).

Thus, sample correction and copying can be steps that can occur in any order and, in particular, be reversed.

Claims

A method of synthesizing a digital audio signal represented by successive blocks of samples, wherein, upon receiving such a signal, to replace at least one invalid block, generating a replacement block from samples at least one valid block, the method comprising the steps of: a) determining (402) a repetition period in at least one valid block, and b) copying (403) the samples of the repetition period into at least one replacement block, characterized in that: in step a), determining a last repetition period (T) in at least one valid block immediately preceding an invalid block, in step b), correcting samples (e (3)) of said last repetition period (T)) as a function of samples (e (2-T ₀ ), e (3-T ₀ ), e (4-T ₀ )) of a period of repetition (T) _-1 ) preceding said last repetition period, to limit the amplitude of a possible sig nal transient in said last repetition period, and the samples thus corrected are copied into said replacement block (T) ₊₁ , T) ₊₂ ).

2. The method of claim 1, wherein the signal is a voiced speech signal, characterized in that the repetition period is a pitch period corresponding to the inverse of a fundamental frequency of the signal.

3. Method according to one of claims 1 and 2, characterized in that, in step b), a current sample (e (3)) of the last repetition period is corrected, by comparing: the amplitude of this current sample, in absolute value, with the amplitude, in absolute value, of at least one sample (e (2-T ₀ )) temporally positioned substantially at a repetition period before the current sample, and assigning to the current sample the minimum amplitude, in absolute value, of these two amplitudes.

4. Method according to claim 3, characterized in that, for a current sample (e (3)) of the last repetition period: a set of samples (75) is formed in a neighborhood centered around a sample ( e (3-7O)) temporally positioned at a repetition period before the current sample, a selected amplitude (76) is determined among the amplitudes of the samples of said neighborhood, taken in absolute value, and this selected amplitude is compared with the amplitude of the current sample, in absolute value, for assigning (77) to the current sample (e (3)) the minimum amplitude, in absolute value, among the selected amplitude and the amplitude of the current sample .

5. Method according to claim 4, characterized in that the amplitude chosen from the amplitudes of the samples of said neighborhood is the maximum amplitude in absolute value (M).

6. Method according to one of the preceding claims, wherein the digital audio signal is a speech signal, characterized in that it detects a degree of voicing in the speech signal (71), and in that the steps a ) and b) are implemented if the speech signal is not voiced or is weakly voiced.

7. Method according to one of the preceding claims, wherein a damping of the amplitude of the samples is applied in said replacement block, characterized in that a possible transient character of the signal is detected in the last repetition period and, if necessary, a faster damping is applied than for a stationary signal.

8. The method according to claim 7, taken in combination with one of claims 3 and 4, characterized in that for a plurality of current samples of the last repetition period, a ratio, in absolute value, of the amplitude of a current sample on said selected amplitude, and

the number of occurrences for said current samples for which said ratio is greater than a first predetermined threshold is counted, and the presence of a transient character is detected if the number of occurrences is greater than a second predetermined threshold.

9. Method according to one of the preceding claims, characterized in that, in the case of a reception of a plurality of consecutive invalid blocks extending over at least one repetition period, the step of correcting samples and (b) is applied to all samples in the last repetition period, taken one by one as the current sample.

10. Method according to claim 9, characterized in that, in the case of a reception of a plurality of consecutive invalid blocks extending over several repetition periods, to replace said plurality of invalid blocks, one copies several times the repetition period corrected in step b) to form the replacement blocks.

Computer program intended to be stored in memory of a device for synthesizing a digital audio signal, characterized in that it comprises instructions for implementing the method according to one of claims 1 to 10 when it is executed by a processor of such a synthesis device.

12. Device for synthesizing a digital audio signal consisting of a succession of blocks, comprising: an input (E) for receiving blocks of the signal (e (n)), preceding at least one current block to be synthesized, and

an output (S) for delivering the synthesized signal (e _mod (n)) and comprising at least said current block, characterized in that it comprises means (MEM, PROC) for implementing the method according to the one of claims 1 to 10 for synthesizing the current block from at least one of said previous blocks.

13. Decoder of a digital audio signal consisting of a succession of blocks, characterized in that it further comprises a device (403) according to claim 12, for synthesizing invalid blocks.