EP3138095A1

EP3138095A1 - Improved frame loss correction with voice information

Info

Publication number: EP3138095A1
Application number: EP15725801.3A
Authority: EP
Inventors: Julien Faure; Stéphane RAGOT
Original assignee: Orange SA
Current assignee: Orange SA
Priority date: 2014-04-30
Filing date: 2015-04-24
Publication date: 2017-03-08
Anticipated expiration: 2035-04-24
Also published as: BR112016024358A2; FR3020732A1; ES2743197T3; RU2016146916A3; JP6584431B2; RU2016146916A; KR20220045260A; KR20230129581A; US10431226B2; BR112016024358B1; ZA201606984B; MX368973B; US20170040021A1; WO2015166175A1; CN106463140B; MX2016014237A; JP2017515155A; EP3138095B1; CN106463140A; KR20170003596A

Abstract

The invention relates to the processing of a digital audio signal, including a series of samples distributed in consecutive frames. The processing is implemented in particular when decoding said signal in order to replace at least one signal frame lost during decoding. The method includes the following steps: a) searching, in a valid signal segment available when decoding, for at least one period in the signal, determined in accordance with said valid signal; b) analysing the signal in said period, in order to determine spectral components of the signal in said period; c) synthesising at least one frame for replacing the lost frame, by construction of a synthesis signal from: an addition of components selected among said predetermined spectral components, and a noise added to the addition of components. In particular, the amount of noise added to the addition of components is weighted in accordance with voice information of the valid signal, obtained when decoding.

Description

Improved frame loss correction with voicing information

The present invention relates to the field of telecommunication coding / decoding, and more particularly that of decoding frame loss correction.

A "frame" is understood to mean an audio segment composed of at least one sample (so that the invention applies equally well to the loss of one or more samples coded according to the G.711 standard, as well as to a loss of one or more sample packets coded according to G.723, G.729, etc.).

The loss of audio frames occurs when a real-time communication using an encoder and a decoder is disturbed by the conditions of a telecommunication network (radio frequency problems, congestion of the access network, etc.). In this case, the decoder uses frame loss correction mechanisms to try to substitute the missing signal with a reconstructed signal by using information available to the decoder (for example the audio signal already decoded for one or more past frames). This technique can maintain quality of service despite degraded network performance.

Frame loss correction techniques are most often very dependent on the type of coding used.

In the case of CELP coding, it is common to repeat certain parameters decoded at the previous frame (spectral envelope, pitch, dictionary gains), with adjustments such as a modification of the spectral envelope to converge towards a medium envelope or the use of a random fixed dictionary.

The most common technique for correcting frame loss in the case of transform coding is to repeat the last frame received if a frame is lost and to set the repeated frame to zero as soon as more than one frame is lost. . This technique is found in several standardized codings (G.719, G.722.1, G.722.1C). We can also mention the case of the standardized coding G.711, for which an example of a frame loss correction described in appendix I of G.711 consists of identifying a fundamental period (called "pitch") in the already decoded signal. and repeat it, taking care to make an overlapping addition (called "Overlap-add") between the already decoded signal and the repeated signal. This overlapping addition makes it possible to "erase" the audio artifacts but requires, to be implemented, an additional delay to the decoder (corresponding to the duration of the recovery). On the other hand, in the case of G.722.1 standardized coding, a Modulated Lapped Transform (or MLT) with 50% overlap addition and sinusoidal windows provide a transition between the last lost frame and the repeated frame which is slow enough to erase the artifacts related to the simple repetition of the frame in the case of a single lost frame. In contrast to the frame loss correction described in the G.711 standard (Appendix I), this realization does not require additional delay since it exploits the existing delay and the time folding of the MLT transform to make an overlay with the reconstituted signal. This technique is very inexpensive but its main defect is an inconsistency between the decoded signal before the loss of frame and the repeated signal. This results in a phase discontinuity that can produce large audio artifacts if the overlap time between the two frames is low, as is the case when the windows used for the MLT are "low delay" as described in the document. FR 1350845 with reference to Figures 1A and 1B of this document. In this case, even a solution that combines a pitch search as in the case of the G.711 encoder (Appendix I) and a MLT window overlap addition is not sufficient to remove the artifacts. audio. The document FR 1350845 proposes a hybrid method that combines the advantages of the two methods by making it possible to maintain phase continuity in the transformed domain. The present invention falls within this framework. A detailed description of the solution that is the subject of this document FR 1350845 is described below with reference to FIG. 1. This solution, even if it is particularly promising, remains to be perfected because, when the coded signal has only a fundamental period ( "Mono pitch") such as for example a voiced segment of a speech signal, the audio quality after lost frame correction can to be degraded and worse than with a loss of frame correction by a speech model of the CELP type for example (for "Code-Excited Linear Prediction").

The invention improves the situation.

It proposes for this purpose a method of processing a digital audio signal comprising a succession of samples distributed in successive frames, the method being implemented during a decoding of said signal to replace at least one lost signal frame decoding.

The method comprises the steps:

a) search, in a valid signal segment available for decoding, of at least one period in the signal, determined according to said valid signal,

b) analyzing the signal in said period for determining spectral components of the signal in said period,

c) synthesizing at least one replacement frame of the lost frame, by constructing a synthesis signal from:

an addition of selected components from said determined spectral components, and

- a noise added to the addition of components. In particular, the amount of noise added to the addition of components is weighted according to a voicing information of the valid signal obtained at decoding.

Advantageously, the voicing information used at decoding, transmitted to at least one encoder bit rate, makes it possible to give more importance to the sinusoidal components of the past signal if this signal is voiced, or to give more importance to the noise. otherwise, which gives an audible result much more satisfying. However, in the case of an unvoiced signal or in the case of a music signal, it is not useful to keep as many components for the synthesis of the signal replacing the lost frame. In this case, more weight can be attributed to the noise injected for signal synthesis. The complexity of the treatments is then advantageously reduced, in particular in the case of an unvoiced signal, without degrading the quality of the synthesis. In an embodiment where a noise signal is added to the components, this noise signal is weighted by a smaller gain in the event of a voicing of the valid signal. For example, this noise signal can be obtained from the frame previously received by a residue between the received signal and the addition of the selected components.

In a complementary or alternative embodiment, the number of components selected for the addition is greater in case of voicing of the valid signal. Thus, if the signal is voiced, more account is taken of the spectrum of the signal passed, as indicated above.

Advantageously, a complementary embodiment may be chosen, in which more components are selected if the signal is voiced while minimizing the gain to be applied to the noise signal. Thus, the overall amount of energy attenuated by applying a gain smaller than 1 on the noise signal is partially offset by the selection of more components. Conversely, the gain to be applied to the noise signal is not decreased and fewer components are selected if the signal is unvoiced or only slightly voiced.

It is furthermore possible to further improve the quality / complexity compromise at decoding and, in step a), the aforesaid period can be sought in a valid signal segment of greater duration in case of voicing of the valid signal. In an exemplary embodiment presented in the detailed description below, a search is carried out, by correlation in the valid signal, of a repetition period corresponding typically to at least one pitch period if the signal is voiced and in this case , especially for men's voices, the search for pitch can be done over 30 milliseconds for example.

In an optional embodiment, the voicing information is provided in a coded stream received at decoding and corresponding to the aforementioned signal comprising a succession of samples distributed in successive frames. In case of decoded frame loss, then, the voicing information contained in a valid signal frame preceding the lost frame is used.

Thus, the voicing information is derived from an encoder generating a coded stream and determining the voicing information, and in a particular embodiment, the information of voicing is encoded on a single bit in the encoded stream. However, as an example of embodiment, the encoder generation of this voicing data may be conditioned by the fact that the rate is sufficient or not on a communication network between the encoder and the decoder. For example, if the rate is below a threshold, this voicing data is not transmitted by the encoder to save bandwidth. In this case, purely by way of example, the last voicing information acquired at the decoder can be used for the frame synthesis, or alternatively it may be decided to apply the case of a non-voicing for the frame synthesis. . In the realization the voicing information is coded on a single bit in the coded stream, the value taken by the gain applied to the noise signal can also be binary and, if the signal is voiced, the value of the gain is set to 0. , 25, and it is 1 otherwise.

In one variant, the voicing information comes from an encoder determining a value of flatness or harmonicity of the spectrum (obtained for example by comparing the amplitudes of the spectral components of the signal, with a background noise), the encoder delivering then this value in binary form in the coded stream (on more than one bit).

In such a variant, the value of the gain may be a function of the aforementioned flatness value (for example according to a continuous variation increasing as a function of this value).

In general, said flatness value can be compared to a threshold to determine:

the signal is voiced if the value of flatness is below the threshold, and

- that the signal is not voiced otherwise,

(which amounts to characterizing voicing in a binary way).

Thus, in the realization of the single bit as in its variant, the selection criteria of the components and / or choice of signal segment duration in which the pitch is sought may be binary.

For example, for the selection of components: if the signal is voiced, the spectral components whose amplitudes are greater than those of the first neighboring spectral components, as well as the first neighboring spectral components, are selected, and

otherwise, only the spectral components whose amplitudes are greater than those of the first neighboring spectral components are selected.

For the choice of pitch search segment duration, for example:

if the signal is voiced, the period is sought in a valid signal segment of duration greater than 30 milliseconds (for example 33 milliseconds),

and, if not, the period is searched for in a valid signal segment of duration less than 30 milliseconds (for example 28 milliseconds).

Thus, the purpose of the invention is to improve the state of the art within the meaning of document FR 1350845 by modifying various stages of the processing presented in this document (search for pitch, selection of components, noise injection), but nevertheless depending in particular characteristics of the original signal.

These characteristics of the original signal may be encoded as particular information in the data stream to the decoder (or "bitstream") depending on the classification of the speech and / or music, and the case of the speech class. in particular.

This information in the decoding stream makes it possible to optimize the compromise between complexity and quality and, jointly, to:

modifying the gain of the noise to be injected in the sum of the spectral components selected to construct the synthesis signal replacing the lost frame,

- modify the number of components selected for the synthesis,

- change the duration of the pitch search segment.

Such an embodiment can be implemented in an encoder for the determination of the voicing information, and more particularly in a decoder, particularly in the case of loss of frame. It can be implemented as software in a coded / decoded implementation for Enhanced Voice Services ("EVS") specified by the 3GPP (SA4) group. As such, the present invention also provides a computer program comprising instructions for implementing the above method, when the program is executed by a processor. An example of a flowchart of such a program is presented in the detailed description below with reference to FIG. 4 for the decoding and with reference to FIG. 3 for the coding.

The present invention also relates to a device for decoding a digital audio signal comprising a succession of samples distributed in successive frames. The device comprises means (such as a processor and a memory, or an ASIC component or other circuit) for replacing at least one lost signal frame, by:

- a noise added to the addition of components,

the amount of noise added to the addition of components being weighted according to a voicing information of the valid signal obtained at decoding.

Similarly, the present invention also provides a device for encoding a digital audio signal, comprising means (such as a memory and a processor, or an ASIC component or other circuit) for providing a voice information in a coded stream that delivers the coding device, distinguishing a speech signal that can be voiced by a music signal, and, in the case of a speech signal, by:

- identifying that the signal is voiced or generic, to consider it globally voiced, or - by identifying that the signal is inactive, transient or unvoiced, to consider globally as unvoiced. Other features and advantages of the invention will appear on examining the detailed description below, and the attached drawings in which:

- Figure 1 recalls the main steps of the frame loss correction method in the sense of FR 1350845;

FIG. 2 schematically illustrates the main steps of a method within the meaning of the invention;

FIG. 3 illustrates an example of steps implemented in coding, in one embodiment in the sense of the invention;

FIG. 4 illustrates an example of steps implemented in decoding, in one embodiment in the sense of the invention;

FIG. 5 illustrates an example of steps implemented at decoding, for the search for pitch in a valid signal segment Ne;

FIG. 6 schematically illustrates an exemplary encoder and decoder device within the meaning of the invention.

Referring to FIG. 1, in which the main steps described in document FR 1350845 are illustrated. In the first step S1, a succession of N audio samples is stored in a buffer memory of the decoder (or "buffer"). b (n) below. These samples correspond to samples already decoded and are therefore accessible for the decay field decoder correction. If the first sample to be synthesized is the sample N, the audio buffer corresponds to the previous samples 0 to N-1. In the case of a transform coding, the audio buffer corresponds to the samples at the previous frame, and are not modifiable because this type of coding / decoding does not provide for delay in the return of the signal, so that it It is not planned to perform a crossfade of sufficient duration to cover a frame loss.

Then, a frequency filtering step S2 is performed, during which the audio buffer b (n) is separated into two bands, a low band BB and a high band BH, with a separation frequency denoted Fc (for example Fc = 4kHz). This filtering is preferably a filtering without delay. The size of the audio buffer is now reduced to N '= N * Fc / fs following the decimation of fs to Fc. In variants of the invention, this filtering step may be optional, the following steps being carried out in full band. The next step S3 consists of searching in the low band for a loopback point and a segment p (n) corresponding to the fundamental period (or "pitch" hereinafter) within the buffer b (n) resampled to the frequency Fc. This realization makes it possible to take into account the continuity of the pitch in the frame (s) lost (s) to be reconstructed.

Step S4 consists of breaking down the segment p (n) into a sum of sinusoidal components. For example, the discrete Fourier transform (DFT) of the signal p (n) can be computed over a period corresponding to the length of the signal. This gives the frequency, phase and amplitude of each of the sinusoidal components (or "peaks") that make up the signal. Other transforms than DFT are possible. For example, transforms of DCT, MDCT or MCLT type can be implemented.

Step S5 is a step of selecting K sinusoidal components so as to keep only the most important components. In a particular embodiment, the selection of the components corresponds firstly to selecting the amplitudes A (n) for which A (n)> A (n1) and A (n)> A (n + 1) with ¾ e \ 0; - ^'- î \, which ensures that the amplitudes correspond to the spectral peaks.

To do this, the samples of the segment p (n) (pitch) are interpolated so as to obtain a segment p '(n) composed of P' samples with P ⁼ 2 " ^s "" ^{3Si <F)} - '> P ₎ where ceil (x) is an integer greater than or equal to x. the analysis by Fourier transform FFT is thus made more efficiently over a length which is a power of 2, without changing the effective pitch period ( interpolation) The FFT transform of p '(n): Π (,%) = FFT (p ^! {ri) is calculated, and, from the FFT transform, the phases ψ {ίί ) and amplitudes A (k of the sinusoidal components, the normalized frequencies between 0 and 1 being given here by:

Then, among the amplitudes of this first selection, the components are selected in descending order of amplitude, so that the cumulative amplitude of the selected peaks is at least x (for example x = 70) of the cumulative amplitude on typically half of the spectrum at the current frame. It is also possible in addition to limit the number of components (for example to 20) so as to make the synthesis less complex.

The step S6 sinusoidal synthesis consists in generating a segment s (n) of length at least equal to the size of the lost frame (T). The synthesis signal s (n) is calculated as a sum of the selected sinusoidal components:

where k is the index of the K selected peaks of step S5. Step S7 consists of "injecting noise" (filling the spectral zones corresponding to the unselected lines) so as to compensate for the energy loss linked to the omission of certain frequency peaks in the low band. A particular embodiment consists of calculating the residue r (n) between the segment corresponding to the pitch p (n) and the signal synthesized.

s (n), with n≡ \ Ù: .P- 1] ,, such that:

ris) = pbi) - sin) n e [0; F - l]

This residue of size P is transformed, for example, fenestrated and repeated by overlapping between windows of variable sizes, as described in document FR 1353551:

IFI

r ^'ijs) = fr)} e ^_Q:; St. ker | 0 2T -

The signal s (n) is then combined with the signal r '(n):

s ^"J_ i.

sh = + r ^! (p) not | 0; 2F ÷ -

Step S8 applied to the high band may simply consist of repeating the past signal.

In a step S9, the signal is synthesized by resampling the low band at its original frequency fc, after being mixed in step S8 to the high band filtered (simply repeated in step S11). Step S 10 is an overlap addition that ensures continuity between the signal before the frame loss and the synthesized signal. The elements added to the process of FIG. 1 are now described in an embodiment within the meaning of the invention. According to a general approach presented in FIG. 2, signaling information of the signal before loss of frame, transmitted to at least one encoder bit rate, is used at decoding (step DI-1) to quantitatively determine a proportion of noise to be added to the signal. synthesis signal replacing one or more lost frames. Thus, the decoder uses the voicing information, to decrease, as a function of the voicing, the general amount of noise mixed with the synthesis signal (by assigning a gain G (res) lower to the noise signal r '(k) from a residue in step DI-3, and / or selecting more amplitude components A (k) to be used for construction of the synthesis signal in step DI-4).

The decoder can further adjust its parameters, including pitch search, to optimize the compromise quality / complexity of the treatment, according to the information of voicing. For example, for the pitch search, if the signal is voiced, the pitch search window Ne can be larger (in step DI-5), as will be seen later with reference to FIG.

For voicing determination, information can be provided by the encoder in two ways at at least one encoder rate:

in the form of a bit of value 1 or 0 according to a degree of voicing identified with the encoder (received from the encoder in step DI-1 and read in step DI-2 in case of loss of frame for the subsequent processing ), or

in the form of an average amplitude value of the peaks that make up the signal at the coding, compared to a background noise.

This "flatness" data PI of the spectrum can be received on several bits at the decoder at the optional step DI-10 of FIG. 2, then compared with a threshold at step DI-11, which amounts to determining at the steps DI-1 and DI-2 if the voicing is above or below a threshold, and deduce the appropriate treatments, in particular for the selection of peaks and for the choice of duration of the pitch search segment.

This information (whether in the form of a single bit or a multi-bit value) is received from the encoder (at at least one code rate), in the example described here. In fact, with reference to FIG. 3, at the encoder, the input signal presented in the form of frames C1 is analyzed in step C2. The analysis step consists in determining whether the audio signal of the current frame has characteristics that would require special treatment in the event of loss of frames to the decoder, as is the case, for example, on voiced speech signals.

In a particular embodiment, it is advantageous to use a classification (speech / music or other) already made to the encoder so as not to increase the overall complexity of processing. Indeed, in the case of encoders with coding modes of speech or music coding, a coder classification already makes it possible to adapt the technique used for the coding according to the nature of the signal (speech or music). Similarly, in the case of speech, predictive coders such as, for example, the coder according to the G.718 standard also use a classification so as to adapt the parameters of the coder to the nature of the signal (voiced / unvoiced, transient, generic, inactive).

In a first particular embodiment, only one bit of "characterization for the loss of frame" is reserved. It is added to the code stream (or bitstream) in step C3 to indicate whether the signal is a speech signal (voiced or generic). This bit is for example set to 1 or to 0 according to the case of the table below:

• the decision of the speech / music classifier,

• and additionally the decision of the classifier of the speech coding mode.

Decision of the classifier of the Word Music encoder

Characterization bit value Decision of classifier 0

for frame loss Coding mode:

Voised 1

Not Vied 0

Transient 0

Generic 1

Inactive 0 Here we mean by "generic" a usual speech signal (which is not a transient related to the pronunciation of a plosive, which is not inactive, and which is not necessarily purely voiced as the pronunciation of a vowel without consonant). In a second alternative embodiment, the information transmitted to the decoder in the coded stream is not binary but corresponds to a quantification of the ratio between the peak levels and the valley levels in the spectrum. This ratio can be expressed by a measure of "flatness" of the spectrum, denoted PI:

In this expression, x (k) is the amplitude spectrum of size N resulting from the analysis of the current frame in the frequency domain (after FFT).

In an alternative, a sinusoidal analysis decomposing the signal to the sinusoidal component and noise encoder is available and the measure of flatness is obtained by ratio between the sinusoidal components and the overall energy on the frame.

Following step C3 (comprising the single-bit voicing information or the flatness measurement over several bits), the audio buffer of the encoder is conventionally coded in a step C4 before possible subsequent transmission to the decoder.

Referring now to Figure 4 to describe the steps implemented in the decoder, in an exemplary embodiment of the invention.

In the case where there are no frame losses in the step D1 (KO arrow at the output of the test D1 of FIG. 4), the decoder reads the information contained in the coded stream, including the information of " characterization for frame loss "in step D2 (at least one code rate). These are stored in memory so that they can be reused in case a next frame is missing. The decoder then continues the conventional decoding steps D3, etc. to obtain the SYN SYNTH synthesized output frame. In the case where a loss of frame (s) occurs (arrow OK at the output of the test D1), steps D4, D5, D6, D7, D8 and D12, corresponding respectively to the steps S2, S3, S4, S5, are applied. S6 and SU of FIG. 1. However, some modifications are made with respect to steps S3 and S5, respectively to steps D5 (search for a loopback point for the determination of the pitch) and D7 (selection of the sinusoidal components). Furthermore, the noise injection in step S7 of FIG. 1 is carried out with a gain determination according to two steps D9 and D10 in FIG. 4 of the decoder within the meaning of the invention. Indeed, in the case where the "characterization for frame loss" information is known (when the previous frame has been received), the invention consists in modifying the processing of steps D5, D7 and D9-D10, as follows.

In a first exemplary embodiment, the "characterization for frame loss" information is binary, and of value:

- equal to 0 in case of unvoiced signal, music type, transient type,

- equal to 1 otherwise (table above).

Step D5 consists in finding a loop point and a segment p (n) corresponding to the pitch within the audio buffer resampled at the frequency Fc. This technique, described in document FR 1350845, is illustrated in FIG. 5, in which:

the audio buffer at the decoder is of sample size N ',

the size of a target buffer BC of Ns samples is determined,

the correlation search is done on Ne samples,

the correlation curve "Correl" has a maximum in me,

the loop point is designated Pt Boucl and is Ns samples of the maximum correlation,

the pitch is then determined on the p (n) samples remaining at N'-l. In particular, a correlated normal correlation (n) is calculated between the target buffer segment of size Ns lying between N'-Ns and N'-1 (of a duration of for example 6 ms) and the sliding segment of size Ns which begins between sample 0 and Ne (with Nc>N'-Ns): b (p + k) Η, Ν'- Ns -

Carri as. j = Ï £ [0; APc]

For music signals, by the nature of the signal, the value Ne does not need to be too large (eg Nc = 28ms). This limitation saves computational complexity when searching for pitch.

On the other hand, the voicing information of the last frame validly received previously makes it possible to determine whether the signal that one seeks to reconstruct is a voiced speech signal (mono pitch). It is therefore possible, in this case and with this information, to increase the size of the segment Ne (for example Nc = 33 ms) so as to optimize the search for pitch (and potentially to find a higher correlation value) .

Moreover, in step D7 of FIG. 4, sinusoidal components are selected so as to keep only the most important components. In a particular embodiment also presented in the document FR 1350845, the first selection of components ionize the amplitudes A (n) for which A (n)> A (n1) and A (n)> A (n + 1) with

In the case of the invention, it is advantageously known if the signal that one seeks to reconstruct is a speech signal (voiced or generic) so with marked peaks and a low noise level. Under these conditions, it is preferable to select not only the peaks A (n) for which A (n)> A (n1) and A (n)> A (n + 1) as presented above, but also of broaden the selection to A (nl) and A (n + 1) so that the selected peaks account for a large part of the total spectrum energy. This modification notably makes it possible to lower the noise level (and in particular the level of noise injected in steps D9 and D10 presented below) with respect to the level of signal synthesized by sinusoidal synthesis in step D8, while maintaining a global level. sufficient energy not to cause audible artifacts related to energy fluctuations.

Then, in the case where the signal is free of noise (at least in the low frequencies), as is the case in a voiced or generic speech signal, it is observed that the addition of the noise corresponding to the transformed residue r '(n) in the sense of document FR 1350845, degrades in fact the quality.

Thus, the voicing information is advantageously used here to attenuate the noise by applying a gain G to the step D10. The signal s (n) resulting from the step D8 is mixed with the noise signal r '(n) resulting from the step D9 by applying however here a gain G which depends on the information of "characterization for the loss of frame from the coded stream of the previous frame, that is: In this particular embodiment, G may be a constant equal to 1 or 0.25 as a function of the voiced or unvoiced nature of the signal of the preceding frame, according to the table given below by way of example:

In the alternative embodiment where the "frame loss characterization" information has several discrete levels characterizing the PI flatness of the spectrum. The gain G can be expressed directly as a function of the value Pl. The same applies to the limit of the segment Ne for the search for pitch and / or for the number of peaks An to be taken into account for the synthesis of the signal.

By way of example, a treatment can be defined as follows.

The gain G is already defined directly as a function of the value PI: ^~ 2 ^{* 1}

In addition, the value PI is compared to a mean value -3 dB, with the proviso that the value 0 corresponds to a flat spectrum, and -5 dB corresponds to a spectrum with sharp peaks.

If the value PI is lower than the average value threshold -3dB (corresponding to a spectrum with pronounced peaks, typical of a voiced signal), then we can set the duration of the segment of search for pitch Ne at 33 ms and select the peaks A (n) such that A (n)> A (nl) and A (n)> A (n + 1), as well as the first neighboring peaks A (nl) and A (n + 1).

Otherwise (if the PI value is greater than the threshold, which corresponds to less marked peaks, more background noise such as a music signal), the duration Can not be chosen shorter, for example 25 ms and only A (n) peaks such as A (n)> A (n) and A (n)> A (n + 1) are selected.

The decoding can then be continued by mixing the noise, the gain of which is thus obtained, with the components thus selected to obtain the synthesis signal in the low frequencies at the D13 tab, which is added to the synthesis signal in the high frequencies obtained at step D14, to obtain in step D15 the synthesized overall signal.

With reference to FIG. 6, a possible implementation of the invention in which a DECOD decoder (comprising for example a software and hardware hardware such as a judiciously programmed memory MEM and a processor PROC cooperating with this memory is illustrated. , or alternatively a component such as an ASIC, or other, as well as a communication interface COM) implanted for example in a telecommunication device such as a telephone TEL, uses, for the implementation of the method of the FIG. 4, a voicing information that it receives from a coder COD. This encoder comprises, for example, a hardware and software hardware such as a memory MEM 'judiciously programmed to determine the voicing information and a processor PROC cooperating with this memory, or alternatively a component such as an ASIC, or other, and than a communication interface COM '. The coder COD is implanted in a telecommunication device such as a TEL 'telephone.

Of course, the present invention is not limited to the embodiments described above by way of example; it extends to other variants. Thus, for example, it will be understood that the information on voicing can take different forms that can be varied. In the example described above, it may be a binary value on a single bit (voicing or not), or a value on several bits which may be relative to a parameter such as the flatness of the signal spectrum, or any other parameter to characterize (quantitatively or qualitatively) a voicing. Moreover, this parameter can be determined at decoding, for example according to the degree of correlation that can be measured during the identification of the pitch period. On the other hand, an embodiment comprising a separation in a high frequency band and a low frequency band of the signal from previous valid frames, with in particular a selection of the spectral components in the first embodiment, has been presented as an example. low frequency band. Nevertheless, this embodiment is optional although advantageous in the sense that it reduces the complexity of the treatment. The frame replacement method assisted by the voicing information in the sense of the invention can nevertheless be achieved by considering the entire spectrum of the valid signal, alternatively.

In addition, an embodiment has been described above in which the invention was implemented in the context of an addition and overlap transform coding. Nevertheless, this type of process can adapt to any other type of coding (CELP in particular).

It should be noted that in the context of an addition and overlap transform coding (in which typically the synthesis signal is constructed over at least two frame times due to the overlap), the aforementioned noise signal can be obtained by the residue (between the valid signal and the sum of the peaks) by weighting this residue temporally. For example, it can be weighted by overlapping windows, as in the usual framework of a transform coding / decoding with overlap.

It will then be understood that the application of the gain as a function of the voicing information additionally adds another weighting, this time as a function of the voicing.

Claims

claims

A method of processing a digital audio signal comprising a succession of samples distributed in successive frames, the method being implemented during a decoding of said signal to replace at least one lost signal frame at decoding, the method comprising the steps of:

a) search, in a valid signal segment available at decoding (Ne), of at least one period in the signal, determined according to said valid signal,

- a noise added to the addition of components,

wherein the amount of noise added to the component addition is weighted according to a valid signal voicing information obtained at decoding.

2. Method according to claim 1, characterized in that a noise signal added to the addition of components is weighted by a smaller gain in case of voicing the valid signal.

3. Method according to claim 2, characterized in that the noise signal is obtained by a residue between the valid signal and the addition of the selected components.

4. Method according to one of the preceding claims, characterized in that the number of selected components for the addition is greater in case of voicing of the valid signal.

5. Method according to one of the preceding claims, characterized in that, in step a), the period is sought in a valid signal segment (Ne) of greater duration in case of voicing of the valid signal.

6. Method according to one of the preceding claims, characterized in that the voicing information is provided in a coded stream received decoding and corresponding to said signal comprising a succession of samples distributed in successive frames,

and in the case of decoded frame loss, using the voicing information contained in a valid signal frame preceding the lost frame.

7. Method according to claim 6, characterized in that the voicing information is derived from an encoder delivering the coded stream and determining the voicing information, and in that the voicing information is coded on a single bit. in the code stream.

8. The method of claim 7, taken in combination with claim 2, characterized in that, if the signal is voiced, the value of the gain is 0.25, and it is 1 otherwise.

9. Method according to claim 6, characterized in that the voicing information is derived from an encoder determining a spectrum flatness value (PI), obtained by comparison with background noise amplitudes of the spectral components of the signal, the encoder delivering said value in binary form in the coded stream.

10. The method of claim 7, taken in combination with claim 2, characterized in that the value of the gain is a function of said platitude value.

11. Method according to one of claims 9 and 10, characterized in that said platitude value is compared to a threshold to determine:

the signal is voiced if the value of flatness is below the threshold, and

- that the signal is not voiced otherwise.

12. Method according to one of claims 7 and 11, taken in combination with claim 4, characterized in that:

if the signal is voiced, the spectral components whose amplitudes are greater than those of the first neighboring spectral components, as well as the first neighboring spectral components, are selected, and

only the spectral components whose amplitudes are greater than those of the first neighboring spectral components are selected, if not.

13. Method according to one of claims 7 and 11, taken in combination with claim 5, characterized in that:

if the signal is voiced, the period is searched for in a valid signal segment of duration greater than 30 milliseconds,

- and, otherwise, the period is searched for in a valid signal segment of duration less than 30 milliseconds.

14. Computer program characterized in that it comprises instructions for the implementation of the method according to one of claims 1 to 13, when the program is executed by a processor.

15. Device for decoding a digital audio signal comprising a succession of samples distributed in successive frames, the device comprising means (MEM, PROC) for replacing at least one lost signal frame, by:

- a noise added to the addition of components,

16. A device for coding a digital audio signal, comprising means (ΜΕΜ ', PROC) for providing a voicing information in a coded stream that the coding device delivers, distinguishing a speech signal that can be voiced by a music signal, and, in the case of a speech signal,

identifying that the signal is voiced or generic, to consider it globally voiced, or - identifying that the signal is inactive, transient or unvoiced, to consider globally as unvoiced.