IL147263A

IL147263A - Method for decoding and retrieving a sound signal in an asynchronous transmission system

Info

Publication number: IL147263A
Application number: IL147263A
Authority: IL
Original assignee: France Telecom
Priority date: 1999-06-24
Filing date: 2001-12-24
Publication date: 2006-12-10
Also published as: DE60017565T2; WO2001001727A1; ES2235910T3; ATE287625T1; DE60017565D1; FR2795548B1; IL147263A0; FR2795548A1; EP1192833B1; EP1192833A1

Abstract

The invention concerns a method for decoding and retrieving a sound signal in an asynchronous transmission system which consists in detecting an backfilling overload in said buffer memory and/or a second memory in the input or the output of the decoding unit and comparing the backfilling rate to at least a threshold. The invention is characterised in that, depending on the value of the backfilling rate, it consists in using a detection of voice activity and in eliminating the frames considered as being non-active by said detection; carrying out a concatenation processing on two successive frames.

Description

Method for decoding and retrieving a sound signal in an asynchronous transmission system France Telecom C. 136885 A METHOD OF MANAGING THE DECODING AND PLAYBACK OF A SOUND SIGNAL IN AN ASYNCHRONOUS TRANSMISSION SYSTEM The present invention relates to a method of managing asynchrony in audio transmission.

GENERAL DESCRIPTION OF THE FIELD OF THE INVENTION In general, the invention relates to transmission systems using low data rate speech encoders, in which the signals do not carry the reference clock of the source encoding system (the sampling frequency of the encoder) .

This applies, for example, to Internet protocol (IP) type transmissions or indeed to discontinuous transmissions, etc .

A general aim of the invention is to resolve the problems encountered by such systems in producing a continuous stream of decoded speech or sound.

Traditionally, telephone communications and sound channel networks have used analog transmission systems with frequency division multiplexing (primary groups, amplitude and frequency modulation) . Under such conditions, the speech signal (or music signal; the term "speech" is used below throughout this document in generic manner) is converted into an electrical signal by a microphone and it is this analog signal which is filtered and modulated in order to be presented to a receiver which amplifies it prior to presenting it to a playback system (earphone, loudspeaker, etc.) .

Over the last few years, digital transmission and switching techniques have been progressively replacing analog techniques. In pulse code modulation (PCM) systems, the speech signal is sampled and converted into a digital signal using an analog-to-digital converter (ADC) driven at a fixed sampling frequency derived from a master clock delivered by the network and also known to the receiver system. This applies to analog and digital subscriber connection units in telecommunications networks. The digital signal received by the destination (in the broad sense) is converted back into analog so that it can be heard by means of a digital-to-analog converter (DAC) driven by a clock at the same frequency as that used by the ADC of the source . Under such conditions, the entire system is entirely synchronous as generally applies to present-day switching and transmission systems. These can include data rate reduction systems (for example for a telephone signal, for converting from 64 kilobits per second (kbit/s) to 32 kbit/s or 16 kbit/s or 8 kbit/s) . It is the network (or terminal systems as in the case of the integrated services digital network (ISDN) for example) which undertakes the operations of ADC, of encoding and decoding (where encoding and decoding are used in the context of reducing data rate) , and of DAC. The clocks are always distributed and the system comprising ADC, speech encoding, transmission and switching, speech decoding, and finally DAC is fully isochronous. There are no losses or repeats of speech samples in the decoder.

The above-described synchronous transmission systems require the presence of a reference clock throughout the network. Transmission systems are now making greater and greater use (initially for data) of asynchronous and packet techniques (IP protocol, asynchronous transmission mode (ATM)) . In numerous new situations, the decoder has no reference concerning the sampling frequency used by the encoder and it must be capable, using its own means, of reconstituting a decoding clock which attempts to track the reference of the encoder. The present invention is thus particularly advantageous in frame relay telephone systems, in ATM telephony, or in IP telephony. The technique described can easily be used in other fields of speech or sound transmission in which there exists no effective transmission of the clock reference from the encoder to the decoder.

DESCRIPTION OF THE STATE OF THE ART The general problem The general problem posed by transmission systems to which the invention applies is that of mitigating the fact that the speech or sound decoder has no clock reference associated with the source encoding.

In this respect, two circumstances can be distinguished: those corresponding to "weak" asynchrony and those corresponding to "strong" asynchrony.

Weak asynchrony As an illustration, we consider the case of a transmission system comprising the following, as shown diagrammatically in Figure 1: - an encoding source 1 comprising an analog-to-digital converter driven by a reference clock at frequency FMC equal to 8 kilohertz (kHz) (to provide numerical values in the worked examples below) and a speech encoder (of greater or lesser complexity and reducing the data rate for transmission to a greater or lesser extent) ; - an asynchronous transmission system (represented by link 2) which conveys the information produced by the encoding source using its own transmission clock and its own protocols (for example the speech encoder could produce a data rate of 8 kbit/s and the transmission system could be constituted by an RS.232 type asynchronous link operating at 9600 bit/s); and - a reception and decoding system 3 receiving the information conveyed over the asynchronous link (whose data rate must necessarily be a little greater than the raw encoding data rate, e.g. 9600 bit/s instead of 8000 bit/s) having the function of producing the signal after decoding (decompression) and applying the signal it produces to a digital-to-analog converter connected to a transducer such as a loudspeaker, a telephone handset, a headset, or a sound card installed in a personal computer (PC) .

It will be understood that since the reception and decoding system 3 has no clock reference, it must implement a strategy in order to mitigate this asynchrony between the encoder and the decoder.

Whatever the encoding technique used or the type of transmission which does not directly convey a clock, or time markers within the transmitted frame, or indications concerning transmission instants, the above-mentioned problem can be reduced (ignoring the speech encoder, the asynchronous transmission system, and the speech decoder) to a system comprising the following, as shown in Figure 2 : - an analog-to-digital converter 4 for converting speech signals or sound from analog to digital form at a sampling frequency set by a local oscillator; - a digital-to-analog converter 5 for playing back the sound via a transducer suitable for the field of use in question and operating at a sampling frequency given by a local oscillator which, a priori , is at the same frequency but which is never at exactly the same frequency for reasons of acceptable manufacturing cost (highly stable and very accurate frequency sources do indeed exist, but they need to be temperature-compensated and they are unacceptably expensive for mass-produced industrial implementation) ; and - a digital register 6 into which the analog converter 4 writes at its own sampling frequency (F^) , said register being read at the sampling frequency (FDAC) of the playback system by the digital-to-analog converter (DAC) .

It will be understood that since the two clock frequencies (F^ and FDAC) are different, it is necessary from time to time for the DAC to reread the same information twice over (if FDAC is greater than F^) or on the contrary (where FDAC is less than FMC) to allow the ADC to overwrite information before the DAC can read it.

The oscillators that are commonly available in the trade are characterized by the accuracy with which they operate (within a certain temperature range) .

Oscillators that are accurate to within 50 parts per million (ppm) are quite commonly available and are used to provide numerical values for the worked examples below showing how frequently samples are lost or repeated when the sampling frequency is 8 kHz (the reader can easily determine that at higher sampling frequencies samples are skipped or repeated at a frequency which is prorata the sampling frequency; the higher the sampling frequency the higher the frequency at which samples are skilled or repeated) .

Under the least favorable conditions, an ADC is operating at 8000x(l + 50.e-6) in association with a DAC operating at 8000x(l - 50.e-6). In this particular example, the skip period (period for samples being omitted in the DAC since FDAC is less than Fmc) is easily calculated by counting the number of periods of the DAC (where the period is longer than that of the ADC) that produces a value equal to said period of the DAC when multiplied by the difference between the periods.

Writing the period of the DAC as PDAC (in this case 1/8000 x (1 - 50.e-6)) and PMC as the period of the ADC (in this case 1/8000 x (1 + 50.e-6)) we obtain N x (PDAC -PADC) = PDAC- N represents the number of individual operations that stem from the period difference. Writing 50.e-6 = ε and applying the simplifications that are common for small quantities, we obtain N = 1/ (2ε) . In this example, that immediately gives the skip period as being close to 1.25 seconds (s) . If the accuracy of the local oscillators is improved (e.g. by going from 50.e-6 to 5.e-6) then the skip period will increase (in this case there will be one skip every 12.5 s) .

In a complete transmission system including audio encoders operating on signal frames, this phenomenon of "slip" between two clocks will give rise to an absence of speech frames (no frame to be decoded in the time available for decoding) or to overabundance of frames (i.e. two frames for decoding instead of one in the available time) . Taking the example of a speech encoder operating on 30 millisecond (ms) frames at 8 kHz, i.e. 240 samples, in each 30 ms time slot the receiver and more particularly the decoder expects to receive one frame for decoding in order to ensure that playback of the speech signal remains continuous. Unfortunately, if FADC i-s less than FDAC, then on the above assumptions, there will be an absence of any frame of samples for decoding by the sound playback system once very 240 x 1.25 = 300 s, and in the converse situation there will be two frames instead of one (i.e. a frame to be "eliminated") at the decoder once every 300 s. Under such circumstances, the awkward phenomenon of samples being skipped or repeated becomes very disagreeable since an entire block of the signal is skipped or repeated, and this needs to be managed appropriately.

Strong asynchrony Certain types of transmission amplify this problem of asynchrony due to the phenomenon of "slip" between clocks as explained above. This is what we refer to herein as "strong" asynchrony.

When transmission is imperfect, giving rise to samples or frames of samples being lost and also when transmission generates jitter on sample arrival times, where such jitter is associated neither with the sending clock nor with the receiving clock, but is associated with other mechanisms in the transmission system having their own clocks, then the receiver system can be confronted with an absence of several frames, or with an overabundance of several frames. This can apply, for example, with IP type networks which suffer from the phenomenon of packets being lost and from the phenomenon of jitter introduced during packet routing. These phenomena disturb the continuity of the sound playback of the audio signal very strongly. When packets are lost or when jitter delays one or more packets, the playback system finds itself without any sample (or frame of samples) to apply to the DAC for the purpose of ensuring continuity in audio playback. Conversely, when jitter is strong, the playback system can find itself with far too many frames or samples to be sent simultaneously to the DAC. When jitter is strong, sound signal frame transmission can take place in the form of bursts, thus creating phenomena both of gaps and of overabundance amongst sample frames.

It will be observed that using speech encoders operating with a system of the voice activity detector/discontinuous transmission/comfort noise generation (VAD/DTX/CNG) type, a mechanism is also introduced that is similar to the loss of a packet since in the event of silence, the sender will cease to send frames of samples. Ceasing to send samples can be perceived at the receiver as being the same as the loss of a packet or as circumstances in which the ADC clock is faster than the DAC clock, which leads to holes in the signal at the receiver, as shown above.

"Strong" asynchrony thus differs from "weak" asynchrony by involving not only cyclical skips and/or repetitions, but also holes in the signal and/or overabundance of the signal in multiple and non-cyclic manner .

Description of various existing methods Two main methods are presently known for mitigating the drawbacks due to the fact that the speech or sound decoder has no clock reference.

The first consists merely in proceeding as described above in the paragraph describing "weak" asynchrony, i.e. by skipping or repeating samples. The decoding system produces samples at a rate that is more or less equal to that of the encoder and it presents them to the digital-to-analog converter at said rate (means for implementing the above reconstruction system are known to the person skilled in the art) . In some cases, for example when "strong" asynchrony applies with transmission being in the form of frames, it is preferable when samples for playing back are missing to send null sample frames to the ADC rather than repeating a preceding frame.

Furthermore, in the converse situation, when surplus samples are present, they are not eliminated directly, but a first-in-first-out (FIFO) register of some size can be used to absorb jitter to some extent. If the FIFO register becomes too full, then that triggers partial or complete emptying of the FIFO, thereby giving rise to new skips in sound playback.

The second method, which is more complicated and provides better performance, requires a loop to be implemented to recover a hardware clock which is servo-controlled by the filling level of a buffer memory for the signal to be decoded (or to be transmitted as in the ATM adaptation layer number 1 (AAL1) for example) . That method of servo-control attempts to use the clock recovery loop to recover the sampling frequency of the source. The filling level of the receive buffer produces a control signal for servo-controlling a digital or analog phase-locked loop (PLL) .

The first method is extremely simple to implement but suffers from a major defect associated with the quality of the sound reproduced. A skip or elimination once every 1.25 seconds can be very disagreeable to listen to, and this can occur with "weak" asynchrony associated with correction at sample level. Similarly, for a system operating with frames of samples, the inserted repetitions or blanks, and the discontinuities in the signal due to frames being eliminated amplify loss of quality which becomes highly perceptible and very disturbing for the listener.

Furthermore, the use of a FIFO memory runs the risk of establishing a considerable delay in transmission and that also harms the overall quality of a call.

The second method is much more complex to implement and requires a clock servo-control mechanism, and thus requires special hardware. However, it provides partial synchronization and therefore avoids problems associated with managing asynchrony. Nevertheless, that method adapts poorly to discontinuous transmission systems, to systems involving last frames, or to systems with high levels of jitter. Under such circumstances, synchronization information is no longer available.

Furthermore, that method cannot be envisaged on terminal platforms where clock servo-control is not possible, as is the case in particular with PC type terminals, for example, where the system used for playing back sound is a sound card.

Devices are already known from document WO/99 17 584 for implementing a method in accordance with the preamble of claim 1, the devices having only one buffer memory.

Document US-A 4 703 477 facilitates reading voice data by implementing a method of putting frames relating to the same voice data end-to-end.

SUMMARY OF THE INVENTION A general object of the invention is to propose a solution to the problems associated with continuity in the playback of a speech signal in the presence of asynchronous transmission, and to do so by taking action at receiver level, i.e. at the end of the transmission system.

To this end, the invention provides a method of managing the decoding and playback of a sound signal in CORRECTED SHEET This translation of an amended page covers the amendments made in the original.

However, the page breaks match the translation, so that this page is also a replacement page that fits in with the remainder of the translation. an asynchronous transmission system, in which any overabundance of filling of a first buffer memory and/or of a second buffer memory situated at the inlet or at the outlet of a decoding block is detected by comparing the filling level with at least one threshold, the method being characterized in that, depending on the value of the filling level: - voice activity detection is implemented and frames considered by said detection as being non-active are eliminated; and - concatenation processing is implemented on two successive frames to compact them into a pseudo-frame of length less than or equal to one frame, the length reduction ratio of the pseudo- frame relative to the length of the two frames being greater than or equal to two .

Such a method is simple to implement and provides a guarantee of quality by avoiding excessive increase in transmission delay and by managing holes in the speech signal effectively. Furthermore, it does not imply any specific hardware servo-control circuit, and can therefore be quickly adapted to different asynchronous networks, terminals, and platforms.

The method is advantageously associated with the various characteristics below taken singly or in any technically feasible combination: - voice activity detection is implemented and frames considered by said detection as being not active are eliminated whenever the filling level lies between a first threshold and a second threshold, and in that concatenation processing is implemented on two successive frames whenever the filling level lies between a second threshold and a third threshold; - the first and second thresholds are the same; - detection is performed at the inlet or the outlet of a decoding block having a first buffer memory at its inlet and/or its outlet to determine whether any frame is CORRECTED SHEET This translation of an amended page covers the amendments made in the original.

However, the page breaks match the translation, so that this page is also a replacement page that fits in with the remainder of the translation. lObis missing or erroneous or whether any samples to be played back are absent, and a fake frame is generated to ensure continuity in the audio playback on detecting such a missing or erroneous frame, or on detecting such an absence of samples for playback; - when the decoding block implements its decoding processing in cyclical manner relative to the content of the first buffer memory, detection of any missing or erroneous frame or of any absence of samples to play back CORRECTED SHEET This translation of an amended page covers the amendments made in the original.

However, the page breaks match the translation, so that this page is also a replacement page that fits in with the remainder of the translation. 11 is implemented at the same cyclical frequency, said detection taking place far enough in advance relative to the decoding process to make it possible to generate a fake frame in good time; - a fake frame is not generated when a missing or erroneous frame is detected for a frame on which an absence of samples has already been detected; - for a system of the type which can voluntarily stop sending frames, the type of the previously-generated frame is stored from one frame to the next, and this information is used to determine whether to generate fake frames or to generate frames of silence; - in processing for concatenating two successive frames, the samples are weighted in such a manner as to give more importance to the first samples of the first frame and to the last samples of the second frame; - the threshold (s) is/are adaptive; and - a threshold is adapted as a function of the length of time passed with a filling level above a given threshold.

The invention also provides a device for playing back a speech signal, the device comprising a first buffer memory receiving coded frames, means implementing decoding processing on the frames stored in said first buffer memory, a second buffer memory receiving decoded frames output by the decoding means, and sound playback means receiving the frames output by the second buffer memory, the device being characterized in that it further comprises means for implementing the above-specified method.

As will be understood on reading the following description, these means are essentially computer means.

DESCRIPTION OF THE FIGURES Other characteristics and advantages of the invention appear further from the following description which is purely illustrative and non-limiting and which 12 should be read with reference to the accompanying figures, in which: - Figure 1 is a block diagram of an asynchronous transmission system; - Figure 2 is a diagram showing a model of such a transmission system; - Figure 3 is a diagram of a receiver device; and - Figure 4 shows the signals obtained by implementing concatenation processing as proposed by the invention .

DETAILED DESCRIPTION OF ONE OR MORE EMBODIMENTS The method proposed by the invention for managing asynchrony in transmission implements two kinds of processing corresponding to handling the two phenomena described above, namely the lack of samples and surplus samples . 1. Description of the sound playback system in a conventional transmission application As shown in Figure 3, the playback system for a speech signal comprises three elements: - A block 10 waiting to receive samples or frames of code coming from the network. The block 10 contains a FIFO type memory 11 or circular type buffer memory (referred to as "FIFO 1" in the description below) enabling frames to be stored on a temporary basis prior ' to being decoded.

- A decoding block 12 which takes the frames coming from the block 10, decodes them, and stores them in turn in a FIFO memory 13 (referred to below as "FIFO 2") .

- A playback block 14 which takes the decoded sample frames and applies them to any kind of sound playback system 15.

Depending on the terminals and the way the system is organized, the clock frequency used for sound playback (i.e. the digital-to-analog converter frequency FDAC) is 13 not necessarily directly associated with all of the blocks. Since the block 14 is directly associated with the playback system, it is directly associated with the frequency FDAC. However the other blocks can be associated instead with the rate at which frames arrive from the network rather than with the frequency FDAC.

Taking the example of a terminal provided with a multitasking system, and in which each block is performed by a specific task, the tasks 10 and 12 can thus be associated with frame reception. The task 10 waits for a frame from a network, which frame is then decoded by the task 12 and placed in the memory FIFO 2.

Meanwhile the task 14 clocked at FDAC takes samples from the memory FIFO 2 and delivers them continuously to the sound playback system.

It can thus be seen that regardless of whether the asynchrony is "strong" or "weak", it is the way in which the memory FIFO 2 is managed that requires particular attention. Similarly, if the task 12 were strongly associated with the task 14, then particular attention would be required by the memory FIFO 1.

The mechanism constituting an implementation of the invention is described below in application to managing the memory FIFO 2, but the description includes explanations about how to transpose it with certain adaptations to managing the memory FIFO 1. 2. Absence of samples In order to continue playing back sound in the absence of samples, both potential causes of samples for playback being absent are treated. The first cause corresponds to information contained in lost packets, while the second cause corresponds to the absence of any samples to play back (e.g. FIFO 2 empty) even though it is still necessary to keep on sending samples to the sound playback system. 14 2.1 Loss of frames or erroneous frames The processing applied to lost frames or to erroneous frames requires a transmission system to be available that gives access to information about frames being lost and about erroneous frames being received. This is often the case in transmission systems.

For example, in IP networks, it is possible to use the marking of packets coming from the real time transfer (RTP) layer, which marking gives the exact number of samples lost between two packets of audio code being received. This information about loss of frames, or in the case of IP about loss of packets (each containing one or more speech frames) generally becomes available only once the packet following the lost packet (s) is itself received.

It is not necessarily advantageous to take action, while one or more valid frames can be decoded. With new generation speech encoders (CELP encoders, transform encoders, ...) , in order to ensure that the quality of sound playback is maintained, it is often necessary to ensure a degree of synchronism between the encoder and the decoder. The loss of this encoder/decoder synchronism can be compensated by using frame loss correction algorithms associated with the speech encoder used. By way of example, these algorithms are provided in the standards for certain speech encoders (e.g.

International Telecommunications Union (ITU) standard G.723.1) . When using simpler encoders, such a mechanism is not always necessary.

When a large number of frames have been lost, the number of "fake" sample frames that need to be generated in order to pack out the memory FIFO 2 can be limited. The purpose of processing fake frame generation is to fill holes in such a manner as to ensure signal continuity while also smoothing the internal variables of the decoder so as to avoid excessive divergence on decoding the first valid frame following the invalid or lost frames, thereby avoiding any audible discontinuity. After a few frames have been generated, it can be assumed that the variables have been smoothed, and thus that the generation of such fake frames can be limited to a small number of frames (e.g. four to six) whenever a large number of frames has been lost .

It will thus be understood that processing is servo-controlled in this way relative to information about lost frames .

Similar processing is implemented on the basis of information about invalid frames. This information is forwarded to the decoder by the network portion of the receiver and it arrives soon enough to enable a frame correction algorithm to be implemented which, by taking account of such a non-valid frame, makes it possible to ensure continuity in the signal, and thus to avoid having another cause for samples being absent in the memory FIFO 2.

To sum up, this first process corresponds to managing information of the type "n frames lost" or "invalid frame received" coming from the network layer of the receiver. This management is characterized by implementing an algorithm for correcting frame losses (also referred to in this document as an algorithm for generating "fake" frames) . This first process therefore acts at decoding task level and feeds the memory FIFO 2. 2.2 Absence of samples to be played back This second process is associated with the clock coming from the task 14, i.e. with the clock at the frequency FDAC. As mentioned above, the memory FIFO 2 (or FIFO 1 if the task 12 is included in the task 14) can become empty of samples even though it is still necessary to supply samples to the sound playback system. It is then necessary to supply the playback system with samples and if possible to avoid playing back zeros (since this degrades the sound signal very greatly) . 16 This second process can be thought of as a feedback loop on frame decoding. This loop causes the algorithm for correcting frame losses to be called and as a result it needs to be activated soon enough to enable the algorithm to be executed and to enable samples to be sent to the sound playback system. Depending on the platform, this feedback can be called in different ways.

This loop can be implemented in two ways which are described below.

For a single-task receiver (e.g. a digital signal processor (DSP) without any real time operating system (RTOS) ) , the audio decoder portion is tied completely to the DAC clock (FDAC) and is therefore permanently waiting for a frame to be decoded in cyclical manner. For example, with a speech encoder using 30 ms frames, waiting loops are built up in periods that are multiples of 30 ms .

Thus, for a 30 ms loop, the decoder will, every 30 ms, be expecting a frame for decoding to be placed in the memory FIFO 1 (which can correspond merely to a frame passing from the network layer to the task 12) . On arrival of the frame, it is decoded and placed in the memory FIFO 2 for sending to the DAC. The feedback processing is implemented whenever it is observed that there is no frame for decoding in the memory FIFO 1 at the time given by: T = To + 30 ms - Tc where : To = the start time of the 30 ms wait loop; and Tc = the time required for executing the algorithm for generating fake frames with a safety margin corresponding to interrupts and/or other auxiliary processing that might take place before the end of the loop .

Processing is thus implemented with a latency time deadline of Tb (loop time) - Tc (computation time + margin) . 17 With a multitasking receiver (e.g. a PC terminal), time is not managed with such precision and the processing implemented must therefore be somewhat different. (Note: this processing nevertheless remains quite close to the preceding process since it too seeks to take account of the time Tc . ) Under such circumstances, the only waiting loops available are often those associated with events, e.g. the fact that packets have been received by the network, or the fact that buffer memory n (containing one or more sample frames) sent previously to the sound playback system has been read by the DAC and is therefore again available for sending samples to the DAC.

Depending on the structure of the system and on whether or not it is necessary to respond quickly to an event, it is possible to wait for a certain length of time before filling said buffer memory prior to forwarding to the DAC. Such a latency time is selected in such a manner as to leave enough time for the algorithm for generating "fake" frames to execute, if necessary .

Then, possibly after said latency time has elapsed, the process verifies that sufficient samples are present in FIFO 2 (note: this could apply to FIFO 1 if the management takes place at its level) , and if not it requests an appropriate number of fake frames to be generated in order to fill buffer memory n.

When the system is such that it is necessary to fill buffer memory n "immediately", then monitoring the availability of samples and possibly calling for the "fake" frames generation processing are implemented directly after each delivery to the DAC from the buffer memory so that the generated samples are already in the memory FIFO 2 when the event "buffer memory n available" occurs.

Thus, whatever the receiver, the process observes the absence of samples to be sent to the sound playback 18 system by implementing a check on the content of the buffer memory FIFO 2 (or FIFO 1 depending on how the sound playback system is managed) and activates the algorithm for generating "fake" frames in order to generate the missing samples.

It will be understood that the second process responds firstly to the problem of "slip" between clocks, and more precisely to the circumstance in which the received clock (FDAC) is faster than the send clock (F^) . It also applies to the phenomenon of frames being lost since this can lead to there being an absence of samples to send to the DAC even before frame loss has been detected, since such detection occurs only on receiving the frame following the loss.

In order to combine the actions of the first and second processes, the first process is prevented from generating "fake" frames on detection of frame loss whenever the corresponding frames have just been generated by the second process.

For this purpose, use is made of flags and also of counters determining the number of samples that have been generated by the second process. 2.3 Specific actions for speech encoders using VAD/DTX/CNG services Encoders using a VAD/DTX/CNG system can voluntarily stop sending frames; under such circumstances, the absence of samples must not be considered exactly as a loss of frames, but rather as a period of silence. The only way of determining whether the frame to be generated must be silence or should correspond to a lost frame is to know the type of the previously-generated frame (i.e. signal frame or frame corresponding to a lost frame, or a noise update frame (SID) , or a frame of silence (NOT) ) . For this purpose, the type of the generated frame is stored, and while frames are being generated to compensate for an absent frame or a lost frame, it is 19 decided whether fake frames should be generated using the algorithm for correcting frame losses (as applies when the preceding frame was of the FSF type) , or whether frames of silence should be generated by activating the decoder appropriately (as applies when the preceding frame was of the SID or the NOT type) . 3. Overabundance of samples to be played back In order to deal with an overabundance of samples to be played back, processing is implemented to empty out frames, eliminating certain frames in full or in part prior to their possibly being taken into account by the sound playback system.

This processing enables frames to be stored in buffer memories until certain thresholds trigger actions for limiting the amount of frames in memory and for limiting any corresponding increase in delay across the communications system. This limited storage makes it possible to accommodate jitter phenomena on receiving frames in bursts and also slip between clocks, while nevertheless limiting transmission delay. 3.1 Emptying out processing Any accumulation of frames is initially detectable in the memory FIFO 1, and is subsequently transferred to the memory FIFO 2.

The proposed method manages information concerning the filling level of a reference buffer memory, i.e.

FIFO 1 or FIFO 2 depending on how the tasks 10, 12, and 14 are organized in the receiver (as explained above) . If the tasks 12 and 14 are associated with each other, then the filling level information used by the method relates to the memory FIFO 1 which acts as a buffer between the network and the sound playback system.

Similarly, if the tasks 10 and 12 are associated, then it is the memory FIFO 2 which acts as a buffer and it is therefore its filling level which is taken into consideration by the management process.

The process is explained below for the second context. The first is merely an immediate transposition thereof .

In order to maintain synchronization as closely as possible between the encoder and the decoder, and thus maintain optimum sound playback, all of the frames coming from the network are decoded. The process then decides on what action to take on the decoded frame as a function of information concerning filling level. This action is described in greater detail below. To activate the processing, filling level thresholds are used. These thresholds define filling alarm levels for the FIFO memory. In order to take action that is as inaudible as possible (i.e. in order to limit quality degradation) two levels of action are selected. A first level (alarm level 1) corresponds to the filling level being excessive but not critical (far from the maximum acceptable filling level) , while a second level (alarm level 2) corresponds to it being mandatory to take action on each frame (this level is quite close to the maximum acceptable level) . A third or "emergency" level (alarm level 3) is also defined in order to avoid memory overflows or other problems. This level corresponds to filling being very close to the maximum acceptable level. Alarm level 3 should never be reached if the actions taken at the two preceding threshold levels are properly performed and if the thresholds are properly defined.

Each time decoding is performed, the information concerning filling level is compared with the thresholds in order to determine the state of the FIFO (in an alarm state or not) , and, where appropriate, the level of the alarm.

If the state obtained is not an alarm state, then no action is undertaken and the decoded frame is stored in FIFO 2. 21 In the first alarm state, it is considered that at least 50% of the signal coming from a conversation is not useful and therefore at this alarm level, all frames presenting very little information are eliminated. For this purpose, it is possible to implement simple VAD which monitors all frames of samples after they have been decoded to decide whether or not they should be written into FIFO 2. The process can also make decisions based on information taken directly from the code frame concerning the importance or otherwise of the information contained in the frame. In this alarm state, any frame that is considered as containing nothing but noise is simply not stored in FIFO 2 for future sound playback.

In the second alarm state (critical level) , it is necessary to take action on each frame to curb growth in the filling level of the memory FIFO 2 very aggressively. At this level, the preceding processing (i.e. the processing implemented for alarm level 1) remains active. However it is now also necessary to shorten pairs of consecutive frames down to the length of one frame or shorter. A decision is therefore taken on the basis of two non- " silent" sample frames (given that any frame that is "silent" is merely not written to FIFO 2 as a result of alarm state 1 being already included in alarm state 2) . Action on two consecutive frames is therefore undertaken only when a frame is detected as being non-"silent". The frame is initially stored, and then if the next frame is "silent", then it is only the first frame that is written into FIFO 2.

When both frames contain important information, it becomes necessary to replace them by a single frame while minimizing loss of information and degradation of quality. It is the replacement frame that is stored in FIFO 2. Any effective solution capable of performing this task can be used and activated under such conditions (i.e. second alarm state and two non- " silent " frames). 22 Two examples of algorithms for performing this task are described below.

In a first algorithmic solution, the two contiguous frames are replaced by a single frame in which each coefficient x^ (where j. lies in the range 0 to N-l and where N is the number of samples per frame) is given the value {xi + xi+1) /2 (where i. lies in the range 0 to 2N-1, with the coefficients coming from both original frames) . This solution amounts to performing a kind of smoothed undersampling . The frequency of the played-back signal is thus doubled for this frame. Nevertheless, the inventors have found that providing alarm state 2 does not occur very frequently, this solution suffices to maintain the quality of sound playback.

In a second solution, signal amplitude is detected to enable the two frames to be compacted into a pseudo-frame of length shorter than or equal to that of one frame. The number of samples contained in the pseudo-frame is determined by the fundamental frequency or "pitch" information, but in all events it is shorter than or equal to the length of a normal frame, while nevertheless being close to said frame length. The algorithm used ensures continuity of the played-back signal without any audible hole, and without any frequency doubling, while nevertheless dividing the amount of storage required for the signal by a factor that is greater than or equal to 2. This is explained in greater detail in paragraph 3.4 below. Furthermore, this also minimizes the loss of sound information since less than 50% of the information is in fact eliminated.

It should be observed that when the receiver implements its processing on the basis of analyzing FIFO 1, with the decoder being directly associated with the sound playback system, it is necessary to generate a number of samples that is sufficient, i.e. in the present case a number that ensures at least one frame of samples is made available in FIFO 2. The frame concatenation algorithm is then calibrated to ensure that it always generates a minimum number of samples, but at least one frame. Another solution would consist in activating the algorithm several times over instead of only once when that is necessary to obtain a sufficient number of samples .

In the third alarm level, which is normally never reached, no frames are stored in FIFO 2. In a variant, the system can also decide to clear out a fraction of the buffer memory suddenly (this can apply where it is management of FIFO 1 that is activated) .

It should also be observed that depending on the network and on the types of problems it generates in terms of asynchrony, it is possible to decide whether or not to activate particular alarm levels. For example, when asynchrony is "weak" then alarm levels 1 and 2 can be combined, and the simple solution of replacing two frames by a single frame can then constitute the only active process. 3.2 Alarm thresholds There follows a more detailed description of the alarm thresholds and how they are managed.

As explained above, the reference memory is said to be in alarm state 1 when its filling level is above threshold 1; this state remains active until its filling level drops below threshold 0. State 1 therefore operates with hysteresis.

The memory is said to be in alarm state 2 if the filling level exceeds threshold 2 and to be in alarm state 3 if the filling level exceeds threshold 3. It is possible to envisage managing these alarm states with hysteresis as well .

Thresholds 0, 1, and 2 are adaptive. Threshold 3 is directly associated with the maximum acceptable size and it is fixed. These thresholds need to be adaptable in order to accommodate different call contexts and real 24 time fluctuations during a call. It is appropriate to allow a greater amount of delay when the call is suffering a large amount of jitter (where delaying playback remains the best way of ensuring acceptable quality in the presence of jitter) . In a high-jitter context, it is therefore appropriate for the thresholds 0, 1, and 2 to be at quite high levels.

To facilitate processing, the positions of the thresholds can correspond to integer numbers of frame sizes as exchanged between the various tasks of the receiver. This frame size is written Tt .

By way of example, the initial values of these thresholds can be as follows: Threshold 0 : 5 x Tt Threshold 1: 8 x Tt Threshold 2 : 12 x Tt Threshold 3: 24 x Tt (fixed value) The thresholds 0, 1, and 2. can be adapted together in steps of size Tt . Extreme acceptable values can lie in the range -1 to +8, for example.

Thus, threshold 1 can take on values 7x, 8x, 9x, lOx, 16x Tt . Threshold adaptation proper is performed on the basis of an adaptation criterion which is the length of time spent in the alarm state. For this purpose, an alarm state percentage is evaluated about once every N seconds (e.g. N = 10) . When this percentage exceeds a given threshold (5%) , the alarm state thresholds are increased; otherwise, when these percentages are below a given minimum threshold (0.5%) the alarm threshold are decreased. To ensure that the system does not oscillate excessively due to its thresholds being adapted too frequently, hysteresis is applied to adaptation decision making. The thresholds are actually increased by one step, only in the presence of two increase options that are consecutive, and they are decreased by one step, only in the presence of three decrease options that are consecutive. As a result, the length of time between two threshold increments is at least 2N seconds and the length of time between two threshold decrements is at least 3N seconds. The procedure for increasing thresholds can be accelerated if a large percentage of frames are in an alarm state. One accelerating procedure consists in increasing the thresholds directly, for example whenever the alarm percentage exceeds 50%.

Naturally, the threshold values given for the alarm thresholds are provided purely by way of indication. 3.3 Interaction with the first process The first process is the process which causes "fake" frames to be generated when frames are lost or erroneous. When the system is in an alarm state (overabundance of frames), there is no need to generate "fake" frames which would merely aggravate the phenomenon of overabundance. Nevertheless, in order to maintain high quality sound playback it is important to maintain encoder/decoder synchronization by informing the decoder whenever a frame has been lost (e.g. by launching the generation of one or two fake frames, but no more) . The third process will act in the alarm state on the first process so as to curb very strongly the generation of "fake" frames. 3.4 Frame concatenation The purpose of the concatenation process is to shorten the duration of a digital audio signal containing speech or music while introducing as little audible degradation as possible. Since the sampling frequency is given and fixed, it is the number of samples sent to the sound playback apparatus that is decreased. One obvious solution for shortening a sequence of N samples is to remove M regularly spaced apart samples from the sequence in question. This causes the fundamental frequency to increase and that can be unpleasant for the listener, particularly when the ratio M/N is large. Furthermore, there is a danger of no longer complying with the 26 sampling theorem. The process described below makes it possible to shorten an audio sequence without modifying its fundamental frequency and without giving rise to audible degradation due to signal discontinuity. This process is based on detecting the value of the pitch period. The number of samples eliminated by this algorithm cannot be selected freely, since it is a multiple of the pitch value P. Nevertheless, it is possible to define a minimum number of samples to be eliminated Nemin which must satisfy the relationship Nemin < N/2. Since the purpose is to eliminate at least 50% of the samples in the context of the device for managing asynchrony in an audio transmission, it is advantageous to set Nemin = N/2. It is also assumed that the maximum value of the pitch P is less than the length N of the sequence to be shortened. The number Ne of samples that are eliminated by the algorithm is then the smallest multiple of the pitch value P that is greater than or equal to Nemin. I.e. Ne = kP where k is a positive integer and Ne > Nemin > Ne - P. The length of the output signal is then Nr = N - Ne. The input signal to be shortened is written s (n) , where n = 1, ..., N and the output signal is written s' (n) where n = 1, ... , Nr. In order to ensure continuity in the output signal, the first and the last Nr samples of the signal s (n) are merged progressively, i.e. s ' (n) = s (Ne=n) .w(n) + s (n) . (1-w/ (n) ) for n = 1, ... , Nr where w(n) is a weighting function such that 0 < w(n) < 1 , for n = 1 , ... , Nr and w (n) < w (n+1 ) for n = 1 , ... , Nr-1. For example, w(n) can merely be the linear function w(n) = n/Nr. For an unvoiced signal where it is not possible to determine the pitch, Ne can be fixed freely.

Figure 4 showing signal sequences A, B, C, and D illustrates how the process is implemented on a worked example. The first sequence (A) is shown as a solid line and constitutes a piece of the signal s (n) to be shortened that is N = 640 samples long. The purpose is to shorten this sequence by at least 320 samples without 27 changing its fundamental frequency, and without introducing any discontinuity or other audible degradation. The pitch of s(n) varies slowly, its value being equal to 49 at the beginning of the sequence and 45 at the end of the sequence. The pitch detected by a correlation method is P = 47. Thus, s (n) will be shortened by k = 7 periods, i.e. Ne = kP = 7x47 = 329 samples .

In this example, linear weighting has been selected. The sequences B and C show two pieces of the signal of length Nr = N - Ne = 311 that have already been weighted and that are subsequently merged together. Merging is performed by adding these two signals together. In sequence C, it can be seen that because of the slight variation in pitch, these two pieces of the signal s(n) are somewhat phase-shifted. Because of the merging technique used, this does not give rise to any discontinuity in the output signal s'(n) (continuous line in sequence D) . It can also be seen in sequence D that the shortened signal s' (n) remains properly in phase with the signals that precede it and that follow it (dashed line in Figures 1 and 4) .

Claims

- 28 - 147263/2 CLAIMS :

1. A method of managing the decoding and playback of a sound signal in an asynchronous transmission system, in which any overabundance of filling of a first buffer memory and/or of a second buffer memory situated at the inlet and/or at the outlet of a decoding block is detected by comparing the filling level with at least one threshold, the method being characterized in that, depending on the value of the filling level: voice activity detection is implemented and frames considered by said detection as being non-active are eliminated; and concatenation processing is implemented on two successive frames to compact them into a pseudo-frame of length less than or equal to one frame, the length reduction ratio of the pseudo-frame relative to the length of the two frames being greater than or equal to two.

2. A method according to claim 1, characterized in that voice activity detection is implemented and frames considered by said detection as being not active are eliminated whenever the filling level lies between a first threshold and a second threshold, and in that concatenation processing is implemented on two successive frames whenever the filling level lies between a second threshold and a third threshold.

3. A method according to claim 2, characterized in that the first and second thresholds are the same.

4. A method according to any preceding claim, characterized in that detection is performed at the inlet or the outlet of a decoding block having a first buffer memory at its inlet and/or its outlet to determine whether 01368851X19-01 - 29 - 147263/2 any frame is missing or erroneous or whether any samples to be played back are absent, and a fake frame is generated to ensure continuity in the audio playback on detecting such a missing or erroneous frame, or on detecting such an absence of samples for playback.

5. A method according to claim 4, characterized in that when the decoding block implements its decoding processing in cyclical manner relative to the content of the first buffer memory, detection of any missing or erroneous frame or of any absence of samples to play back is implemented at the same cyclical frequency, said detection taking place far enough in advance relative to the decoding process to make it possible to generate a fake frame in good time.

6. A method according to claim 4 or claim 5, characterized in that a fake frame is not generated when a missing or erroneous frame is detected for a frame on which an absence of samples has already been detected.

7. A method according to any one of claims 4 to 6, characterized in that, for a system of the type which can voluntarily stop sending frames, the type of the previously-generated frame is stored from one frame to the next, and this information is used to determine whether to generate fake frames or to generate frames of silence.

8. A method according to any preceding claim, characterized in that in processing for concatenating two successive frames, the samples are weighted in such a manner as to give more importance to the first samples of the first frame and to the last samples of the second frame . 01368851X19-01 - 30 - 147263/2

9. A method according to any preceding claim, characterized in that the threshold (s) is/are adaptive.

10. A method according to claim 9, characterized in that a threshold is adapted as a function of the length of time passed with a filling level above a given threshold.

11. A device for playing back a speech signal, the device comprising a first buffer memory receiving coded frames, means implementing decoding processing on the frames stored in said first buffer memory, a second buffer memory receiving decoded frames output by the decoding means, and sound playback means receiving the frames output by the second buffer memory, the device being characterized in that it further comprises means capable of detecting any overabundance of filling of the first and/or of the second buffer memory by comparing the filling level with at least one threshold, and means which, as a function of the value of the filling level, implement voice activity detection and eliminate frames considered as being non-active, and implement concatenation processing on two successive frames to compact them into a pseudo-frame of length less than or equal to one frame, the reduction ratio of the pseudo-frame relative to the length of the two frames being greater than or equal to two. For the Applicants, REINHOLD COHN AND PARTNERS 01368851X19-01