Dropout concealment for a multi-channel arrangement
The invention relates to a method for the concealment of dropouts in one or more channels of a multi-channel arrangement comprising at least two channels, wherein a replacement signal is generated in the event of a dropout in one channel with the aid of at least one error-free channel.
The wireless transmission of audio signals has constituted an important area of research since the introduction of the wireless microphone on the market at the beginning of the 1990s. At present, these products are used as standard equipment in the area of stage performances, concerts and live shows. In comparison to analog systems, the use of digital transmission links offers the advantage to transmit metadata in addition to the audio data. This metadata can contain, for example, information about the overall concept of a stage installation. Furthermore, the notion of combining the individual channels and exploiting their interoperability in future systems can be realised by means of digital technologies. Despite, the fast development of the underlying hardware in terms of computing power and storage capacity supports the progress of software implementations.
In general, the method of the wireless transmission of signals is not resistant to influences that can crop up along the transmission link. In case of digital radio links, disturbances directly lead to the loss of data, and hence, to a total signal dropout. The degradation of the signal quality, acoustically perceptible as cracks or clicks, is unacceptable at any rate and must be compensated for using appropriate technologies that are incorporated at the receiver side. Since the concealment unit represents an active element in the signal path, the impact of its inherent processing delay must be taken into consideration.
A general classification of error concealment technologies for audio and video transmissions in real time is offered by Wah B. W., Su X., and Lin D.: "A Survey of Error Concealment Schemes for Real-Time Audio and Video Transmission over the Internet"; Proc. IEEE Int. Symposium on Multimedia Software Engineering, Dec. 2000. Here, the dependence of the source coding constitutes a fundamental distinguishing characteristic with which a distinction is made between transmitter-controlled and receiver-based technologies. The method according to the invention belongs to the category "receiver-based method", i.e. it works
completely decoupled from the transmitter or source coding and is therefore not affected by the additional latency inherent to transmitter-controlled technologies.
The simplest methods for the receiver-based concealment of dropouts are represented by the so-called intra-channel concealment techniques, in which each channel of a multi-channel arrangement is treated separately. Standard concealment methods apply substitution and prediction algorithms. The latter are generally comprised by two stages, the analysis unit and the re-synthesis model of the linear prediction error filter. The first stage serves for estimating the filter coefficients and is executed continuously during error-free signal transmission. If a dropout occurs, the lost signal samples are reconstructed by the filtering process. This corresponds to an extrapolation and is suited to the concealment of dropouts of a few milliseconds in general broadband audio signals. In some cases, in which the real-time constraint is not as stringent (for example, the buffering of data is permissible), the extrapolation is transformed into an interpolation and longer dropouts can therefore be handled.
The expansion of one-channel systems to multi-channel systems - the so-called inter-channel concealment techniques — leads to the implementation of adaptive filters. Compared to linear prediction algorithms, the estimation of the filter coefficients is not related exclusively to the signal of the respective channel, but rather information from other parallel channels is also used thatfor. The exploitation of the channel cross correlations is deemed to improve the performance of the concealment method. However, the efficiency of the technique is characterised primarily by the convergence behavior of adaptive filters, which mainly depends on the stationarity of the input signals. Since, in general, broadband audio is highly non-stationary, the behaviour of the adaptive filter will be quite poor. One possible implementation of this method is described in US 2005/0182996 Al (and respective EP 1649452 Al), the entire disclosure of which is incorporated into this specification by virtue of reference.
A common feature of the abovementioned filter techniques denotes the processing in time domain; some algorithms also offer an equivalent description in frequency domain. Yet the aim of the transformation is to increase computing efficiency, whereas the characteristics of the time domain method are retained.
In the following, several concealment methods are described in brief, beginning with single- channel systems:
US 2006/0171373 Al discloses a single-channel method for the concealment of data losses that makes use of a linear prediction estimate from the intact signal component immediately preceding the dropout. The prediction coefficients obtained by means of a spectral analysis filter are used to estimate a residual signal. A maximum repeatable range is determined for the residual signal over several stages. The spectral analysis of the transmitted signal merely serves for an improved detection of the periodicity, which leads to the classic signal repetition. This period is repeated and the all-pole filter of the linear prediction is applied to it. The residual signal emerges from preceding intact signal components that are filtered inversely with the currently calculated filter coefficients, yielding the estimated replacement signal. All computation required for signal reconstruction is performed in time domain, which is characteristic for the suggested method and results in substantial processing delay. Hence, it is incapable of real-time applications.
DE 19735675 C2 also discloses a single-channel concealment method. The algorithm incorporates a perceptionally adapted subband decomposition based on psychoacoustic aspects. The notion of signal reconstruction is to maintain the spectral energy in each subband. If a dropout occurs, an estimation of the signal is obtained by a properly filtered noise signal. Large dropouts yield an unchanged "sound surface". The filter coefficients solely imply the energy information, thus, the preceding time samples are not incorporated.
EP 1 145 227 Bl discloses a single-channel concealment method for the transmission of coded audio signals in the context of the MPEG coding standard. Thus, the transmitted data comprise spectral coefficients rather than time samples. A perceptionally adapted subband splitting is employed to the signal section preceding the dropout by combining several MDCT (modified discrete cosine transform) coefficients into one subband. Since a dropout affects certain subbands, these are transformed back into time domain, and a narrow-band signal is predicted there. The estimated narrow-band signal is in turn MDC-transformed and inserted into the MDCT stream transmitted in MPEG coding.
The article "Packet Loss Concealment for Audio Streaming Based on the GAPES Algorithm" by Ofir et. al. AES 118th Convention, May 28-31, 2005, Barcelona, Spain, describes a single- channel method in the context of the MPEG coding standard and thus, is also MDCT-based.
- A -
Since the properties of the MDCT prevent an adequate interpolation between successive MDCT blocks, an STFT (short-time Fourier transform) representation is computated directly from the MDCT representation. Interpolation results are obtained in the STFT domain, therefore signal components succeeding the dropout are required, i.e. the method induces additional latency. The interpolation itself is carried out per DFT-bin (discrete Fourier transformation) by use of the GAPES (gapped-data amplitude and phase estimation) algorithm. After the interpolation, the STFT data are transformed back into MDCT data.
The single-channel systems described above essentially depend on past signal components, hence, the estimation of the replacement signal is based on the assumption of long-term stationary input signals. Although those methods that incorporate a spectral analysis apply the filter in the frequency domain, both the comparison with preceding samples and the prediction of the future samples occur exclusively in the time domain.
The article "Packet Loss Concealment for Multichannel Audio Using the Multiband Source/Filter Model" by Karadimou et. al. 40th Annual Asilomar Conf. on Signals, Systems and Computers, Oct. 29 - Nov. 01, 2006, discloses a concealment method that relies on several channels. The transmission format is composed in such a manner that an actual audio channel is only transmitted in one single, so-called "source channel," whereas LSF (line spectral frequencies) vectors are transmitted in the remaining channels. The LSF vectors represent a (complex valued) spectral interpretation of a time signal and correspond exactly to the linear prediction coefficients. Thus, they contain all of the information on the phase relationships of the spectral envelope. In this method, dropout concealment is constraint to an error-prone "source channel". Dropouts can therefore only be handled in the LSF channels. The estimation of the LSF vectors is done by means of a Gaussian mixture model (GMM). Despite, the method incorporates subband decomposition, per band and channel prediction, and retransformation into linear prediction coefficients with appropriate filtering of the reference residuum. During computation of the replacement signal, i.e. of the LSF vectors, the entire signal information including the phase information is always transmitted. The different LSF vectors of the individual channels contain information about the characteristics of different microphones that are spaced apart from each other, and which simultaneously pick up a sound event, for example a concert. Hence, correlations between the individual LSF vectors are to be expected, and a so-called cross-channel estimation can be employed, i.e. if a dropout occurs in one LSF vector, parallel LSF vectors can be exploited.
For the substitution, a reference channel is established in advance and its LP residuum serves for the signal synthesis of all other channels (not only in the event of dropout but rather during normal operation, too). The fundamental assumption made is that there is a correlation between target and reference channel. However, this assumption is never verified and is definitely not true for many scenarios. The entire processing steps (subband filtering, LP analysis, LSF computation, synthesis filter) of the concealment procedure are implemented in the signal path, resulting in a considerable processing delay that has to be accepted and low latency can not be achieved, respectively. Due to the subband technique, the computational complexity is high (the prediction is performed per subband and channel, and the all-pole filter is implemented in each subband during resynthesis, too).
Another publication that deals with multi-channel concealment is "Loss Concealment for Multi-Channel Streaming Audio" by Sinha et. al. NOSSDAV03, June 1-3, 2003, Monterey, California, USA. The particular application of "distributed immersive musical performance" describes the implementation of a collaborative concert of spatially separated musicians by data transfer over the internet. A possibility for signal substitution is suggested therein that is based on the spatial proximity of the loudspeaker positions to each other in the multi-channel setup. In this method, a special type of interleaved packeted transmission is essential for the concealment.
The prior art for multi-channel systems is currently limited to different implementations of adaptive filters in the time domain or on transmitter-side channel interleaving with simple substitution rules as are typical in the upmix/downmix matrixing strategy suggested by Gerzon (M. Gerzon: "Hierarchical System of Surround Sound Transmission for HDTV," AES preprint* 3339, 92nd Convention, March 24-27, 1992, Vienna; and M. Gerzon: "Problems of Upward and Downward Compatibility in Multichannel Stereo Systems," AES preprint* 3404, 93rd Convention, Oct. 1-4, 1992, San Francisco). The efficiency of such technologies is either mostly restricted to its area of application (for example, pre-mixed multi-channel recordings) or is characterised predominantly by the convergence behavior of the adaptive filters, thus is highly variable due to the non-stationary input signals in connection with the dropouts of the target signal.
The aim of the present invention consists in providing a concealment method that uses the intact channels of a multi-channel system to replace the lost signal in such a way that the difference between the original signal and its replacement is rendered inaudible. In addition to the reliability of the transmission, the usability in delay-critical real-time systems constitutes an important criterion, for which reason ultra-low latency techniques are in demand for the processing of signals.
According to the invention, this objective is achieved with a method mentioned at the outset, in that during the error-free signal transmission of the channels a mapping takes place of the transmitted signals into the frequency domain, the absolute value of the frequency spectrum being determined, that spectral filter coefficients are calculated that relate the magnitude spectrum of a channel to the magnitude spectrum of at least one other channel, and that in the event of the dropout of one channel the replacement signal is generated by computation of filter coefficients prior to the dropout and application of them to a substitution signal which constitutes of at least one error-free channel.
The concealment filter is calculated using the magnitude spectra, thus, without regard to the phase information, providing a more stable filter, and an improved quality of the replacement signal, respectively. A significant advantage compared to single-channel methods currently in use also lies in the utilisation of the interoperability between the individual signals.
As an extension of the basic method, a modified treatment of the phase information is proposed. In so doing, the constancy of the phase transition at the beginning and at the end of the dropout is improved by taking into account the average time delay between target and replacement signal. A time delay between the respective channels, independent of their source direction, emerges according to the spatial arrangement of the multi-channel recording system.
In the following, the invention is described in more detail on the basis of the drawings. Fig. 1 shows a schematic representation of the transmission chain according to the invention, and
Fig. 2 shows a detailed block diagram of the dropout concealment of the invention for a two- channel system, and
Fig. 3 shows a block diagram of a multi-channel arrangement of, for example eight channels, and
Fig. 4 shows a flowchart of the entire invention, consisting of the estimation of the spectral filter, the determination of the time delay between the channels, as well as the weighted superposition of all channels in order to generate the substitution signal, and
Fig. 5 shows the layout of the device according to the invention for dropout concealment that is to be integrated into each channel of the multi-channel arrangement.
The preferred area of application of the present invention is within the overall system of a multi-channel (optionally wireless) transmission of digital audio data. The entire structure of a transmission chain is depicted in Fig. 1 and typically comprises the following stages for one channel: Signal source 1, e.g. a sensor for recording signals (microphone), analog-digital converter 2 (ADC), optional signal compression and coding on the transmitter side, transmitter 3, transmission channel, receiver 4, concealment module 5. At the output of the concealment module 5, the audio signal is available in digital form — further signal processing units can be connected directly, for example a pre-amp, equalizer, etc.
The proposed concealment method is independent of the transmitter/receiver unit as well as the source coding and acts solely on the receiver side (receiver-based technique). It can therefore be integrated flexibly as an independent module into any transmission path. In some transmission systems (e.g. digital audio streaming), different concealment strategies are implemented simultaneously. While the application shown in Fig. 1 does not provide for any further concealment units, a combination with alternative technologies is possible.
The following application scenarios are provided exemplarily: a) In concert events and stage installations, multi-channel arrangements range from stereo recordings to different variations of surround recordings (e.g. OCT Surround, Decca Tree, Hamasaki Square, etc.) potentially supported by different forms of spot microphones. Especially with main microphone setups, the signals of the individual channels are comprised of similar components whose particular composition is often quite non-stationary. For example, a dropout in one main microphone channel can be concealed according to the present invention introducing little or no latency. b) Multi-channel audio transmission in studios prodeeds at different physical layers (e.g. optical fiber waveguides, AES-EBU, CAT5), and dropouts can occur for various reasons, for example due to loss of synchronization, which must be prevented or
concealed especially in critical applications such as, for example, in the transmission operations of a radio station. Here, too, the concealment method according to the invention can be used as a safety unit with a low processing latency. c) While audio transmission in the internet is less delay-sensitive than the abovementioned areas, transmission errors occur more frequently, resulting in an increased degradation of the perceptual audio quality. The inventive concealment method offers an improvement of the quality of service. d) The method according to the invention can also be used in the framework of a spatially distributed, immersive musical performance, i.e. in the implementation of a collaborative concert of musicians that are separated spatially from each other, hi this case, the ultra-low latency processing strategy of proposed algorithm benefits the system's overall delay.
The invention is not restricted to the following embodiment. It is merely intended to explain the inventive principle and to illustrate one possible implementation. In the following, the dropout concealment method is described for one channel afflicted with dropouts. If transmission errors occur in more than one channel of the multi-channel arrangement, the system can easily be expanded.
The following terminology is used in the description: The channel afflicted with dropouts is defined as target channel or signal. The replica (estimation) of this signal that is to be generated during dropout periods is referred to as replacement signal. At least one substitution channel is required for the computation of the replacement signal. The proposed algorithm is composed of two parts. Computations of the first part are carried out permanently, whereas the second part is only activated in the case of a dropout in the target channel. During error-free transmission, the coefficients of a linear-phase FIR (finite impulse response) filter of length LFiUer are permanently being estimated in the frequency domain. The required information is provided by the optionally non-linearly distorted and optionally time-averaged short-term magnitude spectra of the target and substitution channel. This new type of filter computation disregards any phase information and thus, differs fundamentally from the correlation-dependent adaptive filters.
Selection of the substitution channel or substitution channels
Fig. 2 shows a block diagram of the multi-channel dropout concealment method for a target signal xz and a substitution signal x5 . The individual steps of the method are each indicated by a box containing a reference symbol and denoted in the subsequent table:
6 Transformation into a spectral respresentation
7 Determination of the envelope of the magnitude spectra
8 Non-linear distortion (optional)
9 Time-averaging (optional)
10 Calculation of the filter coefficients
11 Time-averaging of the filter coefficients (optional)
12 Transformation into the time domain with windowing
13 Transformation into the frequency domain (optional)
14 Filtering of the substitution signal respectively in time or frequency domain
15 Estimation of the complex coherence function or GXPSD
16 Time-averaging (optional)
17 Estimation of the GCC and maximum detection in the time domain
18 Determination of the time delay Aτ
19 Implementation of the time delay Δr (optional)
In this example, the transition between target and replacement signal is indicated by a switch 20. A detailed explanation of the individual steps of the method is given in the following description.
The correct selection of a substitution channel depends on the similarity between the substitution and target signal. This correlation can be determined by estimating the cross- correlation or coherence. (See explanations on coherence and on generalized cross-power spectral density (GXPSD) at the end of the specification.) According to the invention, the (GXPSD) is proposed as potential selection strategy. The complex coherence function
Y23 j {k} is used as particular example in embodiments 1. to 9. (A total of K channels are
observed, the channel x0 («) being designated as the target channel xz («) .):
1. For the target channel xz (n), the Jώ channel is defined as a substitution signal by the
optionally time-averaged coherence function Tzs ; (&) between the channels *, («) ,
with 1 < j < K - 1 and the target channel xs («) = xs («) , whose frequency-averaged
value of the complex coherence function, χ( f) = — ∑ T25J (λ:) has a maximum
value according to: J = arg max χ ( A .
2. Alternatively, a fixed allocation can be established between the channels in advance if the user (e.g. a sound engineer) knows the characteristics of the individual channels (according to the selected recording method) and hence their joint signal information.
3. Likewise, several channels can be summed to one substitution channel, optionally in a weighted manner. This weighted combination can be set up by the user a priori.
4. In an alternative realisation, the superposition of several channels to one substitution channel is carried out on the basis of broadband coherence ratios to the target channel
by: xs (n) = -*■ ^ , Λ , for all do (j) = false . j Herein, xs (n) denotes the substitution channel composed of the channels
Xj In- ATJ ) , and χ( i) represents the frequency-averaged coherence function
between the target channel xz (n\ and the corresponding channel x. ln — Aτλ . The
time delay between the selected channel pairs is considered by Δry (c.f. section "Estimation of the time delay between target and substitution channel"). The validity of the potential signals is verified incorporating the status bit ^o(y') .
5. A simplification of 4. is proposed that considers a pre-selected set of channels J rather than all available channels / . The weighted sum is built using ^(7) ej ■ The pre-selection is intended to yield channels whose frequency-averaged coherence function exceed a prescribed threshold Θ :
J = U (1 ≤ J ≤ K-1)Λ(X(J) > Θ)} .
6. Furthermore, a maximum number of M channels (with preferably M = 2...5) can be established as a criterion, according to:
7. A joint implementation of both constraints 5. and 6. is also possible:
8. Alternatively, the selection can be carried out separately for different frequency bands, i.e. in each band the "optimal" substitution channel is determined on the basis of the coherence function, the respective band pass signals are filtered using the method according to the invention, optionally in a time-delayed manner (c.f. "Estimation of the time delay between target and substitution channel"), superposed and used as a replacement signal. In so doing, the same criteria apply as in 1., 4., 5.,
instead of the frequency-averaged function χ( A .
9. Several substitution channels can also be selected. In this case, the processing is carried out separately for each channel, i.e. several replacement signals are generated. These are weighted according to their coherence function, combined and inserted into the dropout.
Generally, the functions used in 1. to 9. are time-varying, thus a mathematically exact notation must consider the time dependency by a (block) index m . To simplify the formulations, m has been omitted.
Calculation during error-free transmission
The computation during error-free transmission is performed in frequency domain, thus in a first step an appropriate short-termn transformation is necessary, resulting in a block-oriented algorithm that requires a buffering of target and substitution signal. Preferably, the block size should be aligned to the coding format. The estimation of the envelopes of the magnitude spectra of target and substitution signal are used to determine the magnitude response of the concealment filter. The exact narrow-band magnitude spectra of the two signals are not relevant, rather broad-band approximations are sufficient, optionally time-averaged and/or
non-linearily distorted by a logarithmic or power function. The estimation of the spectral envelopes can be implemented in various ways. The most efficient possibility concerning computational efficiency is the short-term DFT with short block length, i.e. the spectral resolution is low. A signal block is multiplied by a window function (e.g. Hanning), subjected to the DFT, the magnitude of the short-term DFT is optionally distorted non-linearly and subsequently time-averaged.
Further implementations: o Wavelet transformation (as is described in Daubechies L; "Ten Lectures on Wavelets"; Society for Industrial and Applied Mathematics; Capital City Press, ISBN 0-89871- 274-2, 1992. The entire disclosure of this printed publication is incorporated into this specification by virtue of reference) with optional subsequent time-averaging of the optionally non-linear distortion of the absolute values of the wavelet transformation. o Gammatone filter bank (as described in Irino T., Patterson R.D.; "A compressive gammachirp auditory filter for both physiological and psychophysical date"; J. Acoust. Soc. Am., Vol. 109, pp. 2008-2022, 2001. The entire disclosure of this printed publication is incorporated into this specification by virtue of reference) with subsequent formation of the signal envelopes of the individual subbands, optionally followed by a non-linear distortion. o Linear prediction (as described in Haykin S.; "Adaptive Filter Theory"; Prentice Hall Inc.; Englewood Cliffs; ISBN 0-13-048434-2, 2002. The entire disclosure of this printed publication is incorporated into this specification by virtue of reference) with subsequent sampling of the magnitude of the spectral envelopes of the signal block, represented by the synthesis filter, optionally followed by a non-linear distortion and, subsequent to this, time-averaging. o Estimation of the real cepstrum (as described in Deller J.R., Hansen J.H.L., Proakis J.G.; "Discrete-Time Processing of Speech Signals"; IEEE Press; ISBN 0-7803-5386- 2, 2000. The entire disclosure of this printed publication is incorporated into this specification by virtue of reference) followed by a retransformation of the cepstrum domain into the frequency domain and taking the antilogarithm, optionally followed by a non-linear distortion of the so obtained envelopes of the magnitude spectra and, subsequent to this, time-averaging. o Short-term DFT with maximum detection and interpolation: Here, the maxima are detected in the magnitude spectrum of the short-term DFT and the envelope between
neighboring maxima are calculated by means of linear or non-linear interpolation, optionally followed by a non-linear distortion of the so obtained envelopes of the magnitude spectra and, subsequent to this, time-averaging.
For the optionally used time-averaging of the envelopes, an exponential smoothing of the optionally non-linearly distorted magnitude spectra can be used, as represented in equations (1) with time constant a for the exponential smoothing. Alternatively, the time-averaging can be formed by a moving average filter. The non-linear distortion can, for example, be carried out by means of a power function with arbitrary exponents which, in addition, can be selected differently for the target and substitution channel, as depicted in equations (1) by the exponents γ and δ . (Alternatively, a logarithmic function can also be used.) The non-linear distortion offers the advantage of weighting time periods with high or low signal energy differently along the time-varying progression of each frequency component. The different weighting affects the results of time-averaging within the respective frequency component. Accordingly, exponents γ und δ greater than 1 denote an expansion, i.e. peaks along the signal progression dominate the result of the time-averaging, whereas exponents less than 1 signify a compression, i.e. enhance periods with low signal energy. The optimal selection of the exponent values depends on the sound material to be expected.
where \SZ , \SS : envelopes of the magnitude spectra of target and substitution channel,
\SZ , Ss : time-averaged versions of \SZ and a : time constant of the exponential smoothing, 0 < a < 1 , γ , δ : exponents of the non-linear distortion of S2 and \SS\ , with a preferable value range of: 0.5 < γ, δ ≤ 2 , m : block index.
As an example, equations (1) constitute a special case for the calculation of the spectral envelopes of target and substitution channel with exponential smoothing and arbitrary distortion exponents. In the following, the exponents are set to γ = δ = 1 to simplify formulations (i.e. a non-linear distortion is not explicitly indicated). However, the invention comprises the method with any time-averaging methods and any non-linear distortions of the envelopes of the magnitude spectra and hence, any values for the exponents γ and δ . Beyond, the use of the logarithm of the exponential function is enclosed, too. To simplify notation, the block index m is omitted, though all magnitude values such as S5 SJ or H are considered to be time- variant and therefore a function of block index m .
Calculation of the concealment filter
In standard adaptive systems, concealment filters are calculated by minimizing the mean square error between the target signal and its estimation. The difference signal is given by e(n} = xz (rι}- xz («) . In contrast, the present invention examines the error of the estimated magnitude spectra:
E(k) = Sz (k) - S2 (k) = S2 (k) -H(k) Ss (k) (2)
E(^) corresponds to the difference between the envelope of the magnitude spectra of the optionally non-linearly distorted optionally smoothed target signal and its estimation. The optimization problem is observed separately for each frequency component k . The simplest realisation of the spectral filter H(&) would be determined by the two envelopes, with
Alternatively, a constraint of H(^) is suggested through the introduction of a regularization parameter. The underlying intention is to prevent the filter amplification from rising disproportionally if the signal power of S51 is too weak and hence, background noise becomes audible or the system becomes perceptibly instable. If, for example, the spectral
peaks of one time-block of \SZ and Ss are not located in exactly the same frequency band,
H(A;) will rise excessively in these bands in which S2 has a maximum and Ss has a
minimum. To avoid this problem, a constraint for H{k\ is established through the frequency-
dependent regularisation parameter β{k), yielding
Through positive real-valued β(k), the filter amplification will not increase immoderately,
even with a small value for \SS L and hence, will prevent undesired signal peaks. The optimal
values for β(k) depend on the signal statistics to be expected, whereas a computation based on an estimation of the background noise power per frequency band is proposed inventively. The background noise power Pg (k) can be estimated incorporating the time-averaged
minimum statistics. The regularisation parameter β{k) is proportional to the rms value of the
background noise power, according to: β{k) = c- \ Pg (&) r , and c typically between 1 and 5.
An alternative implementation of H is proposed specifically for quasi-stationary input signals. The envelopes of the magnitude spectra are first estimated without time-averaging and optionally non-linear distortion. Both modifications are considered during the determination of the filter coefficients, according to:
In equation (5), both the block index m and the frequency index k are indicated, since the computation simultaneously depends on both indices in this case. The parameters a and γ determine the behavior of the time-averaging or the non-linear distortion.
Calculations in the event of dropouts in a target signal
The possibilities for detecting a dropout are numerous and known very well in the prior art. For example, a status bit can be transmitted at a reserved position within the respective audio stream (e.g. between audio data frames), and continuously registered at the receiver side. It would also be conceivable to perform an energy analysis of the individual frames and to identify a dropout if it falls below a certain threshold. A dropout could also be detected through synchronization between transmitter and receiver.
If a dropout is detected in the target signal (e.g. as represented in Fig. 2 by a status bit "dropout y/n"; the dotted line denotes the status bit that is actually transmitted contiguously with the audio signal), the replacement signal must be generated using the lastly estimated filter coefficients and the substitution channel(s), and is directly fed to the output of the concealment unit. During a dropout, the estimation of the filter coefficients is deactivated. Basically, the transition between target and replacement signal can be implemented by a switch, assuming any switching artefacts remaining inaudible. According to the invention, a cross-fade between the signals is proposed as being advantageous, but this requires a buffering of the target signal, hence inducing additional latency, hi particularly delay-critical real-time systems that do not allow for any additional buffering, a cross-fade is not readily possible. In this case, an extrapolation of the target signal is proposed, for example by means of linear prediction. The cross-fade is carried out between the extrapolated target signal and the replacement signal by using the method according to the invention.
The replacement signal is finally generated through filtering of the substitution signal with the filter coefficients retransformed into the time domain. The inverse transformation of the filter coefficients 77"1 {//} should be carried out with the same method as the first transformation.
Prior to the filtering, the filter impulse response is optionally time-limited by a windowing function w(n) (e.g. rectangular, Harming).
hw(n) = w(n)T-l {H(k)} or Jφ) = w(n)rl {H(*)} . (6)
The impulse response h (n\ or hw (n\ , respectively, must only be calculated once at the beginning of the dropout, since the continuous estimation of the filter coefficients is
deactivated during the dropout. For the sample- wise determination of the replacement signal xz , an appropriate vector of the substitution signal X5 is necessary,
*z («) = K*s (") or *z (") = KT*s («) • (?)
In some applications, the filtering can be performed in the frequency domain. Thus, the coefficients optionally windowed in the time domain are transformed back into the frequency domain, so that the replacement signal of a block is computed by:
χz W = T-1 JH; (k)xs (*)} . (8)
Successive blocks are combined using methods such as overlap and add or overlap and save. The replacement signal is continued beyond the end of the dropout to enable a cross-fade into the re-existing target signal.
Estimation of the time delay between target and substitution signal
In a particularly preferred embodiment of the present concealment method, the time- alignment of target and replacement signal can be improved, too. Therefore, a time delay is estimated, parallel to the spectral filter coefficients, that takes two components into account. On the one hand, the delay of the replacement signal resulting from the filtering process must
be compensated for, r, = F"ler . On the other hand, a time delay T2 between target and
substitution channel originates due to the spatial arrangement of the respective microphones. This can be estimated, for example, by means of the generalized cross-correlation (GCC) that requires the computation of complex short-term spectra. In a preferred implementation, the short-term DFT employed for the estimation of the concealment filter can be exploited, too, obviating additional computational complexity. (For more information about the characteristics of the GCC, see especially Carter, G. C: "Coherence and Time Delay Estimation"; Proc. IEEE, Vol. 75, No. 2, Feb. 1987; and Omologo M., Svaizer P.: "Use of the Crosspower-Spectrum Phase in Acoustic Event Location"; IEEE Trans, on Speech and Audio Processing, Vol. 5, No. 3, May 1997. The entire disclosures are incorporated into this specification by virtue of reference.) The GCC is calculated using inverse Fourier transform of the estimated generalized cross-power spectral density (GXPSD), which is defined by:
ΦG.zs {ή = G(k)Xz (k)x; (k) (9)
(again, in equations 9-12, the block index m is omitted.)
In equation (9), X2 (k} and X3 (k} are the DFTs of a block of the target or substitution
channel, respectively; * denotes complex conjugation. G(k} represents a pre- filter the aim of which is explained in the following.
The time delay r2 is determined by indexing the maximum of the cross-correlation. The detection of the maximum can be improved by approximating its shape to a delta function. The pre- filter G(k} directly affects the shape of the GCC and thus, enhances the estimation of τ2 . A proper realisation denotes the phase transform filter (PHAT):
This results in the GXPSD with PHAT filter:
O'2^ V / τ , / i \ τ.r* / I \ Λr / Λ ,,t / , \ (H) χz (k)χs' (k)\ \xz (k)x; (k)\
where O23 : cross-power spectral density of target and substitution signal.
Another possibility is offered by the complex coherence function whose pre-filter can be calculated from the power density spectra, yielding:
Φzz : auto-power spectral density of the target signal, O53 : auto-power spectral density of the substitution signal.
The transformation of the signals into the frequency domain is usually implemented by means of short-term DFT. The block length must, on the one hand, be selected large enough in order to facilitate peaks in the GCC that are detectable for the expected time delays but, on the other hand, excessive block lengths lead to increased need for storage capacity. To adequately track variations of the time delay r2 , time-averaging of the GXPSD or of the complex coherence function is proposed (e.g. by exponential smoothing).
(13)
In equations (13) and (14), m refers to the block index. The smoothing constants are designated with μ and v . These must be adapted to the jump distance of the short-term DFT and the stationarity of T2 in order to obtain the best possible estimation of the coherence function or the generalized cross-power spectral density, respectively.
After the retransformation into the time domain and the detection of the maximum of the GCC, the entire time delay element between target and replacement signal can be formulated by
Ar = T2 -T1. (15)
The individual processing steps are summarized in a block diagram in Fig. 2 for one target and one substitution signal. The transition between target and replacement signal or vice-versa is depicted as a simple switch in the graphic; as has already been mentioned, a cross-fade of the signals is recommendable.
The inventive notion of a multi-channel setup with more than two channels is depicted Fig. 3. Depending on which channel is affected by dropouts and hence becomes the target channel, the substitution signal is generated with the remaining intact channels. The discrete blocks of Fig. 3 correspond to the following processing steps:
21 Selection of the substitution channel(s)
22 Calculation of the filter coefficients
23 Application of a time delay
24 Generation of a replacement signal
In the uppermost row of Fig. 3, a replacement signal is generated for channel 1, which is afflicted by dropouts. To achieve this, either one, several, or all of the channels 2 to 7 can be used. The second row corresponds to the reconstruction of channel 2 , etc.
Fig. 4 shows a schematic of the basic algorithm in combination with the expansion stage (i.e. time delay estimation) to illustrate the mutual dependencies of the individual processing steps. To simplify the block diagram, parallel signals (DFT blocks) or (spectral) mappings derived thereof are merged into one (solid) line, the number of which is indicated by K or K -I , respectively. The dotted connections denote the transfer or input of parameters. The first selection of the substitution channels is done in the block labeled "selector" according to the GXPSD. On the one hand, this affects the computation of the envelopes of the magnitude spectra of the substitution signal and, on the other hand, it is needed for the weighted superposition of the same. The second selection criterion is offered by the time delay τ2 . The status bits of the channels are not depicted explicitly, but their verification is considered in relevant signal-processing blocks. Additionally, the particular determination of the target signal can be omitted from this illustration.
Hardware implementation
According to the invention, the method for dropout concealment works as an independent module and is intended for installation into a digital signal processing chain, wherein the software-specified algorithm is implemented on a commercially available digital signal processor (DSP), preferably a special DSP for audio applications. Accordingly, for each channel of a multi-channel arrangement, an appropriate device, such as exemplarily depicted in Fig. 5, is necessary that preferably may be integrated directly into the apparatus for receiving and decoding the transmitted digital audio data.
The apparatus for dropout concealment is equipped with a primary audio input that adopts the digital signal frames from the receiver unit and temporarily stores them in a storage unit 25. The apparatus is equipped with at least one secondary audio input, optionally several secondary audio inputs, at which the digital data of the substitution channel(s) are available and likewise stored temporarily in one, optionally several, storage unit(s) 25. In addition, the device features an interface for the transmission of control data such as the status bit of the signal frames (dropout y/n) or an information bit for the selection of the
substitution channel(s), the latter requiring (a) a bidirectional data line and (b) a temporary storage unit 25.
In order to forward the original or concealed data frames of the primary channel, the apparatus is equipped with an audio output. A separate storage unit for the data blocks to be output is not necessary, since they can be stored as needed in the storage unit of the input signal.