GB2571371A - Signal processing for speech dereverberation - Google Patents

Signal processing for speech dereverberation Download PDF

Info

Publication number
GB2571371A
GB2571371A GB1809609.9A GB201809609A GB2571371A GB 2571371 A GB2571371 A GB 2571371A GB 201809609 A GB201809609 A GB 201809609A GB 2571371 A GB2571371 A GB 2571371A
Authority
GB
United Kingdom
Prior art keywords
signal processing
processing circuit
determination unit
signal
reverberation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1809609.9A
Other versions
GB201809609D0 (en
Inventor
Birchall Tom
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cirrus Logic International Semiconductor Ltd
Original Assignee
Cirrus Logic International Semiconductor Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cirrus Logic International Semiconductor Ltd filed Critical Cirrus Logic International Semiconductor Ltd
Priority to GB2016409.1A priority Critical patent/GB2589972B/en
Publication of GB201809609D0 publication Critical patent/GB201809609D0/en
Publication of GB2571371A publication Critical patent/GB2571371A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

In a speech dereverberation system 100, a determination unit 130 selects a number of samples in an audio signal x (n,k) captured by microphone M according to control inputs A (eg. the SNR of the acoustic space) and B (eg. Room Impulse Response, reverb energy decay, RT60 or threshold time Tth) and sends these via a delay (50, fig. 3) or variable buffer 160 to a reverberation coefficient determination unit 150, which calculates a regression vector of reverb coefficients gk via liear prediction or auto-regression modelling. Each reverb component is then subtracted from the input signal to provide a desired signal dn,k (60, fig. 8). A longer threshold time or means that more signal samples are required to calculate the reverb coefficients accurately (see equation 10) , while a highly noisy signal (fig 6B) will swamp echoes more quickly, leading to fewer samples (ie. fewer blocks or frames) being passed on.

Description

Signal Processing for Speech Dereverberation
Technical Field
This application relates to techniques for speech dereverberation. In particular this application describes signal processing techniques for reducing the effects of reverberation when capturing speech signals in an acoustic environment.
Background
Sound waves that are emitted from a source travel in all directions. Sound that is captured by a microphone in a given space will therefore comprise sound waves that have travelled on a direct path to reach the microphone, as well as sound waves that have been reflected from surfaces of the walls and other obstacles in the space. The persistence of sound waves after the sound source stops, and as a consequence of reflections, is called reverberation.
It will be appreciated that reflected or reverberant sounds captured by a microphone will have travelled on a longer path compared to the direct path and will therefore arrive after sound waves which have travelled on the direct path and at an attenuated level due to power being absorbed by surfaces and the extra distance travelled through the air. Thus, sound signals that are captured by a microphone in a real-world environment will contain multiple delayed and attenuated copies of the signal obtained via the direct path. Reverberations can be considered to be correlated delayed reflections of the source signal.
Speech signals derived from sounds that are captured by a microphone are used for many purposes including voice communication, recording and playback.
Furthermore, applications which rely on voice control as a method of interacting with hardware and associated functionality are becoming more prevalent and many of these applications rely on Automatic Speech Recognition (ASR) techniques.
A typical ASR system configuration is illustrated in Figure 1. Firstly, acoustic features which characterise essential features present in an input speech signal are extracted by an extraction unit 22 from a time frame of the speech signal. Then, on the basis of these features, the most likely text is identified by a decoding unit 23.The decoding unit may use a model, stored in a model storage unit 24, which comprises the knowledge required to decode the features into phonemes. The model is typically trained on a set of acoustic features that are extracted from an undistorted speech signal. Therefore if the input signal to the ASR system is corrupted by reverberant signals, then the recognition performance of the system is degraded.
It is therefore known that reverberation can result in a degradation in the intelligibility of speech signals that are captured by an acoustic sensor such as a microphone. Further, whilst speech recognition systems may perform well in conditions where the source to microphone distance is relatively small, the performance of speech recognition tends to degrade as the distance increases. In the field of home automation for example, where a smart home device operable to receive and process speech commands is typically placed within an acoustic environment such as an indoor room at some distance (e.g. 0.5m to 6m) from a user, the need for dereverberation of audio signals detected by the microphone of the device is particularly apparent.
Mitigating the effects of reverberation is therefore an important consideration in any application which utilises speech signals i.e. electric signals derived by an acoustic sensor in response to incident sounds which include speech. Reducing the effects of reverberation is therefore for important for improving the quality of voice calls and also in the context of applications utilising speech recognition systems.
A number of approaches to dereverberation have been proposed. For example, inverse filtering methods have been considered which are based on the principle of obtaining an inverse filter for the room or space, which is the cause of the reverberation, and deconvolving the captured signal with the inverse filter in order to recover the direct signal component. It will be appreciated that if the room impulse response (RIR) which describes the linear relation between the source and the microphone is known, then the inverse filter of the RIR can accurately recover the source signal. In most speech applications, however, the RIR is not known and must be estimated. The problem of estimating the RIR is compounded by the fact that the acoustic properties of the environment are potentially changeable i.e. not fixed.
A number of so-called blind dereverberation methods have been proposed in which attempts are made to estimate the inverse filter without prior knowledge of the room impulse response. In particular, some previously proposed reverberation techniques involve using a linear prediction based reverberation algorithm to estimate reverberant coefficients, wherein reverberant components may be from the input signal based on the estimated coefficients. Those in the art will understand that linear prediction refers to a mathematical operation in which future values of a discrete time signal are estimated as a linear function of previous samples.
Details of previously proposed linear prediction based dereverberation is described, for example, in:
1) Speech dereverberation based on variance-normalized delayed linear prediction”, T Nakatani et al, IEEE Trans. Audio, Speech and Language Processing, vol. 18, no. 7, pp. 1717-1731, Sept. 2010. In this document an approach for blind speech dereverberation based on multi-channel linear prediction (i.e. a multichannel autoregressive model (MCLP)) has been proposed.
2) Blind speech dereverberation with multi-channel linear prediction based on short time Fourier transform representation, T Nakatani et al, Proc. International Conference on Acoustics Speech and Signal Processing, Las Vegas, USA, May 2008, pp. 85-88. This paper describes an autoregressive generative model for the acoustic transfer functions and models the spectral coefficients of the desired clean speech signal using a Gaussian distribution. Dereverberation is then performed by maximum likelihood estimation of all unknown model parameters.
3) Suppression of late reverberation effect on speech signal using long-term multiple-step linear prediction”, IEEE Trans. Audio, Speech and Language Processing, vo. 17, no. 4, pp. 534-545, May 2009. In this paper a further delayed linear prediction method has been proposed.
It will be appreciated that in most real-world applications, for example in the context of a smart home device operable to receive and process speech commands, the level of background noise will vary over time. Unfortunately, despite improvements in the performance of dereverberation systems, previously considered techniques struggle to maintain a good performance in noise. Furthermore, previously proposed dereverberation systems, may experience issues such as speech suppression and distortion when subject to time-varying, noisy conditions. The low frequencies of speech are especially affected as this is where the longest reverberation times occur and where the lowest signal to noise ratios (SNR) arise.
Aspects described herein are concerned with improving the quality of speech signals derived by a dereverberation system. In particular, aspects described herein are concerned with improving dereverberation performance in noisy environments or in environments which experience time varying noise levels.
According to an example of a first aspect there is provided a signal processing circuit of a speech derverberation system, the signal processing circuit comprising: a reverberation coefficient determination unit configured to determine one or more reverberation coefficients of a portion of an input signal generated by an acoustic sensor provided in an acoustic space; and a determination unit operable to determine a number of past samples of the portion of the input signal to be passed to the reverberation coefficient determination unit, based on:
i) information about the background noise in the acoustic space; and ii) information about energy of reverberant sound in the acoustic space.
The information about background noise in the acoustic space may comprise information about the SNR or NSR. The information about the energy of the reverberant sound may comprise the decay in the energy of the reverberant sound in the acoustic space. The information about the energy of reverberant sound may be determined from a representation of the room impulse response (RIR) for the acoustic space. The representation of the RIR may be estimated.
According to at least one example the determination unit may be operable to determine a threshold time at which a level of the reverberant energy falls below a predetermined value relative to a respective level of the noise. Alternatively or additionally, the determination unit is operable to determine a threshold time at which a level of the energy of the decaying reverberant sound is substantially equal to a level of the NSR. The threshold time may be selected to be the time at which the ratio of the level of reverberant sound energy to the level of the NSR is at or above a predetermined value. The number of past samples input to the dereverberation coefficient determination unit may thus be calculated based on the threshold time. The determination unit may be beneficially configured to determine a number of samples that will maintain or achieve a positive reverberant sound energy to NSR level ratio.
According to one or more examples the signal processing circuit further comprises a selection mechanism operable to select the number of samples of the input signal to be passed to the reverberation coefficient determination unit based on the number of samples determined by the determination unit. The selection mechanism may comprise an adjustable length buffer. Alternatively, the selection mechanism may be operable to cause adjustment of the number of samples that a processed by a correlation unit of the signal processing circuit.
According to an example of a second aspect there is provided a signal processing circuit comprising:
a determination unit operable to determine a number of samples of an input signal to be passed to a reverberation coefficient determination unit that will maintain or achieve a positive reverberant sound to noise ratio.
The signal processing circuit may further comprise a reverberation coefficient determination unit configured to determine one or more reverberation coefficients of a portion of an input signal generated by an acoustic sensor provided in an acoustic space.
According to one or more examples an inverse filter may be obtained from the reverberation coefficients determined by the reverberation coefficient determination unit. The inverse filter may be convolved with the portion of the input signal to obtain an estimate of the reverberant component of the portion. Furthermore, the estimate of the reverberant component of the portion may be subtracted, or deconvolved, with the input signal to give a dereverberated signal dn,kA signal processing circuit as claimed in any preceding claim, wherein the reverberation coefficient determination unit determines the reverberation coefficients based on a linear prediction algorithm.
According to one or more examples the signal processing circuit may further comprise a delay unit configured to apply a delay to the input signal.
According to one or more examples the signal processing circuit may further comprise a Fast Fourier Transform (FFT) operable to the determine the amplitude of the input signal generated by the acoustic sensor in a plurality of frequency ranges, wherein the reverberation coefficient prediction unit is operable to determine the reverberant coefficients in one or more of the frequency ranges.
According to one or more examples the signal processing may be provided in the form of a single integrated circuit.
A device may be provided comprising the signal processing circuit according to an example of one or more of the above aspects. The device may comprise, inter alia: a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller, a domestic appliance or a smart home device. The device may comprise an automatic speech recognition system. The device may comprise one or a plurality of microphones.
According to at least one example the signal processing circuit further comprises a beamformer configured to time align the plurality of microphones in a direction of incident speech sound.
According to an example of a further aspect there is provided method of signal processing comprising:
determining a number of samples of a portion of an input signal generated by an acoustic sensor provided in an acoustic space based on:
i) information about the background noise in the acoustic space; and ii) information about energy of reverberant sound in the acoustic space.
The method may comprise estimating at least one reverberation coefficient of the portion of the input signal.
According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the previous aspect.
According to another aspect of the present invention, there is provided a nontransitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the previous aspect.
Brief Description of Drawings
For a better understanding of the present invention and to show how the same may be carried into effect, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 illustrates a typical ASR system configuration;
Figure 2 illustrates an acoustic space comprising a smart home device 10;
Figures 3 provides a simplified illustrations of a dereverberation system;
Figure 4a illustrates the amplitude of a room impulse response (RIR) of an acoustic environment;
Figure 4b illustrates the decay in the energy of a room impulse response;
Figure 5 illustrates a first example of a dereverberation system;
Figures 6a and 6b each provide a graphical representation of the level of reverberant sound in a given acoustic space as well as the level of noise in the acoustic space;
Figure 7 is a flow diagram illustrating a processing method according to one example of the present aspects;
Figure 8 is a block diagram illustrating a processing system for carrying the processing method illustrated in Figure 7;
Figure 9 is a flow diagram illustrating a processing method according to a further example of the present aspects; and
Figure 10 is a block diagram illustrating a processing system for carrying the processing method illustrated in Figure 9.
Detailed Description
The description below sets forth examples according to the present disclosure. Further example embodiments and implementations will be apparent to those having ordinary skill in the art. Further, those having ordinary skill in the art will recognize that various equivalent techniques may be applied in lieu of, or in conjunction with, the examples discussed below, and all such equivalents should be deemed as being encompassed by the present disclosure.
The methods described herein can be implemented in a wide range of devices and systems. However, for ease of explanation of one example, an illustrative example will be described, in which the implementation occurs in a smart home device utilising automatic speech recognition.
Figure 2 illustrates an acoustic space comprising a smart home device 10 having a microphone M for detecting ambient sounds. The microphone is used for detecting the speech of a user and may be typically located at a distance of greater than 0.5m from the user. It will be appreciated that the smart home device may comprise multiple microphones, although this is not necessary for an understanding of the presently described example aspects.
Sound waves travel along a direct sound path D between a voice source V and the microphone M of the device. Sound waves also travel along a plurality of reverberant sound paths Ri...n, wherein the sound is reflected by the surface of a ceiling 11, or floor, of the acoustic space. It will be appreciated that numerous other reflected sound paths other than those illustrated will be set up following the emission of voice sound. The microphone will also detect background noise N arising within the space and the level of this noise may vary. It will be appreciated that noise is mostly additive and, in contrast to reverberation, is uncorrelated with speech.
The smart home device comprises circuitry for processing sound signals detected by the microphone. In particular, the smart home device 10 may comprise an Automatic Speech Recognition system such as the ASR system illustrated in Figure 1. The device further comprises a derverberation system operable to facilitate dereverberation of audio signals detected by the microphone M. The dereverberation system may be provided at the front-end of the ASR system. Thus, an audio input signal that is derived from a microphone in response to incident sounds including speech, can be processed to derive a dereverberated signal (i.e. a signal in which one or more components of reverberation have been removed) which may be input to an ASR system.
Speech that is captured by a microphone is generally assumed to consist of three parts: a direct-path response, early reflections and late reverberation. Early reflections may be defined as the reflection components that arise after the directpath response within a time interval of about 30-50 ms, and the late reverberation as all latter reflections. It has been demonstrated that late reverberations are a major cause of the degradation of ASR performance and loss of speech intelligibility. In view of this, dereverberation systems may focus on estimating the late reverberation, in order to recover the anechoic signal (clean speech) together with the early reflections.
Considering this mathematically, and as set out an article entitled Speech dereverberation using weighted prediction error with Laplacian model of the desired signal, by A. Jukic and Simon Doclo, we can consider a scenario where a single speech source in an enclosure is captured by M microphones.
Let Sn,k denote the clean speech signal in the STFT domain with time frame index n 6 and frequency bin index k 6 {1,...,K}. The reverberant speech signal observed at the m-th microphone, m 6 is typically modelled in the STFT domain as:
(2)
Where hi}k models the acoustic transfer function (ATF) between the speech source and m-th microphone in the STFT domain, the length of ATF equals Lh, and the (.)* denotes the complex conjugate operator. The additive term jointly represents modeling errors and the additive noise signal. The convolutive model in (2) is often rewritten as
(3) where the signal
(4) is composed of the anechoic speech signal and early reflections at the m-th microphone, and D corresponds to the duration of the early reflections. As previously mentioned, dereverberation methods often aim to recover the anechoic signal together with the early reflections, since the early reflections tend to improve speech intelligibility. Thus, dnk is the desired or dereverberated signal.
In several methods it has been proposed to replace the convolutive model in (2) and (3) with an autoregressive model. The model has been further simplified by assuming = 0, Vn, fc, τη. Under these assumptions, the signal observed at the first microphone (m = 1) can be written in the well-known multi-channel linear prediction form:
V1 _ j yM xn,k un,k Zjm=lk6fc ) Kn-D,k (5) where dnk is the desired signal, and (,)H denotes the conjugate disposition operator.
The vector g™ E (0 Lk is the regression vector of order Lk for the m-th channel and x™k is defined as Ym _ r m m ~iT λη/ (6) with (,)T denoting the transposition operator. The MCLP model (5) can be written in a compact form using the multi-channel regression vector gk E (0 MLk
Xn,k dn,k + gk^-n-D.k(7) with the following notation:
Zk = [(gi)T.....a™ rv(η
Xn,k = ......(¾)7]7(8)
And where xfk is the observed signal, dnk is the desired signal andg^xn_Dk represents the late reverberation.
Thus, the above derivation formulates the problem of speech dereverberation formulated as a blind estimation of the desired signal dnk, consisting of the direct speech signal and early reflections, from the reverberant observations x™k,Vm,n, k.
It is reported that blind channel dereverberation using linear prediction holds exactly for the multi-channel case and is a good approximation for the single channel case. In theory, the room's convolutive system is invertible with a causal FIR filter in the time-domain only if the system is minimum phase, however, it has also been reported that clean speech spectral components may be well recovered with causal FIR filters in the time-frequency domain even when the room's convolutive system is nonminimum phase in the time-domain as it is assumed that frequency components are not correlated as each frequency bin acts is treated as a sub-band filter. Furthermore, the AR model has been confirmed to be effective for dereverberation experimentally.
It therefore follows that the multichannel formulation in (5) can be written in single channel form as:
dn,k %n,k &k x-n-D,kn (9)
Figure 3 illustrates a schematic of an audio signal processing circuit in which reverberation coefficients are calculated and the reverberant component of speech is estimated and removed. Specifically, an input signal xn,k generated by a microphone M following detection of an incident sound is provided via a first branch to a reverberation coefficient prediction unit 50 operable to calculate one or more reverberant coefficients gk.
The reverberation coefficients prediction unit 50 is operable to calculate predicted reverberant coefficients gk based on e.g. a linear prediction algorithm or an autoregressive modelling approach, which is performed in the short-time Fourier transform domain on a portion or frame of the input signal. The linear prediction algorithm may then enable an estimation of future reverberant components on the basis of one or more buffered frames of the input signal. The system may introduce a time delay at delay unit 40 so that the frames input to the reverberation coefficient prediction unit 50 allow an estimation of the later reverberations. The delay applied by the delay unit 40 may be, for example 32 ms, or may be some other amount of delay. The estimated coefficients are matrix multiplied with the buffered vector of previous frames to obtain an estimate of the reverberation for each frame n (not shown) and frequency bin. The estimated reverberant component of that respective frequency bin is then subtracted from the input signal at module 60 to output, in the frequency domain, a dereverberated signal dn,k- It will be appreciated that the dereverberated signal may be represented by equation (9) above.
A dereverberation system may comprise a final stage (not shown) which uses spectral filtering techniques to remove the late reverberant component still present in the signal.
It will be appreciated that after a sound is produced reflections will build up and then decay as the sound is absorbed by the surfaces of the acoustic environment. Reflected sounds will eventually lose enough energy and drop below the level of perception. The amount of time a sound takes to die away is called the reverberation time. A standard measurement of an environment's reverb time is the amount of time required for a sound to fade by 60 dB. This time is often called RT60. It will be appreciated that other measurements of the reverberation time are also possible.
Figure 4a provides a graphical representation of a room impulse response (RIR) of an acoustic environment and plots the amplitude of an emitted impulse signal against time. Figure 4b provides a graphical representation of the decay in the energy of the room impulse response and plots a) the energy of the room impulse response (RIR) as a function of time and b) an exponential gradient of the RIR. The exponential gradient may be considered to represent the amount of reverberation that is present in the environment following the production of a sound impulse and allows the reverberation time RT60 - i.e. the time taken for the impulse to fade by 60dB - to be obtained.
It will also be appreciated that a speech signal that is detected by a microphone will be infected by noise originating from various sources. Thus, past samples input to the reverberation coefficient prediction unit will typically also include a noise component. It will be appreciated that when noise is present in the microphone signal the dereverberation processing may lead to over estimation of the reverberant components which leads to speech suppression. The level of the background noise may be similar to, or may even exceed, the power of the reverberation.
Figure 5 illustrates a first example of a dereverberation system 100 according to the present aspects. The system may be provided as part of an audio signal processing system in a device, which may for example be a smart home device incorporating an automatic speech recognition system or a communication device.
The dereverberation system 100 comprises a reverberation coefficient determination unit 150 configured to receive a portion (e.g. one or more buffered frames) of an input signal x(n, k) and to derive one or more reverberant coefficients (e.g. at least one reverberant coefficient per frame of the portion). The derverberation system further comprises a determination unit 130 which is operable to determine a number of past samples of the input signal that is to be passed to the reverberation coefficient determination unit 150.
The reverberant coefficients g(n, k) may be subsequently applied to the portion of the input signal in order to obtain an estimation of the reverberation (not shown). The reverberant component of that respective frequency bin is then subtracted from the input signal to give a dereverberated signal dn,kAccording to the present example the determination unit 130 receives first and second control inputs.
The first control input A optionally represents the information about the background noise of the acoustic space, or may comprise information to allow the same to be determined (either by calculation or estimation). For example the first control input may comprise information about the SNR (signal to noise ratio), the NSR (Noise to Signal ratio) or information about the level of noise which may, e.g. be considered to be the noise floor (and which may be derived from the SNR). The information about the background noise may be obtained explicitly, i.e. based on a measured value of the SNR or noise floor, or may be estimated e.g. from an estimate or long term estimate of the SNR. According to at least one example the SNR is calculated directly in one of the previous blocks/frames. However, the instantaneous SNR is time varying so in order to get a more stable value of the SNR, a long term estimate is calculated using smoothing.
The second control input optionally represents information about the energy of reverberant sound in the acoustic space. For example, according to at least one example the second control input represents the decay in the power/energy of reverberant sound in the acoustic space as a function of time (the reverberation time RT60). This may be determined from a room impulse response (RIR) for the acoustic space, or may comprise information to allow the same to be determined.
Thus, according to one or more examples of the present aspects, the determination unit 130 is configured to derive a number of samples of the input signal that is provided to the reverberation coefficient determination unit 150 based on:
i) information about the background noise in the acoustic space; and ii) information about the power/energy of reverberant sound in the acoustic space.
The speech dereverberation system further comprises a selection mechanism or module 160 operable to implement the selection or adjustment of the appropriate number of samples, or past samples, of the input signal to be passed to the reverberation coefficient determination unit based on the number of samples determined by the determination unit. It will be appreciated that the selection of the appropriate number of samples may be implemented in a number of ways. For example, by providing a variable buffer prior to the reverberation coefficient determination unit 150 which is configured to allow the length of a buffer to be adjusted. It will be appreciated that signal processing systems may comprise one or more correlation units operable to correlate the input signal or a segment of the input signal against another signal. Thus, it will be appreciated that rather than varying the amount of data stored in the buffer, the amount of data that is processed by one or more correlation units of the signal processing circuit may instead be adjusted. In this sense, examples described herein may refer to a variable effective buffer length, wherein the amount of data stored in a buffer may be varied or the amount of buffered data processed may be varied.
Figures 6a and 6b each provide a graphical representation of the power of reverberant sound in a given acoustic space as well as the level of noise in the acoustic space. Specifically, Figure 6a illustrates these two variables in a low noise scenario, whilst Figure 6b illustrates a high noise scenario.
The graphical representation of the power of the reverberant sound component represents the time taken for the sound power level to decay by 60dB - i.e. the reverberation time or RT60. At any given time a ratio of the level of reverberant sound energy to the NSR can be determined. According to at least one example of the present aspects the portion length determination unit is operable to determine a threshold time ϊτη after the level of the reverberant energy falls below a predetermined value relative to the level of the noise (which may be represented by NSR). Thus, the threshold time can be considered to be the time at which the ratio of the level of reverberant sound energy to the level of the NSR is at or above a predetermined value.
The number of samples of the input signal to be passed to the reverberation coefficient determination unit 150 may then calculated based on the threshold time tTH.
According to one or more examples the threshold time is set to be the time at which the level of the energy of the decaying reverberant sound is substantially equal to the level of the NSR (noise to signal ratio). At this time, and bearing in mind that dB is a logarithm and that when taking ratios of logarithms x/x = 0, the threshold time can be considered to be the time at which a threshold ratio RTh of the level of reverberant sound energy to the level of the NSR is at zero. Thus, a signal processing circuit according to at least one example comprises a determination unit configured to determine a number of samples that will maintain or achieve a positive reverberant sound energy to NSR level ratio.
Depending on the particular requirements of the system, it will be appreciated that the threshold ratio Rth may be set to be a value other than 0 and may, for example, be greater than 0.
This is illustrated in Figures 6a and 6b as the time at which the plot of the power of reverberant sound with respect to time intersects the plot of the NSR. In the low noise scenario illustrated in Figure 6a the threshold time is around 0.33s. In the high noise scenario illustrated in Figure 6b the tTH is around 0.13s. Thus, the portion length adjustment mechanism is operable to adjust the portion length L based on the threshold time in order that samples that are saturated by background noise are not included in the input signal that is passed to the reverberation coefficient determination unit.
This can be represented mathematically by:
L buff er ~ > (1θ) where Lbu^er is the buffer length or the number of samples in the buffer, fs is the sample rate and Nb is the frame size.
According to at least one example the threshold time may be used to derive a number of blocks or frames of the input signal that is to be passed to the reverberation coefficient determination unit.
It will be appreciated that the number of samples need not always be determined on a one to one basis with respect to the threshold time and that other correlations between the number of samples (amount of data) and the threshold time (and thus the threshold ratio) may be applied.
The shaded area X therefore represents the represents the reverberant samples that are likely to have a positive reverberant energy level to noise level ratio and which are therefore input to the reverberation coefficient determination unit 150. It will be appreciated that the shaded area represents samples that having a higher energy than the noise with respect to the speech. Thus, below the level of the NSR the reverberant components are saturated by noise. As such, embodiments of the present example advantageously allow the effective buffer length and/or number of samples to be adjusted based on the level of background noise in order that samples that are saturated or overpowered by the background noise level are preferably not included in the input signal that is passed to the reverberation coefficient determination unit. Thus, preferred examples of the present aspects derive a number of samples to be input to the reverberation coefficient determination unit that will advantageously maintain or achieve a positive reverberation energy level to noise level ratio.
The present examples advantageously allow the input to the derverberation system to be tuned based on a consideration of SNR and also on a consideration of the room impulse response (in particular the reverberation time RT60 derived from the RIR). This allows a more adaptive and bespoke approach to dereverberation which has demonstrated improvements in the performance and/or accuracy of ASR systems which utilise a signal derived from or processed by a dereverberation unit.
According to one or more examples, the determination unit is operable to determine a number of samples of the input signal i.e. an amount of data to be passed to the reverberation coefficient determination unit that will maintain or achieve a positive reverberant energy level to noise level ratio.
Examples of the present aspects can be considered to be performed using a sub band scheme - in that processing is performed independently within each frequency bin k - to allow for frequency dependent noise and reverberation profiles. Thus, examples may benefit from a particular improvement in the quality of low frequency speech signals obtained following the derverberation process where the issues of speech suppression are more acute.
Figure 7 is a flow diagram illustrating a processing method according to one example of the present aspects. Initially at step 80 a Fourier Transform is performed on an acoustic signal generated by a microphone M in response to an incident acoustic stimuli. A delay is applied (not shown) to give an input signal x(n, k). The input signal is passed to a frequency bin buffer. At step 82 the length of the buffer is selected/adjusted based on a number of samples that is determined by a subprocess S. The sub-process involves, at step 71, calculating a threshold time tTH based on first and second control inputs A and B. The first control input comprises a representation of the reverberation time for the acoustic space. For example, this may be estimated using blind estimation techniques or non-intrusive estimation based on prediction from the filter coefficients of the adaptive echo cancellation in a prior block. The second control input comprises a long term estimate of the SNR. This may be obtained, for example, from a speech presence probability estimation circuit used to control the step size of one or more adaptive filters of a noise reduction section in prior circuitry blocks. The speech presence probability SPP may be obtained using minimum controlled recursive averaging MCRA and decision directed methods.
According to the present example the calculation of the threshold time involves determining the time at which the reverberant to noise power ratio is approximately zero. At step 72 the threshold time is converted to a number of samples/blocks/frames and, at step 82, the buffer adjusted or selected accordingly based on the number of samples which correspond to the determined threshold time. At step 84 the portion of the input signal that is output from the buffer is subjected to correlation techniques which may involve auto correlation and/or cross correlation of the output. At step 86 reverberation coefficient are estimated, for example using a linear prediction algorithm or auto-correlation technique, based on statistical models of speech.
Figure 8 is a block diagram illustrating a processing system for carrying the method illustrated in Figure 7. An electrical input signal generated in response to an acoustic stimuli detected by a microphone M is passed to a Fast Fourier Transform (FFT) block 30 which is operable to determine the amplitude of the microphone signal in each of several frequency ranges or bins. The system comprises a first node X at which the signal line is branched into first, second and third branches. On a first branch the signal is passed to a delay unit 40 which applies a predetermined delay to the input signal. The delay applied by the delay unit 40 may be, for example 32 ms, or may be some other amount of delay. The signal is passed to a buffer 41 which may for example take the form of a circular buffer having an area of memory to which data is written, with that data being overwritten when the memory is full. According to this example the buffer 41 is an adjustable length buffer wherein the buffer length e.g. number of frames or data samples that may be written to the buffer, can be selected. The selected buffer length is calculated by a determination unit 130. As previously described, the determination unit 130 is configured to derive determine a number of samples of the input signal to be passed to the reverberation coefficient determination unit, based on:
i) information about the background noise in the acoustic space; and ii) information about energy of reverberant sound in the acoustic space
The amount of data or number of samples of the input signal that are to be provided to the reverberation coefficient determination unit 150 depends, in this example, on the effective buffer length that is selected for the variable buffer 41. The buffered portion of the input signal is subject to known correlation techniques. Specifically, in this example, at unit 170 the delayed buffered samples are cross correlated with the non-delayed input signal which is passed via a third branch. Furthermore, at unit 180 the buffered sample is cross correlated with itself. The correlated signals are input to the reverberation coefficient determination unit 150 which is configured to determine one or more reverberation coefficients based on the buffered sample. The reverberation coefficients directly represent the inverse filter and are applied at to the buffered vector of previous samples to estimate the reverberation component of the respective frequency bin. The reverberant component of that respective frequency bin is then subtracted from the input signal to give a dereverberated signal dn,kFigure 9 is a flow diagram illustrating a processing method according to a further example of the present aspects whilst Figure 10 is a schematic illustration of a processing system for carrying the method illustrated in Figure 9. The processing method is similar to the process steps illustrated in Figure 7 except that the buffer 42 comprises a fixed length buffer. Therefore, rather than adjusting the amount of data that can be stored in the buffer, an adjustment is made to amount of data or number of samples that are processed by the correlation units 170 and 180. It will be appreciated that size of the vectors or the cross correlation and the auto correlation are directly proportional to the input buffer. In this example everything up until this point is calculated with the maximum buffer size that corresponds to a maximum reverberation time (e.g. 800ms) that the system should be able to operate in. This corresponds to a maximum buffer size given by Lmax = 800xB Nb where Nb is the frame size and fs is the sample rate. The size of the vectors of the cross correlation and auto correlation are directly proportional to the maximum buffer size Lmax. The expected value of the auto and cross correlations E[ak] of size (Lmax x Lmax) and E[qk] of size (Lmax X 1) are hence calculated with exponential averaging to get a smoothed output. At this point, the length determined by block 72 in frames Lvariabie is used to adjust the size of E[ak] and ΐθ (Lvariabie X ^-‘variable') ^^d (.Lvariabie X 1) respectively.
The skilled person will recognise that some aspects of the above-described apparatus and methods may be embodied as processor control code, for example on a non-volatile carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. For many applications examples of the invention will be implemented on a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). Thus the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA. The code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays. Similarly the code may comprise code for a hardware description language such as Verilog TM or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, the code may be distributed between a plurality of coupled components in communication with one another. Where appropriate, the examples may also be implemented using code running on a field-(re)programmable analogue array or similar device in order to configure analogue hardware.
Note that as used herein the term unit or module shall be used to refer to a functional unit or block which may be implemented at least partly by dedicated hardware components such as custom defined circuitry and/or at least partly be implemented by one or more software processors or appropriate code running on a suitable general purpose processor or the like. A unit may itself comprise other units, modules or functional units. A unit may be provided by multiple components or sub-units which need not be co-located and could be provided on different integrated circuits and/or running on different processors.
Examples may be implemented in a host device, especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a smart home device a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
It should be noted that the above-mentioned examples illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative examples without departing from the scope of the appended claims. The word comprising does not exclude the presence of elements or steps other than those listed in a claim, a or an does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units recited in the claims.
Any reference numerals or labels in the claims shall not be construed so as to limit their scope.

Claims (32)

1. A signal processing circuit of a speech derverberation system, the signal processing circuit comprising:
a reverberation coefficient determination unit configured to determine one or more reverberation coefficients of a portion of an input signal generated by an acoustic sensor provided in an acoustic space; and a determination unit operable to determine a number of samples of the portion of the input signal to be passed to the reverberation coefficient determination unit, based on:
i) information about the background noise in the acoustic space; and ii) information about energy of reverberant sound in the acoustic space.
2. A signal processing circuit as claimed in claim 1, wherein the information about background noise in the acoustic space comprises information about the SNR or NSR.
3. A signal processing circuit as claimed in claim 1 or 2, wherein the information about the energy of the reverberant sound comprises the decay in the energy of the reverberant sound in the acoustic space.
4. A signal processing circuit as claimed in any preceding claim, wherein the information about the energy of reverberant sound is determined from a representation of the room impulse response (RIR) for the acoustic space.
5. A signal processing circuit as claimed in claim 4, wherein the representation of the RIR is estimated.
6. A signal processing circuit as claimed in any preceding claim wherein the determination unit is operable to determine a threshold time at which a level of the reverberant energy falls below a predetermined value relative to a respective level of the noise.
7. A signal processing circuit as claimed in any preceding claim wherein the determination unit is operable to determine a threshold time at which a level of the energy of the decaying reverberant sound is substantially equal to a level of the NSR.
8. A signal processing circuit as claimed in claim 6 wherein the threshold time is the time at which the ratio of the level of reverberant sound energy to the level of the NSR is at or above a predetermined value.
9. A signal processing circuit as claimed in any one of claims 6 to 8 wherein the number of samples are calculated based on the threshold time.
10. A signal processing circuit as claimed in any preceding claim wherein the determination unit is configured to determine a number of samples that will maintain or achieve a positive reverberant sound energy to NSR level ratio.
11. A signal processing circuit as claimed in any preceding claim, further comprising a selection mechanism operable to select the number of samples of the input signal to be passed to the reverberation coefficient determination unit based on the number of samples determined by the determination unit.
12. A signal processing circuit as claimed in claim 9, wherein the selection mechanism comprises an adjustable length buffer.
13. A signal processing circuit as claimed in claim 11, wherein the selection mechanism is operable to cause adjustment of the number of samples that a processed by a correlation unit of the signal processing circuit.
14. A signal processing circuit comprising: a determination unit operable to determine a number of samples of an input signal to be passed to a reverberation coefficient determination unit that will maintain or achieve a positive reverberant sound to noise ratio.
15. A signal processing circuit as claimed in claim 14, further comprising a reverberation coefficient determination unit configured to determine one or more reverberation coefficients of a portion of an input signal generated by an acoustic sensor provided in an acoustic space.
16. A signal processing circuit as claimed in any one of claims 1 to 13 or claim 15, wherein an inverse filter is obtained from the reverberation coefficients determined by the reverberation coefficient determination unit and wherein the inverse filter is convolved with the portion of the input signal to obtain an estimate of the reverberant component of the portion.
17. A signal processing circuit as claimed in claim 16, wherein the estimate ofthe reverberant component of the portion is subtracted or deconvolved with the input signal to give a dereverberated signal dn,k-
18. A signal processing circuit as claimed in claim 13, wherein the dereverberated signal is represented by:
dn,k xn,k 9k Xn-D,kn where x™k is the observed signal at the acoustic sensor m, and gkxn-D,kn represents late reverberant sound.
19. A signal processing circuit as claimed in any preceding claim, wherein the reverberation coefficient determination unit determines the reverberation coefficients based on a linear prediction algorithm.
20. A signal processing circuit as claimed in any preceding claim, further comprising a delay unit configured to apply a delay to the input signal.
21. A signal processing circuit as claimed in any preceding claim, further comprising an Fast Fourier Transform (FFT) operable to the determine the amplitude of the input signal generated by the acoustic sensor in a plurality of frequency ranges, wherein the reverberation coefficient prediction unit is operable to determine the reverberant coefficients in one or more of the frequency ranges.
22. A signal processing circuit as claimed in any preceding claim, in the form of a single integrated circuit
23. A device comprising a signal processing circuit according to any preceding claim.
24. A device as claimed in claim 23, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller, a domestic appliance or a smart home device.
25. A device as claimed in claim 23 or 24, wherein the device comprises an automatic speech recognition system.
26. A device as claimed in any one of claims 23 to 25, wherein the device comprises a plurality of microphones.
27. A signal processing system as claimed in claim 26, further comprising a beamformer configured to time align the plurality of microphones in a direction of incident speech sound.
28. A method of signal processing comprising:
determining a number of samples of a portion of an input signal generated by an acoustic sensor provided in an acoustic space based on:
i) information about the background noise in the acoustic space; and ii) information about energy of reverberant sound in the acoustic space.
29. A method as claimed in claim 28, wherein the information about background noise in the acoustic space comprises information about the SNR or NSR.
30. A method as claimed in claim 28 or 29, wherein the information about the energy of the reverberant sound comprises the decay in the energy of the reverberant sound in the acoustic space.
31. A method as claimed in any one of claims 28 to 30, wherein the determining comprises determining a number of samples of the input signal that will maintain or achieve a positive reverberant sound energy to NSR level ratio.
32. A method as claimed in any one of claims 28 to 31, further comprising: estimating at least one reverberation coefficient of the portion of the input signal.
GB1809609.9A 2018-02-23 2018-06-12 Signal processing for speech dereverberation Withdrawn GB2571371A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2016409.1A GB2589972B (en) 2018-02-23 2018-06-12 Signal processing for speech dereverberation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/903,688 US10726857B2 (en) 2018-02-23 2018-02-23 Signal processing for speech dereverberation

Publications (2)

Publication Number Publication Date
GB201809609D0 GB201809609D0 (en) 2018-07-25
GB2571371A true GB2571371A (en) 2019-08-28

Family

ID=62975698

Family Applications (2)

Application Number Title Priority Date Filing Date
GB1809609.9A Withdrawn GB2571371A (en) 2018-02-23 2018-06-12 Signal processing for speech dereverberation
GB2016409.1A Active GB2589972B (en) 2018-02-23 2018-06-12 Signal processing for speech dereverberation

Family Applications After (1)

Application Number Title Priority Date Filing Date
GB2016409.1A Active GB2589972B (en) 2018-02-23 2018-06-12 Signal processing for speech dereverberation

Country Status (2)

Country Link
US (1) US10726857B2 (en)
GB (2) GB2571371A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643714A (en) * 2021-10-14 2021-11-12 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10609475B2 (en) 2014-12-05 2020-03-31 Stages Llc Active noise control and customized audio system
US10945080B2 (en) 2016-11-18 2021-03-09 Stages Llc Audio analysis and processing system
US10554822B1 (en) * 2017-02-28 2020-02-04 SoliCall Ltd. Noise removal in call centers
JP2021015202A (en) * 2019-07-12 2021-02-12 ソニー株式会社 Information processor, information processing method, program and information processing system
CN113160842B (en) * 2021-03-06 2024-04-09 西安电子科技大学 MCLP-based voice dereverberation method and system
CN113496698B (en) * 2021-08-12 2024-01-23 云知声智能科技股份有限公司 Training data screening method, device, equipment and storage medium
CN113409810B (en) * 2021-08-19 2021-10-29 成都启英泰伦科技有限公司 Echo cancellation method for joint dereverberation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015165539A1 (en) * 2014-04-30 2015-11-05 Huawei Technologies Co., Ltd. Signal processing apparatus, method and computer program for dereverberating a number of input audio signals
US9646592B2 (en) * 2013-02-28 2017-05-09 Nokia Technologies Oy Audio signal analysis

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8284947B2 (en) * 2004-12-01 2012-10-09 Qnx Software Systems Limited Reverberation estimation and suppression system
US8073147B2 (en) * 2005-11-15 2011-12-06 Nec Corporation Dereverberation method, apparatus, and program for dereverberation
CN102750956B (en) * 2012-06-18 2014-07-16 歌尔声学股份有限公司 Method and device for removing reverberation of single channel voice
US9060052B2 (en) * 2013-03-13 2015-06-16 Accusonus S.A. Single channel, binaural and multi-channel dereverberation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9646592B2 (en) * 2013-02-28 2017-05-09 Nokia Technologies Oy Audio signal analysis
WO2015165539A1 (en) * 2014-04-30 2015-11-05 Huawei Technologies Co., Ltd. Signal processing apparatus, method and computer program for dereverberating a number of input audio signals

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643714A (en) * 2021-10-14 2021-11-12 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program
CN113643714B (en) * 2021-10-14 2022-02-18 阿里巴巴达摩院(杭州)科技有限公司 Audio processing method, device, storage medium and computer program

Also Published As

Publication number Publication date
US20190267018A1 (en) 2019-08-29
US10726857B2 (en) 2020-07-28
GB202016409D0 (en) 2020-12-02
GB2589972A (en) 2021-06-16
GB2589972B (en) 2021-08-25
GB201809609D0 (en) 2018-07-25

Similar Documents

Publication Publication Date Title
US10726857B2 (en) Signal processing for speech dereverberation
JP7175441B2 (en) Online Dereverberation Algorithm Based on Weighted Prediction Errors for Noisy Time-Varying Environments
US10403299B2 (en) Multi-channel speech signal enhancement for robust voice trigger detection and automatic speech recognition
KR102410447B1 (en) Adaptive Beamforming
RU2760097C2 (en) Method and device for capturing audio information using directional diagram formation
US10403300B2 (en) Spectral estimation of room acoustic parameters
JP6644959B1 (en) Audio capture using beamforming
RU2758192C2 (en) Sound recording using formation of directional diagram
RU2768514C2 (en) Signal processor and method for providing processed noise-suppressed audio signal with suppressed reverberation
US9137611B2 (en) Method, system and computer program product for estimating a level of noise
EP3864649A1 (en) Processing audio signals
US20200286501A1 (en) Apparatus and a method for signal enhancement
Thiergart et al. An informed MMSE filter based on multiple instantaneous direction-of-arrival estimates
US20190035382A1 (en) Adaptive post filtering
WO2013061232A1 (en) Audio signal noise attenuation
KR20190099445A (en) Far Field Sound Capturing
US9666206B2 (en) Method, system and computer program product for attenuating noise in multiple time frames
US20130054233A1 (en) Method, System and Computer Program Product for Attenuating Noise Using Multiple Channels
RU2751760C2 (en) Audio capture using directional diagram generation
US10692514B2 (en) Single channel noise reduction
US20200243105A1 (en) Methods and apparatus for an adaptive blocking matrix
JP4950971B2 (en) Reverberation removal apparatus, dereverberation method, dereverberation program, recording medium

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)