GB2537923A - A speech processing system and speech processing method - Google Patents

A speech processing system and speech processing method Download PDF

Info

Publication number
GB2537923A
GB2537923A GB1507486.7A GB201507486A GB2537923A GB 2537923 A GB2537923 A GB 2537923A GB 201507486 A GB201507486 A GB 201507486A GB 2537923 A GB2537923 A GB 2537923A
Authority
GB
United Kingdom
Prior art keywords
frame
speech
power
modification
importance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1507486.7A
Other versions
GB201507486D0 (en
GB2537923B (en
Inventor
Stylianou Ioannis
Petkov Petko
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1507486.7A priority Critical patent/GB2537923B/en
Publication of GB201507486D0 publication Critical patent/GB201507486D0/en
Publication of GB2537923A publication Critical patent/GB2537923A/en
Application granted granted Critical
Publication of GB2537923B publication Critical patent/GB2537923B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Abstract

Speech intelligibility is enhanced by extracting active speech frames from an input S101, calculating a measure of the frame importance S102 (based on eg. Mel cepstral coefficients indicating dissimilarity to the previous frame), and modifying the frames S106 (by eg. overlap-add, fig. 7) based on their importance and an estimation of a contribution due to late reverberation S103 (eg. assuming a long-trained decaying impulse response). Frame power may also be computed and modified (fig. 8).

Description

A speech processing system and speech processing method
FIELD
Embodiments described herein relate generally to speech processing systems and speech processing methods.
BACKGROUND
Reverberation is a process under which acoustic signals generated in the past reflect off objects in the environment and are observed simultaneously with acoustic signals generated at a later point in time. It is often necessary to understand speech in reverberant environments such as train stations and stadiums, large factories, concert and lecture halls.
It is possible to enhance a speech signal such that it is more intelligible in such environments.
BRIEF DESCRIPTION OF THE DRAWINGS
Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure 1 is a schematic of a speech intelligibility enhancing system in accordance with an embodiment; Figure 2 shows a schematic illustration of the processing steps provided by the program in the speech intelligibility enhancing system in accordance with an embodiment Figure 3 shows a plot of the frame importance on the vertical axis, against the active frame index on the horizontal axis; Figure 4 shows three plots relating to modelling the late reverberation; Figure 5(a) shows mapping curves for different noise levels and A = 0; Figure 5(b) shows mapping curves for different noise levels and A = Am; Figure 5(c) shows mapping curves for different noise levels and A = Am; Figure 6 shows a sample (A) curve, where the frame importance is shown on the horizontal axis and the corresponding value of A is shown on the vertical axis; Figure 7 is a schematic illustration of the local time scale modification, or time warp, process according to an embodiment; Figure 8 shows a schematic illustration of the processing steps provided by the program in the speech intelligibility enhancing system in accordance with an embodiment; Figure 9 shows signal waveforms for natural speech, Adaptive Gain Control and Time Warp-modified speech (AGCTVV) and Time Warp-modified speech (T1A/); Figure 10(a) shows a narrow-band spectrogram for the natural speech; Figure 10(b) shows a narrow-band spectrogram for the Adaptive Gain Control and Time Warp-modified speech; Figure 10(c) shows a narrow-band spectrogram for the Time Warp-modified speech; Figure 11 shows a schematic illustration of reverberation in different acoustic environments.
DETAILED DESCRIPTION
In an embodiment, there is provided a speech intelligibility enhancing system for enhancing speech, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output the enhanced speech; and a processor configured to convert speech received from the speech input to enhanced speech to be output by the enhanced speech output, the processor being configured to: i) extract frames of the speech received from the speech input; ii) calculate a measure of the frame importance for each frame; iii) estimate a contribution due to late reverberation for each frame; iv) modify the frame of the speech received from the speech input, wherein the amount of modification is calculated using the measure of the frame importance and the contribution due to late reverberation.
The frame importance is a measure of the dissimilarity between the current frame and previous frames. In an embodiment, the measure of the frame importance is a continuous measure. In an embodiment, the measure of the frame importance is a measure of the dissimilarity of the frame to the previous frame, weighted by a function of the number of previous frames for which the frame importance is below a threshold level.
In an embodiment, the measure of the frame importance is a measure of the dissimilarity of the mel cepstrum of the frame to that of the previous frame, weighted by a function of the number of previous frames for which the frame importance is below a threshold level.
In an embodiment, the measure of the frame importance for a frame i of the speech received from the speech input is: where m, is a set of Mel frequency cepstral coefficients derived from frame i, 4), = where 4c(0,1) and 0, is the number of frames for which c,4 where j is a threshold value of t.
In an embodiment, step iv) comprises applying a spectral modification. The spectral modification may be a modification of the frame power.
In an embodiment, the amount of modification of the frame power is determined by calculating a prescribed modified frame power value that minimizes a distortion measure subject to a penalty term which is a penalty on power increase, wherein the penalty term comprises a multiplier which is a function of the measure of the frame importance.
The amount of modification of the frame power may be determined by applying a post-filtering to the prescribed modified frame power value that minimizes the distortion measure subject to the penalty term.
In an embodiment, step iii) comprises estimating the frame power due to the late reverberation. The estimate of the frame power due to the late reverberation is included in the distortion measure as additive noise.
The contribution due to late reverberation for frame i is the contribution to the frame from the late reverberation signal. The late reverberation power for frame i is the power of the late reverberation signal in frame i. The late reverberation signal at each discrete time index k in frame i is generated by the superposition of reflected, delayed and attenuated instants from the past of the signal produced up to ti seconds before the time instant corresponding to time index k in frame i. In this case, the signal refers to the speech signal that has already been modified.
The frame power due to the late reverberation may be estimated by modelling the impulse response of the environment as a pulse train that is amplitude-modulated with a decaying function. The decaying function ensures a 60 dB power decay between the initial instant, t = 0, which corresponds to arrival of the direct path, and the reverberation time T60. The reverberation time may be measured from the intended acoustic environment, or may be determined from a model of the intended acoustic environment. The frame power due to the late reverberation is estimated by generating a model of the late reverberation signal from the model of the impulse response of the environment and the modified speech signal corresponding to earlier frames.
In an embodiment, the prescribed modified frame power value that minimizes a distortion measure subject to a penalty term is where y is the prescribed modified frame power, x is the frame power of the speech received from the speech input, n is the estimate of frame power due to the late reverberation, c1 and c2 are constants with a dependence on A, b is a shape parameter of a Pareto distribution fitted to sample speech frame powers and A is the multiplier and is a function of the measure of the frame importance.
In an embodiment, the amount of modification of the frame power is calculated from the value of the prescribed modified frame power y In an embodiment, the amount of modification of the frame power is 3±x, where xi is frame power of the frame of speech received from the speech input, and Yip is the prescribed modified frame power value that minimizes the distortion measure subject to the penalty term. In an embodiment, the amount of modification of the frame power is.\131, where x, is frame power of the frame of speech received from the speech input, and yip is the prescribed modified frame power value that minimizes the distortion measure subject to the penalty term.
In an embodiment, A is bounded by an upper value Am and a lower value Am, and A is given by: where e is the measure of the frame importance and a is a constant.
In an embodiment, step i) comprises extracting overlapping frames of the speech received from the speech input.
In an embodiment, step iv) comprises applying a local time scale modification. In an embodiment, the local time scale modification is a time warping of the signal. In an embodiment, step iv) comprises adding an extension to the frame, after the frame has been overlap added to the enhanced speech signal.
The amount of time scale modification for a frame i may be calculated using the amount of modification of the frame power.
The amount of time scale modification may be determined from a range of possible amounts, wherein the value of the upper end of the range and the value of the lower end of the range are dependent on the amount of modification of the frame power, when the modified frame power is less than the frame power of the input speech.
The value of the upper end of the range and the value of the lower end of the range may have a linear relationship with the amount of modification of the frame power, when the modified frame power is less than the frame power of the input speech.
In an embodiment, step iv) comprises: applying a modification of the frame power to a frame of speech received from the speech input to output a modified frame, wherein the amount of modification of the frame power is calculated using the frame importance; overlap adding the modified frame to the enhanced speech signal, to output a new enhanced speech signal; calculating the correlation between a last segment of the new enhanced speech signal and each of a plurality of target segments of the new enhanced speech signal, wherein the target segments correspond to a range of earlier segments of the new enhanced speech signal, wherein the range is calculated using the amount of modification of the frame power, when the modified frame power is less than the frame power of the input speech; determining the target segment corresponding to the highest correlation value; replicating the section of the new enhanced speech signal from the target segment to the end of the new enhanced speech signal; overlap-adding this replicated section to the last segment of the new enhanced speech signal.
In an embodiment, the last segment is the modified frame.
In an embodiment, the time scale modification is performed with a probability p. In an embodiment, the time scale modification is not performed for consecutive frames.
In an embodiment, the processor is further configured to: v) overlap-add the modified frames to produce the enhanced speech signal.
According to another embodiment, there is provided a method of enhancing speech, the method comprising the steps of: receiving speech to be enhanced; extracting frames of the received speech; calculating a measure of the frame importance for each frame; estimating a contribution due to late reverberation for each frame; modifying the frame of the received speech, wherein the amount of modification is calculated using the measure of the frame importance and the contribution due to late reverberation.
adding the modified frames to produce the enhanced speech signal.
According to another embodiment, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform a method of enhancing speech, the method comprising the steps of: receiving speech to be enhanced; extracting frames of the received speech; calculating a measure of the frame importance for each frame; estimating a contribution due to late reverberation for each frame; modifying the frame of the received speech, wherein the amount of modification is calculated using the measure of the frame importance and the contribution due to late reverberation.
adding the modified frames to produce the enhanced speech signal.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
Figure 1 is a schematic of a speech intelligibility enhancing system 1 in accordance with an embodiment.
The system 1 comprises a processor 3 which comprises a program 5 which takes input speech and enhances the speech to increase its intelligibility. The storage 7 stores data that is used by the program 5. Details of the stored data will be described later.
The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input 15 for data relating to the speech to be enhanced. The input 15 may be an interface that allows a user to directly input data.
Alternatively, the input may be a receiver for receiving data from an external storage medium or a network.
Connected to the output module 13 is audio output 17.
In use, the system 1 receives data through data input 15. The program 5, executed on processor 3, enhances the inputted speech in the manner which will be described with reference to figures 2 to 11 The system is configured to increase the intelligibility of speech under reverberation.
The system modifies plain speech such that it has higher intelligibility in reverberant conditions.
In the presence of reverberation, multiple, delayed and attenuated copies of an acoustic signal are observed simultaneously. The phenomenon is more expressed in enclosed environments where the contained acoustic energy affects auditory perception until propagation attenuation and absorption in reflecting surfaces render the delayed signal copies inaudible. Similar to additive noise, high reverberation levels degrade intelligibility. The system is configured to apply a signal modification that mitigates the impact of reverberation on intelligibility.
In one embodiment, the system is configured to apply a spectral modification based on the frame importance and an estimate of the contribution due to late reverberation. Signal portions with low importance often have high energy. Reducing the power of these portions improves the detectability of adjacent sounds of higher importance and prominence.
In another embodiment, the system is configured to apply a time-scale modification based on the frame importance and an estimate of the contribution due to late reverberation. Speech segments with lower importance facilitate time warping as the inherent periodicity in these segments leads to a reduced level of audible artefacts.
In another embodiment, the system is configured to apply a spectral modification based on the frame importance and the estimate of a contribution due to late reverberation and a time scale modification based on the amount of spectral modification. The combination of spectral modification and time scale modification within a unified framework and subject to a single distortion criterion is a general and powerful approach to mitigating the impact of late reverberation.
A speech modification framework taking these aspects into consideration is described in relation to Figure 2. In this framework, the input speech signal is split into overlapping frames for which importance evaluation is performed. In parallel, a statistical model of late reverberation provides an estimate of the expected reverberant power at the resolution of the speech frame. An auditory distortion criterion is optimized to determine the frame-specific gain adjustment. The criterion is composed of an auditory distortion measure and a constraint on the output power. The constraint term is a function of the frame importance. The estimate of the expected late reverberant power is included in the distortion measure as additive noise. The criterion is used to derive the optimal gain adjustment for a given frame. In the event of frame power reduction, the time scale is warped, i.e. the signal is extended, adaptively, to reflect the level of suppression.
Figure 2 shows a schematic illustration of the processing steps provided by program 5 in accordance with an embodiment, in which speech received from a speech input 15 is converted to enhanced speech to be output by an enhanced speech output 17.
Blocks 3101, 3106, 3108 and 3104 are part of the signal processing backbone. Steps 3102 and 3103 incorporate context awareness, including both acoustic properties of the environment and local speech statistics. Steps 3105 and 3107 are associated with measuring distortion induced by late reverberation and the corresponding signal adjustment in the spectral and the time domains.
The operation can be summarized as follows. The input speech signal is split into overlapping frames and each of these is characterized in terms of information content, or frame importance. In parallel, a statistical model of late reverberation provides an estimate of the expected reverberant power at the resolution of the speech frame. A generic spectral distortion criterion determines the optimal local modification, referred to as the prescribed spectral modification, where the power of late reverberation is interpreted as additive noise. In the event of signal power reduction, time warping, or slow-down, is initiated with a probability derived from the reduction factor. Apart from a short processing delay, the described approach is causal and therefore suitable for online applications, subject to the constraints of time warping.
Step 3101 is "Extract active speech frames". This step comprises extracting overlapping frames from the speech signal x received from the speech input 15. The frames may be windowed, for example using a Hann window function.
Frames x, are output from the step 3101 In one embodiment, the duration of the frame, T, is 25ms. Instantaneous power may be approximated by the power of a signal frame of duration 25 ms.
In one embodiment, the frame overlap is 50%. A 50% frame overlap may reduce discontinuities between adjacent frames.
Step S102 is "Evaluate frame importance". In this step, a measure of the frame importance is determined.
The frame importance characterizes the dissimilarity of the current frame to previous frames. Low dissimilarity indicates less new information and therefore lower importance. Lower frame importance corresponds to higher redundancy. A frame with a low dissimilarity to previous frames, and thus high redundancy, has a low frame importance.
The output of this step for each frame 5c is the corresponding frame importance C,.
The frame importance is based on measuring the auditory domain dissimilarity between the current and the previous frame, i.e. assessing the change between two consecutive frames in an auditory domain. In an embodiment, the frame importance is a measure of the dissimilarity of the mel cepstra of the frame to the previous frame, weighted by a function of the number of previous frames for which the frame importance is below a threshold level. In one embodiment, the frame importance is given by: (1) where: mi represents the set of Mel frequency cepstral coefficients (MFCCs) derived from signal frame i, i.e. the MFCC vector at frame i; and 45iis the inter-frame redundancy weight, where OL = C°f, E (0, 1), and Oi is the number of frames leading to frame i for which Ci < , j i for some value C. In an embodiment, C = 0.1. In an embodiment, ( = 0.5. Values for and C may be stored in the storage 7 of the system shown in Figure 1.
As a result of the weighting given by q5i. a sequence of redundant frames is assigned progressively decreasing importance. This is shown in Figure 3, where the importance parameter is computed for the active frames of a randomly selected speech utterance using C = 0.1 and C = 0.5. Figure 3 shows the frame importance on the vertical axis, against the active frame index on the horizontal axis. The dashed line shows 'isp. The solid line shows Ci. It can be seen that a sequence of frames which have a low Lo. have a decreasing For the above relationship given in equation (1), E (0, 1). This means that the frame importance parameter can be interpreted as a probability of observing new information.
The output from step S102 is Ci, the frame importance of the frame i.
In this embodiment, the notion of information content of a segment, or frame, is generalized. Explicit probabilistic modelling is not used, however the adopted parameter space is capable of approximating the information content with a high resolution, i.e. with a continuous measure, as opposed to a binary classifier.
A rigorous estimation of the amount of information in the speech signal at a given time using probabilistic modelling and the notion of entropy can alternatively be used to determine a measure of the frame importance based on an information theoretic perspective on redundancy estimation.
Step S103 is "Model late reverberation".
Reverberation can be modelled as a convolution between the impulse response of the particular environment and the signal. The impulse response splits into three components: direct path, early reflections and late reverberation.
Early reflections have high power, depend on the geometry of the space and are individually distinguishable. They arrive within a short time window after the direct sound and are easily distinguishable when examining the room impulse response (RIR). Early reflections depend on the hall geometry and the position of the speaker and the listener. Early reflections arrive after one or two reflections. Early reflections are not considered harmful to intelligibility, and in fact can improve intelligibility.
Late reverberation is diffuse in nature due to longer acoustic paths. It is the primary factor for reduced intelligibility due to masking between neighbouring sounds. This can be relevant for communication in places such as train stations and stadiums, large factories, concert and lecture halls. Identifying individual reflections is hard as their number increases while their magnitudes decrease. Late reverberation is considered more harmful to intelligibility because it is the primary cause of masking between different sounds in the speech signal. Late reverberations arrive after more reflections than the early reflections, i.e. 3 or more reflections. Late reverberation is composed of delayed and attenuated replicas that have reflected more times than the replicas in early reflections.
The late reverberation model in step S103 is used to assess the reverberant power that is considered to have a negative impact on intelligibility at a given time instant, i.e. that decreases intelligibility at a given time instant.
The boundary ti between early reflections and late reverberation in a RIR is the point where distinct reflections turn into a diffuse mixture. In an embodiment, ti is in the range 50 to 100 ms after the arrival of the sound following the direct path, i.e. the direct sound. In an embodiment, ti is 90ms after the arrival of the direct path.
In step S103, the late reverberation is modelled. In one embodiment, the late reverberation can be modelled accurately to reproduce closely the acoustics of a particular hall. In alternative embodiments, simpler models that approximate the masking power due to late reverberation can be used, because the objective is power estimation of the late reverberation.
In an embodiment, the Velvet Noise model can be used to model the contribution due to late reverberation.
Figure 4 shows three plots relating to use of the Velvet Noise model to model the late reverberation signal.
The first plot shows an example acoustic environment, which is a hall with dimensions fixed to 20 m x 30 m x 8 m, the dimensions being width, length and height respectively. Length is shown on the vertical axis and width is shown on the horizontal axis. The speaker and listener locations are {10 m, 5 m, 3 m} and {10 m, 25 m, 1.8 ml respectively. These values are used to generate the model RIR used for illustration of an RIR in the second plot. For late reverberation power modelling, the particular location of the listener is not required, and the particular values in the RIR are not important.
The second plot shows the normalized room impulse response. The propagation delay and attenuation are normalized to the direct sound. Time is shown on the horizontal axis in seconds. The normalized room impulse response shown here is a model RIR based on knowledge of the intended acoustic environment, which is shown in the first plot. The model is generated with the image-source method, given the dimensions of the hall shown in the first plot and a target Too.
The room impulse response may be measured, and he value of the boundary ti between early reflections and late reverberation and the reverberation time Too can be obtained from this measurement. The reverberation time is the decay time of the late reverberation, in seconds per 60dB delay.
The third plot shows the same normalised room impulse response model as the second plot, as well as the portion of h corresponding to the late reverberation, discussed below. The late reverberation room impulse response model is generated using the Velvet Noise model.
In one embodiment, the model of the late reverberation is based on the assumption that the power of late reverberation decays exponentially with time. Using this property, a model is implemented to estimate the power of late reverberation in a signal frame. A pulse train with appropriate density is generated using the framework of the Velvet Noise model, and is amplitude modulated with a decaying function. The continuous form of the decaying function, or envelope, is: = 10 The discretized envelope is given by: = Cl T60 r s (2) This relationship ensures a 60 dB power decay between the initial instant, t = 0, which corresponds to arrival of the direct path, and the reverberation time Too.
The model of the late reverberation represents the portion of the RIR corresponding to late reverberation as a pulse train, of appropriate density, that is amplitude-modulated with a decaying function of the form given in (2). The pulse train is described below.
An approximation to the late reverberation signal n, which is the noise caused by late reverberation, for the duration of the target frame is computed from: (3) where h is the late reverberation room impulse response model, i.e. the artificial, pulsetrain-based impulse response, fs is the sampling frequency and the beginning of the target frame is associated with time index k=0.
The late reverberation room impulse response model is obtained from the Hadamard, or element-wise, product of a pulse train 4k] and the envelope e[k]: Yak] = t[k] e[k].
where e[k] is given by equation (2) above, and 4k] is a pulse train, and is given by: dki EM a[m] u[k -round ft= nc(m.)))1 where a[m] is a randomly generated sign of value +1 or -1, rnd(m) is a random number uniformly distributed between 0 and 1, "round" denotes rounding to an integer, Td is the average time in seconds between pulses and Ts is the sampling interval. Td is constant for a particular acoustic environment. u denotes a pulse with unit magnitude. This pulse train is the Velvet Noise model.
Thus equation (2) is the envelope applied to the pulse train to generate h. From equation (2), at k=0, e(t)=1, meaning there is no decay for the direct path, which is used as the reference. At k = Teo/Ts, e(t) = 10-3, which in the power domain corresponds to -60 dB.
In one embodiment, t = 90 ms. In one embodiment, t is extracted from a model RIR based on knowledge of the intended acoustic environment. Alternatively, ti is extracted from the measured RIR. Alternatively, an estimated value based on previous studies of similar environments is used.
In one embodiment, T60 = 1. 1 s. In one embodiment, T60 = 1.5s. The reverberation time for the intended acoustic environment may be measured, and this measured value is used as the value of Tao. Alternatively, an estimated value based on previous studies of similar environments is used. Alternatively, the reverberation time can be derived from a model, for example, if the dimensions and the surface reflection coefficients are known.
In one embodiment, f, = 16 KHz. In one embodiment, f, = 8 KHz. T, is equal to 1/f5.
y[k -tfs -m] corresponds to part of the output "buffer", i.e. the already modified signal corresponding to previous frames xp, where p < i. The convolution of h from t onwards and the signal history from the output buffer give a sample or model realization of the late reverberation signal.
For a frame i, the value of n[k] for each value of k is determined, resulting in a set of values n, each corresponding to a value of k for the frame. The power ni of the late reverberation signal for a frame i is computed from rt.
Values for T50, ti and fs may be stored in the storage 7 of the system shown in Figure 1.
Step S103 may be performed in parallel to step S102.
The following steps S104 to S106, are directed to calculating a prescribed spectral modification that reduces the distortion between the natural speech and the modified speech plus late reverberant power in some domain. The domain may be the power dynamics for example. In step S104, auditory features corresponding to the unmodified frame of speech and the modified frame of speech plus the late reverberation contribution are calculated. In step S105, these values are used to calculate the prescribed spectral modification to be applied to the speech frame. When the domain is the power dynamics, the amount of modification of the frame power may be calculated from a value of the prescribed modified frame power that minimizes a distortion measure, subject to some penalty term which comprises a multiplier which is a function of the frame importance. The prescribed spectral modification calculated is then applied to the frame of input speech in step 3106. The amount of modification of the frame power may be calculated by applying a post-filtering, or smoothing, to the value of the prescribed modified frame power.
A distortion measure is used to evaluate the instantaneous, which in practice is approximated by frame-based, deviation between a set of signal features, in the perceptual domain, from clean and modified reverberated speech. Minimizing distortion provides the optimal modification parameters.
In a general case, the calculation of the prescribed spectral modification can be represented as follows.
Step 5104 is "Compute auditory features". In this step, the auditory features of the unmodified speech and the modified speech plus the late reverberant power are calculated. The auditory features may include the mel filter-bank log band-powers or gammatone filter-bank powers for example. For the first iteration, the speech has not been modified, thus the auditory features of the unmodified speech and the unmodified speech plus the late reverberant power are calculated.
Step 5105a is "Evaluate perceptual distortion". In this step, the distortion between the natural, unmodified speech and the modified speech plus noise is evaluated. For the first iteration, the distortion between the unmodified speech and the unmodified speech plus noise is evaluated. This is calculated from the auditory features computed in step S104.
Step S105b is "Optimize spectral modification". A spectral modification is applied to the unmodified speech frame xi. This is output as the modified speech frame yi. The steps S104 to S105b are then repeated, for the new modified speech frame yi. These steps are iterated, to find the prescribed spectral modification that reduces the distortion calculated in step 3105a, subject to some penalty term. In an embodiment, calculating a prescribed spectral modification value comprises using a searching algorithm to find a local minimum for the prescribed spectral modification value.
In one embodiment, there is a closed form solution to the optimization problem. In this case an iterative search for the optimum prescribed spectral modification is not performed. The auditory features are computed in step S104. In step S105 the values for the auditory features are inputted into an equation for the prescribed spectral modification, which corresponds to the solution of the optimization problem. The prescribed spectral modification calculated is then applied in step S106. There may be some further alteration to the prescribed spectral modification before it is applied, for example a smoothing filter. There is no iteration to determine the prescribed spectral modification in this case. The prescribed spectral modification is simply calculated from a pre-determined function.
A set of processing steps S104 to S106 in accordance with an embodiment in which there is a closed solution to the optimization problem are now described.
In these steps, the prescribed spectral modification is determined by minimizing a distortion measure in the power domain, subject to a penalty term, wherein the penalty term comprises a multiplier which is a function of the frame importance. In these steps, the prescribed power of the modified frame is calculated using a function which is a solution for the minimum of the distortion measure subject to the penalty term.
A composite criterion, comprising the distortion term and a power increase penalty, is used to prevent excessive increase in output power. To facilitate the analysis, late reverberation is locally, i.e., for the duration of the current frame, regarded as uncorrelated, additive noise. This is motivated by i) the time separation between the current frame and the period when the interfering speech was produced and ii) the long-term non-stationary nature of the speech signal.
Any composite distortion criterion for speech in noise having a distortion term and a power gain penalty can be used to determine the prescribed spectral modification in this step. A speech in noise criterion is used because late reverberation can be interpreted as additive uncorrelated non-stationary noise.
In one embodiment, a criterion composed of an auditory distortion measure and a constraint on the output power is used to derive the optimal prescribed spectral modification at a given time: (4) where x, y and n are the instantaneous powers of the waveforms x, y and n, in practice approximated by frame powers. Italic font is used to indicate the frame powers. Thus for a particular frame there is a value x, where x is the frame power of the original frame of speech signal. There is also a value of n, where n is the power of the noise in that frame, estimated in step S103. The prescribed modified power for the frame is denoted by y.
cp and a are bounds for the interval of interest. In one embodiment, the parameter a is set to the minimum observed frame powers in the data used for fitting fx(x). In one embodiment, the upper bound Lp of the optimal range for modification is set to Lp = p4 or qj= pa, where p is described later on. In one embodiment, the upper bound tp of the optimal range for modification is set to a number that is at least one order of magnitude larger than the maximum frame power for a sample speech signal.
fx(xib) is the probability density function of the Pareto distribution with shape parameter b. The Pareto distribution is given by: ha" fx(x1b) = xe[a, co) 1+6 '
X
The value of b is obtained from a maximum likelihood estimation for the parameters of the (two-parameter) Pareto distribution fitted to a sample data set, for example standard pre-recorded speech. The Pareto distribution may be fitted off-line to variance-equalized speech data, and a value for b obtained. In one embodiment, b is less than 1.
Values for b, a and tp may be stored in the storage 7 of the system shown in Figure 1.
The first term under the integral in equation (4) is the distortion in the instantaneous power dynamics and the second term is the penalty on the power gain. This distortion criterion for speech in noise is used due to the flexibility and low complexity of the resulting modification. The late reverberant power n is included in the distortion term as additive noise. The term A is a multiplier for the penalty term. The penalty term also includes a factor n.
The integrand is a Lagrangian and A is a Lagrange multiplier. Generally for a Langrangian, there will be an explicit constraint, i.e. an equality or inequality, and this would be solved for A. In this case, this constraint is implicitly in equation (4), and a value for A is calculated from the frame importance.
The solution in closed form for the minimum of the functional (4) found by using calculus of variations is: (5) where c1 and c2 are constants set using the boundary conditions: (6) (7) where cE(0, 1) and y'=,:d. In one embodiment, c = 0.9. cl and c2 are thus dependent on A and are given by: = (2b-A. )-2e.th 2i, Cab xis the prescribed power of the modified speech frame. The power gain, or the amount of modification of the frame power for a frame i is thus y/xi, if no post-filtering is performed.
The second boundary condition given in equation (7) is used as it provides the functionality and desired flexibility while facilitating the analysis and the calibration of the power mapping curves. This particular form induces compressive behaviour in the presence of noise and convergence to feed through behaviour, i.e. unity gain, in its absence.
Since the optimization of the criterion in (4) gives a closed-form solution in equation (5), the speech modification has low-complexity.
The value of A for the target frame i is calculated in step Si 05b.
Mapping curves for different noise levels and A = 0 are shown in Figure 5(a). Figure 5(a) shows the gain for A=0 and different noise levels. Figure 5(a) is a plot of the output in decibels (vertical axis) against the input in decibels (horizontal axis). Unity gain is shown as a straight solid line. This corresponds to the case where n-> 00 dB. The gain for n=10dB is shown by the dotted line. The gain for n=30dB is shown by the dotted and dashed line. The gain for n=50dB is shown by the dashed line.
An increase in the noise, i.e. the late reverberation, power induces an increase in the speech output power. This behaviour can lead to instability due to recursive increase of signal power. In a reverberant environment, such behaviour leads to system instability which can be prevented by appropriate calibration. In other words, increasing the speech power in a reverberant environment also increases the power of the late reverberation. Calibration is thus included to prevent the system instability due to recurrent increase in power. This is done by limiting the growth of the mapping curves. Bounding the power mapping curves ensures stability.
This can be achieved by lower bounding A with Amin such that: (9) where 13 is, for example, the highest expected short-term power in the input speech.
Alternatively, p is the maximum observed frame power in the data used to fit fx(x1b). The data may be pre-recorded standard speech.
Thus, in an embodiment, the parameter a may be set to the minimum observed frame powers in the data used for fitting fx(x1b) and the parameter p may be set to the maximum observed frame power in the data used to fit fx(x1b). Consistency in the estimation may be achieved when the utterances in the data used to fit fx(x1b) are the same power as the input speech signal. The power referred to here is a long-term power measured over several seconds, for example, measured over a time scale that is the same as the utterance duration. Therefore, in an embodiment, the values of p and a are scaled in real time. If the variance of the input speech signal is not the same as that of the data to which the Pareto distribution is fitted, the parameters of the Pareto distribution are updated accordingly. The long-term variance of the input speech is thus monitored and the values of the parameters p and a are scaled with the ratio of the current input speech signal variance and the reference variance, i.e. that of the sample data. The variance is the long term variance, i.e. on a time scale of 2 or more seconds.
Alternatively, p is derived such that on average signal power is preserved, i.e. p is selected based on a long-term power preservation constraint Solving (8) for kw, gives: (9) Given that A is a Lagrange multiplier, the lower bound is: ax (10) Similarly, an upper bound Am can be established to limit the suppression of the input signal:
WAX
where Amax is obtained from: (12) giving In one embodiment, The upper bound of the gain penalty multiplier A has no implication on the stability.
Values for)609 and c may be stored in the storage 7 of the system shown in Figure 1.
Figures 5(b) and 5(c) illustrate families of power mapping curves for A = Am and A = Am respectively. Monotonicity is guaranteed on x E[a, y] for A E[Am, Am] when b < 1.
Figure 5(b) shows the gain for A = Am and different noise levels. Figure 5(b) is a plot of the output in decibels (vertical axis) against the input in decibels (horizontal axis). Unity gain is shown as a straight solid line. This corresponds to the case Mere n dB. The gain for n=30dB is shown by the dotted line. The gain for n=50dB is shown by the dotted and dashed line. The gain for n=70dB is shown by the dashed line.
-
max -21i -t 2b zit (cm -I) cr.131)-An increase in the noise, i.e. the late reverberation, power induces an increase in the speech output power. However, the increase is limited. An increase in noise from 50dB to 70dB causes only a relatively small increase in speech output power.
Figure 5(c) shows the gain for A= Am and different noise levels. Figure 5(c) is a plot of the output in decibels (vertical axis) against the input in decibels (horizontal axis). Unity gain is shown as a straight solid line. This corresponds to the case where n--00 dB. The gain for n=10dB is shown by the dotted line. The gain for n=30dB is shown by the dotted and dashed line. The gain for n=50dB is shown by the dashed line.
Again, an increase in the noise, i.e. the late reverberation, power induces an increase in the speech output power. However, the increase is limited.
The value of A for the target frame i is calculated using the frame importance. Thus the frame importance is linked with the Lagrange multiplier. The Lagrange multiplier is used to regulate the output frame power as a function of the frame importance. This leads to partial suppression of redundant frames in the presence of reverberation. Establishing a connection between the frame importance parameter and the power-gain penalty A provides the possibility for short-term power suppression or boosting as a function of the redundancy in the speech signal.
In one embodiment, a smooth, monotonic sigmoid-like relationship is used to link and A (13) The parameter a is used for tuning. In one embodiment, a = 9. In one embodiment, a=10. A value for a may be stored in the storage 7 of the system shown in Figure 1.
This relationship enforces the constraint Ac (Am. Am). It also emphasizes the relative importance of peaks in the plot in Figure 3. A sample (A) curve is shown in Figure 6, for a = 10 and a particular "noise" power n. The frame importance is shown on the horizontal axis. The corresponding value of A is shown on the vertical axis. Increasing frame importance corresponds to decreasing values of A. Figure 5(b) thus illustrates the mapping curves for a frame which is highly informative, corresponding to Am, and Figure 5(c) thus illustrates the mapping curves for a frame which is highly redundant, corresponding to Am. Step S106 is "Modify speech frame". A spectral modification based on the prescribed modified power y determined in step S105 is applied to the frame of input speech xi. In an embodiment, the spectral modification is achieved by multiplying the input frame xi bypx, where the value yo is the prescribed value of y obtained from equation (5) above for the frame i, for example.
The mapping curves change for every frame if the noise level changes. Updating the mapping curves at the frame rate can be the source of discontinuities with negative impact on intelligibility. Smoothing the gain factors reduces this risk. This requires a trade-off between high temporal resolution and reduced risk of discontinuities. In one embodiment, the signal frame gain is smoothed instantaneously by taking the squared root. This preserves the temporal resolution and the responsiveness of the system to changes in frame importance. Thus in an embodiment, the spectral modification is obtained by multiplying the input frame by 1\iY4 where the value yip is the prescribed xi value of y obtained from equation (5) above for example. This is an example of a post-filtering procedure. Use of the further square root can reduce discontinuities that occur due to over-boosting of the signal. It can compensate for the boosting effect of early reflections, which may add to the power of the direct sound. The prescribed modified power yip is determined with reference only to the direct path and the late reverberation and thus spectral modification by multiplying by 3±P., can overboost the signal. The post-filtering smoothes the gain applied to the signal. This is to reduce discontinuity between adjacent frames.
The modification reduces the signal energy in segments with high-redundancy, or low importance, thereby decreasing reverberant power and masking.
The spectral modification, in this case power modification, is derived by optimizing a distortion measure that models the effect of late reverberation, subject to a penalty term which is dependent on the frame importance.
The spectral modification utilizes an explicit model of late reverberation as noise and optimizes the modification for the impact of this noise as characterized by a distortion measure. Any arbitrary distortion criterion can be used for the spectral modification. However, the parameter describing signal segment information content is introduced into the distortion criterion.
The spectral modification mitigates the impact of late reverberation. Late reverberation can be modelled statistically due to its diffuse nature. At a particular time instant, late reverberation can be seen as additive noise that, given the time offset to the generation instant, or the time separation to its origin, can be assumed to be uncorrelated with the direct or shortest path speech signal. Boosting the signal is an effective intelligibility-enhancing strategy for additive noise since it improves the detectability of the sound.
In an embodiment in which the system is configured to apply a spectral modification, the modified speech frames are simply overlap added at this point, and the resulting enhanced speech signal is output.
Further speech enhancement is achieved by introducing an additional modification dimension. Under reverberation, boosting the signal can be counter-productive, as the boosted signal generates more noise in the future. Overlap-masking between sounds caused by acoustic echoes is a major contributor to the loss in intelligibility. Time-scaling reduces the effective overlap-masking between closely-situated sounds. Extending portions of the signal by time scaling results in reduced masking in these portions from previous sounds, as the late reverberation power decays exponentially with time. This effect improves intelligibility but also reduces the transmission rate.
Slowing down the signal reduces the overlap-masking between closely situated sounds and improves intelligibility, but also slows down the transfer of information.
In an embodiment in which the system is configured to apply a spectral modification and a subsequent time scale modification based on the spectral modification, the time scale modification is performed in steps S107 and S108.
Step S107 is "Warp time scale". Time scaling improves intelligibility by reducing overlap-masking among different sounds. The time-warping functionality searches for the optimal lag when extending the waveform. The method allows for local warping.
Time warping occurs when the frame power is reduced below that of the unmodified input frame power.
In this step, using the history of the output signal y, the correlation sequence ç(k) for a frame i is computed as: (14) where T is the frame duration (in seconds). The value for T may be stored in the storage 7 of the system shown in Figure 1. The variable k is used in the context of time warping to denote a lag. It is not used as in the context of modelling the late reverberation, where the beginning of the target frame is associated with time index k=0.
The optimal lag, k*, is then calculated from: (15) where the lag is a discrete time index, or sample index and km and km are the minimum and maximum lag of the search interval.
The minimum and maximum lag of the search interval are translated in time for the different frames such that higher power reduction leads to more extensive signal stretching. In other words, the search window is shifted depending on reduction rate, such that more reduction gives a larger extension. For a given frame i, the values of km and km are calculated from the amount of the spectral modification which was applied to the frame in step S106, where the modified frame power is less than the unmodified input frame power.
In one embodiment, the minimum and maximum lag of the search interval each have a linear dependence on the amount of modification of the frame power, when the frame power is reduced below that of the unmodified input frame power. When the gain, or the amount of modification of the frame power is 1 or more, there is no time warping, or extension, and the frame is simply overlap added.
In one embodiment, the minimum and maximum lag of the search interval each have a linear dependence on the gain, or the amount of reduction of the frame power, applied in step S106, when the frame power is reduced below that of the unmodified input frame power. The gain, or the amount of reduction of the frame power is Li, where the value yi is the modified frame power of the frame output from step S106 above and xt is the input frame power. In an embodiment, the gain, or the amount of reduction of the frame power is where the value yip is the prescribed modified frame power obtained from equation (5) above and; is the input frame power, for example. In an embodiment in which the spectral modification is smoothed, the gain, or the amount of reduction of the frame power is NI)=P.
In one embodiment, a linear relationship translates km from 15 ms to 25 ms prior to the last frame with the decrease in the frame gain from one to zero. In one embodiment, km is separated from km by 13 ms. Any monotonically increasing relationship between the lag and the gain can be used.
In one embodiment: for G<1 /cm = -10x10-3f,G + 25x10-3f, km = km -13 x10-3fs and where G is the gain. The values of km and km obtained from the above equations are rounded to the nearest integer.
In general: km = yG + for 0<1 and = kyr -(l) for some values of y, p and 0, where 0 >0. The values for y, e and 0 may be stored in the storage 7 of the system shown in Figure 1. The values of km and km obtained from the above equations are rounded to the nearest integer.
Figure 7 is a schematic illustration of the time scale modification process according to an embodiment.
The modified frames after the overlap and add process performed in step S108 of Figure 2 form an output "buffer".
In the time scale modification process, a new frame yi is output from step S106 of Figure 2, having been gain adjusted. This frame is overlap-added to the buffer in step 5108. This corresponds to step 5701 of the time scale modification process shown in Figure 7. The "new frame" is also referred to as the "last frame". The point k=0 is the start of the last frame.
All frames are overlap added to the buffer in this manner. However, if the power of the frame is reduced in step S106, i.e. if G<1, then the time will be warped around this point, in the manner described in the following steps.
In step 5107, it is desired to determine a time scale modification amount that will time warp the signal without introducing discontinuities. This involves calculating the correlation, from equation (14), of the "last frame" of the signal with a target segment of the buffer signal, corresponding to k=km in equation (14). This is repeated for target segments corresponding to k=km-1 to k=km. This corresponds to step S702 of the time scale modification process.
The value of k corresponding to the maximum peak in the correlation function gives the optimum lag k*. This is determined in step S703 of the time scale modification process.
The buffer signal is then extracted from this point on, i.e. the segment of the buffer signal from k=k* to the end of the buffer is replicated in step S704, and this is overlap added with the "last frame" from the point k=0 in step S705. In an embodiment, the overlap-add is on a scale twice as large as the frame-based processing.
This overlap-adding therefore results in left over, or extra, samples on the end of the buffer signal, containing the "last frame". This is the signal extension or the time warp effect.
In this embodiment, the time warp is thus related to the gain adjustment, as the search interval for the optimal lag used for the time warp is shifted depending on the gain, when the frame power is reduced below that of the unmodified input frame power.
Reducing the frame power below the original power triggers the time warp. The higher the reduction of the power of the input signal in step S106, the larger the extension. Thus the frame importance, t is linked with the amount of time scale modification. In this case, the amount of time scale modification is related to the frame importance through the gain adjustment. Frame reduction occurs for frames with low importance, i.e. high redundancy. Thus generally, if a frame has low frame importance, it will be subject to higher suppression and the time warp will be more significant.
For this case, time warping thus substantially occurs within the steady-state. This reduces the possibility for discontinuities. The extent of time warping thus adapts to both the environment and the local speech signal statistics. The extent of the time warping adapts to the environment through the dependence of the gain on the reverberation power, which in turns depends on T60, the value of which can be measured or based on previous studies of similar environments, as described above.
The extent of the time warping adapts to the local speech signal statistics through the dependence of the gain on the frame importance.
In 3108 therefore, the waveform extension is extracted from the position identified by k* and overlap-added to the last frame using complementary windows of appropriate length. The waveform extension is over-lap added using smooth "half windows in the overlap area. Finally the end of the extension is smoothed, using the original overlap-add window to prepare for the next frame.
Artefacts can be caused by over-periodicity. The artefacts can be reduced by reducing the time warp rate.
In one embodiment, to reduce artefacts due to over-periodicity, time warping is not performed for consecutive frames. This is a deterministic approach.
In an alternative embodiment, to reduce the possibility for artefacts due to over-periodicity, the time is warped with probability: (16) This is a probabilistic approach.
Thus, more aggressive suppression implies more likely time warping.
Speech intelligibility in reverberant environments decreases with an increase in the reverberation time. This effect is attributed primarily to late reverberation, which can be modelled statistically and without knowledge of the exact hall geometry and positions of the speaker and the listener. The system described above uses a low-complexity speech modification framework for mitigating the effect of late reverberation on intelligibility. Distortion in the speech power dynamics, caused by late reverberation, triggers multi-modal modification comprising adaptive gain control and local time warping. Estimates of the late reverberation power and the redundancy in the speech signal at the frame level allow for context-aware adaptation of the modification depth. Subjective evaluation of the proposed speech modification framework validates its effectiveness in a comparison with natural speech for two test conditions. Relative improvements of 13% and 11.5% were measured for reverberation times of 1.1 s and 1.5 s respectively.
The system is adaptive to the environment, and provides multi-modal, i.e. in time and frequency speech modification for a wide operation range. The system uses high-resolution frame-importance aware distortion criterion for more efficient use of signal power. The system operates with low delay and complexity, thus allows for addressing a wide range of applications. The modularity of the framework facilitates incremental sophistication of individual components.
Figure 8 is a schematic illustration of the processing steps provided by program 5 in accordance with an embodiment, in which speech received from a speech input 15 is converted to enhanced speech to be output by an enhanced speech output 17.
Step S200 is "Extract frame xi". This corresponds to step S101 shown in the framework in Figure 2. This step comprises extracting frames from the speech signal x received from the speech input 15. Frames A are output from the step S200.
Step S201 is "Compute frame power xi'.
Step S202 is "Compute frame importance". This corresponds to step S102 in the framework shown in Figure 2.
The frame importance is a measure of the dissimilarity of the frame to the previous frame, weighted by a function of the number of previous frames for which the frame importance is below a threshold level. In one embodiment, the frame importance is given by equation (1) above. The output from step S202 is the frame importance of the frame i.
Step S203 is "Calculate late reverberation power ni". This corresponds to step S103 in the framework shown in Figure 2. The parameters T60, t and fs and the previously modified frames yp output from step S212 where p<i, are used to calculate the late reverberation power in this step.
In one embodiment an approximation to the masking signal n, which is the noise caused by late reverberation, for the duration of the target frame is computed from equation (3) above. The power rt of the masking signal for a frame i is computed from Steps S200, S202 and S203 may be performed in parallel or in any order.
Step 5204 is "Calculate Am and Am". The value of Am and Am may be calculated using equations (9), (10), (11) and (12). The parameters b, Ly, c, a, 0 and p and the late reverberation power output from step S203 are used to calculate Am and A. Step 5205 is "Calculate A,". In one embodiment, A, is calculated from equation (13) above. The parameter a, the frame importance output from step S202 and the values of A and A output from step 5204 are used to calculate A. Step S206 is "Calculate prescribed modified power yip = + + -2b)".
The prescribed modified power is calculated from equation (4) above. This corresponds to step 5105 in the framework in Figure 2 above. The parameters c1, c2 and b, as well as the power xi, the late reverberation power ni, and Ai are used to calculate the prescribed modified power yip.
Step 207 is "Calculate spectral modification". The step, the factor by which the input speech frame x, is modified is calculated. In one embodiment, the spectral modification is j3±', where the value yip is obtained from step 5206. In one embodiment, the spectral xi modification is
X
Step S208 is "Generate modified speech frame y,". This corresponds to step 5106 above.
Step S209 is "Overlap add y, to buffer'. In this step, the modified speech frame output from step 3207 is overlap-added to the enhanced speech signal generated for previous frames, referred to as the "buffer".
Step S210 is "Calculate km and km". This step is only performed if the modified frame power is reduced below that of the unmodified input frame power. The minimum and maximum lag of the search interval, km and km are calculated from the gain applied to the frame y,, when the frame power is reduced below that of the unmodified input frame power.
Step S211 is "Calculate k*". k* may be determined using equations (14) and (15) as described above.
Step S212 is "Overlap add extension". In this step, the waveform extension is extracted from the position identified by k* and overlap-added to the last frame.
Steps S210 and S211 correspond to step 5107 in the framework described above. Steps S209 and 5212 correspond to step S108 in the framework described above.
Figure 9 shows the signal waveforms for natural speech, corresponding to the top waveform; Adaptive Gain Control and Time Warp-modified speech (AGCTVV), corresponding to the middle waveform; and Time Warp-modified speech (TW), corresponding to the bottom waveform. All the waveforms are for T50 = 1. 1 s. The natural speech is a sentence from the listening test used to evaluate the signal modification, described below.
Reverberation was simulated using a model RIR obtained with a source-image method. The hall dimensions were fixed to 20 m x 30 m x 8 m. The speaker and listener locations used for RIR generation were {10 m, 5 m, 3 ml and {10 m, 25 m, 1.8 ml respectively. The propagation delay and attenuation were normalized to the direct sound. Effectively, the direct sound is equivalent to the sound output from the speaker.
The signal modification framework described in relation to Figure 2 was validated with a listening test. Six native English listeners were recruited for the purpose. A total of 120 sentences were presented to each listener following an established test protocol.
Three modification strategies were considered in two reverberation conditions giving a total of 20 utterances per modification and condition for each subject. The intelligibility scores for natural speech and adaptive gain control and time warping (AGCTW) are presented in Table 1. Figure 9 illustrates the corresponding waveforms before reverberation for a test utterance in the T60 = 1. 1 s condition.
Adaptive gain control and time warping (AGCTW) is used to denote the system described in relation to Figures 2 and 8 above, in which spectral modification and time scale modification are applied to the input speech.
In both conditions modified speech is, on average, more intelligible than natural speech.
Table 1: Subjective evaluation results for the intelligibility of i) natural, ii) frame-gain-modified and time-warped (AGCTVV) speech for two test conditions.
Processing Natural AGCTW T60 Mean subjective 0.800 0.904 1.1s Mean subjective 0.720 0.803 1.5s The subjective scores validate the benefit from signal modification. The results come from an open-set large vocabulary test with 6 native speakers. This is the realistic expectation for a natural communication situation with native speakers.
To facilitate the interpretation of the above results, the waveforms of natural and processed speech are shown in Figure 9. The spectral modification AGCTW suppresses redundant frames and emphasizes moderately highly-informative frames. As a result, the impact of self-masking between powerful voiced regions and ensuing weaker segments with low predictability is reduced. Time-warp inside the redundant regions in combination with suppression allows for the gradual reduction of reverberant power. This in turn reduces the need for large amplification of highly-informative segments and reduces distortion at a later time instant.
The spectrograms of the reverberated signals, for Tso = 1.1 s, corresponding to the waveforms from Figure 9 are shown in Figure 10.
Figure 10 shows narrow-band spectrograms for the original, in Figure 10(a); spectrally-modified and time warped, in Figure 10(b); and time-warped only, in Figure 10(c); speech signals under reverberation with Tso = 1.1 S. For the time warped only waveform the system warps the time in the same way as AGCTW, but preserves the original frame power, i.e. does not apply any spectral modification.
It can be seen that the spectrally-modified and time-warped method provides more cues to the listener compared to the original speech and time-warped speech without spectral modification.
Figure 11 shows a schematic illustration of reverberation in different acoustic environments. The figures show examples of the paths travelled by speech signals generated at the speaker, for an oval hall, a rectangular hall, and an environment with obstacles.
Sufficiently high reverberation reduces speech intelligibility. Degradation of intelligibility can be encountered in large enclosed environments for example. It can affect public announcement systems and teleconferencing. Degradation of intelligibility can be a more severe problem for the hard of hearing population.
Reverberation reduces modulation in the speech signal. The resulting smearing is seen as the source of intelligibility degradation.
Speech signal modification provides a platform for efficient and effective mitigation of the intelligibility loss.
The framework in Figure 2 is a framework for multi-modal speech modification, which introduces context awareness through a distortion criterion. Both signal-side, i.e. frame redundancy evaluation, and environment-side, i.e. late reverberation power, aspects are represented by context awareness. Multi-modal modification maintains high intelligibility in severe reverberation conditions. The multi-modal modification is subject to single criterion power dynamics recovery, i.e. by a full-band spectral modification and adaptive time scale warping, by a full-band local time scale modification.
The modification is characterized by a low algorithmic delay and a low complexity. In an embodiment, the most computationally costly operations are the search for the optimal lag k*, the MFCC computation in the frame redundancy estimator and the convolution with h in equation (3).
The modification can significantly improve intelligibility in reverberant environments.
In one embodiment, the system implements context awareness in the form of adaptation to reverberation time T60 and local speech signal redundancy. The system allows modification optimality as a result of using an auditory-domain distortion criterion in determining the depth of the speech modification. The system allows simultaneous and coherent modification along different signal dimensions allowing for reduced processing artefacts.
The system is based on a general theoretical framework that facilitates method analysis.
In an embodiment, the system can be used for public announcements in large spaces such as train stations, airports, lecture halls, tunnels and covered stadiums. Alternatively, the system can be used for teleconferencing or disaster prevention systems.
As described above, Figure 2 shows a general framework for improving speech intelligibility in reverberant environments through speech modification. Simultaneous modification of the frame-specific power and the local time scale provide a modified speech signal with low level of artefacts and higher intelligibility under reverberation.
The framework provides a unified and general framework that combines context-awareness with multi-modal modifications. These support good performance in a wide range of conditions. The information content, or importance, of a speech segment is measured, and this information is used when optimizing the modification.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims (20)

  1. CLAIMS: 1. A speech intelligibility enhancing system for enhancing speech, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output the enhanced speech; and a processor configured to convert speech received from the speech input to enhanced speech to be output by the enhanced speech output, the processor being configured to: i) extract frames of the speech received from the speech input; ii) calculate a measure of the frame importance for each frame; iii) estimate a contribution due to late reverberation for each frame; iv) modify the frame of the speech received from the speech input, wherein the amount of modification is calculated using the measure of the frame importance and the contribution due to late reverberation.
  2. 2. A system according to claim 1, wherein the measure of the frame importance is a continuous measure.
  3. 3. A system according to claim 1, wherein the measure of the frame importance is a measure of the dissimilarity of the frame to the previous frame, weighted by a function of the number of previous frames for which the frame importance is below a threshold level.
  4. 4. A system according to claim 1, wherein the measure of the frame importance is a measure of the dissimilarity of the mel cepstrum of the frame to that of the previous frame, weighted by a function of the number of previous frames for which the frame importance is below a threshold level.
  5. 5. A system according to claim 1, wherein step i) comprises: extracting overlapping frames of the speech received from the speech input.
  6. 6. A system according to claim 1, wherein step iv) comprises applying a spectral modification.
  7. 7. A system according to claim 6, wherein step iv) comprises modifying the frame power, wherein the amount of modification is determined by calculating a prescribed modified frame power value that minimizes a distortion measure subject to a penalty term which is a penalty on power increase, wherein the penalty term comprises a multiplier which is a function of the measure of the frame importance.
  8. 8. A system according to claim 7, wherein step iii) comprises estimating the frame power due to the late reverberation.
  9. 9. A system according to claim 8, wherein the estimate of the frame power due to the late reverberation is included in the distortion measure as additive noise.
  10. 10. A system according to claim 9, wherein the frame power due to the late reverberation for frame i is the power of the late reverberation signal in frame i., wherein the late reverberation signal at each discrete time index k in frame i is generated by the superposition of reflected, delayed and attenuated instants from the past of the signal produced up to ti seconds before the time instant corresponding to time index k in frame i.
  11. 11. A system according to claim 10, wherein the frame power due to the late reverberation is estimated by modelling the impulse response of the environment as a pulse train that is amplitude-modulated with a decaying function.
  12. 12. A system according to claim 11, where in the prescribed modified frame power is calculated from the relationship: where y is the prescribed modified frame power, x is the frame power of the speech received from the speech input, n is the estimate of frame power due to the late reverberation, c1 and c2 are constants with a dependence on A, b is a shape parameter of a Pareto distribution fitted to sample speech data and A is a function of the measure of the frame importance.
  13. 13. A system according to claim 12, wherein A is bounded by an upper value Am and a lower value Am, and wherein A is given by: where is the measure of the frame importance and a is a constant.
  14. 14. A system according to claim 6, wherein step i) comprises: extracting overlapping frames of the speech received from the speech input; wherein step iv) further comprises: applying a time scale modification, wherein the amount of time scale modification is calculated using the amount of modification of the frame power.
  15. 15. A system according to claim 14, wherein the amount of time scale modification is determined from a range of possible amounts, wherein the value of the upper end of the range and the value of the lower end of the range are dependent on the amount of modification of the frame power, when the modified frame power is less than the frame power of the input speech.
  16. 16. A system according to claim 1, wherein step iv) comprises: applying a modification of the frame power to a frame of speech received from the speech input to output a modified frame, wherein the amount of modification of the frame power is calculated using the frame importance; overlap adding the modified frame to the enhanced speech signal, to output a new enhanced speech signal; calculating the correlation between a last segment of the new enhanced speech signal and each of a plurality of target segments of the new enhanced speech signal, wherein the target segments correspond to a range of earlier segments of the new enhanced speech signal, wherein the range is calculated using the amount of modification of the frame power, when the modified frame power is less than the frame power of the input speech; determining the target segment corresponding to the highest correlation value; replicating the section of the new enhanced speech signal from the target segment to the end of the new enhanced speech signal; overlap-adding this replicated section to the last segment of the new enhanced speech signal.
  17. 17. A system according to claim 1, the processor being further configured to: v) overlap-add the modified frames to produce the enhanced speech signal.
  18. 18. A system according to claim 1, wherein the measure of the frame importance for a frame i of the speech received from the speech input is: where m, is a set of Mel frequency cepstral coefficients derived from frame i, cl) where (c(0,1) and 0, is the number of frames for which where j
  19. 19. A method of enhancing speech, the method comprising the steps of: receiving speech to be enhanced; extracting frames of the received speech; calculating a measure of the frame importance for each frame; estimating a contribution due to late reverberation for each frame; modifying the frame of the received speech, wherein the amount of modification is calculated using the measure of the frame importance and the contribution due to late reverberation.adding the modified frames to produce the enhanced speech signal.
  20. 20. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 19.
GB1507486.7A 2015-04-30 2015-04-30 A speech processing system and speech processing method Expired - Fee Related GB2537923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1507486.7A GB2537923B (en) 2015-04-30 2015-04-30 A speech processing system and speech processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1507486.7A GB2537923B (en) 2015-04-30 2015-04-30 A speech processing system and speech processing method

Publications (3)

Publication Number Publication Date
GB201507486D0 GB201507486D0 (en) 2015-06-17
GB2537923A true GB2537923A (en) 2016-11-02
GB2537923B GB2537923B (en) 2021-05-12

Family

ID=53489001

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1507486.7A Expired - Fee Related GB2537923B (en) 2015-04-30 2015-04-30 A speech processing system and speech processing method

Country Status (1)

Country Link
GB (1) GB2537923B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090043570A1 (en) * 2007-08-07 2009-02-12 Takashi Fukuda Method for processing speech signal data
US20140249812A1 (en) * 2013-03-04 2014-09-04 Conexant Systems, Inc. Robust speech boundary detection system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090043570A1 (en) * 2007-08-07 2009-02-12 Takashi Fukuda Method for processing speech signal data
US20140249812A1 (en) * 2013-03-04 2014-09-04 Conexant Systems, Inc. Robust speech boundary detection system and method

Also Published As

Publication number Publication date
GB201507486D0 (en) 2015-06-17
GB2537923B (en) 2021-05-12

Similar Documents

Publication Publication Date Title
Wen et al. Blind estimation of reverberation time based on the distribution of signal decay rates
Habets Single-and multi-microphone speech dereverberation using spectral enhancement
TWI587290B (en) Apparatus and method for generating an adaptive spectral shape of comfort noise, and related computer program
JP4774100B2 (en) Reverberation removal apparatus, dereverberation removal method, dereverberation removal program, and recording medium
US7590526B2 (en) Method for processing speech signal data and finding a filter coefficient
KR102132500B1 (en) Harmonicity-based single-channel speech quality estimation
US7856353B2 (en) Method for processing speech signal data with reverberation filtering
Tang et al. Improving reverberant speech training using diffuse acoustic simulation
JP2017223930A (en) Speech processing system and speech processing method
Ratnarajah et al. Towards improved room impulse response estimation for speech recognition
Gamper et al. Predicting word error rate for reverberant speech
US10438604B2 (en) Speech processing system and speech processing method
JP2009276365A (en) Processor, voice recognition device, voice recognition system and voice recognition method
Guzewich et al. Improving Speaker Verification for Reverberant Conditions with Deep Neural Network Dereverberation Processing.
GB2537923A (en) A speech processing system and speech processing method
Parchami et al. Model-based estimation of late reverberant spectral variance using modified weighted prediction error method
Gibson et al. Multi-condition deep neural network training
Cho et al. A subband-based stationary-component suppression method using harmonics and power ratio for reverberant speech recognition
Chan et al. A decision-directed adaptive gain equalizer for assistive hearing instruments
Crespo et al. Speech reinforcement with a globally optimized perceptual distortion measure for noisy reverberant channels
Gaubitch et al. Multimicrophone speech dereverberation using spatiotemporal and spectral processing
Prego et al. Perceptual Improvement of a Two-Stage Algorithm for Speech Dereverberation.
Paul A robust vocoder with pitch-adaptive spectral envelope estimation and an integrated maximum-likelihood pitch estimator
Yoshioka et al. Statistical models for speech dereverberation
Petkov et al. Generalizing Steady State Suppression for Enhanced Intelligibility under Reverberation

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230430