GB2576320A - A processing method, a processing system and a method of training a processing system - Google Patents

A processing method, a processing system and a method of training a processing system Download PDF

Info

Publication number
GB2576320A
GB2576320A GB1813189.6A GB201813189A GB2576320A GB 2576320 A GB2576320 A GB 2576320A GB 201813189 A GB201813189 A GB 201813189A GB 2576320 A GB2576320 A GB 2576320A
Authority
GB
United Kingdom
Prior art keywords
audio signal
frequency spectrum
frames
magnitude
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1813189.6A
Other versions
GB2576320B (en
GB201813189D0 (en
Inventor
Nikolov Petkov Petko
Stylianou Yannis
Tsiaras Vassilis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Priority to GB1813189.6A priority Critical patent/GB2576320B/en
Publication of GB201813189D0 publication Critical patent/GB201813189D0/en
Publication of GB2576320A publication Critical patent/GB2576320A/en
Application granted granted Critical
Publication of GB2576320B publication Critical patent/GB2576320B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Stereophonic System (AREA)

Abstract

An audio (eg. speech) enhancement system suppresses reverb by using a supporting neural network 201 to generate linearly predicted (LP) magnitude spectra (√Λ͂ or root tilde Lambda, rtL) for audio frames from an input signal ym and then generating magnitude spectra (√Λ̂ or root hat Lambda, rhL) for desired output signal x̃m frames (ie. reverb suppressed frames 202) having particular filter coefficients g, such that subtracting rhL from rtL yields zero or nearly so (ie. cost function 204 is minimised – see p.25). The neural network is trained using non-parallel data.

Description

A processing method, a processing system and a method of training a processing system
FIELD
The present disclosure relates to an audio signal processing method, an audio signal processing system and a method of training an audio signal processing system.
BACKGROUND
Audio signal processing systems are used in many applications, for example to enhance audio signals (such as speech signals) in an enhanced listening device. For example, the input audio signals may be enhanced before they are output by a hearing aid device, a loudspeaker device or a mobile phone. Such processing methods may be used in all forms of enhanced listening devices. Such processing methods may also be used to enhance speech signals before they are inputted to an automatic speech recognition system (ASR) for example.
There is a continuing need to enhance audio signals, particularly signals from reverberant environments. Reverberation is the process by which multiple delayed copies of a signal are observed simultaneously. It degrades the perceptual quality (and for speech, possibly also intelligibility) of the signal by overlap-masking. It also degrades ASR performance, increasing the error rates. Late reverberation (LR), i.e. reflections with longer propagation paths and low correlation with the direct signal, is considered to be most detrimental to performance. Speech signal enhancement may be performed to improve the intelligibility of speech in a reverberant environment. Furthermore, there is a continuing need to improve the performance of ASR systems used in such environments.
BRIEF DESCRIPTION OF FIGURES
Systems and methods in accordance with non-limiting examples will now be described with reference to the accompanying figures in which:
Figure 1 shows a schematic illustration of an audio signal processing system in accordance with an example;
Figure 2 shows a schematic illustration of a configuration for a parallel (left) training data method of training a system and a non-parallel (right) training data methods of training a system in accordance with an example;
Figure 3(a) shows a schematic illustration of the steps of an audio signal processing method according to an example;
Figure 3(b) shows a schematic illustration of the steps of a method of training an audio signal processing system according to an example;
Figure 3(c) shows a schematic illustration of an example unit in a recurrent neural network used in a method of audio signal processing according to an example;
Figure 3(d) shows an example filter coefficient vector for a frequency k, with two different ordering possibilities, generated in a method of audio signal processing according to an example;
Figure 3(e) shows operations performed in a method of audio signal processing according to an example and in a method of training an audio signal processing system according to an example;
Figure 4 shows a schematic illustration of an audio signal processing system combined with an automatic speech recognition system in accordance with an example;
Figure 5 shows experimental results based on automatic speech recognition performance;
Figure 6 shows experimental results based on perceptual performance;
Figure 7 shows experimental results in which a representation of the audio signal is shown.
DETAILED DESCRIPTION
According to one example, there is provided an audio signal processing method, comprising:
receiving a discrete input audio signal;
generating an estimate of a magnitude of a frequency spectrum of each of a plurality of frames of a desired audio signal corresponding to a first segment of the input signal, by inputting a magnitude of a frequency spectrum of each of the plurality of frames of the first segment of the input audio signal into a trained algorithm;
generating a frequency spectrum corresponding to each frame of the desired audio signal, by subtracting a first frequency spectrum corresponding to each frame from a frequency spectrum corresponding to each frame of the first segment of the input audio signal, wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal, the plurality of further segments each being located at least partly prior to the first segment and each being weighted by a set of frequency dependent coefficients, wherein the coefficients are generated from the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm.
In an example, the trained algorithm comprises a trained neural network. The trained neural network may be a recurrent neural network, for example comprising a long-short term memory layer.
In an example, a log magnitude of the frequency spectrum of each of the plurality of frames of the first segment of the input audio signal is inputted into the trained neural network, and wherein the trained neural network outputs an estimated log magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal.
In an example, the first frame of each of the plurality of further segments of the input audio signal is located at least a minimum number of frames prior to the first frame of the first segment.
The input audio signal may comprise multiple channels, wherein generating the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal corresponding to the first segment of the input signal comprises generating the estimate for one channel, wherein generating the frequency spectrum corresponding to each frame of the desired audio signal comprises generating the frequency spectra for the one channel, wherein the first frequency spectrum corresponding to each frame is subtracted from the frequency spectrum corresponding to each frame of the first segment of the input audio signal for the one channel, and wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal from the one channel and/or from one or more other channels.
In an example, the method further comprises generating the frequency spectrum corresponding to each frame of the desired audio signal for the one or more other channels.
In an example, the method further comprises generating an output audio signal from the frequency spectra of the desired audio signal.
The audio signal may be a speech signal. In an example, the audio signal is an audio speech signal.
In an embodiment, the method further comprises performing automatic speech recognition.
According to another example, there is provided an audio signal processing system, comprising:
an input configured to receive a discrete input audio signal;
an output configured to output information relating to a desired audio signal;
a processor configured to:
generate an estimate of a magnitude of a frequency spectrum of each of a plurality of frames of a desired audio signal corresponding to a first segment of the input signal, by inputting a magnitude of a frequency spectrum of each of the plurality of frames of the first segment of the input audio signal into a trained algorithm;
generate a frequency spectrum corresponding to each frame of the desired audio signal, by subtracting a first frequency spectrum corresponding to each frame from a frequency spectrum corresponding to each frame of the first segment of the input audio signal, wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal, the plurality of further segments each being located at least partly prior to the first segment and each being weighted by a set of frequency dependent coefficients, wherein the coefficients are generated from the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm.
According to another example, there is provided a method of training an audio signal processing system, comprising:
receiving a discrete input audio signal;
generating an estimate of a magnitude of a frequency spectrum of each of a plurality of frames of a desired audio signal corresponding to a first segment of the input signal, by inputting a magnitude of a frequency spectrum of each of the plurality of frames of the first segment of the input audio signal into an algorithm;
generating a frequency spectrum corresponding to each frame of the desired audio signal, by subtracting a first frequency spectrum corresponding to each frame from a frequency spectrum corresponding to each frame of the first segment of the input audio signal, wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal and/or one or more related input audio signals, the plurality of further segments each being located at least partly prior to the first segment and each being weighted by a set of frequency dependent coefficients, wherein the coefficients are generated from the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm;
generating the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal;
updating the algorithm based on a measure of the difference between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm and the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal.
In an example, updating the algorithm is further based on the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal.
In an example, updating the algorithm comprises updating the algorithm based on the difference measure between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm and the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal, multiplied by the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal.
The difference measure may be multiplied by the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal to the power of a constant.
In an example, updating the algorithm comprises updating the algorithm based on the difference between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm and the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal, added to the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal multiplied by a constant.
The value of the constant may be selected based on the desired application. For example, a value of the constant greater than 0 may be selected when the application uses a human listener, and a value of the constant equal to zero may be selected when the application comprises automatic speech recognition.
In an example, the trained algorithm comprises a trained neural network. The trained neural network may be a recurrent neural network, for example comprising a long-short term memory layer.
According to another example, there is provided an audio signal processing system, trained according to any of the above methods of training.
The methods are computer-implemented methods. Since some methods in accordance with examples can be implemented by software, some examples encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal. The carrier medium may comprise a non-transitory computer readable storage medium.
According to an example, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform any of the above methods.
Figure 1 shows a schematic illustration of an audio signal processing system 1 in accordance with an example.
The system 1 comprises a processor 3, and takes an input audio signal. It may output an audio signal and/or information relating to a desired audio signal (such as spectral data relating to a desired audio signal for example). A computer program 5 is stored in non-volatile memory. The non-volatile memory is accessed by the processor and the stored code is retrieved and executed by the processor 3. The processor 3 may comprise logic circuitry that responds to and processes the instructions in the stored code. The storage 7 stores data that is used by the program 5.
The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input 15 for receiving the signal. The input 15 may be a receiver for receiving data from an external storage medium or a communication network. Alternatively, the input 15 may comprise hardware such as a microphone. Connected to the output module 13 is output 17. The output 17 may comprise hardware, such as a speaker. Alternatively, the output may be a transmitter for transmitting data to an external storage medium or through a communication network. Alternatively, the output may direct to, for example, an ASR application. For example, the system may be used as the front-end of an ASR system.
In an example, the system 1 may be located in a common system with hardware such as a microphone, and/or speaker for inputting and outputting audio signals. Alternatively, the system 1 may be a remote system 1, which receives data regarding the input signal transmitted from another unit, and/or transmits data regarding the output signal to the same or a different unit. For example, the system may be implemented on a cloud computing system, which receives and transmits data. Although in the described system, a single processor 3 located in a device is used, the system may comprise two or more remotely located processors configured to perform different parts of the processing and transmit data between them. For example, where the system is located in a hearing aid device, some or all of the processing may be performed at a remote server, and data transmitted to the hearing aid device (which comprises the microphone and speaker).
In use, the system 1 receives data corresponding to the input signal through data input 15. The program 5, executed on processor 3, outputs data corresponding to the output signal through the output 17 in the manner which will be described with reference to the following figures. The system may be used for reverberation suppression in listening enhancement devices or in an ASR front end.
Figure 2 shows a schematic illustration of a configuration for parallel (left) and nonparallel (right) training data methods. A flow chart of a training method and a processing method in accordance with a first example is shown on the right hand side of the figure, in which a desired audio signal xm is generated from an input audio signal y. The input audio signal may comprise a single audio channel, or two or more audio channels. For example, y may comprise the signals y^ to yM from an array of M microphones. Where there is more than one channel, a channel selection 208 is performed to fix the target channel. The process may be repeated for as many channels as desired however, by fixing a different target channel and repeating the process for an input audio signal segment. The output signal corresponds to the desired audio signal for the selected channel (in this case channel m). The desired audio signal xm may be generated so as to reduce the effect of reverberation in the input signal. The method uses a supporting neural network 201 labelled design II, which is trained using observed audio signals, in the manner which will be described below.
The method is a prediction based method, and uses (multi-channel) linear prediction ((MC)LP) 202. Although (multi-channel) linear prediction is referred to, as explained above, there may be a single input channel only. The method uses an estimated magnitude spectrum to generate the desired output signal. In the first example, illustrated on the right-hand side of the figure, the estimated magnitude spectrum is generated from the output of the supporting neural network 201. The squared magnitude spectrum (or power spectrum) can be considered to be equivalent to the instantaneous variance. Using a supporting neural network 201 to estimate the magnitude spectrum, rather than using an iterative optimization procedure for example, can provide reduced distortion in the output and reduce instabilities, even for input signals of short duration. It can provide improved robustness, and may reduce the need for repeated computationally heavy operations to be performed during operation. The neural network 201 is trained to predict a magnitude spectrum, for example a log magnitude spectrum, and then a single-shot estimation is performed to generate filter coefficients g (described in more detail below) from the estimated magnitude spectrum, and use these coefficients to generate the desired output signal.
A processing method according to a comparative example is shown on the left hand side of the figure, in which a “clean” audio signal xm is generated from the input audio signal y. The “clean” audio signal is also generated so as to reduce the effect of reverberation. The method uses a supporting neural network 206 labelled design I, which is trained using observed audio signals together with corresponding “clean” signals as targets. During the training stage for the comparative example, the supporting neural network 206 is trained using parallel data 205. Each audio signal used to train the system requires a corresponding clean signal, which is used as a target to train the neural network 206. The training of the neural network labelled design I thus requires parallel data, i.e., both reverbed and “clean” data. Clean signals comprise only the direct sound and early reflections.
For the method of the comparative example, during implementation, the input is the signal array y. Where there is more than one channel, a channel selection 208 is performed, in which one of the channels is selected, and the signal from the channel is inputted into the trained neural network 206. The output of the supporting neural network 206 is used to generate a predicted “clean” signal magnitude spectrum, for example a log magnitude spectrum, in the time frequency plane. This is then inputted to the (MC)LP algorithm 207, together with the input channel signal and optionally further data from one or more further channels. The (MC)LP algorithm outputs information relating to the “clean” audio signal x^, for example spectral data. The dereverbed audio signal may be generated from the information.
For the method according to the first example on the other hand, each audio signal used to train the neural network may be an independent, observed signal. Such signals are straightforward to obtain, even for particular or new acoustic environments. In the model according to the first example, the training of the supporting neural net 201 is thus achieved without parallel data. The supporting neural net 201 may be trained using only observed audio signals, for example reverberant speech 203. The audio signals used to train the neural network need not be signals observed in the same environment as that in which the system will be used, but may be audio signals from any acoustic environment. For example, audio signals from a first acoustic environment may be used to train the neural network, and the system may then be implemented in a second acoustic environment. Multi-condition training may be performed, using signals from many different acoustic environments. Performing multi-condition training allows the neural network generalise better during implementation. Furthermore, by gradually reducing the learning rate over time during the training stage, overfitting to the training data can be mitigated, and the generalisation may be further improved.
Furthermore, the supporting neural network 201 is trained using the output of the (multi-channel) linear prediction of the desired signal spectrum. Such a training method is light-weight, can be integrated into larger systems, is less computationally demanding and less volatile. Furthermore, it works solely on the audio signal level. Furthermore, combination with other enhancement methods such as noise reduction and speech separation is possible.
During the training stage, the reverbed training data 203 is inputted. Each signal in the training data 203 comprises one or more audio channels. Where there is more than one channel, one of the channels is selected, and the signal from the channel is inputted into the neural network 201. The output of the supporting neural network 201 is used to generate a predicted desired magnitude spectrum, for example a log magnitude spectrum, in the time frequency plane. This is then inputted to the (MC)LP algorithm 202, together with the input channel signal and optionally further data from one or more further channels in the training data. The (MC)LP algorithm 202 outputs information relating to the desired audio signal xm, for example spectral data. The magnitude spectrum of the desired audio signal is then calculated. The difference between the magnitude spectrum of the desired audio signal xm output from the (MC)LP 202 and the predicted magnitude spectrum generated from the output of the neural network 201 is then used in the cost optimizer 204 to train the neural network 201.
During implementation, the input is the signal array y. Where there is more than one channel, a channel selection 208 is performed, in which one of the channels is selected, and the signal from the channel is inputted into the trained neural network 201. The output of the supporting neural network 201 is used to generate a predicted desired magnitude spectrum, for example a log magnitude spectrum, in the time frequency plane. This is then inputted to the (MC)LP algorithm 202, together with the input channel signal and optionally further data from one or more further channels. The (MC)LP algorithm outputs information relating to the desired audio signal xm, for example spectral data. The desired time domain audio signal may then be generated from this data for example, or this data may be inputted into an ASR system.
The dashed box in both parts of the figure shows the system components involved when training the supporting neural network. For the neural network 206 having design I, the training data comprises both inputs and targets. For the case of the neural network 201 having design II, which uses non-parallel data, the steps performed during the training stage include all system components used to generate the internal reference needed for training, i.e. the cost function used for optimisation. In particular, the steps of the (MC)LP stage 202 are also performed during training, since the output is used for optimisation 204.
A more detailed schematic illustration showing the steps of the method according to the first example is shown in Figure 3(a). The illustration also shows the steps for training the supporting neural network in Figure 3(b). Core linear prediction operations are enclosed by a dashed line. The neural network is trained in line with the signal flow of the linear prediction model.
In the first example, the neural network 201 is an LSTM (long-short term memory) DNN (deep neural network), however the neural network 201 may alternatively be a different kind of neural network. For example, it may be a different kind of recurrent neural network. Alternatively, it may be a type of neural network other than a recurrent neural network, and context information may be provided as part of the input (namely data from one or more frames prior to and/or subsequent to the target frame). Alternatively, other kinds of machine learning algorithms may be used.
In these figures, solid arrows (paths) symbolize input data representations. Dotted arrows represent data which depends on the supporting neural network 201. The two paths flowing into the cost specification box (Ο) provide all the information for the optimization process. Once the system is trained, the path providingA and A to the cost function is disconnected and the output is provided by the dashed arrow, as shown in Figure 3(a).
The input signal, both during training and during implementation, is an audio signal y. y may be a single channel audio signal or a multi-channel audio signal. Each channel will be designated with the suffix m, and the total number of channels is designated M, where M=1 for the single channel case. The inputs may be signals acquired by one or more microphones (a microphone array). A single or multi-microphone set-up may be used for example. Training of the neural network 201 can be performed using a single channel or multiple channels. During operation, a single or multiple channel signal can then be used. Different numbers of channels may be used during training and implementation. Using more channels may provide improved de-reverberation performance for example.
The input signal is processed in segments. Each segment may correspond to an utterance (for example a sentence), and the segment length may be of the order of a couple of seconds for example. Each input segment may have different length. The input signal y may be divided into segments. The signal may be segmented into utterances by identifying silences, which occur between the utterances. The input signals are sampled signals, comprising discrete time samples of the audio input.
To allow generation of complex valued representations of the input signal segment, for each channel the segment is framed, for example overlapping frames are extracted.
The frames may then be windowed with an appropriate smooth window function. Each frame is denoted with the index n, where N denotes the total number of frames in each segment. Since each segment may be of a different length, N may take a different value for each input. In an example, N may be of the order of 500. The choice of the window ensures that if no processing is done, i.e. the input is not modified, then at the output side the original signal can be re-synthesized.
Each frame is then transformed into the frequency domain, for example using a discrete Fourier transform, for example based on a Fast Fourier Transform algorithm. The discrete Fourier Transform converts the frame of equally-spaced samples of the audio signal in the time domain into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. Each sample is denoted with the index k, and there are K total samples in the frequency spectrum corresponding to the frame n. The interval at which the DTFT is sampled is the reciprocal of the duration of the input frame (i.e. number of time domain samples in the input frame). The output sample values are the coefficients of unique complex sinusoids at the corresponding frequencies k. In other words, only half of the spectral values are used (the other half being complex conjugates). In an embodiment, K=257.
These initial signal pre-processing operations are not illustrated in Figure 3, in which the input signal Y comprises the complex spectra corresponding to the frames for the segment. In this specification, upper-case letters may represent the complex spectrum (e.g. Y) whereas lower case letters may represent the time domain real valued signal (eg· y)·
A channel selection step 208 is performed on the input signal Y, in which, for the case of two or more channels, one of the channels is selected as the target channel Ym. Ym is a KxN matrix comprising K rows (corresponding to each frequency) and N columns (corresponding to each frame in the segment). Each column in the matrix corresponds to the complex spectrum for the frame n. Each entry in the matrix corresponds to a complex number for the frequency k and frame n. For the single channel case, this step is omitted, and Ym = X for each input segment.
In step 301, a matrix |Ym| comprising the magnitude spectrum values of Ym is generated. Each entry (k, n) in the matrix |Ym| corresponds to the modulus of the complex number in the entry (k, n) in Ym.
The matrix log|Ym| is then generated, by taking the natural log of each value in |Ym|. The log of the magnitude spectrum is used in order to compress the range, however this step may optionally be omitted. Numerical robustness is enhanced by operating in log domain.
The log magnitude spectrum for each frame may be inputted into the neural network in 201, one frame at a time. Each vector of length K corresponding to a frame of the segment is inputted in to the neural network in sequence. The neural network outputs a vector of length K corresponding to each input vector. These are combined to form a KxN output matrix corresponding to the segment.
In an alternative example, each input to the neural network may comprise the target frame together with context information. In this case, step 302 is performed in order to generate the input comprising the context information for each target frame. In step 302, a splice operation is performed on the matrix log|Ym|. In this step, for each target frame n, (corresponding to a column n), a “spliced” vector comprising the data from p previous frames and s subsequent frames is generated. In an example, for each target frame n, a vector comprising the features from frame n-5 to frame n+5, with the target frame n in the middle, is concatenated into a vector. The is done for each target frame, in sequence from n=1 to n=N, such that N input vectors, each of length {(Kxp)+K+(Kxs)} are generated. Where the target frame is one of the first p frames in the segment, the missing frame entries may be replaced by zeroes, or by repeating one or more frames for example. Similarly, where the target from is one of the last s frames in the segment, the missing frame entries may be replaced by zeroes of by repeating one or more frames for example.
The vector for each target frame is then inputted into the neural network in 201, one vector at a time. The neural net has learned a mapping from the input vectors to the log-magnitude spectrum of the enhanced signal, i.e. the desired signal.
The operation of an LSTM neural network will be described here, however alternative machine learning algorithms can be used in this step, for example a feed-forward neural network.
Each target vector of length {(Kxp)+K+(Kxs)}, for example Kx11, corresponding to a target frame of the segment is inputted in to the LSTM NN in sequence. The LSTM NN outputs a vector of length K corresponding to each input vector. These are combined to form a KxN output matrix corresponding to the segment.
The method may be implemented using a machine learning toolkit such as TensorFlow, CNTK, CHAINER, or Cafe for example. In an example, the neural network comprises a single long-short term memory (LSTM) layer having 500 units. This may be followed by one or more fully connected layers, for example two fully-connected layers with 2048 nodes each. Rectified linear unit (ReLU) activations may be used for these layers. A final linear layer, i.e. a fully-connected layer with no activation, may be included. This maps to a K dimensional output corresponding to the single-sided (log) magnitude spectrum, for example a 257 dimensional output. The input to the network comprises the center, i.e. target frame, plus a context of 10 (±5) frames in this example.
During operation, each vector is input to the LSTM layer in sequence. Figure 3(c) shows a schematic illustration of an example LSTM layer. The o and tanh in the dashed boxes each represent a neural network layer with the respective non-linear activation function (sigmoid and tanh). In the example, each of these layers comprises 500 nodes (or units). The output of the LSTM layer in this case is thus a vector of length 500. The tanh and other operations in the solid boxes represent point-wise operations. The dashed arrows represent the information passed on to the next time step (i.e. corresponding to the next input vector). In this case, the output (lower dashed line) for the input target vector corresponding to n is passed on to the next time step, and input at the point indicated by the lower dashed line. Furthermore, the cell state (upper dashed line) is passed on to the next time step and input at the point indicated by the upper dashed line.
The output of each layer is then fed as the input to the subsequent layer. Each node in the fully connected layers computes a weighted sum of all of its inputs (being the outputs of each node/unit in the previous layer) and an additive bias term, and then applies the activation function to the result. The weights and biases which characterise each layer are learned before operation during the training stage, which will be described below. These are the trainable parameters.
The output of the neural network comprises a log magnitude spectrum of the desired audio signal corresponding to the segment. The matrix is a KxN matrix, comprising K rows (corresponding to each frequency sample) and N columns (corresponding to each frame in the segment). Each column in the matrix corresponds to the log magnitude spectrum of the desired signal for the frame n. Each entry in the matrix corresponds to the log magnitude spectral value for the frequency k and frame n.
The exponential of the output matrix is then taken in 304, to generate a magnitude spectrum, |Xm|. The exponential of each value in the output matrix is taken in this step. This magnitude spectrum matrix corresponds to the magnitude spectrum >/A for the desired output signal, which is a KxN matrix. Thus the square of the exponent of the neural network output corresponds to the prediction for the second moment (instantaneous variance) of the desired signal in each time-frequency cell (n, k). The square of the exponent of the neural network output corresponds to the predicted power spectrum, i.e. the power spectral density, of the desired signal in each timefrequency cell (n, k).
In 305, each element of the magnitude spectrum matrix is taken to the power of-2, in other words each value in the matrix is taken to the power of -2 to generate Φ. The matrix Φ is also a KxN matrix, and is equivalent to |Xm|'2. The notation |Xn,fc,m|2 = ΑηΛ;ίη= ^n.k.m 1 is used, where ~ :
In the next steps, (MC)LP is performed based on a weighted prediction error (WPE) method. In these steps, for each frequency k, a matrix Diag($k) is generated, which comprises the values from the kth row of the matrix Φ in the diagonal entries, with all other entries being zero. Diag($k) is thus an NxN matrix of real values.
In the linear prediction method, a complex spectrum corresponding to each frame of the enhanced signal, i.e. the desired output signal is generated. The below provides an explanation of the derivation of the calculation performed to generate the complex spectrum. The calculations actually performed will then be described.
The complex spectrum may be derived as the minimum variance estimator:
X = | XpX|K(X|Y)dX where pX|K(X|Y) is the conditional probability of X given Y. Given the observations Y this expression is the minimum variance estimator of X.
The conditional probability density function is obtained through Bayes rule from:
Ό fx|Y. Pm(Y|X)MX) |rC } /pK|X(Y|X)px(X)dX
In these expressions, the oblique font denotes the random variables for which the probability density function is defined. In this expression, px is the source model (i.e. the probability of the clean (i.e. enhanced) signal X), and pY\X the room acoustics model (i.e. the probability of Y given X), and both models are complex-valued distributions. Operating in the short-term Fourier transform (STFT) domain, i.e. looking at one frequency over multiple frames which are offset by, e.g., 8 ms (i.e. partially overlapping), offers computational efficiency. In addition, the de-correlating effect of the transform justifies the independent enhancement of individual spectral bins.
The mean X is given in the following expression, where the choice of an auto regressive propagation model leads to a moving-average expression for the conditional expectation X, giving the optimally de-reverbed spectrum:
Dk+L ~ V Η H
X n,k,m Y n,k,m / St,k,mYn-t„k Y n,k,m Sk,m^n,k t=Dfc+l
D is a delay preventing over-prediction and L is the number of filter coefficients for each observation microphone channel m. Use of D relaxes the estimation of the source to that of the direct plus early reflections (ER) spectrum. Both Yn.t,k and its stacked representation Sn,k may contain L observations from all microphone channels. The suffix m refers only to the target channel for the output signal (thus Yn.t,k and Snk may include observations from many channels, whereas Ynkm is the input signal corresponding to the target channel m only). In general, the suffix m refers to the target channel. Including all values gt, k,m from Dk + 1 to Dk + L in a single vector gives gk,m. The length of gk,m. is ML (assuming the same number of observations are used for each channel). If M=3, then each gt,k,m contains 3 numbers. For M=1, it is a single number. Thus, for M=1, there are L coefficients in vector gk,m. If M=3, then gk,m would contain 3L coefficients. The coefficients and stacked representation will be discussed in further detail below.
In this example, Dk is the same for each frequency k, however Dk can be dependent on the frequency channel k.
Furthermore, in this example, L is the same for each frequency k, however, L can be made dependent on the frequency channel k. Furthermore, in this example, L is the same for each microphone channel, however L can be made dependent on m.
The calculation of the complex spectrum corresponding to each frame of the enhanced signal, i.e. the desired output signal (the derivation of which is described above):
Xn,k,m Y n,k,m Sk.m^n.k is performed in the following steps. This equation is the form of the mean of the posterior distribution, taken as the point estimate of the desired parameter, as described above. This is the weighted prediction error solution to the de-reverberation problem (at time instant n).
In the present application, H denotes the conjugate transpose of a matrix, obtained by taking the transpose of the matrix and then taking the complex conjugate of each entry.
Y n km corresponds to the input KxN matrix comprising K rows (corresponding to each frequency sample) and N columns (corresponding to each frame in the segment) for the selected channel m (where m is always equal to 1 for the single channel case), where each entry in the matrix corresponds to a complex number for the frequency k and frame n. It is the observed signal, for example the observed reverberant speech spectrum.
Xnkm ίθ th® desired output signal for the segment. For example, it may be the dereverbed speech spectrum, at time n and frequency kfor channel m.
The gk,m are the filter coefficients, and are derived from the predicted matrix Φ in the manner described below. They may be the optimal de-reverberation filter coefficients.
The Snk are signal observations comprising multiple time instants for an individual or (if more microphone channels are available) multiple microphones, which will be described in further detail below. They are the stacked observations preceding the target time instant excluding a “buffer” of observations immediately preceding the target observation. Excluding this buffer acts to mitigate over-prediction (flattening). Sk is the matrix of observations over all time instants.
As described above, suffixes n, k and m represent the time instant (frame index), the frequency channel and the target microphone channel respectively. Throughout this application, the notation Λ denotes an estimate from the neural network, and ~ denotes the linear prediction output.
Deriving the conditional mean results in the following expressions for the filter coefficients:
As described above, a prediction for |Xm|·2 , denoted |Xm|'2 , is generated from the output of the neural network. This is then used to calculate the filter coefficients. The calculation of the filter coefficients in described below. This avoids an iterative estimation of the filter coefficients and IXml'2· The supporting neural network 202 in this case predicts |Xm| in the log domain.
The filter coefficients gk,m thus depend on the magnitude spectrum of the desired output signal. For example, if it is desired to generate a “clean” signal, comprising only direct sound and early reflections, the filter coefficients depend on the power spectrum of the clean signal. However, only the observed signal is known initially. The filter coefficients are calculated in the below steps, from the output of the neural network.
In the calculation described below, each frequency channel is processed separately from the others. For each frequency k, the calculation:
rk,m = SkDiag($)Y£m is performed. rfem is referred to as the correlation vector, and represents the correlation between the segments in Sk and the input Yk,m. These are temporal correlations based on the lag, and can also spatial and temporal correlations if there are multiple microphone channels in Sk.
For each frequency k, Sk is an (M*L)xN matrix, where N is the number of frames in each segment, and L is the total number of observations used for each microphone channel, M being the number of channels. Each observation I corresponds to a different segment of the input audio signal, from the same or from a different audio channel (i.e. microphone channel), of length N frames, and being prior to the segment currently being processed.
For example, where the first frame of the input audio signal is designated n = 1 and where N=500, the matrix Sk, may comprise the complex values corresponding to the frequency k for L = 40 segments. Each column in Sk, corresponds to a frame n, from n=1 to N=500. For the case of a single microphone channel, for a column n, the rows correspond to the frames of the input signal n-t, from t=D+1 to t=D+L, where, for example, D=5. For each adjacent column n, the observation moves forward one time index. For M microphone channels, row entries may correspond to n-t, from t=D+1 to t=D+L, for each channel m, in any order. Thus, where D=5, M=2, L=40 and N - 500, the column corresponding to n=101 comprises the entries corresponding to frames 56 to 95 for one microphone channel. Zeroes may be used for entries corresponding to frames prior to n=1 for example and frames after n=N.
In an example, N»L. For example, where N=500 (frames), L=30. All microphones and observations pertaining to a single target frame n are stacked in one column of Sk. In other words, each column n of the matrix Sk comprises the observations used to predict the late reverberation spectral density at target frame n, and used compute the enhanced (de-reverbed) spectral density. The order of the observations in the rows is arbitrary, but is consistent throughout.
Each segment is generated corresponding to a time period which excludes a “buffer” of observations immediately preceding the target segment to mitigate over-prediction (flattening). If only a single channel is present, all the observations are taken from the single channel. If other channels are present, observations may be taken from one or more of the channels, for example all of the channels, for example the observations may be evenly split between the channels.
Ykm corresponds to the values in the row k of the matrix Ym, and is a 1xN matrix. Ykm is a vector of observations from microphone channel m and frequency channel k. The output rk,m is an M*Lx1 matrix of complex values.
For each frequency sample k, the calculation:
Rk,m = SkDiag(O)Sk H is also performed. The output Rk,m is an L*MxL*M matrix of complex values and is a weighted auto-correlation matrix. Note that as shown in Figure 3(e), the calculation SkDiag(&) is performed only once. Furthermore, in practice, due the zero entries in Diag(O), in these calculations the full matrix operations may not be computed, however the matrix algebra notation is used for illustration. More efficient implementations that are numerically equivalent but require less memory and time to compute may be performed.
The filter coefficients are then calculated from gk,m Rk,mrk,m where Rk^ is the matrix inverse of (and is an L*MxL*M matrix) and gk,m is an L*Mx1 vector of complex values. Figure 3(d) shows the vector g, with two different ordering possibilities. The length of gk,m is LM. The suffix m denotes the target channel whereas the index in brackets denotes the channel on which the coefficient operates (assuming multiple microphone channels). The order of the coefficients is arbitrary, but the same ordering must be used for Sk, as well. Although the first subscript t is shown from 1 to L, this is just used for convenience. As described above, t runs from D+1 to D+L.
The observations are then weighted by the filter coefficients, in the calculation gk,mSki, which outputs a 1xN matrix of complex values. The outputs for each frequency k are then combined to give a matrix corresponding to the frequency spectrum for each frame n, i.e. the 1xN matrices corresponding to each frequency k form the rows of the combined matrix, which is a KxN matrix of complex values. This matrix therefore comprises a first frequency spectrum corresponding to each frame, where each column corresponds to a first frequency spectrum for the frame n.
This KXN matrix of complex values (i.e. comprising the first frequency spectrum corresponding to each frame n) is then subtracted from the input matrix Ym, to give the frequency spectrum for the desired output signal, i.e. X, a KxN matrix of complex values, comprising K rows (corresponding to each frequency) and N columns (corresponding to each frame in the segment). Each column in the matrix corresponds to the complex spectrum for the frame n. Each entry in the matrix corresponds to a complex number for the frequency k and frame n.
This spectral data may be output directly to an Automatic Speech Recognition system for example. Figure 4 shows an example of a method according to the first example performed on the input to an ASR system.
Alternatively, for some applications the signal is resynthesized from the enhanced spectrum. An inverse Discrete Time Fourier Transform, for example an inverse FFT is performed for each frame, and the output time domain frames are overlap-added to synthesize the time domain audio signal. This audio signal may then be outputted through a speaker of an enhanced listening device for example. Alternatively, this may be inputted to an ASR system. Thus re-synthesis followed by extraction may be performed for an ASR system.
In the above described method, a step of selecting a channel is performed in 208. The output of the method is then a desired audio signal (or information relating to the desired audio signal such as the spectral data) corresponding to the selected channel. If it is desired to output a desired audio signal for two or more channels (for example corresponding to each input channel), then the method must be repeated for each output channel, selecting the corresponding input channel in 208 each time. The same prior observations S may be used for each repeat of the method, as the same observations may be used for any target channel. Furthermore, the same neural network 201 is used, regardless of the target channel. A separate output signal corresponding to each channel is then generated. These may be combined to form fewer channels, for example a combined output channel. The method may be performed as part of the input to a beam-former for example.
Figure 3(b) shows the additional steps performed during the training stage in order to train the neural network. The method performs neural-net-supported de-reverberation avoiding the need for parallel training data. During the training stage, input audio signals are inputted in the same manner as described previously. Multiple microphone channels may be used for training. Different numbers of channels may be used during training, testing and implementation. Furthermore, the training signals may come from the same or a different acoustic environment as the eventual implementation. The steps described above in relation to Figure 3(a) are performed for each input segment. Training of the supporting neural net is then performed in the absence of parallel training data.
During the training stage, the following additional steps are performed. The magnitude of the predicted frequency spectrum for the desired output signal X outputted from 202 is taken, i.e. a matrix |X| comprising the magnitude values is generated. Each entry k, n in the matrix corresponds to the modulus of the complex number in the entry k, n in X. This is equivalent to the magnitude spectrum, or the square root of the instantaneous variance.
The difference between the two values, i.e. the magnitude spectrum, VX determined from the generated frequency spectrum of the desired output signal and the predicted magnitude spectrum, >/A for the desired output signal generated from the log magnitude spectrum outputted from the neural network is then used in an optimisation function used to train the neural network 201. In other words, the difference between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the neural network 201 and the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal from 202 is used for the optimisation.
The magnitude spectrum, >/A determined from the generated frequency spectrum of the desired output signal and the predicted magnitude spectrum Ά for the desired output signal generated from the log magnitude spectrum outputted from the neural network should be approximately the same, therefore the neural network is trained to minimise this difference. In other words, the second moments of the desired signal spectrum output at a time-frequency cell should approximate the estimated power magnitude spectrum A from the output of the neural network at the same cell. In other words, once the signal information is generated, the resulting power magnitude spectrum A (computed from X) should be close to the prediction A generated from the neural network output.
By optimising a cost function, the parameters of the supporting neural network 201 may be learned. By exploiting the properties of the de-reverberation model, namely that A and A are two estimates of the same random variable, the supporting neural network can be trained using observed signals, for example reverberant speech, only. This training involves back-propagating the gradients through the complete de-reverberation system, as described below. The cost function is formulated based on the properties of the de-reverberation model, namely that A and A are two estimators of the same random variable. Exploiting the specifics of the linear prediction model allows an unsupervised approach to training the supporting neural network without involving an acoustic model. The resulting framework offers efficiency and modularity. The linear prediction model estimates the de-reverbed spectrum Xm for uncorrelated LR and “clean” spectrum. Low correlation and, consequently, effective de-reverberation may be achieved due to the non-stationary nature of speech (i.e. there is a difference between the previous observations and the signal, since the speech changes over time).
The enhanced power spectrum A should approximate Λ, computed from the neural network output. This neural network may therefore be trained based on a measure of the difference between these. The enhanced power spectrum Λ and Λ computed from the neural network output are two estimators for the same random variable. The neural network can therefore be trained to enhance the similarity of the two estimates.
In an example, the cost function is:
namely, the Frobenius norm of the KxN matrix A, where A is generated by performing the subtraction - Τλ = A. The cost function may equivalently be expressed as ::: Awsfe The neural network is trained to minimise the cost function. The optimisation is performed over the whole segment (or utterance) i.e. the N frames, not in a frame by frame manner.
In an example, the cost function comprises a term ||λ/α — >/a || - This term may be referred to as the core objective. This is an example of a measure of the difference between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the neural network>/A and the magnitude of the frequency spectrum of each of the plurality of frames cf the output desired audio signal -/Λ.
The optimisation may also take into account the desired magnitude spectrum of the output signal for example. A cost function of the general form:
C (||7λ-Τα ||,p II) may therefore be used for the optimisation.
For example, it may be desired to minimize the core objective while also favouring lower absolute output magnitude spectrum for the desired signal. In an example, the cost function is:
δ
where δ is a real number and is selected in the manner which will be described below. The case where δ = 0 reduces to the cost function ||>/A — >/A || - Again, the neural network is trained to minimise the cost function. This cost function may be used to obtain better de-reverberation behaviour in the output signal for example. There may be multiple local minima where A and Λ come close together, without necessarily achieving effective de-reverberation, due to the non-convexity of the optimization problem for the neural network parameters. A penalty term biasing the solution towards lower output spectral instantaneous variance, i.e., more aggressive de-reverberation may therefore be added. In general, the magnitude spectrum of the signal is expected to decrease as late reverberation is removed.
In the above cost function, the second term may be replaced with 5||A || However, using Λ results in a shorter path to the optimizeable parameters in the neural network, and therefore the back propagation is more robust (a shorter path for gradient backpropagation). In an example, δ is greater than or equal to 0 and less than or equal to 1.
The level of the instantaneous variance of the desired output signal (equivalent to the squared magnitude spectrum, or power spectrum) can be controlled. If the magnitude spectrum is lower, once the signal is re-synthesised, lower instantaneous variance is observed. For example, by selecting an appropriate value of δ, the level of the instantaneous variance in the desired output signal can be controlled. By selecting a higher value of δ, a desired output signal with lower instantaneous variance is generated. An output signal with lower instantaneous variance may be desirable for an application where there is a human listener, for example a hearing aid or other enhanced listening device, where improved performance is seen with a lower instantaneous variance. In an example, δ is greater than 0. By selecting a lower value of <5, a desired output signal with a higher instantaneous variance is generated. An output signal with a higher instantaneous variance may be desirable for an ASR application for example, where improved performance is seen with a relatively higher instantaneous variance. The amount of instantaneous variance in the desired output signal can be controlled using the value <5, and there is no maximum instantaneous variance. In an example, δ is equal to 0. Use of a composite cost function comprising a distortion criterion and a penalty term thus provides an additional level of control.
In an alternative example, the cost function may be multiplicative, for example:
or o2 = ||(Ta -ΤΛ) °Ta ||2
For a multiplicative cost, A is used for the second term, since otherwise the cost is trivially minimized by zero output from the supporting neural network.
A cost function of the form:
03= IIVA-VA ||2° ||A ii/ or o3 = ||(Ta -ΤΛ) °(Τλ )δ||2 may be used in order to provide control over the level of the instantaneous variance of the desired output signal (equivalent to the squared magnitude spectrum, or power spectrum). By selecting an appropriate value of <5, the level of the instantaneous variance in the desired output signal can be controlled. In this case, if δ = 0 the cost function only takes into account the difference between the predicted magnitude spectrum and the output magnitude spectrum. By selecting a higher value of <5, a desired output signal with lower instantaneous variance is generated. In this example, the difference measure is multiplied by the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal to the power of a constant.
The cost function is used to update the parameters of the neural network 201 by performing back-propagation. However, both the input (Y) and output (X) of the system are complex Fourier spectra. As a result, the majority of the operations in the dashed box in Figure 3 are also complex-valued. Optimizing the neural network parameters thus involves back-propagation through complex-valued operations. The operations performed in the back-propagation algorithm used to optimize the weights of the neural net are derived in a manner suited to complex valued operations. For this specific part (i.e. the steps in the dashed box) the corresponding sub-graph is shown in Figure 3(e), showing the operations in the forward and the backward passes.
The forwards pass is the “normal” direction taken to obtain the model output. The backward pass propagates the gradient used to adjust the values of the neural network parameters. This is illustrated in the forward and backward pass operations for the subgraph of the complex-valued operations shown in Figure 3(e). The complex valued operations in the backward pass (corresponding to the sub-graph with complex-valued operations in Figure 3(e)) may be derived using Wirtinger calculus, to facilitate the implementation and the identification of stability caveats. For example, where the forward pass comprises a matrix inverse operation on a matrix comprising complex values, and the gradient is needed in the backward pass, Wirtinger calculus allows an expression for the gradient to be derived. Bold font in the forward pass indicates dependence on the neural network parameters. In Figure 3(e), Vj represents the incoming gradient from the previous operations, where “I” is used to represent the input.
One or more additional steps may be included in the operations to stabilize the optimization process during the training process. These steps may also be included during testing and implementation.
In an example, the magnitude spectrum Va is limited from below prior to computing its reciprocal value in step 305. This may be done using differentiable operations:
/)-/1 ί,,Λ ,/) :, J /..4. .· /.' where S denotes the sigma function, ξ is a parameter of S controlling its knee shapes is a small number and lb stands for lower-bounded. For example, ε = 1e-5 and ξ = 0.1. By lower bounding >/A, this prevents the inverse from becoming too large, which may cause a bias, and problems in the matrix R’1.
Alternatively, instead of including the lower bound operation, noise may be included on the diagonal.
In an example, Sk is normalised prior to computing rk and Rk.
An upper bound, ensuring that the neural network-based power spectrum prediction A does not exceed the corresponding value for |Y|, may also be used:
where ub stands for upper-bounded. This operation is shown performed on the lower bounded A, however these operations may be performed in either order. This ensures that the enhanced power spectrum may not exceed the observed power spectrum. All of these operations may form part of the computational graph.
Thus for a single input training segment (comprising N frames), the gradient of the cost with respect to each of the parameters (i.e. the weights and biases in the neural network) is calculated. Every operation performed in the forward pass is differentiable and therefore a functional expression for the derivative of the cost with respect to every single parameter (i.e. weights and biases in the neural network) can be determined by the chain rule. The gradient values are calculated from these expressions using the back-propagated error and the activations (inputs for each layer from the forward pass, cached during the forward pass). This results in a gradient value corresponding to each parameter of the neural network 201. Many of the operations (i.e. those having dependence on R, S and Y) depend on all frames in the segment, and thus frames are not treated completely independently, i.e. there is a single cost and back-propagation process for the segment. At the point in the back propagation process of the neural network however, since the neural network takes a separate input for each target frame, there is one value of the gradient per target frame. This is then averaged over the frames, to give a single gradient value corresponding to each neural network parameter. Where more than one target channel is used, the back propagation may be performed for each channel, and an average of the gradient values across the channels used. Furthermore, the training may be performed in batches, and the update performed based on the batch average values of the gradient for each parameter. In general, in back-propagation, the incoming gradient from preceding (in reverse direction) operations is multiplied by the (partial) derivatives of the outputs of the current block with respect to its inputs.
The gradient for each parameter is then used to calculate the updated parameter value from the previous parameter value using an optimizer function (i.e. a gradient descent type optimiser function). For example, a Momentum stochastic gradient descent (SGD) optimiser with a weight of 0.9 may be used. The input to the optimiser function for each parameter is the previous parameter value, the corresponding gradient value and a learning rate parameter. In general, gradient descent based optimizers update the parameter in the direction of steepest descent of the cost function with respect to the parameter, scaled by a learning rate, for example an initial learning rate of 0.0002 may be used, and this may be gradually reduced over time to prevent over-fitting. The parameters are replaced with the new values and the process iterates with another training signal segment.
The supporting network 201, used to predict the magnitude spectrum of the dereverbed signal, is trained without using parallel data. The neural network is a supporting neural network used in a (MC)LP-based de-reverberation algorithm. The training is performed using unsupervised learning.
Figure 4 shows the system used as a de-reverberation front-end to an ASR system. As described above, the output complex spectra may be inputted directly to the ASR, or the time domain audio signal may reconstructed first. The ASR is trained separately to the DRV. Multi-condition training may be performed to train the acoustic model of the ASR system. Using different types of data to train the acoustic model allows it to generalize. Enhanced data, i.e. data output from the de-reverberation system may also be used in the training of the acoustic model.
During implementation, features for the ASR system can be extracted directly from the enhanced (de-reverbed) spectrum. For feature extraction in the ASR system, frequencies may be combined. Higher resolution at low frequency may be obtained by combining fewer frequency bins than are combined at higher frequency, using a warped filter-bank. The power in each band, is taken, then the log(). The resulting filter bank features are then used for the ASR system. Alternatively, an additional transformation, e.g., a DCT transform may be taken to give cepstral coefficients, e.g. Mel filter-bank cepstral coefficients (MFCC). Either set of features may be used in ASR.
Experimental results will now be described. In these results, the method according to the first example will be referred to as InL, followed by a letter (A, B or C) and an index (1 or 2). The letter corresponds to the value of δ e {0, %, 2/3}. The index corresponds to the number of microphone channels used for training. Thus, InL C2 stands for using δ = 2/3 with two microphone channels available at training time. Elsewhere, the number of channels refers to the test-time set-up.
In the first set of results, the method described in relation to Figure 3 is applied as a front-end to an ASR system as illustrated in Figure 4 for example. The neural network 201 is trained using Oo = ||>/A ->/A || + δ ||>/A || as the cost function.
The acoustic model (AM) used in the ASR system is a 10-layer convolutional neural net followed by two fully connected layers (CNN). All the CNN layers use 3x3 filter kernels and are followed by batch normalisation and ReLU layers. Max-pooling is applied after every two CNN layers. The progression of filters in the 10 layers is as follows: 64; 64; 128; 128; 128; 128; 256; 256. The output of the CNNs are fed through two fullconnected layers having 2048 units each before the output layer. A dropout of 0.5 is applied after each max-pooling layer and on the fully-connected layers. It is trained on multi-condition training data, on 64-dimensional Mel filter-bank features. The features were mean and variance normalized. Multi-condition data, excluding the enhanced data, was used. Enhancement is thus applied only on the test set.
Test results are based on the real evaluation set of the same task. Word error rates are presented in Figure 5, for the method of the first example described in relation to Figure 3 (InL), for no enhancement, for enhancement using iterative techniques (vanilla) and for the method of the comparative example using parallel training data (on-line). The performance of InL A1 matches, on average, that of the Online method (using parallel data). Both methods improve on the iterative Vanilla method. The more aggressive version of the method (InL C2) is also better than the Vanilla method but suggests a slight degradation in ASR performance compared to the other two methods. A possible explanation is the increasing deviation of the enhanced data from the multi-condition training data, which in the absence of the parallel data may have a detrimental effect. The system of the first example is able to match the performance of the comparative example system trained using parallel data for all three microphone set-ups. Furthermore, performance improves as more microphone channels become available at test time. Figure 5 shows the word error rate (WER) on real test data. The ASR performance is on par with a system trained with parallel de-reverbed speech. It is seen that on average the proposed de-reverberation method matches the performance corresponding to the parallel data case.
The signal enhancement reduces the enhanced signal power, relative to the observed one, by removing late reverberation. More effective enhancement is expected to achieve lower output signal instantaneous variance. However, modifications that are too aggressive may cause distortion. Given the considered range of values for δ in the method according to the first example (In-line), it is observed that the On-line method (parallel data) is the most aggressive method. As expected, increasing δ reduces the output signal instantaneous variance. Moderate decrease in instantaneous variance for a fixed δ is observed as the number of channels for training increases. This is related to the resulting lower value of Λ, which serves as the target for neural network training. A decrease in the output signal instantaneous variance (for the test set) is seen as the weight δ on the second term in the cost function increases. Perceptually, the decrease in signal instantaneous variance did not result in signal distortions.
To evaluate the perceptual effect of the different processing algorithms a listening test was conducted. Using a comparative category rating scale ranging from -3 (much worse), through 0 (same), to 3 (much better), pairs of methods were compared blindly. To keep the complexity manageable, a subset of all systems was considered. These include the most and the least aggressive In-line methods variants according to Figure 6(a), On-line (parallel data), Vanilla (iterative) and the original signal (No enh.). Each comparison includes InL C2 and another method. Each method is applied to the same 20 sentences from the real evaluation test set. These are the 20 sentences from the complete test set for which InL C2 and InL A1 showed the largest instantaneous variance gap. Such a pre-selection facilitates the evaluation. The total number of comparisons was 80 and the duration of the test 20 minutes on average. The presentation order was randomized across the pairs. The average preference scores presented in Figure 6(b) indicate that the aggressive mode of the method according to the first example (In-line) and On-line (parallel data) are indistinguishable. InL C2 was preferred over Vanilla (iterative) and InL A1, which is consistent with the ranking from Figure 6(a). As expected, the gain over the observed signal is most substantial.
The experimental results show that on average, the method according to the first example matches the performance of a system trained with parallel data both in terms of perceptual quality and ASR performance.
The de-reverberation effect at the signal level is illustrated in Figure 7. The cost function Oo = ||>/A — Va || was used. Training was again performed on a single microphone channel. While the de-reverbed signals look similar, the model based on parallel data results in smaller instantaneous variance for the de-reverbed signal. This is an expected result given that in the case without parallel data, estimation is performed from a noisy, reverberant signal. In the case with parallel data the noise is duplicated in both data sub-sets. Figure 7 shows the waveforms of reverbed and dereverbed speech using eight microphone channels.
The system performs neural-network-supported de-reverberation using only reverberant data for training. A robust and efficient training procedure is used, based on complex-valued back-propagation. The training method is light-weight, robust and effective. The system allows a large degree of scalability both in training and operation, in other words it is scalable in terms of the usage of multiple microphone channels in training and testing.
The method may be used for speech signal enhancement for reverberant environments with application to automatic speech recognition and all forms enhanced listening devices. The method may perform reverberation reduction.
Mathematical relations (i.e. the cost function) derived from the de-reverberation model itself are used to create an internal reference allowing training without parallel data.
Effective neural-net supported signal enhancement, in the context of linear predictive methods, is achieved by the method through unsupervised learning. The need for parallel training data or alignment information, for the case of joint training with an acoustic model, is avoided. Setting the value of the parameter δ controls the aggressiveness of the algorithm.
While certain arrangements have been described, these arrangements have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and systems described herein may be made.

Claims (20)

CLAIMS:
1. An audio signal processing method, comprising:
receiving a discrete input audio signal;
generating an estimate of a magnitude of a frequency spectrum of each of a plurality of frames of a desired audio signal corresponding to a first segment of the input signal, by inputting a magnitude of a frequency spectrum of each of the plurality of frames of the first segment of the input audio signal into a trained algorithm;
generating a frequency spectrum corresponding to each frame of the desired audio signal, by subtracting a first frequency spectrum corresponding to each frame from a frequency spectrum corresponding to each frame of the first segment of the input audio signal, wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal, the plurality of further segments each being located at least partly prior to the first segment and each being weighted by a set of frequency dependent coefficients, wherein the coefficients are generated from the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm.
2. The method according to claim 1, wherein the trained algorithm comprises a trained neural network.
3. The method according to claim 2, wherein the trained neural network is a recurrent neural network.
4. The method according to claim 1 or 2, wherein a log magnitude of the frequency spectrum of each of the plurality of frames of the first segment of the input audio signal is inputted into the trained neural network, and wherein the trained neural network outputs an estimated log magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal.
5. The method according to any preceding claim, wherein the first frame of each of the plurality of further segments of the input audio signal is located a minimum number of frames prior to the first frame of the first segment.
6. The method according to any preceding claim, wherein the input audio signal comprises multiple channels, wherein generating the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal corresponding to the first segment of the input signal comprises generating the estimate for one channel, wherein generating the frequency spectrum corresponding to each frame of the desired audio signal comprises generating the frequency spectra for the one channel, wherein the first frequency spectrum corresponding to each frame is subtracted from the frequency spectrum corresponding to each frame of the first segment of the input audio signal for the one channel, and wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal from the one channel and/or from one or more other channels.
7. The method according to claim 6, further comprising generating the frequency spectrum corresponding to each frame of the desired audio signal for the one or more other channels.
8. The method according to any preceding claim, further comprising generating an output audio signal from the frequency spectra of the desired audio signal.
9. The method according to any preceding claim, wherein the audio signal is a speech signal and further comprising performing automatic speech recognition.
10. An audio signal processing system, comprising:
an input configured to receive a discrete input audio signal;
an output configured to output information relating to a desired audio signal;
a processor configured to:
generate an estimate of a magnitude of a frequency spectrum of each of a plurality of frames of a desired audio signal corresponding to a first segment of the input signal, by inputting a magnitude of a frequency spectrum of each of the plurality of frames of the first segment of the input audio signal into a trained algorithm;
generate a frequency spectrum corresponding to each frame of the desired audio signal, by subtracting a first frequency spectrum corresponding to each frame from a frequency spectrum corresponding to each frame of the first segment of the input audio signal, wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal, the plurality of further segments each being located at least partly prior to the first segment and each being weighted by a set of frequency dependent coefficients, wherein the coefficients are generated from the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm.
11. A method of training an audio signal processing system, comprising:
receiving a discrete input audio signal;
generating an estimate of a magnitude of a frequency spectrum of each of a plurality of frames of a desired audio signal corresponding to a first segment of the input signal, by inputting a magnitude of a frequency spectrum of each of the plurality of frames of the first segment of the input audio signal into an algorithm;
generating a frequency spectrum corresponding to each frame of the desired audio signal, by subtracting a first frequency spectrum corresponding to each frame from a frequency spectrum corresponding to each frame of the first segment of the input audio signal, wherein the first frequency spectra are generated from the frequency spectra corresponding to a plurality of further segments of the input audio signal and/or one or more related input audio signals, the plurality of further segments each being located at least partly prior to the first segment and each being weighted by a set of frequency dependent coefficients, wherein the coefficients are generated from the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm;
generating the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal;
updating the algorithm based on a measure of the difference between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm and the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal.
12. The method according to claim 11, wherein updating the algorithm is further based on the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal.
13. The method according to claim 12, wherein updating the algorithm comprises updating the algorithm based on the difference measure between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm and the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal, multiplied by the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal.
14. The method according to claim 13, wherein the difference measure is multiplied by the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal to the power of a constant.
15. The method according to claim 12, wherein updating the algorithm comprises updating the algorithm based on the difference between the estimate of the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm and the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal, added to the magnitude of the frequency spectrum of each of the plurality of frames of the desired audio signal generated using the algorithm or the magnitude of the frequency spectrum of each of the plurality of frames of the output desired audio signal multiplied by a constant.
16. The method according to claim 14 or 15, wherein the value of the constant is selected based on the desired application.
17. The method according to any of claims 12 to 16, wherein the algorithm comprises a neural network.
18. The method according to any of claims 12 to 16, wherein the algorithm comprises a recurrent neural network.
19. An audio signal processing system, trained according to the method of any of claims 12 to 18.
20. A carrier medium comprising computer readable code configured to cause a 5 computer to perform the methods of any of claims 1 to 9 and 11 to 18.
GB1813189.6A 2018-08-13 2018-08-13 A processing method, a processing system and a method of training a processing system Expired - Fee Related GB2576320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1813189.6A GB2576320B (en) 2018-08-13 2018-08-13 A processing method, a processing system and a method of training a processing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1813189.6A GB2576320B (en) 2018-08-13 2018-08-13 A processing method, a processing system and a method of training a processing system

Publications (3)

Publication Number Publication Date
GB201813189D0 GB201813189D0 (en) 2018-09-26
GB2576320A true GB2576320A (en) 2020-02-19
GB2576320B GB2576320B (en) 2021-04-21

Family

ID=63667220

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1813189.6A Expired - Fee Related GB2576320B (en) 2018-08-13 2018-08-13 A processing method, a processing system and a method of training a processing system

Country Status (1)

Country Link
GB (1) GB2576320B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2623110A (en) * 2022-10-06 2024-04-10 Nokia Technologies Oy Apparatus, methods and computer programs for audio signal enhancement using a dataset

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101334991B1 (en) * 2012-06-25 2013-12-02 서강대학교산학협력단 Method of dereverberating of single channel speech and speech recognition apparutus using the method
US20180308503A1 (en) * 2017-04-19 2018-10-25 Synaptics Incorporated Real-time single-channel speech enhancement in noisy and time-varying environments

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101334991B1 (en) * 2012-06-25 2013-12-02 서강대학교산학협력단 Method of dereverberating of single channel speech and speech recognition apparutus using the method
US20180308503A1 (en) * 2017-04-19 2018-10-25 Synaptics Incorporated Real-time single-channel speech enhancement in noisy and time-varying environments

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAO ET AL. "Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation" - EURASIP Journal on Advances in Signal Processing 2016:4 https://doi.org/10.1186/s13634-015-0300-4 13 January 2016 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2623110A (en) * 2022-10-06 2024-04-10 Nokia Technologies Oy Apparatus, methods and computer programs for audio signal enhancement using a dataset

Also Published As

Publication number Publication date
GB2576320B (en) 2021-04-21
GB201813189D0 (en) 2018-09-26

Similar Documents

Publication Publication Date Title
Chou et al. One-shot voice conversion by separating speaker and content representations with instance normalization
US9721559B2 (en) Data augmentation method based on stochastic feature mapping for automatic speech recognition
Qian et al. Speech Enhancement Using Bayesian Wavenet.
Zhao et al. Wasserstein GAN and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a WaveNet vocoder
US10147442B1 (en) Robust neural network acoustic model with side task prediction of reference signals
WO2019204547A1 (en) Systems and methods for automatic speech recognition using domain adaptation techniques
Zhang et al. On loss functions and recurrency training for GAN-based speech enhancement systems
Zhao et al. A two-stage algorithm for noisy and reverberant speech enhancement
Ravanelli et al. A network of deep neural networks for distant speech recognition
Xue et al. Speaker adaptation of hybrid NN/HMM model for speech recognition based on singular value decomposition
Sadhu et al. Continual Learning in Automatic Speech Recognition.
Hwang et al. LP-WaveNet: Linear prediction-based WaveNet speech synthesis
Liu et al. Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition.
US11315548B1 (en) Method and system for performing domain adaptation of end-to-end automatic speech recognition model
Wang et al. Unsupervised speaker adaptation of batch normalized acoustic models for robust ASR
Meng et al. Modular hybrid autoregressive transducer
Braun et al. Effect of noise suppression losses on speech distortion and ASR performance
KR20200025750A (en) Method and apparatus of personalizing voice recognition model
Nesta et al. A flexible spatial blind source extraction framework for robust speech recognition in noisy environments
Ravanelli et al. Automatic context window composition for distant speech recognition
Martinez et al. Prediction of speech intelligibility with DNN-based performance measures
Ueda et al. Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization
Astudillo et al. Integration of beamforming and uncertainty-of-observation techniques for robust ASR in multi-source environments
Wang et al. TeCANet: Temporal-contextual attention network for environment-aware speech dereverberation
Liao et al. Joint uncertainty decoding for robust large vocabulary speech recognition

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20230813