WO2024018390A1

WO2024018390A1 - Method and apparatus for speech enhancement

Info

Publication number: WO2024018390A1
Application number: PCT/IB2023/057347
Authority: WO
Inventors: Alberto Gil C. P. RAMOS; Abhinav Mehrotra; Sourav Bhattacharya
Original assignee: Samsung Electronics Co., Ltd.
Priority date: 2022-07-19
Filing date: 2023-07-19
Publication date: 2024-01-25
Also published as: GB2620747B; GB202210575D0; GB2620747A

Abstract

Embodiments of the present techniques provides a method for training a machine learning, ML, model to reduce or remove noise in audio signals that comprise speech, and a method of using a trained ML model to clean noisy speech signals and thereby improve telephony and/or automatic speech recognition performance. Advantageously, the present techniques use a ML model having a single neural network to perform both amplitude and phase enhancement of corrupted (noisy) audio signals. This is advantageous because it ensures that both important parts of an audio signal are enhanced, without increasing the model size or complexity, which results in enhanced audio signals even on resource-constrained devices, such as smartphones.

Description

METHOD AND APPARATUS FOR SPEECH ENHANCEMENT

The present application generally relates to a method, apparatus and system for speech enhancement. In particular, the present application provides a method for training a machine learning, ML, model to reduce or remove noise in audio signals that comprise speech, and a method of using a trained ML model to clean noisy speech signals and thereby improve telephony and/or automatic speech recognition performance.

Speech recognition systems are ubiquitous, and most often used on personal devices, such as mobile phones or voice assistant devices. Currently, machine learning models are the state-of-the-art in speech recognition. Machine learning relies on using training data to train a machine learning model, such as a deep neural network to perform a specific task. Solving complex task may then require large amounts of data and possibly large neural networks that require a lot of computational resources, both for training and at inference time.

The applicant has therefore identified the need for an improved method to perform speech enhancement.

Solution to Problem

In a first approach of the present techniques, there is provided a computer-implemented method for training, on a server, a machine learning, ML, model for speech enhancement, the method comprising: obtaining a training dataset comprising a plurality of corrupted audio signals, the corrupted audio signals containing speech of individual speakers; and training the ML model to enhance an amplitude (also referred to herein as a ‘norm’) and a phase of each corrupted audio signal of the training dataset, by modelling phase correction.

Advantageously, the present techniques use a ML model having a single neural network to perform both amplitude and phase enhancement of corrupted (noisy) audio signals. This is advantageous because it ensures that both important parts of an audio signal are enhanced, without increasing the model size or complexity, which results in enhanced audio signals even on resource-constrained devices, such as smartphones.

The plurality of corrupted audio signals may be time domain signals. In this case, the method may comprise transforming the time domain signals into frequency domain signals prior to using the corrupted audio signals to train the ML model. This is useful because there are advantages of modelling phase in the frequency domain.

The method may comprise: inputting, into the ML model, the amplitude, cosine of the phase and sine of the phase of each corrupted audio signal. Specifically, the inputting may comprise inputting a tensor into the ML model, the tensor comprising the amplitude, cosine of the phase and sine of the phase.

The amplitude, cosine of the phase and sine of the phase may be interleaved prior to input into the ML model. Compared to situations where the inputs into a ML model are concatenated, the interleaving of the inputs makes more effective use of spatial operations, and has particular important advantages for e.g. grouped convolution operations.

The method may further comprise: outputting, from the ML model, an enhanced amplitude, a cosine correction to the phase and a sine correction to the phase. Thus, the output of the ML model is not an enhanced audio signal, but the parameters required to generate the enhanced audio signal. This is advantageous because it reduces modelling difficulty and complexity.

The training dataset may comprise a plurality of noise samples, a plurality of clean audio samples that each contain speech of individual speakers, and a speaker embedding vector for each individual speaker. Thus, obtaining the training dataset may comprise: generating, using the clean audio samples, the corrupted audio samples by adding at least one noise sample to each clean audio sample.

Training the ML model may comprise: training neural networks of the ML model, using the training dataset, to remove the noise from the corrupted audio samples while maintaining the speech of the individual speakers and thereby generate enhanced audio samples; comparing each enhanced audio sample with the corresponding clean audio sample and determining how well the generated enhanced audio sample matches the corresponding clean audio sample; calculating, using a result of the comparing, at least one loss function; and updating the neural networks to minimise the at least one loss function.

By using the ML model to operate on the amplitude, cosine of the phase and sine of the phase of the corrupted signals, there are advantageously more loss functions available compared to other techniques. Thus, training the ML model may use one or more of these loss functions.

In some cases, calculating at least one loss function may comprise calculating a loss function using an amplitude of the enhanced audio sample and an amplitude of the corresponding clean audio sample.

More generally, calculating at least one loss function may comprise calculating a loss function using an amplitude, a phase, a cosine phase and/or a sine phase of the enhanced audio sample and an amplitude, a phase, a cosine phase and/or a sine phase of the corresponding clean audio sample.

Training the ML model may comprise combining multiple loss functions using coefficients that are obtained deterministically or stochastically. Advantageously, this means that the ML model may be trained to generate enhanced audio signals that are suitable for multiple applications. For example, training the ML model may comprise combining multiple loss functions and training the ML model to generate enhanced audio signals for both telephony and automatic speech recognition (ASR).

In a second approach of the present techniques, there is provided a server for training a machine learning, ML, model for speech enhancement, the server comprising: at least one processor coupled to memory and arranged to: obtain a training dataset comprising a plurality of corrupted audio signals, the corrupted audio signals containing speech of individual speakers; and train the ML model to enhance an amplitude and a phase of each corrupted audio signal of the training dataset, by modelling phase correction.

The features described above with respect to the first approach apply equally to the second approach and therefore, for the sake of conciseness, are not repeated.

In a third approach of the present techniques, there is provided a computer-implemented method for using a trained machine learning, ML, model to perform speech enhancement for a target user, the method comprising: obtaining a corrupted audio signal comprising speech of the target user and noise; inputting the corrupted audio signal and a speaker embedding vector for the target user into the trained ML model; and using the trained ML model to enhance an amplitude and a phase of the corrupted audio signal.

The corrupted audio signal may be a time domain signal and the method may comprise: transforming the time domain signal into a frequency domain signal prior to input into the trained ML model.

The method may comprise: inputting, into the trained ML model, the amplitude, cosine of the phase and sine of the phase of the corrupted audio signal. The amplitude, cosine of the phase and sine of the phase may be interleaved prior to input into the trained ML model.

The method may comprise: outputting, from the trained ML model, an enhanced amplitude, a cosine correction to the phase and a sine correction to the phase.

The method may further comprise: using the cosine correction and the sine correction to the phase, and trigonometric identities to compute a cosine of the phase and a sine of the phase of an enhanced audio signal; and generating the enhanced audio signal.

Advantageously, the enhanced amplitude may be higher than the amplitude of the corrupted audio signal.

The method may further comprise: transforming the enhanced audio signal from a frequency domain signal into a time domain signal.

In a fourth approach of the present techniques, there is provided an apparatus for using a trained machine learning, ML, model to perform speech enhancement for a target user, the apparatus comprising: an audio capture device; and at least one processor coupled to memory and arranged to: obtain a corrupted audio signal comprising speech of the target user and noise; input the corrupted audio signal and a speaker embedding vector for the target user into the trained ML model; and use the trained ML model to enhance an amplitude and a phase of the corrupted audio signal, and thereby generate an enhanced audio signal.

The features described above with respect to the third approach apply equally to the fourth approach and therefore, for the sake of conciseness, are not repeated.

In some cases, the corrupted audio signal may be obtained during an audio call and the at least one processor may be configured to transmit the enhanced audio signal to another participant in the audio call.

The at least one processor may be configured to input the enhanced audio signal into an automatic speech recognition (ASR) system.

The apparatus may be a constrained-resource device, but which has the minimum hardware capabilities to use a trained neural network. The apparatus may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, or a smart consumer device (such as a smart fridge). It will be understood that this is a non-exhaustive and non-limiting list of example apparatus.

In a related approach of the present techniques, there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, causes the processor to carry out any of the methods described herein.

As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

The method described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, "obtained by training" means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

As mentioned above, the present techniques may be implemented using an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/o may be implemented through a separate server/system.

The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.

The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings.

Fig.1

is a schematic diagram of an existing speech enhancement method.

Fig.2

is a schematic diagram of the speech enhancement method of the present techniques.

Fig.3

is a flowchart of example steps to train an ML model according to the present techniques.

Fig.4

is a flowchart of example steps to use a trained ML model to perform speech enhancement according to the present techniques.

Fig.5

is a block diagram of a system for training and using a ML model to perform speech enhancement.

Broadly speaking, embodiments of the present techniques provides a method for training a machine learning, ML, model to reduce or remove noise in audio signals that comprise speech, and a method of using a trained ML model to clean noisy speech signals and thereby improve telephony and/or automatic speech recognition performance. Advantageously, the present techniques use a ML model having a single neural network to perform both amplitude and phase enhancement of corrupted (noisy) audio signals. This is advantageous because it ensures that both important parts of an audio signal are enhanced, without increasing the model size or complexity, which results in enhanced audio signals even on resource-constrained devices, such as smartphones.

In the context of speech enhancement, the importance of modelling phase, in addition to norm (amplitude), is well recognized as a potential solution to increase speech quality for telephony applications and decrease artifacts that adversely affect the performance of downstream tasks such as ASR.

Unfortunately, the fact that phase varies much more rapidly over time than norm has prevented its widespread adoption to improve speech enhancement systems on the frequency domain, with only a few exceptions in recent years.

Indeed, this difficulty has led researchers and engineers to take advantage of phase information indirectly, by enhancing the speech in the time domain rather than in the frequency domain. This is disadvantageous in terms of model size, latency and energy requirements. In particular, more outputs required per time unit naturally increases computational requirements, there may be no access to specialized support for Short Time Fourier Transform (STFT) and inverse STFT (iSTFT) that could allow for speed up in computation, and/or STFT computation may be discarded that may need to be computed for downstream ASR systems.

Given the advantages of modelling phase in the frequency domain, one key observation of recent work is that it often consists of architecture dependent innovations and usually solves a harder problem than required. For these reasons, the present applicant proposes an architecture independent modelling strategy, by considering as inputs to the network the norm, cosine of the phase and sine of the phase interleaved or channelized rather than concatenated for effective use of spatial operations with important advantages for grouped convolutions for instance. Furthermore, the present applicant decreases the modelling difficulty, by considering as outputs of the network the norm, cosine correction and sine correction rather than the signals themselves and leveraging the corrections via known trigonometric identities with reduced cost to compute the phase given the identities do not need to be learned.

Most approaches for real-time on-device speech enhancement are discriminative. For example, Complex U-Net introduces complex convolutional layers that operate on the real and imaginary representation of the signal in the frequency domain, i.e., with input and output complex value signals . This is achieved by proposing hand-crafted versions of what could be complex counterparts to traditional layers, e.g., Conv1D. In comparison, the present techniques advantageously do not require the introduction of any specialized layer type, or architecture, and therefore can take advantage of all optimizations for existing types of layers that operate on real signals. Furthermore, whereas Complex U-Net work operates on the real and imaginary parts of the signal, the present techniques instead operate on the norm, cosine of the correction to the phase and sine of the correction to the phase since those present fewer discontinuities (computing the predicted cosine and sine via trigonometric identities of sums of angles, thereby removing that effort from the network). This incurs an increased input dimension (3 rather than 2 times) but yields better performance. That is, the present techniques trade-off between quality and input dimension.

DCCRN is another discriminative technique, which proposes another hand-crafted architecture for complex neural networks that operate on the real and imaginary parts of the signal, but in this particular case with convolutional and recurrent blocks in an encoder (conv), filter (rnn) and decoder (conv) manner. In contrast, the present techniques are advantageous over DCCRN because the present techniques provide a general solution built from any layer type that is already supported in software/hardware, and that yields better performance due to modelling norm, cosine of the correction to the phase and sine of the correction to the phase rather than real and imaginary.

PHASEN is another discriminative technique, which predicts norm, cosine of phase and sine of phase. The PHASEN architecture is hand-crafted as a two-stream network, where amplitude stream and phase stream are dedicated to amplitude and phase prediction, with cross connections at engineered points. In contrast to PHASEN, rather than modeling the phase directly, the present techniques model the phase correction. This is possible by leveraging trigonometric identities of phase addition, and importantly allows the network of the present techniques to focus an an easier modeling problem, which therefore yields better performance at lower computational cost. In addition, the present techniques only require one network for both amplitude and phase, which lends itself naturally to any form factor and architecture search technique. It works with inputs and outputs encoded as or which allows the use of Conv2D, ConvLSTM2D and Conv2DTranspose with advantages in terms of efficiency (energy/latency) which otherwise are not natural to employ. Furthermore, modelling the cosine and the sine of the phase rather than the real part and imaginary part of the phase decreases discontinuities in the signal and yield better performance, i.e., SDRi for telephony applications and WER for ASR applications.

Generative approaches take advantage of the fact that measurements of phase are highly sensitive to sampling time, and cast speech enhancement as a generative task in the time domain most often. However. they are often too expensive for real-time on-device deployment.

Figure 1 is a schematic diagram of an existing speech enhancement method. The standard approach in Figure 1 consists in taking an audio signal in the time domain , converting it to the frequency domain , and learning a model, e.g., a neural network whose task is to predict a mask with values in . This is so that the enhanced signal , when converted to the time domain for a telephony application or used as is for a downstream ASR system, contains only the audio signal that matters, i.e., the one of the device owner without any ambient noise (such as fan noise) or babble noise (e.g., other people talking in the background).

The traditional approach in Figure 1 sometimes works reasonably well, but yields two issues in practice: the enhanced signal sounds unnatural, i.e., robotic, which degrades the quality of telephony applications or contains artefacts that deteriorate the performance of downstream tasks such as ASR tasks, e.g., incorrect substitution, deletion or insertion in transcripts. This can be observed in , namely the enhanced signal by the limitations of the approach has to lie at the same angle as the original signal, i.e., on the dashed line shown in the image.

Figure 2 is a schematic diagram of the speech enhancement method of the present techniques. The present techniques solve the above-mentioned issues by enhancing the phase in addition to the norm of the signal. Furthermore, the present techniques allow for increasing gain (not just decreasing) across both time and frequency. In detail, whereas the aforementioned approach illustrated in Figure 1 defines the enhanced signal as , the approach of the present techniques, as depicted in Figure 2, instead defines the enhanced signal as where where and denote the sigmoid and inverse sigmoid functions. Note, phase is enhanced in the proposed approach from . Note also in the existing techniques that , whereas no such limitation exists for the proposed of the present tehcniques. As can be seen, unlike traditional methods, the approach of the present techniques allows the enhanced signal to move anywhere in the circle as required to improve telephony or ASR performance.

One advantage of the present techniques is that it is architecture agnostic, which is possible by assembling in a spatial aware form the input based the norm, the cosine of the phase and the sine of the phase, whereas previous approaches rely on hand-crafted custom layers to learn how to process this information. Another advantage is that the present tehcniques focus on modeling the output as phase corrections rather than phase itself, by taking advantage of trigonometric identities, which are not learned and therefore led to better and/or smaller speech enhancement models.

Table 1 below presents a comparison of the benefits that can be gained from enhancing the phase in two personalized sound enhancement systems. Table 1 shows an example of performance metrics on two personalized speech enhancement systems without ( ) and with ( ) phase enhancement. As can be seen from Scale-invariant Source-to-noise Ratio (SI-SNR) (higher the better) and Source-to-Distortion Ratio improvement (SDRi) (higher the better) there is a significant improvement in signal quality when enhancing the phase in addition to the norm of audio signals.

Method	SI-SNR	SDR
Figure 1: Enhancing norm but not phase	15.91	16.84
Figure 2: Enhancing both norm and phase	16.53	17.50

In personalized speech enhancement systems, i.e., those aimed at removing babble noise (other people speaking) in addition to ambient noise (music for example), enrolment data is collected once at setup time and converted to a embedding or reference vector . At training and inference time, the objective of the system is to enhance the noised signal , into the de-noised signal , . During training, the model typically learns in a supervised manner by comparing the aforementioned de-noised signal against the ground-truth , signal.

Input pre-processing: Given phase is notoriously discontinuous when represented in a compact interval say and that phase unwrapping techniques are neither perfect nor ideal given they result in outputs which may increase without bound as increases, the present applicant instead considers and . Whereas this representation increases the input dimension, that disadvantage is greatly offset by the continuous nature of the representation.

Thus, rather than modelling the input as norm and phase or as real and imaginary , the present applicant proposes the input to the neural network should be based on , and in the form of a tensor with shape for Conv1D or for Conv2D and Conv2DTranspose.

Note that while there is essentially only one option for creating the latter tensor, there are many for creating the former. In particular, when using grouped convolutions or other layers that act locally it is crucial that rather than stacking the vectors, they are instead interleaved

Output post-processing: Given a neural network , its outputs are un-interleaved, scaled, normalized and combined via sum formulas for trigonometric functions

such as, for example,

with and denoting the sigmoid and inverse sigmoid functions. Note that whereas the input to the network models the phase, the output of the network models the phase correction offset as can be seen from

Loss functions: Given the present techniques consider the data representation as , rather than the obvious approach of or , the present applicants have more loss functions readily available than otherwise is possible, namely:

which are aggregated into

Deterministic: Note that and corresponds to complex Mean Squared Error (MSE), whereas combinations such as and may also be used. The model may be trained using a deterministic combination, i.e. using coefficients that are determined deterministically.

Stochastic: In addition to training new deterministic combinations of the aforementioned loss functions, it is also possible that the coefficients are stochastically determined. For example, the coefficients may be the multiplication of a sample drawn from a Bernoulli distribution (i.e., to include or not the corresponding loss term) and of a sample drawn from a Gamma distributions (by how much to weight the corresponding loss term). Given the inherent disconnect between the objectives of improving telephony and ASR quality, i.e., perceptual quality versus performance of downstream tasks, the model is trained using the proposed stochastic combination.

is a flowchart of example steps to train an ML model according to the present techniques. The method may comprise: obtaining a training dataset comprising a plurality of corrupted audio signals, the corrupted audio signals containing speech of individual speakers (step S100); and training the ML model to enhance an amplitude (also referred to herein as a ‘norm’) and a phase of each corrupted audio signal of the training dataset, by modelling phase correction (step S102).

The loss functions mentioned above, aggregated in a deterministic or stochastic manner, may be used to train the ML model in a supervised manner, using the data described above.

Specifically, the training dataset may comprise a plurality of noise samples, a plurality of clean audio samples that each contain speech of individual speakers, and a speaker embedding vector for each individual speaker. Thus, the step S100 of obtaining the training dataset may comprise: generating, using the clean audio samples, the corrupted audio samples by adding at least one noise sample to each clean audio sample. Then, step S102 of training the ML model may comprise: training neural networks of the ML model, using the training dataset, to remove the noise from the corrupted audio samples while maintaining the speech of the individual speakers and thereby generate enhanced audio samples; comparing each enhanced audio sample with the corresponding clean audio sample and determining how well the generated enhanced audio sample matches the corresponding clean audio sample; calculating, using a result of the comparing, at least one loss function; and updating the neural networks to minimise the at least one loss function.

As noted, the training requires speech and noise datasets. Examples of publicly available datasets for training include LibriSpeech (Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an ASR corpus
based on public domain audio books. In ICASSP, 2015), for spoken text. This is suitable for generating appropriate datasets for training and testing. For example, the datasets may be generated by taking 100h and 360h of clean speech from LibriSpeech, which are recorded using close-talk microphones, without any background noise. Another dataset for training is the DEMAND dataset, which may be suitable for providing the noise samples (Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. DEMAND: a collection of multi-channel
recordings of acoustic noise in diverse environments, June 2013. Supported by Inria under the Associate Team Program VERSAMUS).

Not shown in are the pre-processing and post-processing steps described above.

is a flowchart of example steps to use a trained ML model to perform speech enhancement according to the present techniques. The method may comprise: obtaining a corrupted audio signal comprising speech of the target user and noise (step S200); inputting the corrupted audio signal and a speaker embedding vector for the target user into the trained ML model (step S202); and using the trained ML model to enhance an amplitude and a phase of the corrupted audio signal (step S204).

Not shown in are the pre-processing and post-processing steps described above.

After step S204, the method may comprise using the cosine correction and the sine correction to the phase, and trigonometric identities to compute a cosine of the phase and a sine of the phase of an enhanced audio signal; and generating the enhanced audio signal. This is described in more detail above with reference to . Advantageously, the enhanced amplitude may be higher than the amplitude of the corrupted audio signal.

The system comprises a server 100 for training a machine learning, ML, model 106 for speech enhancement. The server comprise: at least one processor 102 coupled to memory 104 and arranged to: obtain a training dataset 108 comprising a plurality of corrupted audio signals, the corrupted audio signals containing speech of individual speakers; and train the ML model 106 to enhance an amplitude and a phase of each corrupted audio signal of the training dataset, by modelling phase correction.

The system comprises an apparatus 200 for using a trained machine learning, ML, model 206 (i.e. the model trained by the server 100) to perform speech enhancement for a target user. The apparatus 200 comprises: an audio capture device 208 for capturing audio signals, including the corrupted audio signals.

The apparatus 200 comprises at least one processor 202 coupled to memory 204 and arranged to: obtain a corrupted audio signal comprising speech of the target user and noise; input the corrupted audio signal (e.g. that captured by the audio capture device 208) and a speaker embedding vector 210 for the target user into the trained ML model; and use the trained ML model 206 to enhance an amplitude and a phase of the corrupted audio signal, and thereby generate an enhanced audio signal.

The apparatus 200 may be any one of: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a robot or robotic device, a robotic assistant, image capture system or device, an Internet of Things device, and a smart consumer device. It will be understood that this is a non-limiting and non-exhaustive list of apparatuses.

The at least one processor 202 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The memory 204 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.

Thus, the present tehcniques provide higher quality and/or more efficient personalized or non-personalized speech enhancement systems, with fewer artifacts yielding improved telephony and/or downstream automatic speech recognition (ASR) performance. At a high level this is achieved by:

Enhancing phase in addition to norm information in an architecture agnostic way, i.e., rather than custom layers, more efficient modelling strategies;

Allowing the norm of the enhanced signal to be higher than that of the original signal, by modelling the norm directly rather than through masks;

Modelling the phase continuously, rather than discontinuously, as an offset of the original phase signal, making the speech enhancement problem easier, thereby decreasing model size, latency and energy requirements;

By jointly using norm and phase information to enhance both norm and phase signals, there is an improvement not only on the phase itself, but on the norm as well, which yields additional improvements for telephony and ASR;

Considering new deterministic and stochastic loss functions aimed at training systems with a better performance at both telephony and ASR rather than optimizing different systems for each one;

By having an architecture-agnostic approach solving an easier problem, the present techniques enable the development of speech enhancement systems that can be deployed in a wider range of devices.

Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Claims

A computer-implemented method for training, on a server, a machine learning, ML, model for speech enhancement, the method comprising:
obtaining a training dataset comprising a plurality of corrupted audio signals, the corrupted audio signals containing speech of individual speakers; and
training the ML model to enhance an amplitude and a phase of each corrupted audio signal of the training dataset, by modelling phase correction.
The method as claimed in claim 1, wherein the plurality of corrupted audio signals are time domain signals, and wherein the method comprises transforming the time domain signals into frequency domain signals prior to using the corrupted audio signals to train the ML model.
The method as claimed in claim 1 further comprising:
inputting, into the ML model, the amplitude, cosine of the phase and sine of the phase of each corrupted audio signal.
The method as claimed in claim 3 wherein the inputting comprises inputting a tensor into the ML model, the tensor comprising the amplitude, cosine of the phase and sine of the phase.
The method as claimed in claim 4 wherein the amplitude, cosine of the phase and sine of the phase are interleaved.
The method as claimed in claim 1, further comprising:
outputting, from the ML model, an enhanced amplitude, a cosine correction to the phase and a sine correction to the phase.
The method as claimed in claim 4 wherein the training dataset comprises a plurality of noise samples, a plurality of clean audio samples that each contain speech of individual speakers, and a speaker embedding vector for each individual speaker, and wherein obtaining the training dataset comprises:
generating, using the clean audio samples, the corrupted audio samples by adding at least one noise sample to each clean audio sample.
The method as claimed in claim 7 wherein training the ML model comprises:
training neural networks of the ML model, using the training dataset, to remove the noise from the corrupted audio samples while maintaining the speech of the individual speakers and thereby generate enhanced audio samples;
comparing each enhanced audio sample with the corresponding clean audio sample and determining how well the generated enhanced audio sample matches the corresponding clean audio sample;
calculating, using a result of the comparing, at least one loss function; and
updating the neural networks to minimise the at least one loss function.
The method as claimed in claim 8 wherein calculating at least one loss function comprises calculating a loss function using an amplitude of the enhanced audio sample and an amplitude of the corresponding clean audio sample.
The method as claimed in claim 8 wherein calculating at least one loss function comprises calculating a loss function using an amplitude, a phase, a cosine phase and/or a sine phase of the enhanced audio sample and an amplitude, a phase, a cosine phase and/or a sine phase of the corresponding clean audio sample.
The method as claimed in claim 8 wherein training the ML model comprises combining multiple loss functions using coefficients that are obtained deterministically or stochastically.
The method as claimed in claim 11 wherein training the ML model comprises combining multiple loss functions and training the ML model to generate enhanced audio signals for both telephony and automatic speech recognition.
A server for training a machine learning, ML, model for speech enhancement, the server comprising:
at least one processor coupled to memory and arranged to:
obtain a training dataset comprising a plurality of corrupted audio signals, the corrupted audio signals containing speech of individual speakers; and
train the ML model to enhance an amplitude and a phase of each corrupted audio signal of the training dataset, by modelling phase correction.
A computer-implemented method for using a trained machine learning, ML, model to perform speech enhancement for a target user, the method comprising:
obtaining a corrupted audio signal comprising speech of the target user and noise;
inputting the corrupted audio signal and a speaker embedding vector for the target user into the trained ML model; and
using the trained ML model to enhance an amplitude and a phase of the corrupted audio signal.
The method as claimed in claim 14 wherein the corrupted audio signal is a time domain signal and the method comprises:
transforming the time domain signal into a frequency domain signal prior to input into the trained ML model.