WO2022103290A1

WO2022103290A1 - Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems

Info

Publication number: WO2022103290A1
Application number: PCT/RU2020/000600
Authority: WO
Inventors: Marina Viktorovna VOLKOVA; Sergey Aleksandrovich NOVOSYOLOV; Galina Mihaylovna LAVRENTYEVA; Tseren Vladimirovich ANDZHUKAEV; Aleksey Evgenyevich GUSEV
Original assignee: "Stc"-Innovations Limited"
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2022-05-19

Abstract

The present invention relates to the field of automatic speech signal quality evaluating and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof. The method for training a neural network to evaluate quality characteristics of an input speech signal comprises the steps of preparing a set of training speech signals, wherein for each of the training speech signal the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.

Description

METHOD FOR AUTOMATIC QUALITY EVALUATION OF SPEECH SIGNALS USING NEURAL NETWORKS FOR SELECTING A CHANNEL IN

MULTIMICROPHONE SYSTEMS

Field of the invention

The present invention relates to the field of automatic evaluating of a speech signal quality and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof.

The speech signal quality evaluating can be used in various speech processing applications, such as for automatically selecting the best microphone in a sound recording multimicrophone system. In voice biometrics, it can be used for determining speech segments of the highest quality in recordings taken in various acoustic conditions in order to develop a voice pattern of a speaker based on selected fragments.

Background of the invention

In the field of signal processing for evaluating a certain distortion, qualitative characteristics, such as a signal-to-noise ratio and a reverberation time, are usually used.

A signal-to-noise ratio (SNR) is defined as a ratio of signal power to the noise power and can be represented through the following mathematical expression:

where SNR is a signal-to-noise ratio,

P_signal is a signal mean power,

P_noise is a noise mean power,

A_signal is a signal root-mean-square amplitude,

A_noise is a noise root-mean-square amplitude. The reverberation time (RT) is considered as a main parameter that defines acoustic environment of the area where the speech was recorded. In most cases, it is determined as the time the sound pressure level takes to decreases by 60 dB (in 1 million times in terms of power or in 1,000 times in terms of sound pressure). In literature, the reverberation time defined in such way is usually referred to as RT60 or T60. There are established methods to determine RT60 based on a known room impulse response, but in real-life scenarios of working with sound recordings obtained from random sources, the impulse response is not available. Thus, the task of the approximate reverberation time evaluating is based on a given sound recording without any additional data on the acoustic conditions becomes relevant.

The most common method for evaluating the quality of speech signals is mean opinion score estimation (MOS) which is subjective statistical testing with the help of a group of expert listeners. Although the reliability of this evaluation is adequately high with a large number of listeners, it is the most resource-consuming. Among objective methods that do not rely on experts, evaluations based on comparison between the original (reference) and encoded (distorted) signals are often used. For example, perceptual evaluation of speech quality (PESQ) and its improved version — perceptual objective listening quality assessment (POLQA). Although it is widely used in telecommunication systems, such evaluations are limited in real-time speech processing applications, since the reference signal is not available in such cases.

Automatic signal quality evaluation methods that do not require a reference sample for comparison are known from the prior art. These solutions use machine learning methods for providing systems, typically neural networks, predicting a MOS estimation (CN104581758A, EP3494575A1) or PESQ, POLQA estimations (CN108346434A, CN108322346A).

However, overall estimations obtained through this method are difficult to interpret in terms of quantitative characteristics of the acoustic environment, such as the signal-to-noise ratio and the reverberation time. Knowing these parameters is necessary for predicting operational conditions of speech processing systems: for example, SNR ≥ 15 dB, RT60 < 600 ms are considered good conditions for generating a speaker’s voice pattern. That is why, besides an overall signal quality estimation, it is important to have some information about quantitative characteristics of the environment, in which the recording was taken. US9396738B2 discloses evaluation of quantitative characteristics of signal quality using neural networks, in particular, a signal-to-noise ratio, spectral clarity, skew, kurtosis and pitch average are determined. However, the main goal of these measurements is to evaluate a distorted signal transmitted via a network in telecommunications applications. Thus, acoustic environment factors, such as the reverberation time, are not accounted for in this evaluation.

In most cases, the process of calculating the signal reverberation time itself is based on the use of an impulse response of a room where the recording was taken (EP2238590A1, WO2015010983). It is evident that such methods can be used only if the room impulse response is known, which is virtually impossible in real-life automatic speech processing applications.

When the room impulse response is not known (the term “blind estimation” is also used), the reverberation time evaluation is usually reduced to traditional signal processing methods. For example, US 9558757B1 provides determination of the rate of sound decay by generating an autocorrelogram of the signal intensity as a function of time. However, only reverberation time values that exceed a certain threshold are determined. The disadvantages of this method include low sensitivity to signals having small reverberation time values, the inability to be used on short speech fragments, as well as instability to noised data.

A multichannel microphone-based reverberation time estimation method using deep neural networks is disclosed in US 20200082843A1. According to this method, signals obtained through a multichannel microphone are analyzed. However, using this method for speech signal processing from one microphone input is quite problematic.

A method for determining characteristics, selecting and adapting training acoustic signals for an automatic speech recognition system is known from US 9922664B2. This method comprises preparing training data that imitates target environment conditions, including noise and reverberation levels, which later can be used for training a deep neural network. The trained neural network can be later used for classifying speech data samples to simulate codecs corresponding to the speech data samples. However, this method does not allow, in particular, training a neural network to simultaneously predict or evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training data. Therefore, there is a need in methods for training neural networks to simultaneously evaluate input speech signal characteristics without using additional data on the acoustic environment in which this speech signal was extracted.

Summary of invention

According to one embodiment of the present invention, a method for training a neural network to evaluate quality characteristics of an input speech signal is provided, the method comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.

When this method is used, the quality characteristics of the signal can be evaluated only based on a given input signal, without the need to compare with a reference or undistorted signal, and for evaluating the reverberation time, knowing an impulse response of a room where the speech was recorded is not required.

According to one embodiment, preparing the set of training speech signals may comprise providing a plurality of clear speech signals having minimum values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal- to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and also calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the reverberation time. According to another embodiment, each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. Furthermore, each of the noise signal is reverberated using a room impulse response different from the impulse response that was selected for reverberating the corresponding clear speech signal.

According to another embodiment, the method comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to-noise ratio, the reverberation time and the overall quality estimation.

The noise class is evaluated using a classifier trained with the help of binary cross-entropy.

In a further embodiment, a method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features is provided, the method comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.

Furthermore, the predetermined condition is a maximum value of the overall quality estimation.

Brief description of the drawings

The suggested invention will be described in further detail below with a reference to accompanying drawings, in which:

Fig. 1 is a sequence of operations providing the train data set; Fig. 2 is a flow-chart of generating speech signal quality estimations from an original audio recording with the help of a trained model.

Detailed description of the preferred embodiments

According to one embodiment a method for training neural networks to evaluate quality characteristics of an input speech signal is provided. The method comprises preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes. The method further comprises applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal, and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.

As illustrated in Fig. 1, at the first stage of preparing a training data set, a plurality of clear speech signals 101 having minimum values for the signal-to-noise and the reverberation time, a plurality of stationary noise signals 102 of various classes, a plurality of impulse responses 103 corresponding to various rooms, the reverberation time (T60) for which is known, are taken. Meanwhile, existing speech and noise databases can be used as a source of clear speech signals and of stationary noise signals. The required impulse responses can be generated using special utility software.

In one embodiment, a database of 79 noise classes, such as typing noise, rain noise, the hum of a crowd of people, manufacture machinery, etc., was used as stationary noise signals. A specifically generated impulse response database imitating 40,000 rooms of various sizes with the reverberation time of 0 - 2 seconds was used as the plurality of impulse responses, wherein for each room, 4 impulse responses were generated imitating various positions of the acoustic source inside that room.

At the second stage of preparing a training data set, a convolution operation is performed on each clear speech signal with an impulse response of an arbitrarily selected room to generate a plurality of reverberated speech signals 104. Several impulse responses can correspond to each room depending on the position of the acoustic source in this room.

In a further embodiment, each noise signal can be also reverberated 105. Concurrently, reverberation can be performed using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. If there are several impulse responses in a room, an impulse response can be used that differs from the one used for reverberating a clear speech signal. It enables to imitate various spatial positions of speech and noise sources and generate a more realistic database.

At the third stage of preparing a training data set, each reverberated signal from a plurality of reverberated signals generated at the previous stage is combined with stationary noises of various types to result in a plurality of noised signals 106 with various signal-to-noise ratio values.

In one embodiment, in order to provide an accurate signal-to-noise ratio, the power of a speech signal is calculated only on speech segments, with pauses not taken into account; for this, a voice activity detector (VAD) is applied to the speech signal.

At the fourth stage of preparing a training data set, a final balanced train and test plurality 107 is formed from prepared speech signals distorted by noise and reverberation. At the same time, parameters SNR, RT60 and a noise class are known for each of these signals.

At the fifth stage, an overall or integral quality estimation (QE) is calculated for each speech signal distorted by reverberation and noised as a certain function of distortion parameters SNR and RT60. According to a particular embodiment, an QE is calculated using the following mathematical expressions:

where S_SNR is a speech segment SNR level estimation,

S_RT60 is a speech segment reverberation level estimation,

OQ is an integral speech segment quality estimation,

SNR_dB is ^a speech segment SNR value in decibels, RT 60_ms is a speech segment reverberation time value in milliseconds.

When preparing a training data set, an existing data set that satisfies the balance condition in terms of the SNR and T60 ranges and in terms of noise classes can also be used.

Then, training features are extracted from the prepared speech signals. In particular, a voice activity detector (VAD) is applied to the prepared speech signals so as to extract and save only speech segments. Then, training features are extracted from the generated signals, such as mel- frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal. The resulting training features are further processed, such as by performing cepstral mean normalization (CMN).

Isolated training features are then used to train a convolutional neural network in a multitasking mode. A convolutional neural network is simultaneously trained to evaluate a signal- to-noise ratio (SNR), a reverberation time (RT60), a noise class and an overall quality estimation (OQ) on input features generated at the previous stage. It is performed by using four outputs in the neural network architecture and a combined loss function based on the sum of four cost functions with different weight coefficients.

In one of the embodiments, for automatic SNR, RT60 and OQ evaluation, a regressive predictor model trained using a cost function based on a mean squared error is used. An automatic noise class estimation can be based on the use of a classifier trained using binary cross-entropy (BCE).

A formula for calculating a combined loss function is given below as a non-limiting example:

L = 10 • MSE(0Q) + 0,001 • MSE(RT60_ms) + MSE(SNR_dB) + 10 • BCE (noise class) where L is a combined loss function,

MSE(0Q) is a loss function based on a mean squared error for the integral quality estimation, MSE(RT 60_ms) is a loss function based on a mean squared error for the RT60 estimation, MSE(SNR_AB) is a loss function based on a mean squared error for the SNR estimation, BCE (noise class) is a binary cross-entropy loss function for noise classification.

Since it is contemplated that the developed non-linear speech signal quality prediction model should evaluate the quality on short speech fragments (1 to 2 seconds), in one of the embodiments, the model also is trained on short speech fragments. In real life, human speech and natural noises are not strictly stationary. It means that a global signal-to-noise ratio value which was obtained at the data preparation stage and which is singular for a whole file should be corrected for each short segment of this file. As a non-limiting example, below is a formula for calculating a local signal-to-noise ratio:

where is an energy of a reverberated speech signal before being noised, and is

an energy of a reverberated noise. The coefficients α and β for each signal are determined by solving a linear equation system on four signal fragments:

where X_aug(i) is an i-th fragment of the augmented signal, are its

reverberated speech and noise parts.

A neural network architecture that can be used for evaluating speech signal quality characteristic is given below as a non-limiting example.

Residual network ResNet18 is comprised of 8 ResNet blocks, each being formed by two convolutional layers with 64 filters dimensioned 3x3 and a skip connection through the two layers. This connection is implemented by simply combining, element-by-element, a block input and an output of the last layer of a block if the dimensions match, or by using a convolution operation to accommodate dimensions. The top level is formed by a global average pooling layer, the 512-dimensional output of which can be referred to as a quality vector (“quality embedding”). This vector is then provided to three linear layers: for predicting a signal-to-noise ratio (SNR), reverberation time (RT60) and a quality estimate (OQ). The sigmoid activation function is used for quality evaluation.

For classifying noise, an additional two-layer classifier is used with a softmax activation function or its modifications in the number of noise classes (79 in this embodiment).

Accordingly, the suggested method for training a neural network allows obtaining both an overall speech signal quality estimation and its specific acoustic characteristics (a signal-to-noise ratio and a reverberation time), which can be used both in voice biometrics applications and for selecting the best channel in multimicrophone systems according to a predetermined criterion.

In another embodiment, a method for automatically selecting a channel in a multimicrophone system is provided, which is implemented using the trained neural network described above.

According to this method, input speech signals are received from a plurality of channels of a multimicrophone system. Then, as illustrated in Fig. 2, a voice activity detector 202 is applied to each individual speech signal 201 so as to extract features characterizing this input speech signal and corresponding to training features that have been used for training the neural network, such as mel-frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal.

Then, the obtained features 203 characterizing the input speech signal are provided to a neural network input 204 and are simultaneously evaluated to obtain evaluated values of a signal- to-noise ratio 205, reverberation time 206, an overall quality estimation 207 and a predicted noise class 208 at a neural network output for each input speech signal. Based on these evaluated values, a channel, from which an input speech signal having evaluated values that satisfy the predetermined condition was received, is selected from a plurality of channels of a multimicrophone system. At the same time, a maximum value of an overall quality evaluation can be used as the predetermined condition. However, a person skilled in the art will readily appreciate that other parameters known from the prior art can be used as the predetermined condition. The present invention is not limited to specific embodiments disclosed in the specification for illustration purposes and encompasses all possible modifications and alternatives at each step of implementation.

Claims

1 . A method for training a neural network to evaluate quality characteristics of an input speech signal, the method comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.

2. The method according to claim 1, wherein preparing the set of training speech signals comprises: providing a plurality of clear speech signals having minimal values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal-to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the reverberation time.

3. The method according to claim 2, wherein each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal.

4. The method according to claim 2, wherein each of the noise signals is reverberated using a room impulse response that differs from the impulse response that was selected for reverberating the corresponding clear speech signal.

5. The method according to claim 1, comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to- noise ratio, the reverberation time and the overall quality estimation.

6. The method according to claim 5, wherein the noise class is evaluated using a classifier trained with the help of binary cross-entropy.

7. A method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features, the method comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as to extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.

8. The method according to claim 7, wherein the predetermined condition is a maximum value of the overall quality estimation.