WO2022103290A1 - Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems - Google Patents

Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems Download PDF

Info

Publication number
WO2022103290A1
WO2022103290A1 PCT/RU2020/000600 RU2020000600W WO2022103290A1 WO 2022103290 A1 WO2022103290 A1 WO 2022103290A1 RU 2020000600 W RU2020000600 W RU 2020000600W WO 2022103290 A1 WO2022103290 A1 WO 2022103290A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
noise
training
signals
speech
Prior art date
Application number
PCT/RU2020/000600
Other languages
French (fr)
Inventor
Marina Viktorovna VOLKOVA
Sergey Aleksandrovich NOVOSYOLOV
Galina Mihaylovna LAVRENTYEVA
Tseren Vladimirovich ANDZHUKAEV
Aleksey Evgenyevich GUSEV
Original Assignee
"Stc"-Innovations Limited"
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by "Stc"-Innovations Limited" filed Critical "Stc"-Innovations Limited"
Priority to PCT/RU2020/000600 priority Critical patent/WO2022103290A1/en
Publication of WO2022103290A1 publication Critical patent/WO2022103290A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present invention relates to the field of automatic evaluating of a speech signal quality and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof.
  • the speech signal quality evaluating can be used in various speech processing applications, such as for automatically selecting the best microphone in a sound recording multimicrophone system.
  • voice biometrics it can be used for determining speech segments of the highest quality in recordings taken in various acoustic conditions in order to develop a voice pattern of a speaker based on selected fragments.
  • a signal-to-noise ratio is defined as a ratio of signal power to the noise power and can be represented through the following mathematical expression: where SNR is a signal-to-noise ratio,
  • P signal is a signal mean power
  • P noise is a noise mean power
  • a signal is a signal root-mean-square amplitude
  • a noise is a noise root-mean-square amplitude.
  • the reverberation time (RT) is considered as a main parameter that defines acoustic environment of the area where the speech was recorded. In most cases, it is determined as the time the sound pressure level takes to decreases by 60 dB (in 1 million times in terms of power or in 1,000 times in terms of sound pressure). In literature, the reverberation time defined in such way is usually referred to as RT60 or T60. There are established methods to determine RT60 based on a known room impulse response, but in real-life scenarios of working with sound recordings obtained from random sources, the impulse response is not available. Thus, the task of the approximate reverberation time evaluating is based on a given sound recording without any additional data on the acoustic conditions becomes relevant.
  • MOS mean opinion score estimation
  • PESQ perceptual evaluation of speech quality
  • POLQA perceptual objective listening quality assessment
  • the reverberation time evaluation is usually reduced to traditional signal processing methods.
  • US 9558757B1 provides determination of the rate of sound decay by generating an autocorrelogram of the signal intensity as a function of time.
  • reverberation time values that exceed a certain threshold are determined.
  • the disadvantages of this method include low sensitivity to signals having small reverberation time values, the inability to be used on short speech fragments, as well as instability to noised data.
  • a multichannel microphone-based reverberation time estimation method using deep neural networks is disclosed in US 20200082843A1. According to this method, signals obtained through a multichannel microphone are analyzed. However, using this method for speech signal processing from one microphone input is quite problematic.
  • a method for determining characteristics, selecting and adapting training acoustic signals for an automatic speech recognition system comprises preparing training data that imitates target environment conditions, including noise and reverberation levels, which later can be used for training a deep neural network.
  • the trained neural network can be later used for classifying speech data samples to simulate codecs corresponding to the speech data samples.
  • this method does not allow, in particular, training a neural network to simultaneously predict or evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training data. Therefore, there is a need in methods for training neural networks to simultaneously evaluate input speech signal characteristics without using additional data on the acoustic environment in which this speech signal was extracted.
  • a method for training a neural network to evaluate quality characteristics of an input speech signal comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
  • the quality characteristics of the signal can be evaluated only based on a given input signal, without the need to compare with a reference or undistorted signal, and for evaluating the reverberation time, knowing an impulse response of a room where the speech was recorded is not required.
  • preparing the set of training speech signals may comprise providing a plurality of clear speech signals having minimum values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal- to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and also calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the noise class; and also
  • each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. Furthermore, each of the noise signal is reverberated using a room impulse response different from the impulse response that was selected for reverberating the corresponding clear speech signal.
  • the method comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to-noise ratio, the reverberation time and the overall quality estimation.
  • the noise class is evaluated using a classifier trained with the help of binary cross-entropy.
  • a method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.
  • the predetermined condition is a maximum value of the overall quality estimation.
  • Fig. 1 is a sequence of operations providing the train data set
  • Fig. 2 is a flow-chart of generating speech signal quality estimations from an original audio recording with the help of a trained model.
  • a method for training neural networks to evaluate quality characteristics of an input speech signal comprises preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes.
  • the method further comprises applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal, and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
  • a plurality of clear speech signals 101 having minimum values for the signal-to-noise and the reverberation time, a plurality of stationary noise signals 102 of various classes, a plurality of impulse responses 103 corresponding to various rooms, the reverberation time (T60) for which is known, are taken.
  • existing speech and noise databases can be used as a source of clear speech signals and of stationary noise signals.
  • the required impulse responses can be generated using special utility software.
  • a database of 79 noise classes such as typing noise, rain noise, the hum of a crowd of people, manufacture machinery, etc.
  • a specifically generated impulse response database imitating 40,000 rooms of various sizes with the reverberation time of 0 - 2 seconds was used as the plurality of impulse responses, wherein for each room, 4 impulse responses were generated imitating various positions of the acoustic source inside that room.
  • a convolution operation is performed on each clear speech signal with an impulse response of an arbitrarily selected room to generate a plurality of reverberated speech signals 104.
  • impulse responses can correspond to each room depending on the position of the acoustic source in this room.
  • each noise signal can be also reverberated 105.
  • reverberation can be performed using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. If there are several impulse responses in a room, an impulse response can be used that differs from the one used for reverberating a clear speech signal. It enables to imitate various spatial positions of speech and noise sources and generate a more realistic database.
  • each reverberated signal from a plurality of reverberated signals generated at the previous stage is combined with stationary noises of various types to result in a plurality of noised signals 106 with various signal-to-noise ratio values.
  • the power of a speech signal is calculated only on speech segments, with pauses not taken into account; for this, a voice activity detector (VAD) is applied to the speech signal.
  • VAD voice activity detector
  • a final balanced train and test plurality 107 is formed from prepared speech signals distorted by noise and reverberation.
  • parameters SNR, RT60 and a noise class are known for each of these signals.
  • an overall or integral quality estimation is calculated for each speech signal distorted by reverberation and noised as a certain function of distortion parameters SNR and RT60.
  • an QE is calculated using the following mathematical expressions: where S SNR is a speech segment SNR level estimation,
  • S RT60 is a speech segment reverberation level estimation
  • OQ is an integral speech segment quality estimation
  • SNR dB is a speech segment SNR value in decibels
  • RT 60 ms is a speech segment reverberation time value in milliseconds.
  • an existing data set that satisfies the balance condition in terms of the SNR and T60 ranges and in terms of noise classes can also be used.
  • training features are extracted from the prepared speech signals.
  • a voice activity detector VAD
  • VAD voice activity detector
  • training features are extracted from the generated signals, such as mel- frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal.
  • MFCC mel- frequency cepstral coefficients
  • FBANK band filter bank
  • Isolated training features are then used to train a convolutional neural network in a multitasking mode.
  • a convolutional neural network is simultaneously trained to evaluate a signal- to-noise ratio (SNR), a reverberation time (RT60), a noise class and an overall quality estimation (OQ) on input features generated at the previous stage. It is performed by using four outputs in the neural network architecture and a combined loss function based on the sum of four cost functions with different weight coefficients.
  • SNR signal- to-noise ratio
  • RT60 reverberation time
  • OQ overall quality estimation
  • a regressive predictor model trained using a cost function based on a mean squared error is used for automatic SNR, RT60 and OQ evaluation.
  • An automatic noise class estimation can be based on the use of a classifier trained using binary cross-entropy (BCE).
  • L 10 • MSE(0Q) + 0,001 • MSE(RT60 ms ) + MSE(SNR dB ) + 10 • BCE (noise class) where L is a combined loss function
  • MSE(0Q) is a loss function based on a mean squared error for the integral quality estimation
  • MSE(RT 60 ms ) is a loss function based on a mean squared error for the RT60 estimation
  • MSE(SNR AB ) is a loss function based on a mean squared error for the SNR estimation
  • BCE noise class
  • the developed non-linear speech signal quality prediction model should evaluate the quality on short speech fragments (1 to 2 seconds), in one of the embodiments, the model also is trained on short speech fragments.
  • human speech and natural noises are not strictly stationary. It means that a global signal-to-noise ratio value which was obtained at the data preparation stage and which is singular for a whole file should be corrected for each short segment of this file.
  • a formula for calculating a local signal-to-noise ratio where is an energy of a reverberated speech signal before being noised, and is an energy of a reverberated noise.
  • the coefficients ⁇ and ⁇ for each signal are determined by solving a linear equation system on four signal fragments: where X aug (i) is an i-th fragment of the augmented signal, are its reverberated speech and noise parts.
  • a neural network architecture that can be used for evaluating speech signal quality characteristic is given below as a non-limiting example.
  • Residual network ResNet18 is comprised of 8 ResNet blocks, each being formed by two convolutional layers with 64 filters dimensioned 3x3 and a skip connection through the two layers. This connection is implemented by simply combining, element-by-element, a block input and an output of the last layer of a block if the dimensions match, or by using a convolution operation to accommodate dimensions.
  • the top level is formed by a global average pooling layer, the 512-dimensional output of which can be referred to as a quality vector (“quality embedding”). This vector is then provided to three linear layers: for predicting a signal-to-noise ratio (SNR), reverberation time (RT60) and a quality estimate (OQ).
  • SNR signal-to-noise ratio
  • RT60 reverberation time
  • OQ quality estimate
  • an additional two-layer classifier is used with a softmax activation function or its modifications in the number of noise classes (79 in this embodiment).
  • the suggested method for training a neural network allows obtaining both an overall speech signal quality estimation and its specific acoustic characteristics (a signal-to-noise ratio and a reverberation time), which can be used both in voice biometrics applications and for selecting the best channel in multimicrophone systems according to a predetermined criterion.
  • a method for automatically selecting a channel in a multimicrophone system is provided, which is implemented using the trained neural network described above.
  • input speech signals are received from a plurality of channels of a multimicrophone system.
  • a voice activity detector 202 is applied to each individual speech signal 201 so as to extract features characterizing this input speech signal and corresponding to training features that have been used for training the neural network, such as mel-frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal.
  • MFCC mel-frequency cepstral coefficients
  • FBANK band filter bank
  • the obtained features 203 characterizing the input speech signal are provided to a neural network input 204 and are simultaneously evaluated to obtain evaluated values of a signal- to-noise ratio 205, reverberation time 206, an overall quality estimation 207 and a predicted noise class 208 at a neural network output for each input speech signal.
  • a channel from which an input speech signal having evaluated values that satisfy the predetermined condition was received, is selected from a plurality of channels of a multimicrophone system.
  • a maximum value of an overall quality evaluation can be used as the predetermined condition.
  • a person skilled in the art will readily appreciate that other parameters known from the prior art can be used as the predetermined condition.
  • the present invention is not limited to specific embodiments disclosed in the specification for illustration purposes and encompasses all possible modifications and alternatives at each step of implementation.

Abstract

The present invention relates to the field of automatic speech signal quality evaluating and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof. The method for training a neural network to evaluate quality characteristics of an input speech signal comprises the steps of preparing a set of training speech signals, wherein for each of the training speech signal the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.

Description

METHOD FOR AUTOMATIC QUALITY EVALUATION OF SPEECH SIGNALS USING NEURAL NETWORKS FOR SELECTING A CHANNEL IN
MULTIMICROPHONE SYSTEMS
Field of the invention
The present invention relates to the field of automatic evaluating of a speech signal quality and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof.
The speech signal quality evaluating can be used in various speech processing applications, such as for automatically selecting the best microphone in a sound recording multimicrophone system. In voice biometrics, it can be used for determining speech segments of the highest quality in recordings taken in various acoustic conditions in order to develop a voice pattern of a speaker based on selected fragments.
Background of the invention
In the field of signal processing for evaluating a certain distortion, qualitative characteristics, such as a signal-to-noise ratio and a reverberation time, are usually used.
A signal-to-noise ratio (SNR) is defined as a ratio of signal power to the noise power and can be represented through the following mathematical expression:
Figure imgf000003_0001
where SNR is a signal-to-noise ratio,
Psignal is a signal mean power,
Pnoise is a noise mean power,
Asignal is a signal root-mean-square amplitude,
Anoise is a noise root-mean-square amplitude. The reverberation time (RT) is considered as a main parameter that defines acoustic environment of the area where the speech was recorded. In most cases, it is determined as the time the sound pressure level takes to decreases by 60 dB (in 1 million times in terms of power or in 1,000 times in terms of sound pressure). In literature, the reverberation time defined in such way is usually referred to as RT60 or T60. There are established methods to determine RT60 based on a known room impulse response, but in real-life scenarios of working with sound recordings obtained from random sources, the impulse response is not available. Thus, the task of the approximate reverberation time evaluating is based on a given sound recording without any additional data on the acoustic conditions becomes relevant.
The most common method for evaluating the quality of speech signals is mean opinion score estimation (MOS) which is subjective statistical testing with the help of a group of expert listeners. Although the reliability of this evaluation is adequately high with a large number of listeners, it is the most resource-consuming. Among objective methods that do not rely on experts, evaluations based on comparison between the original (reference) and encoded (distorted) signals are often used. For example, perceptual evaluation of speech quality (PESQ) and its improved version — perceptual objective listening quality assessment (POLQA). Although it is widely used in telecommunication systems, such evaluations are limited in real-time speech processing applications, since the reference signal is not available in such cases.
Automatic signal quality evaluation methods that do not require a reference sample for comparison are known from the prior art. These solutions use machine learning methods for providing systems, typically neural networks, predicting a MOS estimation (CN104581758A, EP3494575A1) or PESQ, POLQA estimations (CN108346434A, CN108322346A).
However, overall estimations obtained through this method are difficult to interpret in terms of quantitative characteristics of the acoustic environment, such as the signal-to-noise ratio and the reverberation time. Knowing these parameters is necessary for predicting operational conditions of speech processing systems: for example, SNR ≥ 15 dB, RT60 < 600 ms are considered good conditions for generating a speaker’s voice pattern. That is why, besides an overall signal quality estimation, it is important to have some information about quantitative characteristics of the environment, in which the recording was taken. US9396738B2 discloses evaluation of quantitative characteristics of signal quality using neural networks, in particular, a signal-to-noise ratio, spectral clarity, skew, kurtosis and pitch average are determined. However, the main goal of these measurements is to evaluate a distorted signal transmitted via a network in telecommunications applications. Thus, acoustic environment factors, such as the reverberation time, are not accounted for in this evaluation.
In most cases, the process of calculating the signal reverberation time itself is based on the use of an impulse response of a room where the recording was taken (EP2238590A1, WO2015010983). It is evident that such methods can be used only if the room impulse response is known, which is virtually impossible in real-life automatic speech processing applications.
When the room impulse response is not known (the term “blind estimation” is also used), the reverberation time evaluation is usually reduced to traditional signal processing methods. For example, US 9558757B1 provides determination of the rate of sound decay by generating an autocorrelogram of the signal intensity as a function of time. However, only reverberation time values that exceed a certain threshold are determined. The disadvantages of this method include low sensitivity to signals having small reverberation time values, the inability to be used on short speech fragments, as well as instability to noised data.
A multichannel microphone-based reverberation time estimation method using deep neural networks is disclosed in US 20200082843A1. According to this method, signals obtained through a multichannel microphone are analyzed. However, using this method for speech signal processing from one microphone input is quite problematic.
A method for determining characteristics, selecting and adapting training acoustic signals for an automatic speech recognition system is known from US 9922664B2. This method comprises preparing training data that imitates target environment conditions, including noise and reverberation levels, which later can be used for training a deep neural network. The trained neural network can be later used for classifying speech data samples to simulate codecs corresponding to the speech data samples. However, this method does not allow, in particular, training a neural network to simultaneously predict or evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training data. Therefore, there is a need in methods for training neural networks to simultaneously evaluate input speech signal characteristics without using additional data on the acoustic environment in which this speech signal was extracted.
Summary of invention
According to one embodiment of the present invention, a method for training a neural network to evaluate quality characteristics of an input speech signal is provided, the method comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
When this method is used, the quality characteristics of the signal can be evaluated only based on a given input signal, without the need to compare with a reference or undistorted signal, and for evaluating the reverberation time, knowing an impulse response of a room where the speech was recorded is not required.
According to one embodiment, preparing the set of training speech signals may comprise providing a plurality of clear speech signals having minimum values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal- to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and also calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the reverberation time. According to another embodiment, each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. Furthermore, each of the noise signal is reverberated using a room impulse response different from the impulse response that was selected for reverberating the corresponding clear speech signal.
According to another embodiment, the method comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to-noise ratio, the reverberation time and the overall quality estimation.
The noise class is evaluated using a classifier trained with the help of binary cross-entropy.
In a further embodiment, a method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features is provided, the method comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.
Furthermore, the predetermined condition is a maximum value of the overall quality estimation.
Brief description of the drawings
The suggested invention will be described in further detail below with a reference to accompanying drawings, in which:
Fig. 1 is a sequence of operations providing the train data set; Fig. 2 is a flow-chart of generating speech signal quality estimations from an original audio recording with the help of a trained model.
Detailed description of the preferred embodiments
According to one embodiment a method for training neural networks to evaluate quality characteristics of an input speech signal is provided. The method comprises preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes. The method further comprises applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal, and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
As illustrated in Fig. 1, at the first stage of preparing a training data set, a plurality of clear speech signals 101 having minimum values for the signal-to-noise and the reverberation time, a plurality of stationary noise signals 102 of various classes, a plurality of impulse responses 103 corresponding to various rooms, the reverberation time (T60) for which is known, are taken. Meanwhile, existing speech and noise databases can be used as a source of clear speech signals and of stationary noise signals. The required impulse responses can be generated using special utility software.
In one embodiment, a database of 79 noise classes, such as typing noise, rain noise, the hum of a crowd of people, manufacture machinery, etc., was used as stationary noise signals. A specifically generated impulse response database imitating 40,000 rooms of various sizes with the reverberation time of 0 - 2 seconds was used as the plurality of impulse responses, wherein for each room, 4 impulse responses were generated imitating various positions of the acoustic source inside that room.
At the second stage of preparing a training data set, a convolution operation is performed on each clear speech signal with an impulse response of an arbitrarily selected room to generate a plurality of reverberated speech signals 104. Several impulse responses can correspond to each room depending on the position of the acoustic source in this room.
In a further embodiment, each noise signal can be also reverberated 105. Concurrently, reverberation can be performed using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. If there are several impulse responses in a room, an impulse response can be used that differs from the one used for reverberating a clear speech signal. It enables to imitate various spatial positions of speech and noise sources and generate a more realistic database.
At the third stage of preparing a training data set, each reverberated signal from a plurality of reverberated signals generated at the previous stage is combined with stationary noises of various types to result in a plurality of noised signals 106 with various signal-to-noise ratio values.
In one embodiment, in order to provide an accurate signal-to-noise ratio, the power of a speech signal is calculated only on speech segments, with pauses not taken into account; for this, a voice activity detector (VAD) is applied to the speech signal.
At the fourth stage of preparing a training data set, a final balanced train and test plurality 107 is formed from prepared speech signals distorted by noise and reverberation. At the same time, parameters SNR, RT60 and a noise class are known for each of these signals.
At the fifth stage, an overall or integral quality estimation (QE) is calculated for each speech signal distorted by reverberation and noised as a certain function of distortion parameters SNR and RT60. According to a particular embodiment, an QE is calculated using the following mathematical expressions:
Figure imgf000009_0001
where SSNR is a speech segment SNR level estimation,
SRT60 is a speech segment reverberation level estimation,
OQ is an integral speech segment quality estimation,
SNRdB is a speech segment SNR value in decibels, RT 60ms is a speech segment reverberation time value in milliseconds.
When preparing a training data set, an existing data set that satisfies the balance condition in terms of the SNR and T60 ranges and in terms of noise classes can also be used.
Then, training features are extracted from the prepared speech signals. In particular, a voice activity detector (VAD) is applied to the prepared speech signals so as to extract and save only speech segments. Then, training features are extracted from the generated signals, such as mel- frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal. The resulting training features are further processed, such as by performing cepstral mean normalization (CMN).
Isolated training features are then used to train a convolutional neural network in a multitasking mode. A convolutional neural network is simultaneously trained to evaluate a signal- to-noise ratio (SNR), a reverberation time (RT60), a noise class and an overall quality estimation (OQ) on input features generated at the previous stage. It is performed by using four outputs in the neural network architecture and a combined loss function based on the sum of four cost functions with different weight coefficients.
In one of the embodiments, for automatic SNR, RT60 and OQ evaluation, a regressive predictor model trained using a cost function based on a mean squared error is used. An automatic noise class estimation can be based on the use of a classifier trained using binary cross-entropy (BCE).
A formula for calculating a combined loss function is given below as a non-limiting example:
L = 10 • MSE(0Q) + 0,001 • MSE(RT60ms) + MSE(SNRdB) + 10 • BCE (noise class) where L is a combined loss function,
MSE(0Q) is a loss function based on a mean squared error for the integral quality estimation, MSE(RT 60ms) is a loss function based on a mean squared error for the RT60 estimation, MSE(SNRAB) is a loss function based on a mean squared error for the SNR estimation, BCE (noise class) is a binary cross-entropy loss function for noise classification.
Since it is contemplated that the developed non-linear speech signal quality prediction model should evaluate the quality on short speech fragments (1 to 2 seconds), in one of the embodiments, the model also is trained on short speech fragments. In real life, human speech and natural noises are not strictly stationary. It means that a global signal-to-noise ratio value which was obtained at the data preparation stage and which is singular for a whole file should be corrected for each short segment of this file. As a non-limiting example, below is a formula for calculating a local signal-to-noise ratio:
Figure imgf000011_0001
where is an energy of a reverberated speech signal before being noised, and is
Figure imgf000011_0004
Figure imgf000011_0005
an energy of a reverberated noise. The coefficients α and β for each signal are determined by solving a linear equation system on four signal fragments:
Figure imgf000011_0002
where Xaug(i) is an i-th fragment of the augmented signal, are its
Figure imgf000011_0003
reverberated speech and noise parts.
A neural network architecture that can be used for evaluating speech signal quality characteristic is given below as a non-limiting example.
Residual network ResNet18 is comprised of 8 ResNet blocks, each being formed by two convolutional layers with 64 filters dimensioned 3x3 and a skip connection through the two layers. This connection is implemented by simply combining, element-by-element, a block input and an output of the last layer of a block if the dimensions match, or by using a convolution operation to accommodate dimensions. The top level is formed by a global average pooling layer, the 512-dimensional output of which can be referred to as a quality vector (“quality embedding”). This vector is then provided to three linear layers: for predicting a signal-to-noise ratio (SNR), reverberation time (RT60) and a quality estimate (OQ). The sigmoid activation function is used for quality evaluation.
For classifying noise, an additional two-layer classifier is used with a softmax activation function or its modifications in the number of noise classes (79 in this embodiment).
Accordingly, the suggested method for training a neural network allows obtaining both an overall speech signal quality estimation and its specific acoustic characteristics (a signal-to-noise ratio and a reverberation time), which can be used both in voice biometrics applications and for selecting the best channel in multimicrophone systems according to a predetermined criterion.
In another embodiment, a method for automatically selecting a channel in a multimicrophone system is provided, which is implemented using the trained neural network described above.
According to this method, input speech signals are received from a plurality of channels of a multimicrophone system. Then, as illustrated in Fig. 2, a voice activity detector 202 is applied to each individual speech signal 201 so as to extract features characterizing this input speech signal and corresponding to training features that have been used for training the neural network, such as mel-frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal.
Then, the obtained features 203 characterizing the input speech signal are provided to a neural network input 204 and are simultaneously evaluated to obtain evaluated values of a signal- to-noise ratio 205, reverberation time 206, an overall quality estimation 207 and a predicted noise class 208 at a neural network output for each input speech signal. Based on these evaluated values, a channel, from which an input speech signal having evaluated values that satisfy the predetermined condition was received, is selected from a plurality of channels of a multimicrophone system. At the same time, a maximum value of an overall quality evaluation can be used as the predetermined condition. However, a person skilled in the art will readily appreciate that other parameters known from the prior art can be used as the predetermined condition. The present invention is not limited to specific embodiments disclosed in the specification for illustration purposes and encompasses all possible modifications and alternatives at each step of implementation.

Claims

1 . A method for training a neural network to evaluate quality characteristics of an input speech signal, the method comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
2. The method according to claim 1, wherein preparing the set of training speech signals comprises: providing a plurality of clear speech signals having minimal values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal-to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the reverberation time.
3. The method according to claim 2, wherein each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal.
4. The method according to claim 2, wherein each of the noise signals is reverberated using a room impulse response that differs from the impulse response that was selected for reverberating the corresponding clear speech signal.
5. The method according to claim 1, comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to- noise ratio, the reverberation time and the overall quality estimation.
6. The method according to claim 5, wherein the noise class is evaluated using a classifier trained with the help of binary cross-entropy.
7. A method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features, the method comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as to extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.
8. The method according to claim 7, wherein the predetermined condition is a maximum value of the overall quality estimation.
PCT/RU2020/000600 2020-11-12 2020-11-12 Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems WO2022103290A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU2020/000600 WO2022103290A1 (en) 2020-11-12 2020-11-12 Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2020/000600 WO2022103290A1 (en) 2020-11-12 2020-11-12 Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems

Publications (1)

Publication Number Publication Date
WO2022103290A1 true WO2022103290A1 (en) 2022-05-19

Family

ID=76305976

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2020/000600 WO2022103290A1 (en) 2020-11-12 2020-11-12 Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems

Country Status (1)

Country Link
WO (1) WO2022103290A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116564351A (en) * 2023-04-03 2023-08-08 湖北经济学院 Voice dialogue quality evaluation method and system and portable electronic equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2456296A (en) * 2007-12-07 2009-07-15 Hamid Sepehr Audio enhancement and hearing protection by producing a noise reduced signal
EP2238590A1 (en) 2008-01-31 2010-10-13 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for computing filter coefficients for echo suppression
WO2015010983A1 (en) 2013-07-22 2015-01-29 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method for processing an audio signal in accordance with a room impulse response, signal processing unit, audio encoder, audio decoder, and binaural renderer
CN104581758A (en) 2013-10-25 2015-04-29 中国移动通信集团广东有限公司 Voice quality estimation method and device as well as electronic equipment
US9396738B2 (en) 2013-05-31 2016-07-19 Sonus Networks, Inc. Methods and apparatus for signal quality analysis
US9558757B1 (en) 2015-02-20 2017-01-31 Amazon Technologies, Inc. Selective de-reverberation using blind estimation of reverberation level
US9922664B2 (en) 2016-03-28 2018-03-20 Nuance Communications, Inc. Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems
US9972339B1 (en) * 2016-08-04 2018-05-15 Amazon Technologies, Inc. Neural network based beam selection
CN108322346A (en) 2018-02-09 2018-07-24 山西大学 A kind of voice quality assessment method based on machine learning
CN108346434A (en) 2017-01-24 2018-07-31 中国移动通信集团安徽有限公司 A kind of method and apparatus of speech quality evaluation
EP3494575A1 (en) 2016-08-09 2019-06-12 Huawei Technologies Co., Ltd. Devices and methods for evaluating speech quality
US20200082843A1 (en) 2016-12-15 2020-03-12 Industry-University Cooperation Foundation Hanyang University Multichannel microphone-based reverberation time estimation method and device which use deep neural network

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2456296A (en) * 2007-12-07 2009-07-15 Hamid Sepehr Audio enhancement and hearing protection by producing a noise reduced signal
EP2238590A1 (en) 2008-01-31 2010-10-13 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for computing filter coefficients for echo suppression
US9396738B2 (en) 2013-05-31 2016-07-19 Sonus Networks, Inc. Methods and apparatus for signal quality analysis
WO2015010983A1 (en) 2013-07-22 2015-01-29 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method for processing an audio signal in accordance with a room impulse response, signal processing unit, audio encoder, audio decoder, and binaural renderer
CN104581758A (en) 2013-10-25 2015-04-29 中国移动通信集团广东有限公司 Voice quality estimation method and device as well as electronic equipment
US9558757B1 (en) 2015-02-20 2017-01-31 Amazon Technologies, Inc. Selective de-reverberation using blind estimation of reverberation level
US9922664B2 (en) 2016-03-28 2018-03-20 Nuance Communications, Inc. Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems
US9972339B1 (en) * 2016-08-04 2018-05-15 Amazon Technologies, Inc. Neural network based beam selection
EP3494575A1 (en) 2016-08-09 2019-06-12 Huawei Technologies Co., Ltd. Devices and methods for evaluating speech quality
US20200082843A1 (en) 2016-12-15 2020-03-12 Industry-University Cooperation Foundation Hanyang University Multichannel microphone-based reverberation time estimation method and device which use deep neural network
CN108346434A (en) 2017-01-24 2018-07-31 中国移动通信集团安徽有限公司 A kind of method and apparatus of speech quality evaluation
CN108322346A (en) 2018-02-09 2018-07-24 山西大学 A kind of voice quality assessment method based on machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANDERSON R AVILA ET AL: "Non-intrusive speech quality assessment using neural networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 March 2019 (2019-03-16), XP081154240 *
J.URGEN TCHORZ ET AL: "SNR Estimation Based on Amplitude Modulation Analysis With Applications to Noise Suppression", vol. 11, no. 3, 1 May 2003 (2003-05-01), XP011079710, ISSN: 1063-6676, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/1208288> DOI: 10.1109/TSA.2003.811542 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116564351A (en) * 2023-04-03 2023-08-08 湖北经济学院 Voice dialogue quality evaluation method and system and portable electronic equipment
CN116564351B (en) * 2023-04-03 2024-01-23 湖北经济学院 Voice dialogue quality evaluation method and system and portable electronic equipment

Similar Documents

Publication Publication Date Title
Avila et al. Non-intrusive speech quality assessment using neural networks
Eaton et al. Estimation of room acoustic parameters: The ACE challenge
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Falk et al. Single-ended speech quality measurement using machine learning methods
Cauchi et al. Non-intrusive speech quality prediction using modulation energies and lstm-network
Dong et al. An attention enhanced multi-task model for objective speech assessment in real-world environments
Su et al. HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features
Fu et al. MetricGAN-U: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
Valentini-Botinhao et al. Speech enhancement of noisy and reverberant speech for text-to-speech
CN109313893A (en) Characterization, selection and adjustment are used for the audio and acoustics training data of automatic speech recognition system
Williams et al. Comparison of speech representations for automatic quality estimation in multi-speaker text-to-speech synthesis
Poorjam et al. Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection
Lavrentyeva et al. Blind Speech Signal Quality Estimation for Speaker Verification Systems.
US20230245674A1 (en) Method for learning an audio quality metric combining labeled and unlabeled data
Karbasi et al. Non-intrusive speech intelligibility prediction using automatic speech recognition derived measures
WO2022103290A1 (en) Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems
Sharma et al. Non-intrusive estimation of speech signal parameters using a frame-based machine learning approach
Huber et al. Single-ended speech quality prediction based on automatic speech recognition
Pirhosseinloo et al. A new feature set for masking-based monaural speech separation
Chen et al. InQSS: a speech intelligibility assessment model using a multi-task learning network
Karbasi et al. Blind Non-Intrusive Speech Intelligibility Prediction Using Twin-HMMs.
EA043719B1 (en) METHOD FOR AUTOMATIC ASSESSMENT OF THE QUALITY OF SPEECH SIGNALS USING NEURAL NETWORKS FOR CHANNEL SELECTION IN MULTI-MICROPHONE SYSTEMS
Islam GFCC-based robust gender detection
Ahmed et al. Channel and channel subband selection for speaker diarization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20897637

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20897637

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20897637

Country of ref document: EP

Kind code of ref document: A1