WO2022103290A1 - Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems - Google Patents
Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems Download PDFInfo
- Publication number
- WO2022103290A1 WO2022103290A1 PCT/RU2020/000600 RU2020000600W WO2022103290A1 WO 2022103290 A1 WO2022103290 A1 WO 2022103290A1 RU 2020000600 W RU2020000600 W RU 2020000600W WO 2022103290 A1 WO2022103290 A1 WO 2022103290A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- noise
- training
- signals
- speech
- Prior art date
Links
- 230000001537 neural Effects 0.000 title claims abstract description 34
- 238000011156 evaluation Methods 0.000 title description 11
- 230000000694 effects Effects 0.000 claims abstract description 9
- 230000004044 response Effects 0.000 claims description 28
- 230000000875 corresponding Effects 0.000 claims description 12
- 230000001373 regressive Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006011 modification reaction Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000003190 augmentative Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000000034 method Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000003595 spectral Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Abstract
The present invention relates to the field of automatic speech signal quality evaluating and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof. The method for training a neural network to evaluate quality characteristics of an input speech signal comprises the steps of preparing a set of training speech signals, wherein for each of the training speech signal the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
Description
METHOD FOR AUTOMATIC QUALITY EVALUATION OF SPEECH SIGNALS USING NEURAL NETWORKS FOR SELECTING A CHANNEL IN
MULTIMICROPHONE SYSTEMS
Field of the invention
The present invention relates to the field of automatic evaluating of a speech signal quality and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof.
The speech signal quality evaluating can be used in various speech processing applications, such as for automatically selecting the best microphone in a sound recording multimicrophone system. In voice biometrics, it can be used for determining speech segments of the highest quality in recordings taken in various acoustic conditions in order to develop a voice pattern of a speaker based on selected fragments.
Background of the invention
In the field of signal processing for evaluating a certain distortion, qualitative characteristics, such as a signal-to-noise ratio and a reverberation time, are usually used.
A signal-to-noise ratio (SNR) is defined as a ratio of signal power to the noise power and can be represented through the following mathematical expression:
where SNR is a signal-to-noise ratio,
Psignal is a signal mean power,
Pnoise is a noise mean power,
Asignal is a signal root-mean-square amplitude,
Anoise is a noise root-mean-square amplitude.
The reverberation time (RT) is considered as a main parameter that defines acoustic environment of the area where the speech was recorded. In most cases, it is determined as the time the sound pressure level takes to decreases by 60 dB (in 1 million times in terms of power or in 1,000 times in terms of sound pressure). In literature, the reverberation time defined in such way is usually referred to as RT60 or T60. There are established methods to determine RT60 based on a known room impulse response, but in real-life scenarios of working with sound recordings obtained from random sources, the impulse response is not available. Thus, the task of the approximate reverberation time evaluating is based on a given sound recording without any additional data on the acoustic conditions becomes relevant.
The most common method for evaluating the quality of speech signals is mean opinion score estimation (MOS) which is subjective statistical testing with the help of a group of expert listeners. Although the reliability of this evaluation is adequately high with a large number of listeners, it is the most resource-consuming. Among objective methods that do not rely on experts, evaluations based on comparison between the original (reference) and encoded (distorted) signals are often used. For example, perceptual evaluation of speech quality (PESQ) and its improved version — perceptual objective listening quality assessment (POLQA). Although it is widely used in telecommunication systems, such evaluations are limited in real-time speech processing applications, since the reference signal is not available in such cases.
Automatic signal quality evaluation methods that do not require a reference sample for comparison are known from the prior art. These solutions use machine learning methods for providing systems, typically neural networks, predicting a MOS estimation (CN104581758A, EP3494575A1) or PESQ, POLQA estimations (CN108346434A, CN108322346A).
However, overall estimations obtained through this method are difficult to interpret in terms of quantitative characteristics of the acoustic environment, such as the signal-to-noise ratio and the reverberation time. Knowing these parameters is necessary for predicting operational conditions of speech processing systems: for example, SNR ≥ 15 dB, RT60 < 600 ms are considered good conditions for generating a speaker’s voice pattern. That is why, besides an overall signal quality estimation, it is important to have some information about quantitative characteristics of the environment, in which the recording was taken.
US9396738B2 discloses evaluation of quantitative characteristics of signal quality using neural networks, in particular, a signal-to-noise ratio, spectral clarity, skew, kurtosis and pitch average are determined. However, the main goal of these measurements is to evaluate a distorted signal transmitted via a network in telecommunications applications. Thus, acoustic environment factors, such as the reverberation time, are not accounted for in this evaluation.
In most cases, the process of calculating the signal reverberation time itself is based on the use of an impulse response of a room where the recording was taken (EP2238590A1, WO2015010983). It is evident that such methods can be used only if the room impulse response is known, which is virtually impossible in real-life automatic speech processing applications.
When the room impulse response is not known (the term “blind estimation” is also used), the reverberation time evaluation is usually reduced to traditional signal processing methods. For example, US 9558757B1 provides determination of the rate of sound decay by generating an autocorrelogram of the signal intensity as a function of time. However, only reverberation time values that exceed a certain threshold are determined. The disadvantages of this method include low sensitivity to signals having small reverberation time values, the inability to be used on short speech fragments, as well as instability to noised data.
A multichannel microphone-based reverberation time estimation method using deep neural networks is disclosed in US 20200082843A1. According to this method, signals obtained through a multichannel microphone are analyzed. However, using this method for speech signal processing from one microphone input is quite problematic.
A method for determining characteristics, selecting and adapting training acoustic signals for an automatic speech recognition system is known from US 9922664B2. This method comprises preparing training data that imitates target environment conditions, including noise and reverberation levels, which later can be used for training a deep neural network. The trained neural network can be later used for classifying speech data samples to simulate codecs corresponding to the speech data samples. However, this method does not allow, in particular, training a neural network to simultaneously predict or evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training data.
Therefore, there is a need in methods for training neural networks to simultaneously evaluate input speech signal characteristics without using additional data on the acoustic environment in which this speech signal was extracted.
Summary of invention
According to one embodiment of the present invention, a method for training a neural network to evaluate quality characteristics of an input speech signal is provided, the method comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
When this method is used, the quality characteristics of the signal can be evaluated only based on a given input signal, without the need to compare with a reference or undistorted signal, and for evaluating the reverberation time, knowing an impulse response of a room where the speech was recorded is not required.
According to one embodiment, preparing the set of training speech signals may comprise providing a plurality of clear speech signals having minimum values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal- to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and also calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the reverberation time.
According to another embodiment, each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. Furthermore, each of the noise signal is reverberated using a room impulse response different from the impulse response that was selected for reverberating the corresponding clear speech signal.
According to another embodiment, the method comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to-noise ratio, the reverberation time and the overall quality estimation.
The noise class is evaluated using a classifier trained with the help of binary cross-entropy.
In a further embodiment, a method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features is provided, the method comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.
Furthermore, the predetermined condition is a maximum value of the overall quality estimation.
Brief description of the drawings
The suggested invention will be described in further detail below with a reference to accompanying drawings, in which:
Fig. 1 is a sequence of operations providing the train data set;
Fig. 2 is a flow-chart of generating speech signal quality estimations from an original audio recording with the help of a trained model.
Detailed description of the preferred embodiments
According to one embodiment a method for training neural networks to evaluate quality characteristics of an input speech signal is provided. The method comprises preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes. The method further comprises applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal, and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
As illustrated in Fig. 1, at the first stage of preparing a training data set, a plurality of clear speech signals 101 having minimum values for the signal-to-noise and the reverberation time, a plurality of stationary noise signals 102 of various classes, a plurality of impulse responses 103 corresponding to various rooms, the reverberation time (T60) for which is known, are taken. Meanwhile, existing speech and noise databases can be used as a source of clear speech signals and of stationary noise signals. The required impulse responses can be generated using special utility software.
In one embodiment, a database of 79 noise classes, such as typing noise, rain noise, the hum of a crowd of people, manufacture machinery, etc., was used as stationary noise signals. A specifically generated impulse response database imitating 40,000 rooms of various sizes with the reverberation time of 0 - 2 seconds was used as the plurality of impulse responses, wherein for each room, 4 impulse responses were generated imitating various positions of the acoustic source inside that room.
At the second stage of preparing a training data set, a convolution operation is performed on each clear speech signal with an impulse response of an arbitrarily selected room to generate a plurality of reverberated speech signals 104.
Several impulse responses can correspond to each room depending on the position of the acoustic source in this room.
In a further embodiment, each noise signal can be also reverberated 105. Concurrently, reverberation can be performed using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. If there are several impulse responses in a room, an impulse response can be used that differs from the one used for reverberating a clear speech signal. It enables to imitate various spatial positions of speech and noise sources and generate a more realistic database.
At the third stage of preparing a training data set, each reverberated signal from a plurality of reverberated signals generated at the previous stage is combined with stationary noises of various types to result in a plurality of noised signals 106 with various signal-to-noise ratio values.
In one embodiment, in order to provide an accurate signal-to-noise ratio, the power of a speech signal is calculated only on speech segments, with pauses not taken into account; for this, a voice activity detector (VAD) is applied to the speech signal.
At the fourth stage of preparing a training data set, a final balanced train and test plurality 107 is formed from prepared speech signals distorted by noise and reverberation. At the same time, parameters SNR, RT60 and a noise class are known for each of these signals.
At the fifth stage, an overall or integral quality estimation (QE) is calculated for each speech signal distorted by reverberation and noised as a certain function of distortion parameters SNR and RT60. According to a particular embodiment, an QE is calculated using the following mathematical expressions:
where SSNR is a speech segment SNR level estimation,
SRT60 is a speech segment reverberation level estimation,
OQ is an integral speech segment quality estimation,
SNRdB is a speech segment SNR value in decibels,
RT 60ms is a speech segment reverberation time value in milliseconds.
When preparing a training data set, an existing data set that satisfies the balance condition in terms of the SNR and T60 ranges and in terms of noise classes can also be used.
Then, training features are extracted from the prepared speech signals. In particular, a voice activity detector (VAD) is applied to the prepared speech signals so as to extract and save only speech segments. Then, training features are extracted from the generated signals, such as mel- frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal. The resulting training features are further processed, such as by performing cepstral mean normalization (CMN).
Isolated training features are then used to train a convolutional neural network in a multitasking mode. A convolutional neural network is simultaneously trained to evaluate a signal- to-noise ratio (SNR), a reverberation time (RT60), a noise class and an overall quality estimation (OQ) on input features generated at the previous stage. It is performed by using four outputs in the neural network architecture and a combined loss function based on the sum of four cost functions with different weight coefficients.
In one of the embodiments, for automatic SNR, RT60 and OQ evaluation, a regressive predictor model trained using a cost function based on a mean squared error is used. An automatic noise class estimation can be based on the use of a classifier trained using binary cross-entropy (BCE).
A formula for calculating a combined loss function is given below as a non-limiting example:
L = 10 • MSE(0Q) + 0,001 • MSE(RT60ms) + MSE(SNRdB) + 10 • BCE (noise class) where L is a combined loss function,
MSE(0Q) is a loss function based on a mean squared error for the integral quality estimation,
MSE(RT 60ms) is a loss function based on a mean squared error for the RT60 estimation, MSE(SNRAB) is a loss function based on a mean squared error for the SNR estimation, BCE (noise class) is a binary cross-entropy loss function for noise classification.
Since it is contemplated that the developed non-linear speech signal quality prediction model should evaluate the quality on short speech fragments (1 to 2 seconds), in one of the embodiments, the model also is trained on short speech fragments. In real life, human speech and natural noises are not strictly stationary. It means that a global signal-to-noise ratio value which was obtained at the data preparation stage and which is singular for a whole file should be corrected for each short segment of this file. As a non-limiting example, below is a formula for calculating a local signal-to-noise ratio:
where is an energy of a reverberated speech signal before being noised, and is
an energy of a reverberated noise. The coefficients α and β for each signal are determined by solving a linear equation system on four signal fragments:
where Xaug(i) is an i-th fragment of the augmented signal, are its
reverberated speech and noise parts.
A neural network architecture that can be used for evaluating speech signal quality characteristic is given below as a non-limiting example.
Residual network ResNet18 is comprised of 8 ResNet blocks, each being formed by two convolutional layers with 64 filters dimensioned 3x3 and a skip connection through the two layers. This connection is implemented by simply combining, element-by-element, a block input and an output of the last layer of a block if the dimensions match, or by using a convolution operation to accommodate dimensions.
The top level is formed by a global average pooling layer, the 512-dimensional output of which can be referred to as a quality vector (“quality embedding”). This vector is then provided to three linear layers: for predicting a signal-to-noise ratio (SNR), reverberation time (RT60) and a quality estimate (OQ). The sigmoid activation function is used for quality evaluation.
For classifying noise, an additional two-layer classifier is used with a softmax activation function or its modifications in the number of noise classes (79 in this embodiment).
Accordingly, the suggested method for training a neural network allows obtaining both an overall speech signal quality estimation and its specific acoustic characteristics (a signal-to-noise ratio and a reverberation time), which can be used both in voice biometrics applications and for selecting the best channel in multimicrophone systems according to a predetermined criterion.
In another embodiment, a method for automatically selecting a channel in a multimicrophone system is provided, which is implemented using the trained neural network described above.
According to this method, input speech signals are received from a plurality of channels of a multimicrophone system. Then, as illustrated in Fig. 2, a voice activity detector 202 is applied to each individual speech signal 201 so as to extract features characterizing this input speech signal and corresponding to training features that have been used for training the neural network, such as mel-frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal.
Then, the obtained features 203 characterizing the input speech signal are provided to a neural network input 204 and are simultaneously evaluated to obtain evaluated values of a signal- to-noise ratio 205, reverberation time 206, an overall quality estimation 207 and a predicted noise class 208 at a neural network output for each input speech signal. Based on these evaluated values, a channel, from which an input speech signal having evaluated values that satisfy the predetermined condition was received, is selected from a plurality of channels of a multimicrophone system. At the same time, a maximum value of an overall quality evaluation can be used as the predetermined condition. However, a person skilled in the art will readily appreciate that other parameters known from the prior art can be used as the predetermined condition.
The present invention is not limited to specific embodiments disclosed in the specification for illustration purposes and encompasses all possible modifications and alternatives at each step of implementation.
Claims
1 . A method for training a neural network to evaluate quality characteristics of an input speech signal, the method comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
2. The method according to claim 1, wherein preparing the set of training speech signals comprises: providing a plurality of clear speech signals having minimal values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal-to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the reverberation time.
3. The method according to claim 2, wherein each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal.
4. The method according to claim 2, wherein each of the noise signals is reverberated using a room impulse response that differs from the impulse response that was selected for reverberating the corresponding clear speech signal.
5. The method according to claim 1, comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to- noise ratio, the reverberation time and the overall quality estimation.
6. The method according to claim 5, wherein the noise class is evaluated using a classifier trained with the help of binary cross-entropy.
7. A method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features, the method comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as to extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.
8. The method according to claim 7, wherein the predetermined condition is a maximum value of the overall quality estimation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2020/000600 WO2022103290A1 (en) | 2020-11-12 | 2020-11-12 | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2020/000600 WO2022103290A1 (en) | 2020-11-12 | 2020-11-12 | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022103290A1 true WO2022103290A1 (en) | 2022-05-19 |
Family
ID=76305976
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2020/000600 WO2022103290A1 (en) | 2020-11-12 | 2020-11-12 | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022103290A1 (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2456296A (en) * | 2007-12-07 | 2009-07-15 | Hamid Sepehr | Audio enhancement and hearing protection by producing a noise reduced signal |
EP2238590A1 (en) | 2008-01-31 | 2010-10-13 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and method for computing filter coefficients for echo suppression |
WO2015010983A1 (en) | 2013-07-22 | 2015-01-29 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Method for processing an audio signal in accordance with a room impulse response, signal processing unit, audio encoder, audio decoder, and binaural renderer |
CN104581758A (en) | 2013-10-25 | 2015-04-29 | 中国移动通信集团广东有限公司 | Voice quality estimation method and device as well as electronic equipment |
US9396738B2 (en) | 2013-05-31 | 2016-07-19 | Sonus Networks, Inc. | Methods and apparatus for signal quality analysis |
US9558757B1 (en) | 2015-02-20 | 2017-01-31 | Amazon Technologies, Inc. | Selective de-reverberation using blind estimation of reverberation level |
US9922664B2 (en) | 2016-03-28 | 2018-03-20 | Nuance Communications, Inc. | Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems |
US9972339B1 (en) * | 2016-08-04 | 2018-05-15 | Amazon Technologies, Inc. | Neural network based beam selection |
CN108322346A (en) | 2018-02-09 | 2018-07-24 | 山西大学 | A kind of voice quality assessment method based on machine learning |
CN108346434A (en) | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
EP3494575A1 (en) | 2016-08-09 | 2019-06-12 | Huawei Technologies Co., Ltd. | Devices and methods for evaluating speech quality |
US20200082843A1 (en) | 2016-12-15 | 2020-03-12 | Industry-University Cooperation Foundation Hanyang University | Multichannel microphone-based reverberation time estimation method and device which use deep neural network |
-
2020
- 2020-11-12 WO PCT/RU2020/000600 patent/WO2022103290A1/en unknown
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2456296A (en) * | 2007-12-07 | 2009-07-15 | Hamid Sepehr | Audio enhancement and hearing protection by producing a noise reduced signal |
EP2238590A1 (en) | 2008-01-31 | 2010-10-13 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and method for computing filter coefficients for echo suppression |
US9396738B2 (en) | 2013-05-31 | 2016-07-19 | Sonus Networks, Inc. | Methods and apparatus for signal quality analysis |
WO2015010983A1 (en) | 2013-07-22 | 2015-01-29 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Method for processing an audio signal in accordance with a room impulse response, signal processing unit, audio encoder, audio decoder, and binaural renderer |
CN104581758A (en) | 2013-10-25 | 2015-04-29 | 中国移动通信集团广东有限公司 | Voice quality estimation method and device as well as electronic equipment |
US9558757B1 (en) | 2015-02-20 | 2017-01-31 | Amazon Technologies, Inc. | Selective de-reverberation using blind estimation of reverberation level |
US9922664B2 (en) | 2016-03-28 | 2018-03-20 | Nuance Communications, Inc. | Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems |
US9972339B1 (en) * | 2016-08-04 | 2018-05-15 | Amazon Technologies, Inc. | Neural network based beam selection |
EP3494575A1 (en) | 2016-08-09 | 2019-06-12 | Huawei Technologies Co., Ltd. | Devices and methods for evaluating speech quality |
US20200082843A1 (en) | 2016-12-15 | 2020-03-12 | Industry-University Cooperation Foundation Hanyang University | Multichannel microphone-based reverberation time estimation method and device which use deep neural network |
CN108346434A (en) | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
CN108322346A (en) | 2018-02-09 | 2018-07-24 | 山西大学 | A kind of voice quality assessment method based on machine learning |
Non-Patent Citations (2)
Title |
---|
ANDERSON R AVILA ET AL: "Non-intrusive speech quality assessment using neural networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 March 2019 (2019-03-16), XP081154240 * |
J.URGEN TCHORZ ET AL: "SNR Estimation Based on Amplitude Modulation Analysis With Applications to Noise Suppression", vol. 11, no. 3, 1 May 2003 (2003-05-01), XP011079710, ISSN: 1063-6676, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/1208288> DOI: 10.1109/TSA.2003.811542 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Avila et al. | Non-intrusive speech quality assessment using neural networks | |
Eaton et al. | Estimation of room acoustic parameters: The ACE challenge | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
Falk et al. | Single-ended speech quality measurement using machine learning methods | |
Dubey et al. | Non-intrusive speech quality assessment using several combinations of auditory features | |
Cauchi et al. | Non-intrusive speech quality prediction using modulation energies and lstm-network | |
CN109313893A (en) | Characterization, selection and adjustment are used for the audio and acoustics training data of automatic speech recognition system | |
Valentini-Botinhao et al. | Speech enhancement of noisy and reverberant speech for text-to-speech | |
Williams et al. | Comparison of speech representations for automatic quality estimation in multi-speaker text-to-speech synthesis | |
Poorjam et al. | Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection | |
Su et al. | HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features | |
Dong et al. | A pyramid recurrent network for predicting crowdsourced speech-quality ratings of real-world signals | |
Xiong et al. | Joint estimation of reverberation time and early-to-late reverberation ratio from single-channel speech signals | |
Lavrentyeva et al. | Blind Speech Signal Quality Estimation for Speaker Verification Systems. | |
Fu et al. | MetricGAN-U: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech | |
Karbasi et al. | Non-intrusive speech intelligibility prediction using automatic speech recognition derived measures | |
Huber et al. | Single-ended speech quality prediction based on automatic speech recognition | |
Pirhosseinloo et al. | A new feature set for masking-based monaural speech separation | |
Gamper et al. | Predicting word error rate for reverberant speech | |
WO2022103290A1 (en) | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems | |
Al-Karawi et al. | Model selection toward robustness speaker verification in reverberant conditions | |
WO2021259842A1 (en) | Method for learning an audio quality metric combining labeled and unlabeled data | |
Karbasi et al. | Blind Non-Intrusive Speech Intelligibility Prediction Using Twin-HMMs. | |
Chen et al. | InQSS: a speech intelligibility assessment model using a multi-task learning network | |
Sharma et al. | Non-intrusive speech intelligibility assessment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20897637 Country of ref document: EP Kind code of ref document: A1 |