AU2021101586A4

AU2021101586A4 - A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model

Info

Publication number: AU2021101586A4
Application number: AU2021101586A
Authority: AU
Inventors: Tusar Kanti Dash; Tiruveedula GopiKrishna; Satyasis Mishra; Ganapati Panda; Prabodh Kumar Sahoo
Original assignee: Individual
Current assignee: Individual
Priority date: 2021-03-28
Filing date: 2021-03-28
Publication date: 2021-05-20
Anticipated expiration: 2029-03-28

Abstract

The present disclosure relates to asystem and a method for non-intrusive speech quality and intelligibility evaluation measures using FLANN model. It involves three main stages. The first stage consists of formation of speech database and calculation of the corresponding PESQ and STOI values for each speech sample. A Total two data bases consisting of speech samples at different noise levels have been used. In the second stage, the speech files are labelled into four categories, very poor, poor, average and good quality. Then, considering it as a four-class classification method, features election algorithms are used to select best six audio features. The FLANN prediction method is dealt in stage three to find out the relation between these four features and the corresponding PESQ and STOI values. For training the proposed model, 80% of the speech files (800) are used and remaining 20 % files are used for validation purpose. 14 2k) Figure 3 |Input Speech Signal Pre-Processing Friniig Extraction Windowing FLANN Model for Pefrac Twice Activity Predictionof Evaluatum detection NJ Measures Calculation Dynamic Range Compressor Figure 4

Description

2k)

Figure 3

|Input Speech Signal

Pre-Processing

Friniig Extraction Windowing FLANN Model for Pefrac Twice Activity Predictionof Evaluatum detection NJ Measures Calculation Dynamic Range Compressor

Figure 4

A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model

FIELD OF THE INVENTION

The present disclosure relates to a system and a method for non-intrusive speech quality and intelligibility evaluation measures using the Functional Link Artificial Neural Network (FLANN) model.

BACKGROUND OF THE INVENTION

Speech Enhancement algorithms are widely used in speech communication systems, hearing aid devices, and speech recognition applications. The performance evaluation of these algorithms plays a crucial role in the selection of the best algorithm for the desired application. The performance evaluation can be broadly categorized into two methods such as subjective listening tests and objective evaluation measures. Subjective evaluation requires a group of listeners and it is time-consuming. The objective evaluation is based on the mathematical interpretation of the subjective tests and it has a high correlation with the subjective ones. But in all these evaluations, the original reference clean speech and the processed speech data are required. This is known as intrusive evaluation. In reality, the intrusive evaluation is practically not feasible where the original reference speech isnot available at the receiver end. To overcome this problem, several non-intrusive (NI) evaluation measures have been suggested. The NI evaluation measures can predict speech quality and intelligibility directly from the processed speech signal. Several attempts have been made in the literature in the effective design of the NI measures.

A data-driven method for NI quality and intelligibility assessment denoted as NISA is reported for determining the quality of service in telecommunication. A combination of audio features such as fundamental frequency, zero-crossing rate, long-term average speech spectra deviation, Hilbert envelope-based features, LPC-based features, and Importance weighted signal-to-noise ratio for classification and regression tree analysis was used. The performance of the NISA algorithm has a spearman correlation of 0.93 with PESQ (Perceptual evaluation of speech quality) and 0.95 with STOI (Short-time objective intelligibility) with an RMS error of 0.08 using the additive noise database. NI-STOI evaluation measure is discussed in another solution. In this case, the clean speech is calculated from the noisy speech by passing it through FFT, Eigen value decomposition, and principal components selection. The performance of the proposed method was compared with one NI measure (Speech to Reverberation Modulation energy Ratio) and one Intrusive measure (STOI). It is observed that its performance is better than the NI measure, but not good enough in the case of the intrusive measure. In yet another solution, the optimal linear combination of objective mean opinion score has been used which is calculated by using multiple time-scale features in a probabilistic method. Bidirectional long short-term memory-based NI Quality Evaluation Measure called Quality-Net is proposed. The quality estimation has been carried out through the training of speech magnitude spectrogram and the quality score. The simulation results show that the Quality-Net provides a high correlation with the PESQ measure.

NI speech intelligibility estimation method is designed using a recurrent neural network with a long short-term memory structure. The MFCC features are taken as the input and the STOI measurement is taken as the output of this prediction model. This model provides better results than the P.563 in different noisy and reverberation conditions. A classification approach based on the utterance-level is used for the quality estimation. The frame-wise magnitude spectrogram is the input to the multi-layered Convolutional Neural Network, which is used to find the twenty quality rankings of noisy speech. For the mean opinion score (MOS) estimation, three neural network models such as Convolutional Neural Network with CNN with constant Q spectral components, Deep Neural Network(DNN) using i-vector, and DNN with MFCC features are employed. The experimental results demonstrate that the DNN with MFCC features-based model performs better than the NI standard P.563 in realistic audio data set using the crowd source labelling. In another NI method using recurrent neural network(RNN) on a large training data set to estimate the Perceptual Objective Listening Quality Analysis Score (Intrusive measure) using speech MFCC and modulation domain features.

It is found that several estimation/prediction methods along with different sets of features have been used in the NI evaluation task. The features, which are mainly used in this task, are the perceptual and statistical features. It is very difficult to select appropriate features from all the audio features available. A thorough feature selection analysis is required in this NI task. In order to overcome the above-mentioned drawbacks, a method for non intrusive speech quality and intelligibility evaluation measures using the FLANN model needs to be developed. SUMMARY OF THE INVENTION

The present disclosure relates to asystem and a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model. The present disclosure involves three main stages. The first stage consists of the formation of a speech database and calculation of the corresponding PESQ and STOI values for each speech sample. A Total of two data bases consisting of speech samples at different noise levels have been used. In the second stage, the speech files are labelled into four categories such as very poor quality, poor quality, average, and good quality. Then, considering it as a four-class classification method, features election algorithms are used to select the best six audio features. The FLANN prediction method is dealt with in stage threetofindouttherelationbetweenthesefourfeaturesandthecorrespondingPESQand STOI values. For training the proposed model, 80% of the speech files (800) are used and the remaining 20 % files are used for validation purposes.

In an embodiment, a system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model is proposed and developed. The system 100 comprises: a database 102 consisting of two different databases with different sampling rates; a pre-processing 104 module for windowing and framing speech signals received from two different databases with different sampling rates; a voice activity detection (VAD) unit 106 for selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; a dynamic range compressor (DRC) 108 consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks for normalizing amplitude of the speech signals; a feature extraction module 110 for extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; and a FLANN model 112 for predicting NI measures upon calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values, wherein the FLANN model is designed by non-intrusive evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures.

In an embodiment, a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model is proposed and developed. The method 200 comprises of the following steps: At step 202, pre-processing speech signals received from two different databases with different sampling rates for windowing and framing; At step 204, selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; At step 206, normalizing amplitude of the speech signals through a dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks; At step 208, extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; At step 210, calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values and thereby designing non-intrusive (NI) evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures; At step 212, obtaining empirical relationship between the selected features and PESQ, STOI values from a trained FLANN model; and At step 214, executing performance evaluation of the FLANN model using evaluation metrics.

The objectives of this disclosure are listed below:

1. To classify the available speech files into four groups each based on their quality and intelligibility.

2. To apply appropriate features as inputs to the FLANN model to obtain the desired PESQ and STOI values.

3. To train a FLANN model employing different types of functional expansions by taking the input-output data.

4. To evaluate, the correlation between estimated NI values obtained from the FLANN model and the corresponding intrusive measure values and to assess the superiority of the proposed model.

To further clarify the advantages and features of the present disclosure, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.

BRIEF DESCRIPTION OF FIGURES

These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:

Figure 1 illustrates a system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. Figure 2 illustrates a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. Figure 3 illustrates the block diagram of the FLANN model during the training period in accordance with an embodiment of the present disclosure. Figure 4 illustrates the implementation details of the proposed method in accordance with an embodiment of the present disclosure.

Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present disclosure. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.

DETAILED DESCRIPTION For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.

It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.

Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.

Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.

Referring to Figure 1 illustrates a system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. The system 100 comprises: a database 102 consisting of two different databases with different sampling rates; a pre-processing 104 module for windowing and framing speech signals received from two different databases with different sampling rates; a voice activity detection (VAD) unit 106 for selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; a dynamic range compressor (DRC) 108 consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks for normalizing amplitude of the speech signals; a feature extraction module 110 for extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; and a FLANN model 112 for predicting NI measures upon calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values, wherein the FLANN model is designed by non intrusive evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures.

Figure 2 illustrates a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. The method 200 comprises of the following steps: at step 202, pre processing speech signals received from two different databases with different sampling rates for windowing and framing; at step 204, selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; at step 206, normalizing amplitude of the speech signals through a dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks; at step 208, extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; at step 210, calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values and thereby designing non intrusive (NI) evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures; at step 212, obtaining empirical relationship between the selected features and PESQ, STOI values from a trained FLANN model; and at step 214, executing performance evaluation of the FLANN model using evaluation metrics.

Figure 3 illustrates the block diagram of the FLANN model during the training period in accordance with an embodiment of the present disclosure. The Functional Link Artificial Neural Network (FLANN) based model is a single-layer non-linear network. It is simple to implement and possesses low complexity but provides fast convergence as well as reasonably good performance as compared to the multi-layer perceptron-based structures. In FLANN, each input undergoes functional expansion by a set of basic functions and the goal is to determine the optimum values of the weight parameters (W) so that the best possible approximation can be found out between inputs and outputs. In the proposed implementation, the three popular basis functions such as Trigonometric (TFLANN), Chebyshev (CFLANN), and Polynomial (PFLANN) are chosen in the FLANN model. The corresponding block diagram during the training period is shown in Figure 3. The extracted input features of the speech signals are fi(k) and the predicted values are PESQ and STOI. The objective is to select the weights W(k) during the training phase so that the desired predicted values are obtained.

Figure 4 illustrates (a) a Working model for non-intrusive speech quality and intelligibility evaluation using the FLANN model (b) Authorisation processing in accordance with an embodiment of the present disclosure. The experimental simulation study implementation of the proposed NI method is carried out in eight steps as mentioned below. The details of the block diagram simulated are shown in Figure 4. The first step is to pre-process the speech signals which includes windowing and framing. The speech signals are taken from two different databases with different sampling rates. To make the sampling rate uniform, all the speech signals are down sampled to 8 kHz. Then, the Hanning window is used with the non overlapping frame size of 25 ms; In the second step, the Voice Activity Detection (VAD) technique is applied to select the frames where the voiced part of the speech signal is present. The VAD method implementation is similar to the P.56 standard; In the third step, is the normalization of the amplitude of the speech signals is performed by using a dynamic range compressor (DRC). It consists of an expander that boosts the low signal levels and a compressor that reduces the high levels of peaks. Basically, the DRC reduces the overall dynamic range so that the feature extraction will be independent of the amplitude level of different speech databases, speakers, and recording environments. The DRC depends on two parameters:compression parameter and filter parameter. In this implementation, these two parameters are taken as 0.5; In the fourth step, the six audio features Spectral Centroid, Spectral Skewness, Spectral Spread, Spectral Tonal Power Ratio, Entropy, and Standard Deviation are extracted from each voiced frame; For designing the NI evaluation of the Quality and Intelligibility, the PESQ and STOI values are used as the intrusive performance measures. These values are calculated for each speech signal of the two databases along with the feature values. These values are determined and stored in the sixth step of the implementation; An empirical relationship between the selected features and PESQ, STOI are obtained in the seventh step from the trained FLANN model. Out of the different combinations of the three types of functional expansions, the TFLANN with sigmoid function is selected because it provides the lowest training error and faster convergence; In the last step, the performance evaluation of the proposed algorithm miscarried out done with the four evaluation metrics.

The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.

Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.

Claims

WE CLAIM

1. A method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model, the method comprises:

Pre-processing speech signals received from two different databases with different sampling rates for windowing and framing; Selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; Normalizing amplitude of the speech signals through a dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor that reduces the high levels of peaks; Extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy, and standard deviation from each voiced frame; calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values and thereby designing non-intrusive (NI) evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures; Obtaining empirical relationship between the selected features and PESQ, STOI values from a trained FLANN model; and Executing performance evaluation of the FLANN model using evaluation metrics.

2. The method as claimed in claim 1, wherein steps for converting different sampling rates to uniform sampling rate comprises: Down sampling all the speech signals to 8 kHz; and Using Hanning window with non-overlapping frame size of 25 m/s to convert different sampling rates to uniform sampling rate.

3. The method as claimed in claim 1, wherein the DRC reduces the overall dynamic range so that the feature extraction will be independent of the amplitude level of different speech databases, speakers, and recording environments.

4. The method as claimed in claim 3, wherein the DRC depends on two parameters includes compression parameter and filter parameter, wherein the parameters are taken as 0.5.

5. The method as claimed in claim 1, wherein out of the different combinations of the three types of functional expansions, the TFLANN with sigmoid function is selected because it provides the lowest training error and faster convergence.

6. The method as claimed in claim 1, wherein the performance obtained from the three FLANN models are Trigonometric (TFLANN), Chebyshev (CFLANN), and Polynomial (PFLANN) are compared.

7. The method as claimed in claim 1, wherein the evaluation metrics are Spearman correlation coefficient (SSC) and root mean square error (RMSE), wherein the SSC gives the numerical values of the monotonic relationship between two ranked variables and RMSE gives the information about the estimation accuracy of the NI techniques.

8. The method as claimed in claim 1, wherein in FLANN, each input undergoes functional expansion by a set of basic functions to determine the optimum values of the weight parameters (W) for determining the best possible approximation between inputs and outputs.

9. A system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model, the system comprises:

A database consisting of two different databases with different sampling rates; A pre-processing module for windowing and framing speech signals received from two different databases with different sampling rates; A voice activity detection (VAD) unit for selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; A dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor that reduces the high levels of peaks for normalizing amplitude of the speech signals; A feature extraction module for extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy, and standard deviation from each voiced frame; and

A FLANN model for predicting NI measures upon calculating and storing perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values, wherein the FLANN model is designed by non-intrusive evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures.

10. The system as claimed in claim 9, wherein the empirical relationship between the selected features and PESQ, STOI values are obtained from a trained FLANN model.