AU2021101586A4 - A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model - Google Patents
A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model Download PDFInfo
- Publication number
- AU2021101586A4 AU2021101586A4 AU2021101586A AU2021101586A AU2021101586A4 AU 2021101586 A4 AU2021101586 A4 AU 2021101586A4 AU 2021101586 A AU2021101586 A AU 2021101586A AU 2021101586 A AU2021101586 A AU 2021101586A AU 2021101586 A4 AU2021101586 A4 AU 2021101586A4
- Authority
- AU
- Australia
- Prior art keywords
- speech
- flann
- model
- values
- evaluation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000001514 detection method Methods 0.000 claims abstract description 12
- 230000000694 effects Effects 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 230000003595 spectral effect Effects 0.000 claims description 29
- 238000005070 sampling Methods 0.000 claims description 16
- 238000009432 framing Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 5
- 230000006835 compression Effects 0.000 claims description 2
- 238000007906 compression Methods 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 7
- 238000004364 calculation method Methods 0.000 abstract description 4
- 230000015572 biosynthetic process Effects 0.000 abstract description 2
- 238000010200 validation analysis Methods 0.000 abstract description 2
- 230000008901 benefit Effects 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000010410 layer Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/69—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure relates to asystem and a method for non-intrusive speech
quality and intelligibility evaluation measures using FLANN model. It involves three main
stages. The first stage consists of formation of speech database and calculation of the
corresponding PESQ and STOI values for each speech sample. A Total two data bases
consisting of speech samples at different noise levels have been used. In the second stage,
the speech files are labelled into four categories, very poor, poor, average and good quality.
Then, considering it as a four-class classification method, features election algorithms are
used to select best six audio features. The FLANN prediction method is dealt in stage three
to find out the relation between these four features and the corresponding PESQ and STOI
values. For training the proposed model, 80% of the speech files (800) are used and
remaining 20 % files are used for validation purpose.
14
2k)
Figure 3
|Input Speech Signal
Pre-Processing
Friniig Extraction
Windowing
FLANN
Model for Pefrac
Twice Activity Predictionof Evaluatum
detection NJ Measures
Calculation
Dynamic Range
Compressor
Figure 4
Description
2k)
Figure 3
|Input Speech Signal
Pre-Processing
Friniig Extraction Windowing FLANN Model for Pefrac Twice Activity Predictionof Evaluatum detection NJ Measures Calculation Dynamic Range Compressor
Figure 4
A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model
The present disclosure relates to a system and a method for non-intrusive speech quality and intelligibility evaluation measures using the Functional Link Artificial Neural Network (FLANN) model.
Speech Enhancement algorithms are widely used in speech communication systems, hearing aid devices, and speech recognition applications. The performance evaluation of these algorithms plays a crucial role in the selection of the best algorithm for the desired application. The performance evaluation can be broadly categorized into two methods such as subjective listening tests and objective evaluation measures. Subjective evaluation requires a group of listeners and it is time-consuming. The objective evaluation is based on the mathematical interpretation of the subjective tests and it has a high correlation with the subjective ones. But in all these evaluations, the original reference clean speech and the processed speech data are required. This is known as intrusive evaluation. In reality, the intrusive evaluation is practically not feasible where the original reference speech isnot available at the receiver end. To overcome this problem, several non-intrusive (NI) evaluation measures have been suggested. The NI evaluation measures can predict speech quality and intelligibility directly from the processed speech signal. Several attempts have been made in the literature in the effective design of the NI measures.
A data-driven method for NI quality and intelligibility assessment denoted as NISA is reported for determining the quality of service in telecommunication. A combination of audio features such as fundamental frequency, zero-crossing rate, long-term average speech spectra deviation, Hilbert envelope-based features, LPC-based features, and Importance weighted signal-to-noise ratio for classification and regression tree analysis was used. The performance of the NISA algorithm has a spearman correlation of 0.93 with PESQ (Perceptual evaluation of speech quality) and 0.95 with STOI (Short-time objective intelligibility) with an RMS error of 0.08 using the additive noise database. NI-STOI evaluation measure is discussed in another solution. In this case, the clean speech is calculated from the noisy speech by passing it through FFT, Eigen value decomposition, and principal components selection. The performance of the proposed method was compared with one NI measure (Speech to Reverberation Modulation energy Ratio) and one Intrusive measure (STOI). It is observed that its performance is better than the NI measure, but not good enough in the case of the intrusive measure. In yet another solution, the optimal linear combination of objective mean opinion score has been used which is calculated by using multiple time-scale features in a probabilistic method. Bidirectional long short-term memory-based NI Quality Evaluation Measure called Quality-Net is proposed. The quality estimation has been carried out through the training of speech magnitude spectrogram and the quality score. The simulation results show that the Quality-Net provides a high correlation with the PESQ measure.
NI speech intelligibility estimation method is designed using a recurrent neural network with a long short-term memory structure. The MFCC features are taken as the input and the STOI measurement is taken as the output of this prediction model. This model provides better results than the P.563 in different noisy and reverberation conditions. A classification approach based on the utterance-level is used for the quality estimation. The frame-wise magnitude spectrogram is the input to the multi-layered Convolutional Neural Network, which is used to find the twenty quality rankings of noisy speech. For the mean opinion score (MOS) estimation, three neural network models such as Convolutional Neural Network with CNN with constant Q spectral components, Deep Neural Network(DNN) using i-vector, and DNN with MFCC features are employed. The experimental results demonstrate that the DNN with MFCC features-based model performs better than the NI standard P.563 in realistic audio data set using the crowd source labelling. In another NI method using recurrent neural network(RNN) on a large training data set to estimate the Perceptual Objective Listening Quality Analysis Score (Intrusive measure) using speech MFCC and modulation domain features.
It is found that several estimation/prediction methods along with different sets of features have been used in the NI evaluation task. The features, which are mainly used in this task, are the perceptual and statistical features. It is very difficult to select appropriate features from all the audio features available. A thorough feature selection analysis is required in this NI task. In order to overcome the above-mentioned drawbacks, a method for non intrusive speech quality and intelligibility evaluation measures using the FLANN model needs to be developed. SUMMARY OF THE INVENTION
The present disclosure relates to asystem and a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model. The present disclosure involves three main stages. The first stage consists of the formation of a speech database and calculation of the corresponding PESQ and STOI values for each speech sample. A Total of two data bases consisting of speech samples at different noise levels have been used. In the second stage, the speech files are labelled into four categories such as very poor quality, poor quality, average, and good quality. Then, considering it as a four-class classification method, features election algorithms are used to select the best six audio features. The FLANN prediction method is dealt with in stage threetofindouttherelationbetweenthesefourfeaturesandthecorrespondingPESQand STOI values. For training the proposed model, 80% of the speech files (800) are used and the remaining 20 % files are used for validation purposes.
In an embodiment, a system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model is proposed and developed. The system 100 comprises: a database 102 consisting of two different databases with different sampling rates; a pre-processing 104 module for windowing and framing speech signals received from two different databases with different sampling rates; a voice activity detection (VAD) unit 106 for selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; a dynamic range compressor (DRC) 108 consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks for normalizing amplitude of the speech signals; a feature extraction module 110 for extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; and a FLANN model 112 for predicting NI measures upon calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values, wherein the FLANN model is designed by non-intrusive evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures.
In an embodiment, a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model is proposed and developed. The method 200 comprises of the following steps: At step 202, pre-processing speech signals received from two different databases with different sampling rates for windowing and framing; At step 204, selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; At step 206, normalizing amplitude of the speech signals through a dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks; At step 208, extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; At step 210, calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values and thereby designing non-intrusive (NI) evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures; At step 212, obtaining empirical relationship between the selected features and PESQ, STOI values from a trained FLANN model; and At step 214, executing performance evaluation of the FLANN model using evaluation metrics.
The objectives of this disclosure are listed below:
1. To classify the available speech files into four groups each based on their quality and intelligibility.
2. To apply appropriate features as inputs to the FLANN model to obtain the desired PESQ and STOI values.
3. To train a FLANN model employing different types of functional expansions by taking the input-output data.
4. To evaluate, the correlation between estimated NI values obtained from the FLANN model and the corresponding intrusive measure values and to assess the superiority of the proposed model.
To further clarify the advantages and features of the present disclosure, a more particular description of the invention will be rendered by reference to specific embodiments thereof, which is illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail with the accompanying drawings.
These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
Figure 1 illustrates a system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. Figure 2 illustrates a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. Figure 3 illustrates the block diagram of the FLANN model during the training period in accordance with an embodiment of the present disclosure. Figure 4 illustrates the implementation details of the proposed method in accordance with an embodiment of the present disclosure.
Further, skilled artisans will appreciate that elements in the drawings are illustrated for simplicity and may not have been necessarily been drawn to scale. For example, the flow charts illustrate the method in terms of the most prominent steps involved to help to improve understanding of aspects of the present disclosure. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
DETAILED DESCRIPTION For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.
It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.
Reference throughout this specification to "an aspect", "another aspect" or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such process or method. Similarly, one or more devices or sub-systems or elements or structures or components proceeded by "comprises...a" does not, without more constraints, preclude the existence of other devices or other sub-systems or other elements or other structures or other components or additional devices or additional sub-systems or additional elements or additional structures or additional components.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The system, methods, and examples provided herein are illustrative only and not intended to be limiting.
Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.
Referring to Figure 1 illustrates a system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. The system 100 comprises: a database 102 consisting of two different databases with different sampling rates; a pre-processing 104 module for windowing and framing speech signals received from two different databases with different sampling rates; a voice activity detection (VAD) unit 106 for selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; a dynamic range compressor (DRC) 108 consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks for normalizing amplitude of the speech signals; a feature extraction module 110 for extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; and a FLANN model 112 for predicting NI measures upon calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values, wherein the FLANN model is designed by non intrusive evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures.
Figure 2 illustrates a method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model in accordance with an embodiment of the present disclosure. The method 200 comprises of the following steps: at step 202, pre processing speech signals received from two different databases with different sampling rates for windowing and framing; at step 204, selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; at step 206, normalizing amplitude of the speech signals through a dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor which reduces the high levels of peaks; at step 208, extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy and standard deviation from each voiced frame; at step 210, calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values and thereby designing non intrusive (NI) evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures; at step 212, obtaining empirical relationship between the selected features and PESQ, STOI values from a trained FLANN model; and at step 214, executing performance evaluation of the FLANN model using evaluation metrics.
Figure 3 illustrates the block diagram of the FLANN model during the training period in accordance with an embodiment of the present disclosure. The Functional Link Artificial Neural Network (FLANN) based model is a single-layer non-linear network. It is simple to implement and possesses low complexity but provides fast convergence as well as reasonably good performance as compared to the multi-layer perceptron-based structures. In FLANN, each input undergoes functional expansion by a set of basic functions and the goal is to determine the optimum values of the weight parameters (W) so that the best possible approximation can be found out between inputs and outputs. In the proposed implementation, the three popular basis functions such as Trigonometric (TFLANN), Chebyshev (CFLANN), and Polynomial (PFLANN) are chosen in the FLANN model. The corresponding block diagram during the training period is shown in Figure 3. The extracted input features of the speech signals are fi(k) and the predicted values are PESQ and STOI. The objective is to select the weights W(k) during the training phase so that the desired predicted values are obtained.
Figure 4 illustrates (a) a Working model for non-intrusive speech quality and intelligibility evaluation using the FLANN model (b) Authorisation processing in accordance with an embodiment of the present disclosure. The experimental simulation study implementation of the proposed NI method is carried out in eight steps as mentioned below. The details of the block diagram simulated are shown in Figure 4. The first step is to pre-process the speech signals which includes windowing and framing. The speech signals are taken from two different databases with different sampling rates. To make the sampling rate uniform, all the speech signals are down sampled to 8 kHz. Then, the Hanning window is used with the non overlapping frame size of 25 ms; In the second step, the Voice Activity Detection (VAD) technique is applied to select the frames where the voiced part of the speech signal is present. The VAD method implementation is similar to the P.56 standard; In the third step, is the normalization of the amplitude of the speech signals is performed by using a dynamic range compressor (DRC). It consists of an expander that boosts the low signal levels and a compressor that reduces the high levels of peaks. Basically, the DRC reduces the overall dynamic range so that the feature extraction will be independent of the amplitude level of different speech databases, speakers, and recording environments. The DRC depends on two parameters:compression parameter and filter parameter. In this implementation, these two parameters are taken as 0.5; In the fourth step, the six audio features Spectral Centroid, Spectral Skewness, Spectral Spread, Spectral Tonal Power Ratio, Entropy, and Standard Deviation are extracted from each voiced frame; For designing the NI evaluation of the Quality and Intelligibility, the PESQ and STOI values are used as the intrusive performance measures. These values are calculated for each speech signal of the two databases along with the feature values. These values are determined and stored in the sixth step of the implementation; An empirical relationship between the selected features and PESQ, STOI are obtained in the seventh step from the trained FLANN model. Out of the different combinations of the three types of functional expansions, the TFLANN with sigmoid function is selected because it provides the lowest training error and faster convergence; In the last step, the performance evaluation of the proposed algorithm miscarried out done with the four evaluation metrics.
The drawings and the forgoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of embodiments is at least as broad as given by the following claims.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.
Claims (10)
1. A method for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model, the method comprises:
Pre-processing speech signals received from two different databases with different sampling rates for windowing and framing; Selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; Normalizing amplitude of the speech signals through a dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor that reduces the high levels of peaks; Extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy, and standard deviation from each voiced frame; calculating and storing perceptual evaluation of speech quality (PESQ) and short time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values and thereby designing non-intrusive (NI) evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures; Obtaining empirical relationship between the selected features and PESQ, STOI values from a trained FLANN model; and Executing performance evaluation of the FLANN model using evaluation metrics.
2. The method as claimed in claim 1, wherein steps for converting different sampling rates to uniform sampling rate comprises: Down sampling all the speech signals to 8 kHz; and Using Hanning window with non-overlapping frame size of 25 m/s to convert different sampling rates to uniform sampling rate.
3. The method as claimed in claim 1, wherein the DRC reduces the overall dynamic range so that the feature extraction will be independent of the amplitude level of different speech databases, speakers, and recording environments.
4. The method as claimed in claim 3, wherein the DRC depends on two parameters includes compression parameter and filter parameter, wherein the parameters are taken as 0.5.
5. The method as claimed in claim 1, wherein out of the different combinations of the three types of functional expansions, the TFLANN with sigmoid function is selected because it provides the lowest training error and faster convergence.
6. The method as claimed in claim 1, wherein the performance obtained from the three FLANN models are Trigonometric (TFLANN), Chebyshev (CFLANN), and Polynomial (PFLANN) are compared.
7. The method as claimed in claim 1, wherein the evaluation metrics are Spearman correlation coefficient (SSC) and root mean square error (RMSE), wherein the SSC gives the numerical values of the monotonic relationship between two ranked variables and RMSE gives the information about the estimation accuracy of the NI techniques.
8. The method as claimed in claim 1, wherein in FLANN, each input undergoes functional expansion by a set of basic functions to determine the optimum values of the weight parameters (W) for determining the best possible approximation between inputs and outputs.
9. A system for non-intrusive speech quality and intelligibility evaluation measures using the FLANN model, the system comprises:
A database consisting of two different databases with different sampling rates; A pre-processing module for windowing and framing speech signals received from two different databases with different sampling rates; A voice activity detection (VAD) unit for selecting frames containing voiced part of the speech signal by employing voice activity detection (VAD) technique; A dynamic range compressor (DRC) consisting of an expander which boosts the low signal levels and a compressor that reduces the high levels of peaks for normalizing amplitude of the speech signals; A feature extraction module for extracting six audio features including spectral centroid, spectral skewness, spectral spread, spectral tonal power ratio, entropy, and standard deviation from each voiced frame; and
A FLANN model for predicting NI measures upon calculating and storing perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) values or each speech signal of the two databases along with the feature values, wherein the FLANN model is designed by non-intrusive evaluation of the quality and intelligibility using PESQ and STOI values as the intrusive measures.
10. The system as claimed in claim 9, wherein the empirical relationship between the selected features and PESQ, STOI values are obtained from a trained FLANN model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021101586A AU2021101586A4 (en) | 2021-03-28 | 2021-03-28 | A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021101586A AU2021101586A4 (en) | 2021-03-28 | 2021-03-28 | A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2021101586A4 true AU2021101586A4 (en) | 2021-05-20 |
Family
ID=75911246
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2021101586A Ceased AU2021101586A4 (en) | 2021-03-28 | 2021-03-28 | A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2021101586A4 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113345420A (en) * | 2021-06-07 | 2021-09-03 | 河海大学 | Countermeasure audio generation method and system based on firefly algorithm and gradient evaluation |
CN117787509A (en) * | 2024-02-23 | 2024-03-29 | 西安热工研究院有限公司 | Wind speed prediction method and system for energy storage auxiliary black start |
-
2021
- 2021-03-28 AU AU2021101586A patent/AU2021101586A4/en not_active Ceased
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113345420A (en) * | 2021-06-07 | 2021-09-03 | 河海大学 | Countermeasure audio generation method and system based on firefly algorithm and gradient evaluation |
CN117787509A (en) * | 2024-02-23 | 2024-03-29 | 西安热工研究院有限公司 | Wind speed prediction method and system for energy storage auxiliary black start |
CN117787509B (en) * | 2024-02-23 | 2024-05-14 | 西安热工研究院有限公司 | Wind speed prediction method and system for energy storage auxiliary black start |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10878823B2 (en) | Voiceprint recognition method, device, terminal apparatus and storage medium | |
US7133826B2 (en) | Method and apparatus using spectral addition for speaker recognition | |
Dubey et al. | Non-intrusive speech quality assessment using several combinations of auditory features | |
TR201810466T4 (en) | Apparatus and method for processing an audio signal to improve speech using feature extraction. | |
AU2021101586A4 (en) | A System and a Method for Non-Intrusive Speech Quality and Intelligibility Evaluation Measures using FLANN Model | |
CN106997765B (en) | Quantitative characterization method for human voice timbre | |
Yu et al. | Metricnet: Towards improved modeling for non-intrusive speech quality assessment | |
Sebastian et al. | Group delay based music source separation using deep recurrent neural networks | |
Moore et al. | Say What? A Dataset for Exploring the Error Patterns That Two ASR Engines Make. | |
Hasan et al. | Preprocessing of continuous bengali speech for feature extraction | |
Rahman et al. | Dynamic time warping assisted svm classifier for bangla speech recognition | |
US20230245674A1 (en) | Method for learning an audio quality metric combining labeled and unlabeled data | |
Dash et al. | Multi-objective approach to speech enhancement using tunable Q-factor-based wavelet transform and ANN techniques | |
CN117409761B (en) | Method, device, equipment and storage medium for synthesizing voice based on frequency modulation | |
Bansal et al. | Low bit-rate speech coding based on multicomponent AFM signal model | |
Kumar | Real‐time implementation and performance evaluation of speech classifiers in speech analysis‐synthesis | |
Hess | Pitch and voicing determination of speech with an extension toward music signals | |
Fathima et al. | Gammatone cepstral coefficient for speaker Identification | |
Huber et al. | Single-ended speech quality prediction based on automatic speech recognition | |
Martin et al. | Cepstral modulation ratio regression (CMRARE) parameters for audio signal analysis and classification | |
Cabrera et al. | PsySound3: a program for the analysis of sound recordings | |
CN116230018A (en) | Synthetic voice quality evaluation method for voice synthesis system | |
KR102042344B1 (en) | Apparatus for judging the similiarity between voices and the method for judging the similiarity between voices | |
KR20190125078A (en) | Apparatus for judging the similiarity between voices and the method for judging the similiarity between voices | |
CN111599345B (en) | Speech recognition algorithm evaluation method, system, mobile terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |