CN114360583A - Voice quality evaluation method based on neural network - Google Patents
Voice quality evaluation method based on neural network Download PDFInfo
- Publication number
- CN114360583A CN114360583A CN202210004522.3A CN202210004522A CN114360583A CN 114360583 A CN114360583 A CN 114360583A CN 202210004522 A CN202210004522 A CN 202210004522A CN 114360583 A CN114360583 A CN 114360583A
- Authority
- CN
- China
- Prior art keywords
- neural network
- layer
- voice
- module
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a voice quality evaluation method based on a neural network, which comprises the following steps: an audio conversion module for converting audio into a format that can be processed in a neural network model; the noise adding module is used for generating a voice with noise which is matched with the pure voice; the characteristic extraction module is used for extracting the characteristics of the input neural network; the neural network module is used for evaluating the voice quality scores corresponding to the input model characteristics; and (5) a loss function is used for training the neural network. The voice quality score is evaluated by extracting the time-frequency characteristics of the voice and utilizing the neural network. Pure speech is not required as a reference when performing speech quality assessment.
Description
Technical Field
The invention relates to the technical field of audio, in particular to a voice quality evaluation method based on a neural network.
Background
The voice is the fastest and most efficient way for people to communicate in daily life. However, in real life, the voice signal is often interfered by various noises, thereby affecting the voice quality. Therefore, it becomes important to evaluate the speech quality of the noisy signal and the denoised signal.
The speech signal quality evaluation methods mainly include two types: adopts a method of artificial subjective evaluation and an objective evaluation method with a reference signal. The method for carrying out voice quality assessment by adopting artificial subjectivity is time-consuming and labor-consuming, requires great labor cost and has risk of information safety; the evaluation method with the reference signal is suitable for a laboratory environment, and is not practical due to the fact that the paired reference signals are often lacked in a real-life scene.
In real life, voice data to be trained and evaluated are stored in various data formats, and different data formats are incompatible in different processing methods.
Disclosure of Invention
In view of the above problems, the present invention provides a speech quality assessment method based on a neural network, the method comprising:
the audio conversion module is used for converting the audio signals to be trained and evaluated into a format which can be processed in the neural network module;
and the noise adding module is connected with the audio conversion module and is used for adding noise to the converted pure voice to generate training data of the neural network model.
And the characteristic extraction module is used for extracting time-frequency characteristics of the voice to be trained or evaluated so as to input the voice to the neural network module.
A neural network module connected to the output of the feature extraction module for predicting an assessment score corresponding to the input speech feature.
And (5) a loss function is used for training the neural network.
In the scheme, the audios in different formats are converted into the specific format suitable for the method through the audio conversion module, so that the practicability of the method is improved. The corresponding noisy speech generated by the noise adding module is input into the pure speech, and the noisy speech is marked by using a PESQ algorithm to generate training data of the neural network. And (4) carrying out batch characteristic extraction on the data to be trained by using a characteristic extraction module, and inputting the data to be trained into a neural network module.
Preferably, the neural network module comprises a pooling layer, a packet long-time memory layer, a full connection layer, a discarding layer and the like.
The layer adopts a self-adaptive average pooling layer and is used for compressing the characteristic dimension.
The grouping long-time and short-time memory layer adopts a grouping strategy and an expression recombination strategy and is used for efficiently extracting the context characteristics of the characteristics on the time dimension to generate the intermediate characteristics.
Preferably, the grouping policy of the grouping long and short term memory layer divides the input features and the hidden states into K groups, which are respectively expressed as:
{,...an,...}. And splicing all hidden states at an output layer. The expression recombination strategy is characterized in that the output feature addition one-dimensional is converted into (K, N/K), wherein N represents a feature dimension; then carrying out dimension exchange on the obtained product, and converting the obtained product into (K, N/K); finally, the shape of the feature is transformed back to the N dimension.
In the scheme, the complexity of the model can be reduced by adopting a grouping strategy in the grouping long-time memory layer; the loss of characteristic context correlation due to grouping can be recovered by adopting an expression recombination strategy.
Preferably, the packet long-term memory layer sets the forgetting gate offset of each LSTM to-3 and sets other parameters to 0 when the parameters are initialized.
In the scheme, the LSTM can pay more attention to the context relationship of the adjacent time through the setting of forgetting gate deviation initialization of each LSTM.
And the full connection layer is used for mapping the intermediate features generated by the long-time and short-time memory layer to a training target.
The discarding layer is used for relieving the overfitting problem of the neural network.
Preferably, the loss function is:
whereinRepresents a weighting factor that is a function of the sentence-level PESQ score expressed as:
、PESQ scores at sentence level, real and predicted respectively; n represents the total number of sentences trained;representing the frame number of the nth sentence of voice;PESQ prediction scores representing frame levels of the t-th frame in the nth sentence of speech.
The weight coefficients of the loss function in the scheme have symmetry, and accordingly the neural network model can have a better prediction effect.
Preferably, the feature extraction module aligns the lengths of the voices to be processed in the batch when training the neural network modules in batch, performs short-time fourier transform and amplitude value extraction respectively, and performs spectrum normalization to generate batch features to be input into the neural network modules.
The characteristic extraction module in the scheme can train the neural network module better by aligning the speech in time length and normalizing the amplitude spectrum, and improves the generalization of the trained model.
Drawings
To further understand the technical solutions of the embodiments of the present invention, the drawings are described herein, and the drawings herein form a part of the present application and do not form a limitation of the embodiments of the present invention.
Fig. 1 is a flowchart of a method for training a neural network module for speech quality assessment according to an embodiment of the present invention.
Fig. 2 is a flowchart of a speech quality assessment method according to an embodiment of the present invention.
Fig. 3 is a flowchart of the operation of the feature extraction module according to an embodiment of the present invention.
Fig. 4 is a block diagram of a neural network module according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Referring to fig. 1, a neural network training method for speech quality assessment according to an embodiment of the present invention includes the following steps:
step S11, the pure speech is input into the audio format conversion module, and audio data with a specific format suitable for the method is generated for training the neural network module.
In this embodiment, the audio format conversion module converts the incoming clean speech to data in ". wav" format at 16k sample rate, single channel.
And step S12, the specific format audio output by the audio format conversion module is subjected to noise addition to generate the paired noisy speech.
In this embodiment, random noise addition is performed with multiple types of noise and multiple signal-to-noise ratios, so as to generate a noisy speech under multiple noise conditions.
Step S13, PESQ value calculation is carried out on the noisy speech and the clean speech to label the noisy speech.
And step S14, performing time-frequency feature extraction on the noisy speech input feature extraction module.
And step S15, inputting the time-frequency characteristics generated by the characteristic extraction module into a neural network module to predict the PESQ value.
In this embodiment, PESQ value is 16k wideband mode, and the range is: 1.04-4.64, and a 16k narrow band mode can be selected, wherein the range is as follows: -0.5-4.5.
And step S16, using the predicted value output by the neural network module and the input loss function of the labeled data for further learning by the neural network module.
Referring to fig. 2, a speech quality evaluation process according to an embodiment of the present invention includes the following steps:
in step S21, the speech to be evaluated passes through the audio format conversion module to generate audio data in a specific format suitable for the method.
And step S22, the audio data after format conversion passes through a feature extraction module to generate time-frequency features.
And step S23, inputting the time-frequency characteristics generated by the characteristic extraction module into the neural network module to generate corresponding evaluation scores.
Referring to fig. 3, a workflow of a feature extraction module according to an embodiment of the present invention includes the following steps:
and step S31, in order to train the neural network module better, dividing the voices to be processed into batches, finding out the longest voice in each batch, and performing zero filling alignment on other voices according to the length of the longest voice.
And step S32, sequentially carrying out short-time Fourier transform on the zero-filling aligned voice to generate time-frequency characteristics.
And step S33, amplitude values of the generated time-frequency characteristics are taken to generate a magnitude spectrum.
In step S34, the generated amplitude spectrum is normalized.
And step S35, inputting the batch quantities of the normalized amplitude spectrum into a neural network module.
Further, in some embodiments, the feature extraction module may further extract other features such as zero-crossing rate, log power spectrum, mel-frequency spectrum coefficient, and the like in addition to the magnitude spectrum for use by the neural network module.
Referring to fig. 4, a schematic structural diagram of a neural network module according to an embodiment of the present invention includes the following contents:
in step S41, the audio features are compressed in the frequency dimension through the pooling layer 1 to be suitable for the dimension number of the grouping performed by the grouping long-time memory layer.
And step S42, grouping the outputs of the pooling layer 1 and sending the outputs into corresponding long-time and short-time memory networks, and finally splicing the outputs of the long-time and short-time memory networks.
In step S43, the output of the packet long/short term memory layer is down-sampled using the full link layer 1.
And step S44, adopting a discarding strategy to the characteristics output by the full connection layer 1, and fitting the characteristics with the network model.
In this embodiment, the packet length time remembers that the number of packets in the network is an even number; the drop layer drop probability is set to 0.3.
Step S45, the feature input full link layer 2 via the discard layer performs a down-sampling operation and outputs a frame level evaluation score.
In step S46, the output of the full link layer 2 is down-sampled to a vector of 1 × 1 by the pooling layer 2 and output.
In this embodiment, the pooling layer is an adaptive average pooling layer.
To demonstrate the effectiveness and feasibility of the present embodiment, a model using a two-way long-and-short-term memory layer was used as a comparison model.
The loss function used by the above comparison model is:
the experimental results are shown in table one: BLSTM represents a model that employs a bidirectional long-short-time memory layer; GLSTM represents a model using the packet long-and-short memory layer described in this example; the two models both adopt the same loss function; GLSTM + loss represents the model of the grouping long-time memory layer and the loss function provided by the example; testing the noisy speech using 4900 different noise conditions, wherein the noise type is different from the noise type used by the training set; the experiment was evaluated from three indices: mean Square Error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), wherein a lower MSE value indicates a smaller error between the estimated value and the true value; the range of LCC, SRCC is: 0-1, closer to 1 indicates a higher correlation of the evaluation value with the true value. The experimental result shows that the method has a certain gain in the evaluation of the voice quality.
Table 1 experimental comparison.
Index (I) | MSE | LCC | SRCC |
BLSTM | 0.1257 | 0.9182 | 0.9252 |
GLSTM | 0.0693 | 0.9554 | 0.9589 |
GLSTM+loss | 0.0601 | 0.9617 | 0.9626 |
The above-mentioned embodiments are intended to further illustrate the objects, technical lines and advantages of the present invention, and are only preferred embodiments of the present invention, which should not be construed as limiting the present invention, and any modifications, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (7)
1. A speech quality assessment method based on a neural network is characterized by comprising the following steps:
the audio conversion module is used for converting the audio signals to be trained and evaluated into a format which can be processed in the neural network module;
the noise adding module is used for adding noise to the pure voice to generate training data of the neural network model;
the characteristic extraction module is used for extracting time-frequency characteristics of the voice so as to input the voice into the neural network module;
a neural network module for predicting an evaluation score corresponding to an input speech feature;
and (5) a loss function is used for training the neural network.
2. The method as claimed in claim 1, wherein the modules include a pooling layer, a grouped long and short term memory layer (grouped lstm layer), a full link layer, a discarding layer, and so on.
3. The neural network-based speech quality assessment method according to claim 2, wherein the pooling layer employs an adaptive average pooling layer for compressing feature dimensions;
the grouping long-time and short-time memory layer adopts a grouping strategy and an expression recombination strategy and is used for efficiently extracting context characteristics of the characteristics on a time dimension to generate intermediate characteristics;
the full connection layer is used for mapping the intermediate features generated by the long-time and short-time memory layer to a training target;
the discarding layer is used for relieving the overfitting problem of the neural network.
4. The packet long term memory layer of claim 2, wherein a packet strategy and expression recombination strategy are employed;
the grouping strategy divides the input features and the hidden states into K groups which are respectively expressed as: {,...An,...Splicing all hidden states together at an output layer;
the expression recombination strategy transforms the feature addition dimension into (K, N/K), wherein N represents the feature dimension; then carrying out dimension exchange on the obtained product, and converting the obtained product into (K, N/K); finally, the shape of the feature is transformed back to the N dimension.
5. The packet long term memory layer according to claim 2, wherein a forgetting gate bias (forget gate bias) in each long term memory (LSTM) parameter is initialized to-3 and other parameters are initialized to 0.
6. The neural network-based speech quality assessment method according to claim 1, wherein said loss function is:
whereinRepresents a weighting factor that is a function of the sentence-level PESQ score expressed as:
、PESQ scores at sentence level, real and predicted respectively; n represents the total number of sentences trained;representing the frame number of the nth sentence of voice;PESQ (objective speech quality assessment) prediction score representing the frame level of the t-th frame in the nth sentence of speech.
7. The neural network-based speech quality assessment method according to claim 1, wherein said audio conversion module converts audio into a specific format; when training the neural network modules in batches, the characteristic extraction module aligns the lengths of the voices to be processed in the batches, then performs short-time Fourier transform and amplitude value taking respectively, and then performs spectrum normalization to generate batch characteristics to be input into the neural network modules.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210004522.3A CN114360583A (en) | 2022-01-05 | 2022-01-05 | Voice quality evaluation method based on neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210004522.3A CN114360583A (en) | 2022-01-05 | 2022-01-05 | Voice quality evaluation method based on neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114360583A true CN114360583A (en) | 2022-04-15 |
Family
ID=81107481
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210004522.3A Pending CN114360583A (en) | 2022-01-05 | 2022-01-05 | Voice quality evaluation method based on neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114360583A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115620748A (en) * | 2022-12-06 | 2023-01-17 | 北京远鉴信息技术有限公司 | Comprehensive training method and device for speech synthesis and false discrimination evaluation |
-
2022
- 2022-01-05 CN CN202210004522.3A patent/CN114360583A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115620748A (en) * | 2022-12-06 | 2023-01-17 | 北京远鉴信息技术有限公司 | Comprehensive training method and device for speech synthesis and false discrimination evaluation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109841226A (en) | A kind of single channel real-time noise-reducing method based on convolution recurrent neural network | |
Xu et al. | Deep sparse rectifier neural networks for speech denoising | |
CN109559736B (en) | Automatic dubbing method for movie actors based on confrontation network | |
CN108766419A (en) | A kind of abnormal speech detection method based on deep learning | |
CN111916111B (en) | Intelligent voice outbound method and device with emotion, server and storage medium | |
CN113488058B (en) | Voiceprint recognition method based on short voice | |
CN112151059A (en) | Microphone array-oriented channel attention weighted speech enhancement method | |
CN102800316A (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN111968666B (en) | Hearing aid voice enhancement method based on depth domain self-adaptive network | |
CN112331224A (en) | Lightweight time domain convolution network voice enhancement method and system | |
Guo et al. | Deep neural network based i-vector mapping for speaker verification using short utterances | |
Mallidi et al. | Autoencoder based multi-stream combination for noise robust speech recognition | |
Valente | Multi-stream speech recognition based on Dempster–Shafer combination rule | |
CN113129900A (en) | Voiceprint extraction model construction method, voiceprint identification method and related equipment | |
CN114360583A (en) | Voice quality evaluation method based on neural network | |
Jia et al. | A deep learning-based time-domain approach for non-intrusive speech quality assessment | |
Jalil et al. | Speaker identification using convolutional neural network for clean and noisy speech samples | |
Soni et al. | State-of-the-art analysis of deep learning-based monaural speech source separation techniques | |
Gu et al. | Dynamic convolution with global-local information for session-invariant speaker representation learning | |
Sunny et al. | Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam | |
Aggarwal et al. | Performance evaluation of artificial neural networks for isolated Hindi digit recognition with LPC and MFCC | |
CHEN et al. | High-quality voice conversion system based on GMM statistical parameters and RBF neural network | |
Gaballah et al. | Assessment of amplified parkinsonian speech quality using deep learning | |
Wang et al. | Automatic voice quality evaluation method of IVR service in call center based on Stacked Auto Encoder | |
Venkateswarlu et al. | Developing efficient speech recognition system for Telugu letter recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |