CN114360583A - Voice quality evaluation method based on neural network - Google Patents

Voice quality evaluation method based on neural network Download PDF

Info

Publication number
CN114360583A
CN114360583A CN202210004522.3A CN202210004522A CN114360583A CN 114360583 A CN114360583 A CN 114360583A CN 202210004522 A CN202210004522 A CN 202210004522A CN 114360583 A CN114360583 A CN 114360583A
Authority
CN
China
Prior art keywords
neural network
layer
voice
module
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210004522.3A
Other languages
Chinese (zh)
Inventor
卢晨华
黄志华
郭创建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinjiang University
Original Assignee
Xinjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinjiang University filed Critical Xinjiang University
Priority to CN202210004522.3A priority Critical patent/CN114360583A/en
Publication of CN114360583A publication Critical patent/CN114360583A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a voice quality evaluation method based on a neural network, which comprises the following steps: an audio conversion module for converting audio into a format that can be processed in a neural network model; the noise adding module is used for generating a voice with noise which is matched with the pure voice; the characteristic extraction module is used for extracting the characteristics of the input neural network; the neural network module is used for evaluating the voice quality scores corresponding to the input model characteristics; and (5) a loss function is used for training the neural network. The voice quality score is evaluated by extracting the time-frequency characteristics of the voice and utilizing the neural network. Pure speech is not required as a reference when performing speech quality assessment.

Description

Voice quality evaluation method based on neural network
Technical Field
The invention relates to the technical field of audio, in particular to a voice quality evaluation method based on a neural network.
Background
The voice is the fastest and most efficient way for people to communicate in daily life. However, in real life, the voice signal is often interfered by various noises, thereby affecting the voice quality. Therefore, it becomes important to evaluate the speech quality of the noisy signal and the denoised signal.
The speech signal quality evaluation methods mainly include two types: adopts a method of artificial subjective evaluation and an objective evaluation method with a reference signal. The method for carrying out voice quality assessment by adopting artificial subjectivity is time-consuming and labor-consuming, requires great labor cost and has risk of information safety; the evaluation method with the reference signal is suitable for a laboratory environment, and is not practical due to the fact that the paired reference signals are often lacked in a real-life scene.
In real life, voice data to be trained and evaluated are stored in various data formats, and different data formats are incompatible in different processing methods.
Disclosure of Invention
In view of the above problems, the present invention provides a speech quality assessment method based on a neural network, the method comprising:
the audio conversion module is used for converting the audio signals to be trained and evaluated into a format which can be processed in the neural network module;
and the noise adding module is connected with the audio conversion module and is used for adding noise to the converted pure voice to generate training data of the neural network model.
And the characteristic extraction module is used for extracting time-frequency characteristics of the voice to be trained or evaluated so as to input the voice to the neural network module.
A neural network module connected to the output of the feature extraction module for predicting an assessment score corresponding to the input speech feature.
And (5) a loss function is used for training the neural network.
In the scheme, the audios in different formats are converted into the specific format suitable for the method through the audio conversion module, so that the practicability of the method is improved. The corresponding noisy speech generated by the noise adding module is input into the pure speech, and the noisy speech is marked by using a PESQ algorithm to generate training data of the neural network. And (4) carrying out batch characteristic extraction on the data to be trained by using a characteristic extraction module, and inputting the data to be trained into a neural network module.
Preferably, the neural network module comprises a pooling layer, a packet long-time memory layer, a full connection layer, a discarding layer and the like.
The layer adopts a self-adaptive average pooling layer and is used for compressing the characteristic dimension.
The grouping long-time and short-time memory layer adopts a grouping strategy and an expression recombination strategy and is used for efficiently extracting the context characteristics of the characteristics on the time dimension to generate the intermediate characteristics.
Preferably, the grouping policy of the grouping long and short term memory layer divides the input features and the hidden states into K groups, which are respectively expressed as:
Figure 783038DEST_PATH_IMAGE001
,...
Figure 963352DEST_PATH_IMAGE002
an
Figure 701370DEST_PATH_IMAGE003
,...
Figure 625333DEST_PATH_IMAGE004
}. And splicing all hidden states at an output layer. The expression recombination strategy is characterized in that the output feature addition one-dimensional is converted into (K, N/K), wherein N represents a feature dimension; then carrying out dimension exchange on the obtained product, and converting the obtained product into (K, N/K); finally, the shape of the feature is transformed back to the N dimension.
In the scheme, the complexity of the model can be reduced by adopting a grouping strategy in the grouping long-time memory layer; the loss of characteristic context correlation due to grouping can be recovered by adopting an expression recombination strategy.
Preferably, the packet long-term memory layer sets the forgetting gate offset of each LSTM to-3 and sets other parameters to 0 when the parameters are initialized.
In the scheme, the LSTM can pay more attention to the context relationship of the adjacent time through the setting of forgetting gate deviation initialization of each LSTM.
And the full connection layer is used for mapping the intermediate features generated by the long-time and short-time memory layer to a training target.
The discarding layer is used for relieving the overfitting problem of the neural network.
Preferably, the loss function is:
Figure 804510DEST_PATH_IMAGE005
wherein
Figure 421305DEST_PATH_IMAGE006
Represents a weighting factor that is a function of the sentence-level PESQ score expressed as:
Figure 443357DEST_PATH_IMAGE007
Figure 171010DEST_PATH_IMAGE008
Figure 735852DEST_PATH_IMAGE009
PESQ scores at sentence level, real and predicted respectively; n represents the total number of sentences trained;
Figure 269688DEST_PATH_IMAGE010
representing the frame number of the nth sentence of voice;
Figure 919981DEST_PATH_IMAGE011
PESQ prediction scores representing frame levels of the t-th frame in the nth sentence of speech.
The weight coefficients of the loss function in the scheme have symmetry, and accordingly the neural network model can have a better prediction effect.
Preferably, the feature extraction module aligns the lengths of the voices to be processed in the batch when training the neural network modules in batch, performs short-time fourier transform and amplitude value extraction respectively, and performs spectrum normalization to generate batch features to be input into the neural network modules.
The characteristic extraction module in the scheme can train the neural network module better by aligning the speech in time length and normalizing the amplitude spectrum, and improves the generalization of the trained model.
Drawings
To further understand the technical solutions of the embodiments of the present invention, the drawings are described herein, and the drawings herein form a part of the present application and do not form a limitation of the embodiments of the present invention.
Fig. 1 is a flowchart of a method for training a neural network module for speech quality assessment according to an embodiment of the present invention.
Fig. 2 is a flowchart of a speech quality assessment method according to an embodiment of the present invention.
Fig. 3 is a flowchart of the operation of the feature extraction module according to an embodiment of the present invention.
Fig. 4 is a block diagram of a neural network module according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Referring to fig. 1, a neural network training method for speech quality assessment according to an embodiment of the present invention includes the following steps:
step S11, the pure speech is input into the audio format conversion module, and audio data with a specific format suitable for the method is generated for training the neural network module.
In this embodiment, the audio format conversion module converts the incoming clean speech to data in ". wav" format at 16k sample rate, single channel.
And step S12, the specific format audio output by the audio format conversion module is subjected to noise addition to generate the paired noisy speech.
In this embodiment, random noise addition is performed with multiple types of noise and multiple signal-to-noise ratios, so as to generate a noisy speech under multiple noise conditions.
Step S13, PESQ value calculation is carried out on the noisy speech and the clean speech to label the noisy speech.
And step S14, performing time-frequency feature extraction on the noisy speech input feature extraction module.
And step S15, inputting the time-frequency characteristics generated by the characteristic extraction module into a neural network module to predict the PESQ value.
In this embodiment, PESQ value is 16k wideband mode, and the range is: 1.04-4.64, and a 16k narrow band mode can be selected, wherein the range is as follows: -0.5-4.5.
And step S16, using the predicted value output by the neural network module and the input loss function of the labeled data for further learning by the neural network module.
Referring to fig. 2, a speech quality evaluation process according to an embodiment of the present invention includes the following steps:
in step S21, the speech to be evaluated passes through the audio format conversion module to generate audio data in a specific format suitable for the method.
And step S22, the audio data after format conversion passes through a feature extraction module to generate time-frequency features.
And step S23, inputting the time-frequency characteristics generated by the characteristic extraction module into the neural network module to generate corresponding evaluation scores.
Referring to fig. 3, a workflow of a feature extraction module according to an embodiment of the present invention includes the following steps:
and step S31, in order to train the neural network module better, dividing the voices to be processed into batches, finding out the longest voice in each batch, and performing zero filling alignment on other voices according to the length of the longest voice.
And step S32, sequentially carrying out short-time Fourier transform on the zero-filling aligned voice to generate time-frequency characteristics.
And step S33, amplitude values of the generated time-frequency characteristics are taken to generate a magnitude spectrum.
In step S34, the generated amplitude spectrum is normalized.
And step S35, inputting the batch quantities of the normalized amplitude spectrum into a neural network module.
Further, in some embodiments, the feature extraction module may further extract other features such as zero-crossing rate, log power spectrum, mel-frequency spectrum coefficient, and the like in addition to the magnitude spectrum for use by the neural network module.
Referring to fig. 4, a schematic structural diagram of a neural network module according to an embodiment of the present invention includes the following contents:
in step S41, the audio features are compressed in the frequency dimension through the pooling layer 1 to be suitable for the dimension number of the grouping performed by the grouping long-time memory layer.
And step S42, grouping the outputs of the pooling layer 1 and sending the outputs into corresponding long-time and short-time memory networks, and finally splicing the outputs of the long-time and short-time memory networks.
In step S43, the output of the packet long/short term memory layer is down-sampled using the full link layer 1.
And step S44, adopting a discarding strategy to the characteristics output by the full connection layer 1, and fitting the characteristics with the network model.
In this embodiment, the packet length time remembers that the number of packets in the network is an even number; the drop layer drop probability is set to 0.3.
Step S45, the feature input full link layer 2 via the discard layer performs a down-sampling operation and outputs a frame level evaluation score.
In step S46, the output of the full link layer 2 is down-sampled to a vector of 1 × 1 by the pooling layer 2 and output.
In this embodiment, the pooling layer is an adaptive average pooling layer.
To demonstrate the effectiveness and feasibility of the present embodiment, a model using a two-way long-and-short-term memory layer was used as a comparison model.
The loss function used by the above comparison model is:
Figure 123430DEST_PATH_IMAGE005
wherein
Figure 339516DEST_PATH_IMAGE012
Comprises the following steps:
Figure 94851DEST_PATH_IMAGE013
the experimental results are shown in table one: BLSTM represents a model that employs a bidirectional long-short-time memory layer; GLSTM represents a model using the packet long-and-short memory layer described in this example; the two models both adopt the same loss function; GLSTM + loss represents the model of the grouping long-time memory layer and the loss function provided by the example; testing the noisy speech using 4900 different noise conditions, wherein the noise type is different from the noise type used by the training set; the experiment was evaluated from three indices: mean Square Error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), wherein a lower MSE value indicates a smaller error between the estimated value and the true value; the range of LCC, SRCC is: 0-1, closer to 1 indicates a higher correlation of the evaluation value with the true value. The experimental result shows that the method has a certain gain in the evaluation of the voice quality.
Table 1 experimental comparison.
Index (I) MSE LCC SRCC
BLSTM 0.1257 0.9182 0.9252
GLSTM 0.0693 0.9554 0.9589
GLSTM+loss 0.0601 0.9617 0.9626
The above-mentioned embodiments are intended to further illustrate the objects, technical lines and advantages of the present invention, and are only preferred embodiments of the present invention, which should not be construed as limiting the present invention, and any modifications, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A speech quality assessment method based on a neural network is characterized by comprising the following steps:
the audio conversion module is used for converting the audio signals to be trained and evaluated into a format which can be processed in the neural network module;
the noise adding module is used for adding noise to the pure voice to generate training data of the neural network model;
the characteristic extraction module is used for extracting time-frequency characteristics of the voice so as to input the voice into the neural network module;
a neural network module for predicting an evaluation score corresponding to an input speech feature;
and (5) a loss function is used for training the neural network.
2. The method as claimed in claim 1, wherein the modules include a pooling layer, a grouped long and short term memory layer (grouped lstm layer), a full link layer, a discarding layer, and so on.
3. The neural network-based speech quality assessment method according to claim 2, wherein the pooling layer employs an adaptive average pooling layer for compressing feature dimensions;
the grouping long-time and short-time memory layer adopts a grouping strategy and an expression recombination strategy and is used for efficiently extracting context characteristics of the characteristics on a time dimension to generate intermediate characteristics;
the full connection layer is used for mapping the intermediate features generated by the long-time and short-time memory layer to a training target;
the discarding layer is used for relieving the overfitting problem of the neural network.
4. The packet long term memory layer of claim 2, wherein a packet strategy and expression recombination strategy are employed;
the grouping strategy divides the input features and the hidden states into K groups which are respectively expressed as: {
Figure DEST_PATH_IMAGE001
,...
Figure DEST_PATH_IMAGE002
An
Figure DEST_PATH_IMAGE003
,...
Figure DEST_PATH_IMAGE004
Splicing all hidden states together at an output layer;
the expression recombination strategy transforms the feature addition dimension into (K, N/K), wherein N represents the feature dimension; then carrying out dimension exchange on the obtained product, and converting the obtained product into (K, N/K); finally, the shape of the feature is transformed back to the N dimension.
5. The packet long term memory layer according to claim 2, wherein a forgetting gate bias (forget gate bias) in each long term memory (LSTM) parameter is initialized to-3 and other parameters are initialized to 0.
6. The neural network-based speech quality assessment method according to claim 1, wherein said loss function is:
Figure DEST_PATH_IMAGE005
wherein
Figure DEST_PATH_IMAGE006
Represents a weighting factor that is a function of the sentence-level PESQ score expressed as:
Figure DEST_PATH_IMAGE007
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE009
PESQ scores at sentence level, real and predicted respectively; n represents the total number of sentences trained;
Figure DEST_PATH_IMAGE010
representing the frame number of the nth sentence of voice;
Figure DEST_PATH_IMAGE011
PESQ (objective speech quality assessment) prediction score representing the frame level of the t-th frame in the nth sentence of speech.
7. The neural network-based speech quality assessment method according to claim 1, wherein said audio conversion module converts audio into a specific format; when training the neural network modules in batches, the characteristic extraction module aligns the lengths of the voices to be processed in the batches, then performs short-time Fourier transform and amplitude value taking respectively, and then performs spectrum normalization to generate batch characteristics to be input into the neural network modules.
CN202210004522.3A 2022-01-05 2022-01-05 Voice quality evaluation method based on neural network Pending CN114360583A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210004522.3A CN114360583A (en) 2022-01-05 2022-01-05 Voice quality evaluation method based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210004522.3A CN114360583A (en) 2022-01-05 2022-01-05 Voice quality evaluation method based on neural network

Publications (1)

Publication Number Publication Date
CN114360583A true CN114360583A (en) 2022-04-15

Family

ID=81107481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210004522.3A Pending CN114360583A (en) 2022-01-05 2022-01-05 Voice quality evaluation method based on neural network

Country Status (1)

Country Link
CN (1) CN114360583A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620748A (en) * 2022-12-06 2023-01-17 北京远鉴信息技术有限公司 Comprehensive training method and device for speech synthesis and false discrimination evaluation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115620748A (en) * 2022-12-06 2023-01-17 北京远鉴信息技术有限公司 Comprehensive training method and device for speech synthesis and false discrimination evaluation

Similar Documents

Publication Publication Date Title
CN109841226A (en) A kind of single channel real-time noise-reducing method based on convolution recurrent neural network
Xu et al. Deep sparse rectifier neural networks for speech denoising
CN109559736B (en) Automatic dubbing method for movie actors based on confrontation network
CN108766419A (en) A kind of abnormal speech detection method based on deep learning
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN113488058B (en) Voiceprint recognition method based on short voice
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN102800316A (en) Optimal codebook design method for voiceprint recognition system based on nerve network
CN111968666B (en) Hearing aid voice enhancement method based on depth domain self-adaptive network
CN112331224A (en) Lightweight time domain convolution network voice enhancement method and system
Guo et al. Deep neural network based i-vector mapping for speaker verification using short utterances
Mallidi et al. Autoencoder based multi-stream combination for noise robust speech recognition
Valente Multi-stream speech recognition based on Dempster–Shafer combination rule
CN113129900A (en) Voiceprint extraction model construction method, voiceprint identification method and related equipment
CN114360583A (en) Voice quality evaluation method based on neural network
Jia et al. A deep learning-based time-domain approach for non-intrusive speech quality assessment
Jalil et al. Speaker identification using convolutional neural network for clean and noisy speech samples
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
Gu et al. Dynamic convolution with global-local information for session-invariant speaker representation learning
Sunny et al. Feature extraction methods based on linear predictive coding and wavelet packet decomposition for recognizing spoken words in malayalam
Aggarwal et al. Performance evaluation of artificial neural networks for isolated Hindi digit recognition with LPC and MFCC
CHEN et al. High-quality voice conversion system based on GMM statistical parameters and RBF neural network
Gaballah et al. Assessment of amplified parkinsonian speech quality using deep learning
Wang et al. Automatic voice quality evaluation method of IVR service in call center based on Stacked Auto Encoder
Venkateswarlu et al. Developing efficient speech recognition system for Telugu letter recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication