CN114360583A

CN114360583A - Voice quality evaluation method based on neural network

Info

Publication number: CN114360583A
Application number: CN202210004522.3A
Authority: CN
Inventors: 卢晨华; 黄志华; 郭创建
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-04-15

Abstract

The invention discloses a voice quality evaluation method based on a neural network, which comprises the following steps: an audio conversion module for converting audio into a format that can be processed in a neural network model; the noise adding module is used for generating a voice with noise which is matched with the pure voice; the characteristic extraction module is used for extracting the characteristics of the input neural network; the neural network module is used for evaluating the voice quality scores corresponding to the input model characteristics; and (5) a loss function is used for training the neural network. The voice quality score is evaluated by extracting the time-frequency characteristics of the voice and utilizing the neural network. Pure speech is not required as a reference when performing speech quality assessment.

Description

Voice quality evaluation method based on neural network

Technical Field

The invention relates to the technical field of audio, in particular to a voice quality evaluation method based on a neural network.

Background

The voice is the fastest and most efficient way for people to communicate in daily life. However, in real life, the voice signal is often interfered by various noises, thereby affecting the voice quality. Therefore, it becomes important to evaluate the speech quality of the noisy signal and the denoised signal.

The speech signal quality evaluation methods mainly include two types: adopts a method of artificial subjective evaluation and an objective evaluation method with a reference signal. The method for carrying out voice quality assessment by adopting artificial subjectivity is time-consuming and labor-consuming, requires great labor cost and has risk of information safety; the evaluation method with the reference signal is suitable for a laboratory environment, and is not practical due to the fact that the paired reference signals are often lacked in a real-life scene.

In real life, voice data to be trained and evaluated are stored in various data formats, and different data formats are incompatible in different processing methods.

Disclosure of Invention

In view of the above problems, the present invention provides a speech quality assessment method based on a neural network, the method comprising:

the audio conversion module is used for converting the audio signals to be trained and evaluated into a format which can be processed in the neural network module;

and the noise adding module is connected with the audio conversion module and is used for adding noise to the converted pure voice to generate training data of the neural network model.

And the characteristic extraction module is used for extracting time-frequency characteristics of the voice to be trained or evaluated so as to input the voice to the neural network module.

A neural network module connected to the output of the feature extraction module for predicting an assessment score corresponding to the input speech feature.

And (5) a loss function is used for training the neural network.

In the scheme, the audios in different formats are converted into the specific format suitable for the method through the audio conversion module, so that the practicability of the method is improved. The corresponding noisy speech generated by the noise adding module is input into the pure speech, and the noisy speech is marked by using a PESQ algorithm to generate training data of the neural network. And (4) carrying out batch characteristic extraction on the data to be trained by using a characteristic extraction module, and inputting the data to be trained into a neural network module.

Preferably, the neural network module comprises a pooling layer, a packet long-time memory layer, a full connection layer, a discarding layer and the like.

The layer adopts a self-adaptive average pooling layer and is used for compressing the characteristic dimension.

The grouping long-time and short-time memory layer adopts a grouping strategy and an expression recombination strategy and is used for efficiently extracting the context characteristics of the characteristics on the time dimension to generate the intermediate characteristics.

Preferably, the grouping policy of the grouping long and short term memory layer divides the input features and the hidden states into K groups, which are respectively expressed as:

｛

，．．．

an

，．．．

}. And splicing all hidden states at an output layer. The expression recombination strategy is characterized in that the output feature addition one-dimensional is converted into (K, N/K), wherein N represents a feature dimension; then carrying out dimension exchange on the obtained product, and converting the obtained product into (K, N/K); finally, the shape of the feature is transformed back to the N dimension.

In the scheme, the complexity of the model can be reduced by adopting a grouping strategy in the grouping long-time memory layer; the loss of characteristic context correlation due to grouping can be recovered by adopting an expression recombination strategy.

Preferably, the packet long-term memory layer sets the forgetting gate offset of each LSTM to-3 and sets other parameters to 0 when the parameters are initialized.

In the scheme, the LSTM can pay more attention to the context relationship of the adjacent time through the setting of forgetting gate deviation initialization of each LSTM.

And the full connection layer is used for mapping the intermediate features generated by the long-time and short-time memory layer to a training target.

The discarding layer is used for relieving the overfitting problem of the neural network.

Preferably, the loss function is:

wherein

Represents a weighting factor that is a function of the sentence-level PESQ score expressed as:

、

PESQ scores at sentence level, real and predicted respectively; n represents the total number of sentences trained;

representing the frame number of the nth sentence of voice;

PESQ prediction scores representing frame levels of the t-th frame in the nth sentence of speech.

The weight coefficients of the loss function in the scheme have symmetry, and accordingly the neural network model can have a better prediction effect.

Preferably, the feature extraction module aligns the lengths of the voices to be processed in the batch when training the neural network modules in batch, performs short-time fourier transform and amplitude value extraction respectively, and performs spectrum normalization to generate batch features to be input into the neural network modules.

The characteristic extraction module in the scheme can train the neural network module better by aligning the speech in time length and normalizing the amplitude spectrum, and improves the generalization of the trained model.

Drawings

To further understand the technical solutions of the embodiments of the present invention, the drawings are described herein, and the drawings herein form a part of the present application and do not form a limitation of the embodiments of the present invention.

Fig. 1 is a flowchart of a method for training a neural network module for speech quality assessment according to an embodiment of the present invention.

Fig. 2 is a flowchart of a speech quality assessment method according to an embodiment of the present invention.

Fig. 3 is a flowchart of the operation of the feature extraction module according to an embodiment of the present invention.

Fig. 4 is a block diagram of a neural network module according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Referring to fig. 1, a neural network training method for speech quality assessment according to an embodiment of the present invention includes the following steps:

step S11, the pure speech is input into the audio format conversion module, and audio data with a specific format suitable for the method is generated for training the neural network module.

In this embodiment, the audio format conversion module converts the incoming clean speech to data in ". wav" format at 16k sample rate, single channel.

And step S12, the specific format audio output by the audio format conversion module is subjected to noise addition to generate the paired noisy speech.

In this embodiment, random noise addition is performed with multiple types of noise and multiple signal-to-noise ratios, so as to generate a noisy speech under multiple noise conditions.

Step S13, PESQ value calculation is carried out on the noisy speech and the clean speech to label the noisy speech.

And step S14, performing time-frequency feature extraction on the noisy speech input feature extraction module.

And step S15, inputting the time-frequency characteristics generated by the characteristic extraction module into a neural network module to predict the PESQ value.

In this embodiment, PESQ value is 16k wideband mode, and the range is: 1.04-4.64, and a 16k narrow band mode can be selected, wherein the range is as follows: -0.5-4.5.

And step S16, using the predicted value output by the neural network module and the input loss function of the labeled data for further learning by the neural network module.

Referring to fig. 2, a speech quality evaluation process according to an embodiment of the present invention includes the following steps:

in step S21, the speech to be evaluated passes through the audio format conversion module to generate audio data in a specific format suitable for the method.

And step S22, the audio data after format conversion passes through a feature extraction module to generate time-frequency features.

And step S23, inputting the time-frequency characteristics generated by the characteristic extraction module into the neural network module to generate corresponding evaluation scores.

Referring to fig. 3, a workflow of a feature extraction module according to an embodiment of the present invention includes the following steps:

and step S31, in order to train the neural network module better, dividing the voices to be processed into batches, finding out the longest voice in each batch, and performing zero filling alignment on other voices according to the length of the longest voice.

And step S32, sequentially carrying out short-time Fourier transform on the zero-filling aligned voice to generate time-frequency characteristics.

And step S33, amplitude values of the generated time-frequency characteristics are taken to generate a magnitude spectrum.

In step S34, the generated amplitude spectrum is normalized.

And step S35, inputting the batch quantities of the normalized amplitude spectrum into a neural network module.

Further, in some embodiments, the feature extraction module may further extract other features such as zero-crossing rate, log power spectrum, mel-frequency spectrum coefficient, and the like in addition to the magnitude spectrum for use by the neural network module.

Referring to fig. 4, a schematic structural diagram of a neural network module according to an embodiment of the present invention includes the following contents:

in step S41, the audio features are compressed in the frequency dimension through the pooling layer 1 to be suitable for the dimension number of the grouping performed by the grouping long-time memory layer.

And step S42, grouping the outputs of the pooling layer 1 and sending the outputs into corresponding long-time and short-time memory networks, and finally splicing the outputs of the long-time and short-time memory networks.

In step S43, the output of the packet long/short term memory layer is down-sampled using the full link layer 1.

And step S44, adopting a discarding strategy to the characteristics output by the full connection layer 1, and fitting the characteristics with the network model.

In this embodiment, the packet length time remembers that the number of packets in the network is an even number; the drop layer drop probability is set to 0.3.

Step S45, the feature input full link layer 2 via the discard layer performs a down-sampling operation and outputs a frame level evaluation score.

In step S46, the output of the full link layer 2 is down-sampled to a vector of 1 × 1 by the pooling layer 2 and output.

In this embodiment, the pooling layer is an adaptive average pooling layer.

To demonstrate the effectiveness and feasibility of the present embodiment, a model using a two-way long-and-short-term memory layer was used as a comparison model.

The loss function used by the above comparison model is:

wherein

Comprises the following steps:

。

the experimental results are shown in table one: BLSTM represents a model that employs a bidirectional long-short-time memory layer; GLSTM represents a model using the packet long-and-short memory layer described in this example; the two models both adopt the same loss function; GLSTM + loss represents the model of the grouping long-time memory layer and the loss function provided by the example; testing the noisy speech using 4900 different noise conditions, wherein the noise type is different from the noise type used by the training set; the experiment was evaluated from three indices: mean Square Error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), wherein a lower MSE value indicates a smaller error between the estimated value and the true value; the range of LCC, SRCC is: 0-1, closer to 1 indicates a higher correlation of the evaluation value with the true value. The experimental result shows that the method has a certain gain in the evaluation of the voice quality.

Table 1 experimental comparison.

Index (I)	MSE	LCC	SRCC
				BLSTM	0.1257	0.9182	0.9252
GLSTM	0.0693	0.9554	0.9589
				GLSTM+loss	0.0601	0.9617	0.9626

The above-mentioned embodiments are intended to further illustrate the objects, technical lines and advantages of the present invention, and are only preferred embodiments of the present invention, which should not be construed as limiting the present invention, and any modifications, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A speech quality assessment method based on a neural network is characterized by comprising the following steps:

the noise adding module is used for adding noise to the pure voice to generate training data of the neural network model;

the characteristic extraction module is used for extracting time-frequency characteristics of the voice so as to input the voice into the neural network module;

a neural network module for predicting an evaluation score corresponding to an input speech feature;

and (5) a loss function is used for training the neural network.

2. The method as claimed in claim 1, wherein the modules include a pooling layer, a grouped long and short term memory layer (grouped lstm layer), a full link layer, a discarding layer, and so on.

3. The neural network-based speech quality assessment method according to claim 2, wherein the pooling layer employs an adaptive average pooling layer for compressing feature dimensions;

the grouping long-time and short-time memory layer adopts a grouping strategy and an expression recombination strategy and is used for efficiently extracting context characteristics of the characteristics on a time dimension to generate intermediate characteristics;

the full connection layer is used for mapping the intermediate features generated by the long-time and short-time memory layer to a training target;

4. The packet long term memory layer of claim 2, wherein a packet strategy and expression recombination strategy are employed;

the grouping strategy divides the input features and the hidden states into K groups which are respectively expressed as: {

，．．．

An

，．．．

Splicing all hidden states together at an output layer;

the expression recombination strategy transforms the feature addition dimension into (K, N/K), wherein N represents the feature dimension; then carrying out dimension exchange on the obtained product, and converting the obtained product into (K, N/K); finally, the shape of the feature is transformed back to the N dimension.

5. The packet long term memory layer according to claim 2, wherein a forgetting gate bias (forget gate bias) in each long term memory (LSTM) parameter is initialized to-3 and other parameters are initialized to 0.

6. The neural network-based speech quality assessment method according to claim 1, wherein said loss function is:

wherein

、

representing the frame number of the nth sentence of voice;

PESQ (objective speech quality assessment) prediction score representing the frame level of the t-th frame in the nth sentence of speech.

7. The neural network-based speech quality assessment method according to claim 1, wherein said audio conversion module converts audio into a specific format; when training the neural network modules in batches, the characteristic extraction module aligns the lengths of the voices to be processed in the batches, then performs short-time Fourier transform and amplitude value taking respectively, and then performs spectrum normalization to generate batch characteristics to be input into the neural network modules.