CN113744749A

CN113744749A - Voice enhancement method and system based on psychoacoustic domain weighting loss function

Info

Publication number: CN113744749A
Application number: CN202111098146.0A
Authority: CN
Inventors: 贾海蓉; 梅淑琳; 吴永强; 申陈宁; 王鲜霞; 张雪英; 王峰
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-03
Anticipated expiration: 2041-09-18
Also published as: CN113744749B

Abstract

The invention relates to a voice enhancement method and a system based on a weighted loss function of a psychoacoustic domain, belonging to the field of voice enhancement and comprising the steps of obtaining a training voice set and a test voice set; preprocessing a voice sample in a training voice set to obtain a voice characteristic with noise and a time-frequency masking value; inputting the voice characteristics with noise and the time-frequency masking value into a neural network voice enhancement model for training, and outputting a primary enhanced voice; simultaneously converting the preliminary enhanced voice and the pure voice to a psychoacoustics Bark domain, and calculating a loudness spectrum of the voice signal; calculating a speech distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighting loss function according to the speech distortion error and the residual noise error; performing optimization training on the neural network speech enhancement model by adopting an error back propagation algorithm; the noisy speech in the test speech set is input into the optimized neural network speech enhancement model, reconstructed enhanced speech is output, and the tone quality and intelligibility of speech signals can be improved.

Description

Voice enhancement method and system based on psychoacoustic domain weighting loss function

Technical Field

The invention relates to the field of voice enhancement, in particular to a voice enhancement method and system based on a psychoacoustic domain weighting loss function.

Background

As a key technology for suppressing background noise and improving speech intelligibility, speech enhancement is often applied to the front end of intelligent equipment to improve the success rate of speech recognition of human-computer interaction. In recent years, neural networks based on deep learning have been successfully applied to the field of speech enhancement, and the speech quality of the neural network speech enhancement method after enhancement is remarkably improved compared with the traditional speech enhancement method. The speech enhancement algorithm can be broadly classified into two categories, unsupervised speech enhancement and supervised speech enhancement, depending on whether the training sample is known or not. Classical unsupervised speech enhancement algorithms include spectral subtraction, wiener filtering and speech enhancement methods based on minimum mean square error estimation. Although the algorithms are simple to implement and can suppress stationary noise, the estimation requirements on the noise spectrum are strict, the problems of serious speech distortion and excessive residual noise can be caused by overestimation and underestimation of the noise, the suppression performance on non-stationary noise is poor, and the phenomenon of speech distortion enhancement is more in the low signal-to-noise ratio environment. Therefore, the speech enhancement signal obtained by reconstruction of the existing speech enhancement method based on the neural network still has the problems of poor speech quality and unobvious enhancement effect caused by speech distortion and excessive residual noise.

Therefore, a more effective speech enhancement method is needed to solve the problems of poor speech quality and unobvious enhancement effect caused by excessive speech distortion and residual noise in the existing neural network speech enhancement method.

Disclosure of Invention

The invention aims to provide a speech enhancement method and a speech enhancement system based on a psychoacoustic domain weighting loss function, so as to enhance speech quality, improve speech intelligibility and solve the problems of poor speech quality and unobvious enhancement effect caused by speech distortion and excessive residual noise in the existing neural network speech enhancement method.

In order to achieve the purpose, the invention provides the following scheme:

in one aspect, the present invention provides a speech enhancement method based on a weighted loss function in a psychoacoustic domain, including:

acquiring a training voice set and a test voice set; the training voice set comprises pure voice, noise and a part of voice with noise, and the testing voice set comprises another part of voice with noise;

preprocessing the voice samples in the training voice set to obtain voice characteristics with noise and a time-frequency masking value;

inputting the voice characteristics with noise and the time-frequency masking value into a neural network voice enhancement model for pre-training, and outputting a primary enhanced voice;

simultaneously converting the preliminary enhanced speech and the clean speech to a psychoacoustic Bark domain, and calculating a loudness spectrum of the speech signal in the psychoacoustic Bark domain;

calculating a speech distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighting loss function according to the speech distortion error and the residual noise error;

training the neural network speech enhancement model by adopting an error back propagation algorithm according to the Bark domain weighting loss function to obtain an optimized neural network speech enhancement model;

and inputting the noisy speech in the test speech set into the optimized neural network speech enhancement model, and outputting reconstructed enhanced speech.

Optionally, the preprocessing is performed on the voice samples in the training voice set to obtain noisy voice features and a time-frequency masking value, and the method specifically includes:

respectively carrying out short-time Fourier transform on the voice with noise, the pure voice and the noise in the training voice set to obtain a voice frequency spectrum with noise, a pure voice frequency spectrum and a noise frequency spectrum;

carrying out voice feature extraction on the voice frequency spectrum with the noise to obtain the voice feature with the noise;

and respectively carrying out time domain decomposition on the pure voice frequency spectrum and the noise frequency spectrum to obtain the time-frequency masking value.

Optionally, the simultaneously converting the preliminary enhanced speech and the clean speech to a psychoacoustic Bark domain, and calculating a loudness spectrum of the speech signal in the psychoacoustic Bark domain specifically includes:

converting the preliminary enhanced speech into the psychoacoustics Bark domain by utilizing a Bark domain change matrix, and calculating a loudness spectrum of the preliminary enhanced speech signal on the psychoacoustics Bark domain to obtain an enhanced speech sound level spectrum;

and converting the pure speech to the psychoacoustics Bark domain by utilizing a Bark domain change matrix, and calculating a loudness spectrum of the pure speech signal in the psychoacoustics Bark domain to obtain a pure speech sound degree spectrum.

Optionally, the calculating a speech distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighting loss function according to the speech distortion error and the residual noise error specifically includes:

calculating according to the enhanced speech sound loudness spectrum and the pure speech loudness spectrum to obtain the speech distortion error and the residual noise error;

and introducing a weighting factor, and combining the speech distortion error and the residual noise error through the weighting factor to obtain the Bark domain weighting loss function.

Optionally, the training of the neural network speech enhancement model by using an error back propagation algorithm according to the Bark domain weighted loss function to obtain an optimized neural network speech enhancement model specifically includes:

training the neural network speech enhancement model, and determining whether the training is finished according to the magnitude relation between the Bark domain weighting loss function value and a preset threshold;

when the Bark domain weighting loss function value is smaller than the preset threshold value, stopping training and storing network parameters to obtain the optimized neural network speech enhancement model;

and when the Bark domain weighting loss function value is greater than or equal to the preset threshold value, adjusting the network parameters by adopting an error back propagation algorithm to continue training until the Bark domain weighting loss function result is less than the preset threshold value.

Optionally, the inputting the noisy speech in the test speech set into the optimized neural network speech enhancement model, and outputting the reconstructed enhanced speech specifically includes:

extracting the noise-carrying voice characteristics of the noise-carrying voice in the test voice set, inputting the noise-carrying voice characteristics into the optimized neural network voice enhancement model, and outputting a prediction time-frequency masking value;

and carrying out waveform synthesis on the noisy speech in the test speech set according to the predicted time-frequency masking value to obtain the reconstructed enhanced speech.

Optionally, the extracting the noisy speech feature of the noisy speech in the test speech set specifically includes:

carrying out short-time Fourier transform on the voice with noise in the test voice set to obtain a voice frequency spectrum with noise;

and performing voice feature extraction on the voice frequency spectrum with the noise to obtain the voice feature with the noise.

Optionally, the performing waveform synthesis on the noisy speech in the test speech set according to the predicted time-frequency masking value to obtain the reconstructed enhanced speech specifically includes:

multiplying the predicted time-frequency masking value with the amplitude spectrum of the noisy speech in the test speech set to obtain the amplitude spectrum of the enhanced speech;

and combining the amplitude spectrum of the enhanced voice with the phase spectrum of the noisy voice to obtain the reconstructed enhanced voice.

Optionally, the neural network speech enhancement model is a deep neural network speech enhancement model formed by stacking two limited boltzmann machines.

In another aspect, the present invention further provides a speech enhancement system based on a psychoacoustic domain weighting loss function, including:

the training voice set and test voice set acquisition module is used for acquiring a training voice set and a test voice set; the training voice set comprises pure voice, noise and a part of voice with noise, and the testing voice set comprises another part of voice with noise;

the training voice set preprocessing module is used for preprocessing the voice samples in the training voice set to obtain voice characteristics with noise and a time-frequency masking value;

the neural network speech enhancement model training module is used for inputting the noisy speech features and the time-frequency masking values into the neural network speech enhancement model for pre-training and outputting a primary enhanced speech;

a psychoacoustic Bark domain conversion and loudness spectrum calculation module, configured to simultaneously convert the preliminary enhanced speech and the clean speech into a psychoacoustic Bark domain, and calculate a loudness spectrum of the speech signal in the psychoacoustic Bark domain;

a Bark domain weighting loss function building module, configured to calculate a speech distortion error and a residual noise error according to the loudness spectrum, and build a Bark domain weighting loss function according to the speech distortion error and the residual noise error;

the neural network speech enhancement model optimization module is used for training the neural network speech enhancement model by adopting an error back propagation algorithm according to the Bark domain weighting loss function to obtain an optimized neural network speech enhancement model;

and the enhanced voice reconstruction module is used for inputting the noisy voices in the test voice set into the optimized neural network voice enhancement model and outputting reconstructed enhanced voices.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a speech enhancement method based on a psychoacoustic domain weighting loss function, which comprises the steps of simultaneously converting enhanced speech and pure speech to a psychoacoustic Bark domain, calculating a loudness spectrum of a corresponding speech signal on the psychoacoustic Bark domain, calculating by using the loudness spectrum to obtain a speech distortion error and a residual noise error, and constructing the Bark domain weighting loss function according to the speech distortion error and the residual noise error, so that the speech distortion error and the residual noise error are combined together by using a weighting factor in the Bark domain weighting loss function and participate in a parameter adjusting process in model training, the simultaneous minimization of the speech distortion error and the residual noise error in the model training process is realized, the influence of two aspects of the speech distortion error and the residual noise error on the speech quality of the enhanced speech is considered at the same time, and the reconstructed enhanced speech effect of an optimized neural network speech enhancement model is better, the voice quality and intelligibility of the enhanced voice are effectively improved, and the problems of poor reconstructed voice auditory perception and low voice intelligibility caused by the fact that the voice distortion problem is not considered in the traditional loss function are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The following drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

Fig. 1 is a flowchart of a speech enhancement method based on a psychoacoustic domain weighting loss function according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a training process of a deep neural network according to embodiment 1 of the present invention;

FIG. 3 is a flow chart of a process of applying the Bark domain weighting loss function provided in embodiment 1 of the present invention in a round-robin fashion;

FIG. 4 is a flow chart of the process of applying Bark domain weighted loss function to a deep neural network according to embodiment 1 of the present invention;

FIG. 5 is a flowchart of a training phase and a testing phase of a neural network speech enhancement model according to embodiment 1 of the present invention;

FIG. 6 is a graph comparing PESQ and STOI for 4 SNR values for different weighting factor values under White noise provided in example 1 of the present invention;

FIG. 7 is a graph comparing PESQ and STOI for 4 SNR values under different weighting factor values under Babble noise provided by example 1 of the present invention;

fig. 8 is a time domain waveform comparison graph based on the MSE and Bark domain weighting loss functions provided in embodiment 1 of the present invention;

FIG. 9 is a comparison graph of speech spectra based on MSE and Bark domain weighting loss functions provided in example 1 of the present invention;

fig. 10 is a block diagram of a speech enhancement system based on a psychoacoustic domain weighting loss function according to embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As used in this disclosure and in the claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Although the present invention makes various references to certain modules in a system according to embodiments of the present invention, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative and different aspects of the systems and methods may use different modules.

Flow charts are used in the present invention to illustrate the operations performed by a system according to embodiments of the present invention. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Existing neural network-based speech enhancement algorithms enhance speech by minimizing the MSE between reconstructed speech and pure speech, which is not closely related to the auditory perception of the human ear, resulting in poor speech intelligibility of the enhanced speech. Subsequently, researchers proposed using SI-SDR as a loss function of the network, and the network optimized using SI-SDR would produce speech distortion while eliminating noise, resulting in poor perceptual quality of reconstructed speech and poor speech quality. Based on the above, the invention provides a speech enhancement method and system based on a psychoacoustic domain weighting loss function, which are used for optimizing a neural network speech enhancement system, wherein the Bark domain weighting loss function changes the traditional MSE method for directly minimizing errors in the time domain, and converts signals into Bark domains which are closer to auditory perception to optimize the loss function value, so that the speech intelligibility of enhanced speech is improved. In order to simultaneously optimize two aspects of residual noise and voice distortion of reconstructed voice, a new loss function introduces a weighting factor to balance the relationship between the residual noise and the voice distortion, and the new loss function is minimized in the training process, so that the aim of improving the intelligibility of the reconstructed voice of the network is finally fulfilled.

Taking the neural network speech enhancement model adopted by the invention as an example, the neural network speech enhancement model is an additive noise speech model, wherein the additive speech refers to the synthesis process of the noisy speech obtained by adding the pure speech and the noise, and the model training uses the additive noisy speech. The following neural network optimization methods compare various loss functions:

a traditional Mean Square Error (MSE) loss function optimization neural network method comprises the following steps:

it is assumed that the noisy speech is obtained by adding the clean speech and the noise, and in general, the directly read-in signal is converted from the time domain to the frequency domain by short-time fourier transform, and the noisy signal is processed together with the frequency domain information. The frequency domain expression of the speech signal is:

Y(t,f)＝X(t,f)+N(t,f)

where t denotes a time frame, f denotes a frequency frame, and X (t, f), N (t, f), and Y (t, f) denote short-time fourier transforms of clean speech, noise, and noisy speech at t frames, f frequencies, respectively.

MSE represents the average difference between the net value and the output value, and the MSE is taken as the loss function of the neural network, and the specific calculation formula is as follows:

wherein the content of the first and second substances,

representing the enhanced speech predicted by the traditional mean square error loss function optimization neural network, Y representing the pure speech under the traditional mean square error loss function optimization neural network, n representing the number of training samples, zeta_MSERepresenting the value of the MSE loss function.

A method of scale-invariant signal-to-distortion ratio (SI-SDR) loss function optimization neural network:

through a large number of experiments, the SI-SDR is obviously superior to other loss functions on the voice objective evaluation performance index as the loss function of the deep learning network, and the calculation formula is as follows:

wherein the content of the first and second substances,x represents clean speech under scale-invariant signal-to-distortion ratio loss function optimization neural network,

represents the enhanced speech under the scale-invariant signal-to-distortion ratio loss function optimization neural network, alpha is the optimal scale factor,

the final expression is:

therefore, the SI-SDR loss function is defined by the formula: zeta_SI-SDR＝-SI-SDR。

The invention aims to provide a voice enhancement method and a system based on a weighted loss function of a psychoacoustic domain, which convert an enhanced output and a training target into a psychoacoustic Bark domain which accords with auditory perception to calculate a loss function, improve the optimization parameters of a traditional network on a time domain by utilizing the masking effect of human ears and enhance the problem of low voice auditory perception degree; secondly, constructing a new weighting loss function, and further improving the speech intelligibility of the reconstructed speech by simultaneously considering two aspects of noise suppression and speech quality; and finally, taking the newly proposed Bark domain weighting loss function as an adjusting and optimizing criterion of network parameters in the neural network speech enhancement algorithm, and optimizing through a reverse error propagation criterion to obtain an optimal neural network speech enhancement model.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

As shown in fig. 1, the present embodiment proposes a speech enhancement method based on a psychoacoustic domain weighting loss function, which specifically includes the following steps:

step S1, acquiring a training voice set and a testing voice set; wherein the training speech set comprises clean speech, noise and a portion of noisy speech, and the test speech set comprises another portion of said noisy speech.

Here, the noisy speech is generally understood as a clean speech mixed with noise, and thus, the noisy speech is generally a mixed speech formed by mixing the clean speech and the noise. The ultimate goal of the invention is to enhance the voice of the voice with noise, so that the voice with noise is closer to the pure voice, and the voice with noise is clearer and more intelligible.

It should be noted that step S1 is actually a process of dividing a large number of speech samples into a training speech set and a testing speech set, where the training speech set includes a clean speech sample, a noise sample, and a noisy speech sample, and because the speech enhancement processing is performed on the noisy speech, the testing speech set only includes the noisy speech sample, and in this embodiment, the ratio of the noisy speech sample in the training speech set to the noisy speech sample in the testing speech set is set to 3: 1. it should be understood that this ratio is only a preferred value, and the present invention does not limit the number of specific samples in the training speech set and the testing speech set, so that the specific number and ratio of speech samples are not fixed and unique, and can be set according to practical situations.

And step S2, preprocessing the voice samples in the training voice set to obtain the characteristics of the voice with noise and a time-frequency masking value. The method specifically comprises the following steps:

s2.1, respectively carrying out short-time Fourier transform on the noisy speech, the pure speech and the noise in the training speech set to obtain a noisy speech frequency spectrum, a pure speech frequency spectrum and a noise frequency spectrum;

s2.2, extracting voice characteristics of the voice frequency spectrum with the noise to obtain the voice characteristics with the noise;

speech feature extraction is achieved by converting speech waveforms to parametric representations at relatively minimal data rates for subsequent processing and analysis. In this embodiment, one of the voice extraction techniques such as Mel-frequency cepstrum coefficient (MFCC), Linear Prediction Coefficient (LPC), Linear Prediction Cepstrum Coefficient (LPCC), Line Spectrum Frequency (LSF), Discrete Wavelet Transform (DWT), or Perceptual Linear Prediction (PLP) may be adopted to extract voice features, which are already commonly used and have high reliability and acceptability, and the voice feature extraction is not a key point of the present invention, and therefore, it is not described in detail in this embodiment.

And S2.3, respectively carrying out time domain decomposition on the pure voice frequency spectrum and the noise frequency spectrum to obtain the time-frequency masking value.

The speech spectrum refers to a representation manner of a speech signal in a time domain in a frequency domain, and can be obtained by performing short-time fourier transform on the speech signal. While spectral analysis is a technique that decomposes complex speech signals into simpler speech signals. For example, a speech signal can be represented as the sum of a number of simple signals at different frequencies, and a method of finding information (e.g., amplitude, power, strength, or phase, etc.) of a speech signal at different frequencies is spectral analysis.

The method comprises the steps of firstly carrying out short-time Fourier transform on a voice signal to obtain a corresponding frequency spectrum, and then directly carrying out time domain decomposition on the frequency spectrum to obtain a time-frequency masking value. The time-frequency masking value is a learning target of a neural network speech enhancement model based on deep learning, and after training is completed, the model outputs a predicted time-frequency masking value.

And step S3, inputting the noisy speech characteristics and the time-frequency masking value into a neural network speech enhancement model for pre-training, and outputting a preliminary enhanced speech.

It should be noted that the preliminary enhanced speech here is an enhanced speech result that is output preliminarily when the noisy speech feature and the time-frequency masking value are input into the neural network speech enhancement model for the pre-training stage, and may be an enhanced speech that is output for the first time, or an enhanced speech that is output for one time in the previous training, because the preliminary enhanced speech at this time is not an output result of the optimized neural network speech enhancement model obtained after the training is completed, the definition of the reconstructed speech output by the final model after the training is poor, and the intelligibility is poor.

The performance of the deep learning based speech enhancement algorithm is mainly affected by three categories of speech features, learning objectives and loss functions of the input network. The loss function is also called an error function, plays a role in optimizing network parameters in the neural network speech enhancement model, and can represent the fitting degree of the model to data. In the whole training process of the enhancement algorithm, the process of training the network by using the Error Back Propagation (BP) algorithm is as follows: (1) calculating the error between the output value of the sample data and the current net value by using the current parameters; (2) calculating the gradient value of the network output layer according to the obtained error; (3) propagating the gradient value to a network hidden layer along a direction different from the input data, and then calculating the gradient value of a neuron node of the hidden layer; (4) and updating the weight matrix and the bias value of the connection between the neurons in each layer of the network according to the numerical value calculated in the last step. The purpose of fine tuning by using the BP algorithm is to minimize the error between the enhanced speech predicted by the network and the actual pure speech, which can be realized by minimizing the value of the loss function, the invention combines with the psycho-acoustic domain to provide a new loss function which is more in line with the auditory perception of human ears to optimize the network, and fig. 2 shows the specific training process of the deep neural network. In the invention, the deep neural network is formed by stacking a plurality of layers of Restricted Boltzmann Machines (RBMs), wherein neurons in an inner layer of an RBM hidden layer and an inner layer of a visible layer are not connected, and the visible layer comprises an input layer and an output layer.

The invention uses a deep neural network formed by stacking two RBMs to carry out voice enhancement, and the training network adopts a method of unsupervised pre-training and supervised reverse fine tuning. In the unsupervised pre-training stage, RBMs are pre-trained by using a contrast Divergence algorithm (CD), the output of each RBM is used as the input of the next RBM, after the layer-by-layer training of a plurality of RBMs is finished, an output layer is added at last, the number of nodes which is the same as that of the input layer is set, and the weight and the offset value are initialized randomly, so that the initialized neural network voice enhancement model is obtained, and at the moment, the output is primary enhanced voice, namely, the enhanced voice which is output first after the neural network voice enhancement model is pre-trained and initialized. In the stage of supervised reverse fine tuning, after unsupervised pre-training, a method of supervised reverse fine tuning is adopted to finely tune the deep neural network, so that the error value between loss functions of the enhanced voice and the original pure voice obtained by training is smaller than a preset threshold value, the training of the neural network voice enhanced model is ended at the moment, the optimized neural network voice enhanced model is obtained, and the enhanced voice is obtained by testing and outputting.

In the actual operation process, if the selected loss function is not proper, even if the training times are many, the neural network speech enhancement model with good enhancement effect can not be obtained. Research shows that the loss function using SI-SDR as deep learning network is obviously superior to other loss functions in voice objective evaluation performance index. However, most speech enhancement algorithms suppress noise at the cost of speech distortion, and both the residual noise and speech intelligibility enhancement of speech are not optimal at the same time. Based on this, the invention adopts a weighted loss function based on the psychoacoustic Bark domain, and the specific construction process of the loss function is as follows:

and step S4, simultaneously converting the preliminary enhanced voice and the pure voice to a psychoacoustic Bark domain, and calculating the loudness spectrum of the voice signal in the psychoacoustic Bark domain. The method specifically comprises the following steps:

s4.1, converting the preliminary enhanced speech into the psychoacoustics Bark domain by utilizing a Bark domain change matrix, and calculating a loudness spectrum of the preliminary enhanced speech signal on the psychoacoustics Bark domain to obtain an enhanced speech sound degree spectrum;

and S4.2, converting the pure voice to the psychoacoustics Bark domain by using a Bark domain change matrix, and calculating a loudness spectrum of the pure voice signal in the psychoacoustics Bark domain to obtain a pure voice sound degree spectrum.

Fig. 3 shows a cyclic application process of the Bark domain weighting loss function, which specifically includes three main steps of calculating Bark domain loudness spectrum, calculating weighting loss function, and applying Bark domain weighting loss function to a deep neural network speech enhancement scenario.

Firstly, calculating Bark domain loudness spectrum, converting the enhanced voice into Bark frequency scale through a change matrix H, and the calculation formula is as follows:

wherein the content of the first and second substances,

representing an enhanced speech signal, x representing a clean speech signal, H representing a Bark domain transformation matrix,

representing a speech signal for enhanced speech conversion to Bark domain, the expression:

wherein, Q represents the Bark domain frequency band number, and in this embodiment, Q takes the value of 24; t represents the number of frames;

bark domain spectrum representing the enhanced speech in the tth frame, the qth band, Q representing the band, Q being 0, 1, …, Q-1; t denotes transposition.

Secondly, pure voice is converted into Bark frequency scale through a change matrix H, and the calculation formula is as follows:

b＝H·x

where x represents the clean speech signal, H represents the Bark domain transformation matrix, b represents the speech signal converted from clean speech to Bark domain, and the expression:

b＝[B(t,0),B(t,1),…,B(t,Q-1)]^T

wherein, Q represents the Bark domain frequency band number, and in this embodiment, Q takes the value of 24; t represents the time frame, B (t, q) represents the Bark domain spectrum of clean speech in the tth frame, the qth band;

And then converting the Bark domain frequency spectrum into a loudness level through a Zwicker conversion rule, wherein the calculation formula is as follows:

wherein s is_lRepresenting the loudness scale factor, s in this example_lThe value is 0.08; p₀(q) represents the hearing threshold value of the q-th frequency band, and since the human ear cannot perceive it, the present embodiment sets all frequency bands having loudness values smaller than the hearing threshold value to 0; gamma represents a weighted power exponent, and the value of gamma is 0.23 in the embodiment; s (t, q) represents the loudness spectrum of the clean speech in the t frame and the q frequency band; b (t, q) represents the Bark domain spectrum of clean speech in the qth frame, the qth band.

Finally, the above process is applied to the converted Bark domain signal to obtain Bark domain loudness spectrums of the enhanced speech and the pure speech, and the expressions are respectively:

wherein the content of the first and second substances,

a loudness spectrum representing enhanced speech, S represents a loudness spectrum of clean speech; s (t, q) represents the loudness spectrum of the clean speech in the t frame and the q frequency band;

indicating the sound of the enhanced speech in the t-th frame, q-th frequency bandThe degree spectrum, T, represents transposition.

And step S5, calculating a speech distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighting loss function according to the speech distortion error and the residual noise error.

In constructing the weighting loss function, in order to better balance the two aspects of speech distortion and residual noise, the Bark domain weighting loss function is defined as:

ζ_new-β＝βζ_distortion+(1-β)ζ_noise

therein, ζ_new-βRepresenting a weighted loss function; beta is a weighting factor, and the value of beta is [0, 1]]When beta is less than 0.5, the voice distortion error plays a main role in noise reduction; when β > 0.5, the residual noise error plays a major role in noise reduction, and when β ═ 0.5, the speech distortion error and the residual noise error play a comparable role in noise reduction. Therefore, the present invention can achieve a balance between the speech distortion and the residual noise by setting the weighting factor β to different values. Zeta_distortionRepresenting the speech distortion error, expressed as:

ζ_noiserepresenting the noise residual error, and the expression is:

wherein, λ is a scale factor, and its expression is:

wherein | | | purple hair²Represents the square of the variable in the middle of the double vertical lines;

a loudness spectrum representing enhanced speech, S represents a loudness spectrum of clean speech; mean (-) denotes averaging;

the value of lambda when the square of the variable in the middle of the double vertical lines is minimum is represented; t denotes transposition.

Starting from the perspective of a speech distortion error and a residual noise error, the invention combines the two errors of the speech distortion error and the residual noise error together by using a weighting factor beta in a Bark domain weighting loss function and participates in a parameter adjusting process in model training, so that the speech distortion error and the residual noise error are minimized at the same time.

And S6, training the neural network speech enhancement model by adopting an error back propagation algorithm according to the Bark domain weighted loss function to obtain the optimized neural network speech enhancement model. The method specifically comprises the following steps:

s6.1, training the neural network speech enhancement model, and determining whether the training is finished according to the magnitude relation between a Bark domain weighting loss function value and a preset threshold;

s6.2, when the Bark domain weighting loss function value is smaller than the preset threshold value, stopping training and storing network parameters to obtain the optimized neural network speech enhancement model;

The Bark domain is a psychoacoustic scale of sound specially proposed for a special structure of a human ear cochlea, a Bark domain weighting loss function different from a traditional loss function is constructed by converting a voice signal to the psychoacoustic Bark domain, and the Bark domain weighting loss function is applied to a deep neural network voice enhancement scene for improving the voice quality problem of enhanced voice, so that the reconstructed voice is more consistent with a hearing perception system of a human ear, and the specific process is shown in fig. 4.

In the neural network training stage, a Bark domain weighting loss function is selected as an Error criterion of reverse optimization, An Error Back Propagation (AEBP) algorithm based on a gradient descent strategy is used, network parameters are adjusted in the negative gradient direction of a target, a difference value calculated by a network visible output unit is transmitted Back to a hidden layer unit along a direction different from that of input data, a difference gradient value of a hidden layer neuron is calculated, finally, parameters among nodes of the neural network are adjusted in a small range according to a calculation result, the optimization process is circulated until a loss function value is trained to be lower than a preset threshold epsilon, the training can be finished, and the obtained optimal network model is stored and used in the network testing stage. In FIG. 4, RBM is a multilayer limited Boltzmann machine, and the neural network speech enhancement model of the invention is a deep neural network model formed by stacking two limited Boltzmann machines, wherein W is₀、W₁Is a matrix weight, V, between network connection layers₀、H₀Respectively representing the visible and hidden layers, V, in the first RBM₁、H₁Respectively representing a visible layer and a hidden layer in a second RBM, wherein epsilon is a preset threshold value of an optimized loss function in the training process, when the result of the trained Bark domain weighting loss function is smaller than the preset threshold value, stopping network training and outputting enhanced voice, otherwise, continuously and circularly training until the result of the Bark domain weighting loss function is smaller than the preset threshold value, and stopping training.

And step S7, inputting the noisy speech in the test speech set into the optimized neural network speech enhancement model, and outputting reconstructed enhanced speech.

As shown in fig. 5, the neural network speech enhancement model includes a training stage and a testing stage, wherein in the training stage, the pure speech and noise in the training speech set are respectively subjected to time domain decomposition to obtain a time-frequency masking value, and after the noisy speech feature of the noisy speech is extracted, the noisy speech feature and the time-frequency masking value are input into the neural network speech enhancement model for model training, and meanwhile, a loudness spectrum of a preliminary enhanced speech signal and a loudness spectrum of the pure speech signal are calculated in a psychoacoustic Bark domain to construct a Bark domain weighting loss function, and then a magnitude relation between the Bark domain weighting loss function and a preset threshold is used as a condition for judging whether the training is finished, and an optimized neural network speech enhancement model is obtained after the training is finished, and at this time, the noisy speech feature of the noisy speech in the testing speech set is input into the optimized neural network speech enhancement model, and obtaining a predicted time-frequency masking value, and using the predicted time-frequency masking value for waveform synthesis to obtain the finally reconstructed enhanced voice.

Step S7 specifically includes:

s7.1, extracting the noise voice characteristics of the noise voice in the test voice set, inputting the noise voice characteristics into the optimized neural network voice enhancement model, and outputting a prediction time-frequency masking value, wherein the prediction time-frequency masking value comprises the following steps:

s7.1.1, performing short-time Fourier transform on the noisy voices in the test voice set to obtain noisy voice frequency spectrums;

step S7.1.2, extracting voice characteristics of the voice spectrum with noise to obtain the voice characteristics with noise;

step S7.1.3, inputting the noisy speech features into the optimized neural network speech enhancement model, and outputting a prediction time-frequency masking value;

s7.2, carrying out waveform synthesis on the noisy speech in the test speech set according to the predicted time-frequency masking value to obtain the reconstructed enhanced speech, wherein the waveform synthesis specifically comprises the following steps:

step S7.2.1, multiplying the predicted time-frequency masking value with the amplitude spectrum of the noisy speech in the test speech set to obtain the amplitude spectrum of the enhanced speech;

step S7.2.2, combining the amplitude spectrum of the enhanced speech with the phase spectrum of the noisy speech to obtain the reconstructed enhanced speech.

The invention firstly provides a new loss function which accords with the auditory perception of human ears, namely a weighting loss function based on a psychoacoustics Bark domain, which is called the Bark domain weighting loss function for short. In the training process of a neural network speech enhancement model, enhanced speech and pure speech are simultaneously converted to a psychoacoustic Bark domain, a corresponding loudness spectrum is calculated, a speech distortion error function and a noise residual error function which can represent the speech distortion degree are calculated through the loudness spectrum, a weighting factor lambda is introduced and integrated together, the speech distortion error and the noise residual error are simultaneously minimized in the training process, the residual noise and the speech distortion of the enhanced speech are ensured to be simultaneously minimized, compared with the traditional MSE optimization-based neural network, the speech enhancement effect of the algorithm under different signal-to-noise ratios is obviously improved, and meanwhile, the speech intelligibility is also obviously improved.

The following examples are used to verify the speech enhancement effect of the method of the present invention:

white NOISE, Babble NOISE, F16 NOISE and Factory NOISE in an IEEE voice library and a NOISE-92 NOISE library are selected for experiments. Using 50 clean voices 600 training sets were created with signal to noise ratios of-5 dB, 0dB and 5dB, 10dB at each noise. 120 test sets were created using 10 voices under the same conditions. The Evaluation indexes adopt segment signal-to-noise ratio (SegSNR), subjective speech Quality (PESQ) and Short-Term Objective Intelligibility (STOI). The specific experimental data obtained were as follows:

table 1 shows the performance enhancement contrast of the piecewise signal-to-noise ratio SegSNR for DNN-MSE (Deep Neural Networks-Mean Square Error, DNN-MSE, Deep Neural Networks-Mean-Square Error loss function) and DNN-Bark domain Weighted loss functions (DNN-Weighted loss function in Bark, DNN-BW, Deep Neural Networks-Bark domain Weighted loss function). The data in table 1 are experimental results of three evaluation indexes obtained under four different signal-to-noise ratios of background noise. As can be seen from the analysis of the data in Table 1, the DNN-BW algorithm provided by the invention can better suppress noise, and the SegSNR value is improved by 9.1372dB on average compared with the noise-containing speech, and is improved by 0.3321dB on average compared with the SegSNR obtained by using the traditional MSE loss function.

TABLE 1 SegSNR enhancement Performance comparison of DNN-MSE and DNN-BW

Tables 2 and 3 show the comparison of the enhanced performance of the subjective speech quality PESQ and the short-term objective intelligibility STOI for DNN-MSE and DNN-BW, respectively:

TABLE 2 subjective Speech quality PESQ enhancement Performance comparison of DNN-MSE and DNN-BW

TABLE 3 short-time objective intelligibility STOI enhanced Performance contrast for DNN-MSE and DNN-BW

It can be seen from tables 2 and 3 that the Bark domain loss function proposed by the present invention reconstructs enhanced speech with improved PESQ value 0.5351 and STOI value 0.1002 compared with the original noisy speech. Compared with the PESQ value and the STOI value of the traditional MSE optimized deep neural network, 0.1407 and 0.0208 are respectively improved, and the Bark domain loss function provided by the invention can obviously improve the speech intelligibility and the definition of enhanced speech.

Fig. 6 and 7 show the magnitude comparison of subjective speech quality PESQ and short-term objective intelligibility STOI at 4 signal-to-noise ratios for different weighting factor values under White noise and Babble noise, respectively. Wherein, fig. 6(a) is the PESQ score under different weighting factors under White noise; FIG. 6(b) is the STOI score for different weighting factors under White noise; fig. 7(a) shows PESQ scores for different weighting factors under Babble noise, and fig. 7(b) shows STOI scores for different weighting factors under Babble noise. It can be seen from the figure that, no matter in White noise, Babble noise or other colored noise, when the weighting factor β is 1, the loss function proposed by the present invention can be regarded as MSE, and when the weighting factor β is both 0 and β is 1, the performance index of the neural network speech enhancement model of the present invention under different signal-to-noise ratios is not optimal. Globally, various performance indicators tend to increase when the value of β is between 0, 0.6, while the corresponding PESQ and STOI scores tend to decrease when the value of β is between 0.6, 1. This shows that when the value of β is around 0.6, the balance strategy between noise reduction and speech distortion can be well realized by taking the speech distortion error as the main factor of the newly constructed Bark domain weighting loss function.

In order to further verify the performance of the Bark domain weighting loss function provided by the invention, MFCC (Mel-Frequency Cepstral coeffients, MFCC) voice characteristics and an IRM learning target are adopted to train a neural network, and two groups of comparison experiments are designed to further illustrate the importance of the loss function conforming to the auditory sense of human ears on improving the voice quality of reconstructed voice.

Experiment 1: and training a neural network speech enhancement model by using a loss function based on the traditional MSE.

Experiment 2: the Bark domain weighting loss function of the invention is adopted to train the neural network speech enhancement model.

In order to visually demonstrate the superiority of the Bark domain weighting loss function provided by the present invention, taking a random group of speech signals under a 5dB Babble noise as an example, fig. 8 shows a time domain waveform comparison diagram based on the MSE and the Bark domain weighting loss function. Fig. 8(a) is a time domain waveform diagram of pure speech, fig. 8(b) is a time domain waveform diagram of noisy speech, fig. 8(c) is a time domain waveform diagram of enhanced speech optimized by MSE loss function, and fig. 8(d) is a time domain waveform diagram of enhanced speech optimized by Bark domain weighting loss function according to the present invention. As can be seen from the position marked by the rectangular box in fig. 8, the Bark domain weighting loss function optimized enhanced speech has a more complete waveform for the original pure speech signal and a smaller speech distortion degree than the MSE optimized enhanced speech; in a silent section, the Bark domain weighting loss function adopted by the invention is optimized to enhance the voice and has stronger capability of inhibiting noise.

Fig. 9 shows a speech spectrum comparison graph based on MSE and Bark domain weighted loss functions. Fig. 9(a) is a speech spectrogram of pure speech, fig. 9(b) is a speech spectrogram of noisy speech, fig. 9(c) is a speech spectrogram of enhanced speech optimized by MSE loss function, and fig. 9(d) is a speech spectrogram of enhanced speech optimized by Bark domain weighting loss function according to the present invention. As shown in fig. 9, the speech reconstructed based on the Bark domain weighting loss function proposed in the present invention remains more complete in spectrum structure, and the residual noise in the spectrum is less than that of the speech reconstructed based on the conventional MSE loss function, which is mutually demonstrated by the information reflected by the time domain waveform diagram in fig. 8. Therefore, the voice enhancement method provided by the invention has better voice enhancement effect, higher voice quality and higher intelligibility.

Example 2

As shown in fig. 10, this embodiment provides a speech enhancement system based on a psychoacoustic domain weighting loss function, where the system adopts the speech enhancement method based on the psychoacoustic domain weighting loss function in embodiment 1, and functions of each module of the system are the same as and correspond to each step of the method in embodiment 1, and the system specifically includes:

a training speech set and test speech set obtaining module M1, configured to obtain a training speech set and a test speech set; the training voice set comprises pure voice, noise and a part of voice with noise, and the testing voice set comprises another part of voice with noise;

a training speech set preprocessing module M2, configured to preprocess the speech samples in the training speech set to obtain a noisy speech feature and a time-frequency masking value;

the neural network speech enhancement model training module M3 is used for inputting the noisy speech features and the time-frequency masking values into the neural network speech enhancement model for pre-training and outputting a preliminary enhanced speech;

a psychoacoustic Bark domain conversion and loudness spectrum calculation module M4, configured to simultaneously convert the preliminary enhanced speech and the clean speech into a psychoacoustic Bark domain, and calculate a loudness spectrum of the speech signal in the psychoacoustic Bark domain;

a Bark domain weighting loss function construction module M5, configured to calculate a speech distortion error and a residual noise error according to the loudness spectrum, and construct a Bark domain weighting loss function according to the speech distortion error and the residual noise error;

the neural network speech enhancement model optimization module M6 is used for training the neural network speech enhancement model by adopting an error back propagation algorithm according to the Bark domain weighting loss function to obtain an optimized neural network speech enhancement model;

and the enhanced voice reconstruction module M7 is configured to input the noisy voices in the test voice set into the optimized neural network voice enhancement model, and output reconstructed enhanced voices.

The invention provides a speech enhancement method and a system based on a psychoacoustic domain weighting loss function, wherein a Bark domain weighting loss function is adopted to change the traditional MSE (mean square error) method for directly minimizing errors in a time domain, and an enhanced speech signal and a pure speech signal are converted into a Bark domain which is closer to auditory perception to optimize the value of the loss function, so that the speech intelligibility of the enhanced speech is improved. In addition, in order to simultaneously optimize two aspects of residual noise and voice distortion of reconstructed voice, a weighting factor is introduced into a Bark domain weighting loss function to balance the relation between the residual noise and the voice distortion, the simultaneous minimization of the residual noise and the voice distortion is realized in the training process, and finally the purpose of improving the intelligibility of the reconstructed voice of the network is achieved.

It is to be understood that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A method for speech enhancement based on a psychoacoustic domain weighting penalty function, comprising:

2. The method according to claim 1, wherein the preprocessing the speech samples in the training speech set to obtain noisy speech features and a time-frequency masking value specifically includes:

3. The method according to claim 1, wherein the step of simultaneously converting the preliminary enhanced speech and the clean speech into a psychoacoustic Bark domain and calculating a loudness spectrum of the speech signal in the psychoacoustic Bark domain comprises:

4. The speech enhancement method according to claim 3, wherein said calculating a speech distortion error and a residual noise error from said loudness spectrum and constructing a Bark domain weighted loss function from said speech distortion error and said residual noise error comprises:

5. The speech enhancement method according to claim 1, wherein the training of the neural network speech enhancement model by an error back propagation algorithm according to the Bark domain weighting loss function to obtain an optimized neural network speech enhancement model specifically comprises:

6. The method according to claim 1, wherein the inputting the noisy speech in the test speech set into the optimized neural network speech enhancement model and outputting a reconstructed enhanced speech includes:

7. The method according to claim 6, wherein said extracting the noisy speech feature of the noisy speech in the test speech set specifically comprises:

8. The speech enhancement method according to claim 6, wherein the performing waveform synthesis on the noisy speech in the test speech set according to the predicted time-frequency mask value to obtain the reconstructed enhanced speech specifically comprises:

9. The speech enhancement method of claim 1 wherein the neural network speech enhancement model is a deep neural network model stacked by two constrained boltzmann machines.

10. A system for speech enhancement based on a psychoacoustic domain weighting penalty function, comprising: