CN113744749B

CN113744749B - Speech enhancement method and system based on psychoacoustic domain weighting loss function

Info

Publication number: CN113744749B
Application number: CN202111098146.0A
Authority: CN
Inventors: 贾海蓉; 梅淑琳; 吴永强; 申陈宁; 王鲜霞; 张雪英; 王峰
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2023-09-19
Anticipated expiration: 2041-09-18
Also published as: CN113744749A

Abstract

The invention relates to a voice enhancement method and a system based on a psychoacoustic domain weighting loss function, which belong to the field of voice enhancement and acquire a training voice set and a test voice set; preprocessing a voice sample in a training voice set to obtain a noisy voice feature and a time-frequency masking value; inputting the noisy speech characteristics and the time-frequency masking values into a neural network speech enhancement model for training, and outputting preliminary enhanced speech; simultaneously converting the primary enhanced voice and the pure voice into a psychoacoustic Bark domain, and calculating a loudness spectrum of a voice signal; calculating a voice distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighting loss function according to the voice distortion error and the residual noise error; performing optimization training on the neural network voice enhancement model by adopting an error back propagation algorithm; the noise-carrying voice in the test voice set is input into the optimized neural network voice enhancement model, the reconstructed enhanced voice is output, and the tone quality and the intelligibility of the voice signal can be improved.

Description

Speech enhancement method and system based on psychoacoustic domain weighting loss function

Technical Field

The present invention relates to the field of speech enhancement, and in particular, to a method and system for speech enhancement based on a psychoacoustic domain weighting loss function.

Background

The voice enhancement is used as a key technology for inhibiting background noise and improving voice intelligibility, and is often applied to the front end of intelligent equipment to improve the success rate of voice recognition of man-machine interaction. In recent years, neural networks based on deep learning have been successfully applied in the field of speech enhancement, and compared with the traditional speech enhancement method, the speech quality after the enhancement of the neural network speech enhancement method has been significantly improved. The speech enhancement algorithms can be broadly classified into two categories, unsupervised speech enhancement and supervised speech enhancement, depending on whether training samples are known. Classical unsupervised speech enhancement algorithms include spectral subtraction, wiener filtering, and speech enhancement methods based on minimum mean square error estimation. Although the algorithm is simple to realize, the algorithm can inhibit stationary noise, the estimation requirement on the noise spectrum is strict, the problems of serious distortion and excessive residual noise of voice can be caused when the noise is overestimated and underestimated, the inhibition performance on non-stationary noise is poor, and the phenomenon of enhancing voice distortion is more in a low signal-to-noise ratio environment. Therefore, the voice enhancement signal obtained after the reconstruction of the existing voice enhancement method based on the neural network still has the problems of poor voice quality and unobvious enhancement effect caused by excessive voice distortion and residual noise.

Therefore, a more effective voice enhancement method is needed at present to solve the problems of poor voice quality and insignificant enhancement effect caused by excessive voice distortion and residual noise in the existing neural network voice enhancement method.

Disclosure of Invention

The invention aims to provide a voice enhancement method and a voice enhancement system based on a psychoacoustic domain weighting loss function, so as to enhance voice quality, improve voice intelligibility and solve the problems of poor voice quality and unobvious enhancement effect caused by excessive voice distortion and residual noise in the existing neural network voice enhancement method.

In order to achieve the above object, the present invention provides the following solutions:

in one aspect, the present invention provides a method for speech enhancement based on a psychoacoustic domain weighted loss function, comprising:

acquiring a training voice set and a test voice set; wherein the training speech set comprises clean speech, noise and a part of noisy speech, and the test speech set comprises another part of noisy speech;

preprocessing the voice samples in the training voice set to obtain noisy voice characteristics and a time-frequency masking value;

inputting the noisy speech characteristics and the time-frequency masking values into a neural network speech enhancement model for pre-training, and outputting preliminary enhanced speech;

Simultaneously converting the primary enhanced speech and the pure speech to a psychoacoustic Bark domain, and calculating a loudness spectrum of a speech signal on the psychoacoustic Bark domain;

calculating a voice distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighting loss function according to the voice distortion error and the residual noise error;

training the neural network voice enhancement model by adopting an error back propagation algorithm according to the Bark domain weighted loss function to obtain an optimized neural network voice enhancement model;

inputting the noisy speech in the test speech set into the optimized neural network speech enhancement model, and outputting the reconstructed enhanced speech.

Optionally, the preprocessing the voice sample in the training voice set to obtain a noisy voice feature and a time-frequency masking value specifically includes:

respectively carrying out short-time Fourier transform on the noisy speech, the clean speech and the noise in the training speech set to obtain a noisy speech spectrum, a clean speech spectrum and a noise spectrum;

extracting voice characteristics from the voice spectrum with noise to obtain the voice characteristics with noise;

And respectively carrying out time domain decomposition on the pure voice spectrum and the noise spectrum to obtain the time-frequency masking value.

Optionally, the converting the preliminary enhanced speech and the pure speech to a psychoacoustic Bark domain simultaneously, and calculating a loudness spectrum of the speech signal in the psychoacoustic Bark domain specifically includes:

converting the primary enhanced voice to the psychoacoustic Bark domain by using a Bark domain change matrix, and calculating a loudness spectrum of the primary enhanced voice signal on the psychoacoustic Bark domain to obtain an enhanced voice sound intensity spectrum;

and converting the pure voice into the psychoacoustic Bark domain by using a Bark domain change matrix, and calculating the loudness spectrum of the pure voice signal on the psychoacoustic Bark domain to obtain the pure voice sound intensity spectrum.

Optionally, the calculating a speech distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighted loss function according to the speech distortion error and the residual noise error specifically includes:

calculating the voice distortion error and the residual noise error according to the enhancement language sound spectrum and the pure language sound spectrum;

and introducing a weighting factor, and combining the voice distortion error and the residual noise error through the weighting factor to obtain the Bark domain weighting loss function.

Optionally, training the neural network speech enhancement model by adopting an error back propagation algorithm according to the Bark domain weighted loss function to obtain an optimized neural network speech enhancement model, which specifically includes:

training the neural network voice enhancement model, and determining whether training is completed according to the magnitude relation between the Bark domain weighted loss function value and a preset threshold value;

when the Bark domain weighted loss function value is smaller than the preset threshold value, stopping training and saving network parameters to obtain the optimized neural network voice enhancement model;

and when the Bark-domain weighted loss function value is greater than or equal to the preset threshold value, adopting an error back propagation algorithm to adjust network parameters to continue training until the Bark-domain weighted loss function result is smaller than the preset threshold value.

Optionally, inputting the noisy speech in the test speech set into the optimized neural network speech enhancement model, and outputting the reconstructed enhanced speech, which specifically includes:

extracting the noisy speech characteristics of the noisy speech in the test speech set, inputting the noisy speech characteristics into the optimized neural network speech enhancement model, and outputting a predicted time-frequency masking value;

And carrying out waveform synthesis on the noisy voices in the test voice set according to the predicted time-frequency masking value to obtain the reconstructed enhanced voices.

Optionally, the extracting the noisy speech feature of the noisy speech in the test speech set specifically includes:

performing short-time Fourier transform on the noisy speech in the test speech set to obtain the noisy speech spectrum;

and extracting voice characteristics from the voice spectrum with noise to obtain the voice characteristics with noise.

Optionally, the step of performing waveform synthesis on the noisy speech in the test speech set according to the predicted time-frequency masking value to obtain the reconstructed enhanced speech specifically includes:

multiplying the predicted time-frequency masking value with the amplitude spectrum of the noisy speech in the test speech set to obtain the amplitude spectrum of the enhanced speech;

and combining the amplitude spectrum of the enhanced voice with the phase spectrum of the noisy voice to obtain the reconstructed enhanced voice.

Optionally, the neural network voice enhancement model is a deep neural network voice enhancement model formed by stacking two limited boltzmann machines.

In another aspect, the present invention also provides a speech enhancement system based on a psychoacoustic domain weighting loss function, including:

The training voice set and test voice set acquisition module is used for acquiring a training voice set and a test voice set; wherein the training speech set comprises clean speech, noise and a part of noisy speech, and the test speech set comprises another part of noisy speech;

the training voice set preprocessing module is used for preprocessing voice samples in the training voice set to obtain noisy voice characteristics and a time-frequency masking value;

the neural network voice enhancement model training module is used for inputting the noisy voice characteristics and the time-frequency masking values into a neural network voice enhancement model for pre-training and outputting preliminary enhanced voice;

the psychoacoustic Bark domain conversion and loudness spectrum calculation module is used for converting the primary enhanced voice and the pure voice into a psychoacoustic Bark domain at the same time and calculating the loudness spectrum of the voice signal in the psychoacoustic Bark domain;

the Bark domain weighted loss function construction module is used for calculating a voice distortion error and a residual noise error according to the loudness spectrum and constructing a Bark domain weighted loss function according to the voice distortion error and the residual noise error;

the neural network voice enhancement model optimization module is used for training the neural network voice enhancement model by adopting an error back propagation algorithm according to the Bark domain weighted loss function to obtain an optimized neural network voice enhancement model;

And the enhanced voice reconstruction module is used for inputting the noisy voices in the test voice set into the optimized neural network voice enhancement model and outputting the reconstructed enhanced voices.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a voice enhancement method based on a psychoacoustic domain weighting loss function, which is characterized in that enhanced voice and pure voice are simultaneously converted to a psychoacoustic Bark domain, the loudness spectrum of a corresponding voice signal is calculated on the psychoacoustic Bark domain, the loudness spectrum is used for calculating to obtain voice distortion errors and residual noise errors, and then the Bark domain weighting loss function is constructed according to the voice distortion errors and the residual noise errors, so that the voice distortion errors and the residual noise errors are combined together by using weighting factors in the Bark domain weighting loss function and participate in a parameter adjustment process in model training, the influence of the voice distortion errors and the residual noise errors on the voice quality of the enhanced voice in the model training process is simultaneously minimized, the enhanced voice effect reconstructed by an optimized neural network voice enhancement model is better, the voice perception and the voice intelligibility of the enhanced voice are effectively improved, and the problems of poor voice perception and low voice intelligibility caused by the fact that the voice distortion is not considered in the traditional loss function are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. The following drawings are not intended to be drawn to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is a flowchart of a speech enhancement method based on a psychoacoustic domain weighting loss function according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of a training process of the deep neural network according to embodiment 1 of the present invention;

FIG. 3 is a flowchart of a cyclic application process of the Bark domain weighted loss function provided in embodiment 1 of the present invention;

FIG. 4 is a flowchart of a process of applying the Bark domain weighted loss function to the deep neural network according to embodiment 1 of the present invention;

FIG. 5 is a flowchart of the training phase and the testing phase of the neural network speech enhancement model according to embodiment 1 of the present invention;

FIG. 6 is a graph comparing PESQ and STOI at 4 signal-to-noise ratios for different weighting factor values for White noise provided in example 1 of the present invention;

FIG. 7 is a graph comparing PESQ and STOI at 4 signal to noise ratios for different weighting factor values for the Babble noise provided in example 1 of the present invention;

FIG. 8 is a comparison plot of time domain waveforms based on MSE and Bark domain weighted loss functions provided in example 1 of the present invention;

FIG. 9 is a graph comparing the speech spectra of the MSE and Bark domain based weighted loss function provided in example 1 of the present invention;

fig. 10 is a block diagram of a speech enhancement system based on a psychoacoustic domain weight loss function according to embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present invention makes various references to certain modules in a system according to embodiments of the present invention, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

A flowchart is used in the present invention to describe the operations performed by a system according to embodiments of the present invention. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

Existing neural network-based speech enhancement algorithms are enhanced by minimizing the MSE between the reconstructed speech and the clean speech, which is not close to the auditory perception of the human ear, resulting in poor speech intelligibility of the enhanced speech. Later, researchers proposed using SI-SDR as a loss function of the network, and using SI-SDR optimized networks would produce speech distortion while eliminating noise, making the auditory perception of the reconstructed speech worse and the speech quality lower. Based on the method, the invention provides a voice enhancement method and a voice enhancement system based on a psychoacoustic domain weighting loss function, which are used for optimizing a neural network voice enhancement system, and the Bark domain weighting loss function changes a method for directly minimizing errors in a time domain by using a traditional MSE, and converts signals to a Bark domain optimizing loss function value which is more closely related to auditory perception, so that the voice intelligibility of enhanced voice is improved. In order to optimize the two aspects of residual noise and speech distortion of the reconstructed speech at the same time, the new loss function introduces a weighting factor to balance the relation between the two, minimizes the relation in the training process at the same time, and finally achieves the aim of improving the intelligibility of the network reconstructed speech.

Taking the neural network voice enhancement model adopted by the invention as an example, the neural network voice enhancement model is an additive noise voice model, wherein the additive voice refers to the synthesis process of the noisy voice and is obtained by adding pure voice and noise, and the additive noisy voice is used in model training. The following neural network optimization method for comparing various loss functions:

a method for optimizing a neural network by using a conventional Mean Square Error (MSE) loss function:

it is assumed that the noisy speech is obtained by adding clean speech and noise, and in general, the signal read in directly is converted from the time domain to the frequency domain by short-time fourier transform, and the noisy signal is processed in combination with the time-frequency domain information. The frequency domain expression of the speech signal is:

Y(t,f)＝X(t,f)+N(t,f)

where t represents a time frame, f represents a frequency frame, and X (t, f), N (t, f) and Y (t, f) represent short-time Fourier transforms of clean speech, noise and noisy speech at the t-frame, f frequencies, respectively.

MSE represents the average difference between the net value and the output value, and the MSE is taken as the loss function of the neural network, and the specific calculation formula is as follows:

wherein,,the enhanced voice predicted by the traditional mean square error loss function optimization neural network is represented, Y represents the pure voice under the traditional mean square error loss function optimization neural network, n represents the number of training samples, and ζ _MSE Representing the value of the MSE loss function.

A method for optimizing a neural network for scale-invariant signal-to-distortion ratio (SI-SDR) loss function:

through a large number of experiments, the loss function of the SI-SDR as the deep learning network is obviously superior to other loss functions in terms of objective speech evaluation performance indexes, and the calculation formula is as follows:

wherein X represents the pure voice under the scale-invariant signal-to-distortion ratio loss function optimization neural network,represents the enhanced speech under the scale-invariant signal-to-distortion ratio loss function optimized neural network, alpha is the optimal scale factor,

the final expression is:

thus, the SI-SDR loss function defines the formula: zeta type _SI-SDR ＝-SI-SDR。

The invention aims to provide a voice enhancement method and a voice enhancement system based on a psychoacoustic domain weighting loss function, which are used for converting enhancement output and a training target into a psychoacoustic Bark domain which accords with auditory perception to calculate the loss function, and improving the problem that the traditional network optimizes parameters in a time domain and enhances the auditory perception of voice by utilizing the masking effect of human ears; secondly, a new weighting loss function is constructed, and the speech intelligibility of the reconstructed speech is further improved by simultaneously considering two aspects of noise suppression and speech quality; and finally, taking the newly proposed Bark domain weighted loss function as a tuning criterion of network parameters in a neural network voice enhancement algorithm, and optimizing through a reverse error propagation criterion to obtain an optimal neural network voice enhancement model.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

As shown in fig. 1, the present embodiment proposes a speech enhancement method based on a psychoacoustic domain weighting loss function, which specifically includes the following steps:

step S1, acquiring a training voice set and a test voice set; wherein the training speech set comprises clean speech, noise, and a portion of noisy speech, and the test speech set comprises another portion of noisy speech.

Noise-containing speech is generally understood to be clean speech that is doped with noise, and thus noise-containing speech is generally a mixed speech formed by mixing clean speech and noise. The final objective of the invention is to enhance the voice of the voice with noise, so that the voice with noise is more similar to pure voice, and the voice with noise is clearer and has higher intelligibility.

It should be noted that, step S1 is actually a process of dividing a large number of voice samples into a training voice set and a test voice set, where the training voice set includes a clean voice sample, a noise sample and a noisy voice sample, and since the noisy voice is subjected to the voice enhancement processing, the test voice set includes only the noisy voice sample, and in this embodiment, the ratio of the noisy voice sample in the training voice set to the noisy voice sample in the test voice set is set to 3:1. it should be understood that this ratio is merely a preferred value, and the present invention is not limited to a specific number of samples in the training speech set and the test speech set, and thus, the specific number and ratio of the speech samples are not fixed and unique, and may be set according to actual situations.

And S2, preprocessing the voice samples in the training voice set to obtain noisy voice characteristics and a time-frequency masking value. The method specifically comprises the following steps:

s2.1, respectively performing short-time Fourier transform on the noisy speech, the clean speech and the noise in the training speech set to obtain a noisy speech spectrum, a clean speech spectrum and a noise spectrum;

s2.2, extracting voice characteristics of the voice spectrum with noise to obtain the voice characteristics with noise;

speech feature extraction is accomplished by converting the speech waveform into a parametric representation at a relatively minimal data rate for subsequent processing and analysis. In this embodiment, one of the methods of speech feature extraction such as Mel Frequency Cepstrum Coefficient (MFCC), linear Prediction Coefficient (LPC), linear Prediction Cepstrum Coefficient (LPCC), line Spectral Frequency (LSF), discrete Wavelet Transform (DWT) or Perceptual Linear Prediction (PLP) may be used for speech feature extraction, and these methods have been commonly used, and have high reliability and acceptability, and the speech feature extraction is not the focus of the present invention, so that the details of the present embodiment will not be repeated.

And S2.3, respectively performing time domain decomposition on the pure voice spectrum and the noise spectrum to obtain the time-frequency masking value.

The speech spectrum refers to a representation mode of a speech signal in a time domain under the frequency domain, and can be obtained by performing short-time fourier transform on the speech signal, in short, the speech spectrum can represent which frequency of the speech signal is composed of sine waves, and also can see the information of the magnitude, the phase and the like of the sine waves in each frequency. And spectral analysis is a technique that decomposes complex speech signals into simpler speech signals. For example, a speech signal may be represented as a sum of a number of simple signals of different frequencies, and a method for finding information (such as amplitude, power, intensity or phase, etc.) of a speech signal at different frequencies is spectral analysis.

According to the method, firstly, a voice signal is subjected to short-time Fourier transform to obtain a corresponding frequency spectrum, and then, the frequency spectrum is directly subjected to time domain decomposition to obtain a time-frequency masking value. The time-frequency masking value is a learning target of a neural network voice enhancement model based on deep learning, and when training is completed, the model outputs a predicted time-frequency masking value.

And S3, inputting the noisy speech characteristics and the time-frequency masking values into a neural network speech enhancement model for pre-training, and outputting preliminary enhanced speech.

It should be noted that, the preliminary enhanced speech herein is an enhanced speech result that is primarily output when the noisy speech feature and the time-frequency masking value are input into the neural network speech enhancement model for the pre-training stage, and may be the enhanced speech that is output for the first time or the enhanced speech that is output for a certain time in the previous times of training.

The performance of the deep learning-based speech enhancement algorithm is mainly affected by three classes of speech features, learning objectives and loss functions of the input network. The loss function is also called an error function, plays a role in optimizing network parameters in the neural network voice enhancement model, and can represent the fitting degree of the model to data. During the training of the entire enhancement algorithm, the process of training the network using the error back propagation (Error Back Propagation, BP) algorithm is as follows: (1) Calculating an error between an output value of the sample data and a current clean value by using the current parameter; (2) Calculating a gradient value of a network output layer according to the obtained error; (3) Propagating the gradient values to the hidden layer of the network along a direction different from the input data, and then calculating gradient values of neuron nodes of the hidden layer; (4) And updating the weight matrix and the bias value connected among the neurons of each layer of the network according to the numerical value calculated in the last step. The purpose of fine tuning using the BP algorithm is to minimize the error between the network predicted enhanced speech and the actual clean speech, which can be achieved by minimizing the value of the loss function. In the invention, the deep neural network is composed of a stack of multi-layer restricted boltzmann machines (Restricted Boltzmann Machine, RBMs), wherein neurons in layers of the RBM hidden layer and the visible layer are unconnected, and the visible layer comprises an input layer and an output layer.

The invention uses the deep neural network formed by stacking two RBMs to carry out voice enhancement, and the training network adopts an unsupervised pre-training method and a supervised reverse fine tuning method. In the unsupervised pre-training stage, a contrast divergence algorithm (Contrastive Divergence, CD) is adopted to pre-train RBMs, the output of each RBM is used as the input of the next RBM, after the multi-RBM stacking layer-by-layer training is finished, an output layer is added at last, the same node number as the input layer is set, the weight and the bias value are randomly initialized, and the initialized neural network voice enhancement model is obtained, and the initial enhanced voice is output at the moment, namely the enhanced voice which is output first after the neural network voice enhancement model starts to be pre-trained and initialized. And in the supervised reverse fine tuning stage, after the non-supervised pre-training, performing fine tuning and optimization on the deep neural network by adopting a supervised reverse fine tuning method, so that the error value between the loss function between the enhanced voice obtained by training and the original pure voice is smaller than a preset threshold value, finishing training of the neural network voice enhancement model at the moment, obtaining an optimized neural network voice enhancement model, and testing and outputting the enhanced voice.

In the actual operation process, if the selected loss function is not proper, even if the training times are large, the neural network voice enhancement model with good enhancement effect can not be obtained. Studies show that the loss function using SI-SDR as the deep learning network is obviously superior to other loss functions in terms of objective speech evaluation performance index. Most speech enhancement algorithms suppress noise at the expense of speech distortion, and both the residual noise and speech intelligibility of the enhanced speech are not optimal. Based on the above, the invention adopts a weighted loss function based on a psychoacoustic Bark domain, and the specific construction process of the loss function is as follows:

and S4, converting the primary enhanced voice and the pure voice into a psychoacoustic Bark domain at the same time, and calculating the loudness spectrum of the voice signal in the psychoacoustic Bark domain. The method specifically comprises the following steps:

s4.1, converting the primary enhanced voice to the psychoacoustic Bark domain by using a Bark domain change matrix, and calculating a loudness spectrum of the primary enhanced voice signal on the psychoacoustic Bark domain to obtain an enhanced voice sound intensity spectrum;

and S4.2, converting the pure voice into the psychoacoustic Bark domain by using a Bark domain change matrix, and calculating the loudness spectrum of the pure voice signal on the psychoacoustic Bark domain to obtain the pure voice sound intensity spectrum.

Fig. 3 shows a cyclic application process of a Bark domain weighted loss function, specifically including three main steps of calculating a Bark domain response spectrum, calculating a weighted loss function, and applying the Bark domain weighted loss function to a deep neural network speech enhancement scene.

Firstly, a Bark domain response spectrum is calculated, enhanced voice is converted into Bark frequency scale through a change matrix H, and a calculation formula is as follows:

wherein,,indicating an increaseStrong speech signal, x represents clean speech signal, H represents Bark domain transform matrix, ++>A speech signal representing an enhanced speech transition to the Bark domain, expressed as:

wherein Q represents the number of bands of the Bark domain, and in this embodiment, the value of Q is 24; t represents the number of frames;a Bark domain spectrum representing enhanced speech at the t frame, the Q-th frequency band, Q representing the frequency band, q=0, 1, …, Q-1; t represents the transpose.

Secondly, pure voice is converted into Bark frequency scale through a change matrix H, and a calculation formula is as follows:

b＝H·x

wherein x represents a pure voice signal, H represents a Bark domain transformation matrix, b represents a voice signal converted from pure voice to Bark domain, and the expression is:

b＝[B(t,0),B(t,1),…,B(t,Q-1)] ^T

wherein Q represents the number of bands of the Bark domain, and in this embodiment, the value of Q is 24; t represents a time frame, and B (t, q) represents a Bark domain spectrum of pure voice in a t frame and a q frequency band; A Bark domain spectrum representing enhanced speech at the t frame, the Q-th frequency band, Q representing the frequency band, q=0, 1, …, Q-1; t represents the transpose.

Then, the Bark domain spectrum is converted into a loudness level through a Zwicker conversion criterion, and the calculation formula is as follows:

wherein the method comprises the steps of，s _l Representing the loudness scaling factor, s in this embodiment _l The value is 0.08; p (P) ₀ (q) represents the hearing threshold of the q-th frequency band, and since the human ear is not perceivable, the present embodiment sets all frequency bands with loudness values smaller than the hearing threshold to 0; gamma represents a weighted power exponent, in this example the gamma takes a value of 0.23; s (t, q) represents the loudness spectrum of clean speech in the t frame, the q band; b (t, q) represents the Bark domain spectrum of clean speech at the t frame, the q band.

Finally, the process is applied to the converted Bark domain signal to obtain Bark domain loudness spectrums of the enhanced voice and the pure voice, and the expressions are respectively as follows:

wherein,,representing the loudness spectrum of the enhanced speech, S representing the loudness spectrum of the clean speech; s (t, q) represents the loudness spectrum of clean speech in the t frame, the q band;Representing the loudness spectrum of the enhanced speech in the T frame, the q-th frequency band, and T representing the transpose.

And S5, calculating a voice distortion error and a residual noise error according to the loudness spectrum, and constructing a Bark domain weighting loss function according to the voice distortion error and the residual noise error.

In constructing the weighted loss function, in order to better balance the two aspects of voice distortion and residual noise, the invention defines the Bark domain weighted loss function as:

ζ _new-β ＝βζ _distortion +(1-β)ζ _noise

wherein ζ _new-β Representing a weighted loss function; beta is a weighting factor, and the value of beta is [0,1 ]]When beta is less than 0.5, the voice distortion error plays a main role in noise reduction; when β > 0.5, the residual noise error plays a major role in noise reduction, and when β=0.5, the speech distortion error and the residual noise error play a role in noise reduction. Accordingly, the present invention can achieve a balance between speech distortion and residual noise by setting the weighting factor β to different values. Zeta type _distortion Representing a speech distortion error, expressed by:

ζ _noise representing the noise residual error, the expression is:

wherein λ is a scale factor, and the expression is:

wherein I ² Square of variable representing middle of double vertical line;representing the loudness spectrum of the enhanced speech, S representing the loudness spectrum of the clean speech; mean (·) represents the mean;A value representing the square of the variable in the middle of the double vertical line at least lambda; t represents the transpose.

The invention starts from the angles of the voice distortion error and the residual noise error, combines the two errors of the voice distortion error and the residual noise error by using the weighting factor beta in the Bark domain weighting loss function and participates in the parameter adjustment process in the model training, so that the voice distortion error and the residual noise error are simultaneously minimized, and the influence of the two aspects of the voice distortion error and the residual noise error on the voice quality of the enhanced voice is simultaneously considered, thereby the enhanced voice reconstructed by the optimized neural network voice enhancement model has better effect, stronger voice quality and higher intelligibility, and further the problems of poor hearing perception and low voice intelligibility of the reconstructed voice caused by the fact that the voice distortion problem is not considered by the traditional loss function are solved.

And S6, training the neural network voice enhancement model by adopting an error back propagation algorithm according to the Bark domain weighted loss function to obtain an optimized neural network voice enhancement model. The method specifically comprises the following steps:

step S6.1, training the neural network voice enhancement model, and determining whether training is completed according to the magnitude relation between the Bark domain weighted loss function value and a preset threshold value;

s6.2, stopping training and saving network parameters when the Bark domain weighted loss function value is smaller than the preset threshold value, so as to obtain the optimized neural network voice enhancement model;

The Bark domain is a psychoacoustic scale of sound specially proposed for a specific structure of a cochlear of a human ear, a Bark domain weighted loss function different from a traditional loss function is constructed by converting a voice signal to the psychoacoustic Bark domain, and the Bark domain weighted loss function is applied to a deep neural network voice enhancement scene for improving the voice quality problem of the enhanced voice, so that the reconstructed voice is more in line with an auditory perception system of the human ear, and the specific process is shown in fig. 4.

In the neural network training stage, a Bark domain weighted loss function is selected as an error criterion of reverse tuning, and error reverse based on gradient descent strategy is usedAnd (5) transmitting (Accumulated Error Back Propagation, AEBP) the algorithm, adjusting the network parameters in the negative gradient direction of the target, transmitting the difference value calculated by the network visible output unit back to the hidden layer unit along the direction different from the input data, calculating the difference gradient value of the hidden layer neuron, finally performing small adjustment on the parameters among all nodes of the neural network according to the calculation result, and continuously cycling the optimization process until the training is finished until the loss function value is lower than the preset threshold epsilon, and storing the obtained optimal network model for the testing stage of the network. In FIG. 4, RBM is a multi-layer limited Boltzmann machine, and the neural network speech enhancement model of the invention is a deep neural network model formed by stacking two limited Boltzmann machines, wherein W is ₀ 、W ₁ V is the matrix weight between the network connection layers ₀ 、H ₀ Representing the visible layer and hidden layer, V, respectively, in the first RBM ₁ 、H ₁ And respectively representing a visible layer and an implicit layer in the second RBM, wherein epsilon is a preset threshold value of an optimized loss function in the training process, stopping network training when the result of the Bark-domain weighted loss function is smaller than the preset threshold value, outputting enhanced voice, otherwise, continuing to circulate training until the result of the Bark-domain weighted loss function is smaller than the preset threshold value, and stopping training.

And S7, inputting the noisy speech in the test speech set into the optimized neural network speech enhancement model, and outputting the reconstructed enhanced speech.

As shown in fig. 5, the neural network speech enhancement model includes two phases, namely a training phase and a testing phase, wherein in the training phase, pure speech and noise in a training speech set are respectively subjected to time domain decomposition to obtain a time-frequency masking value, after the noisy speech features of the noisy speech are extracted, the noisy speech features and the time-frequency masking value are input into the neural network speech enhancement model for model training, meanwhile, the loudness spectrum of a preliminary enhanced speech signal and the loudness spectrum of the pure speech signal are calculated on a psychoacoustic Bark domain, so that a Bark domain weighted loss function is constructed, then the size relation between the Bark domain weighted loss function and a preset threshold is used as a condition for judging whether the training is finished, the optimized neural network speech enhancement model is obtained after the training is completed, at the moment, the noisy speech features in the testing speech set are input into the optimized neural network speech enhancement model to obtain a predicted time-frequency masking value, and the predicted time-frequency masking value is used for waveform synthesis to obtain the finally reconstructed enhanced speech.

The step S7 specifically comprises the following steps:

s7.1, extracting the noisy speech characteristics of the noisy speech in the test speech set, inputting the noisy speech characteristics into the optimized neural network speech enhancement model, and outputting a predicted time-frequency masking value, wherein the method comprises the following steps:

s7.1.1, performing short-time Fourier transform on the noisy voices in the test voice set to obtain a noisy voice frequency spectrum;

s7.1.2, extracting voice characteristics of the voice spectrum with noise to obtain the voice characteristics with noise;

s7.1.3, inputting the noisy speech characteristics into the optimized neural network speech enhancement model, and outputting a predicted time-frequency masking value;

step S7.2, carrying out waveform synthesis on the noisy voices in the test voice set according to the predicted time-frequency masking value to obtain the reconstructed enhanced voice, wherein the method specifically comprises the following steps:

step S7.2.1, multiplying the predicted time-frequency masking value with the amplitude spectrum of the noisy speech in the test speech set to obtain the amplitude spectrum of the enhanced speech;

and S7.2.2, combining the amplitude spectrum of the enhanced voice with the phase spectrum of the noisy voice to obtain the reconstructed enhanced voice.

The invention provides a new loss function which accords with the auditory perception of human ears for the first time, namely a weighted loss function based on a psychoacoustic Bark domain, namely the Bark domain weighted loss function is short for short, and from the viewpoint of model optimization, the Bark domain weighted loss function can obviously improve the voice quality and the voice intelligibility of enhanced voice. In the training process of the neural network voice enhancement model, enhanced voice and pure voice are simultaneously converted to a psychoacoustic Bark domain, a corresponding loudness spectrum is calculated, then a voice distortion error function and a noise residual error function which can represent the voice distortion degree are calculated through the loudness spectrum, and then the voice distortion error and the noise residual error are integrated together by introducing a weighting factor lambda, so that the residual noise and the voice distortion of the enhanced voice are simultaneously minimized in the training process, and the enhanced voice effect of the algorithm under different signal to noise ratios is obviously improved compared with the traditional neural network based on MSE optimization.

The following examples are used to verify the speech enhancement effect of the method of the present invention:

the experiment selects the White NOISE, the Babble NOISE, the F16 NOISE and the factor NOISE in the IEEE voice library and the NOISE-92 NOISE library. 50 clean voices were used to create 600 training sets with signal to noise ratios of-5 dB, 0dB and 5dB, 10dB at each noise. 120 test sets were created under the same conditions using 10 voices. The evaluation index uses a segmented signal-to-noise ratio (SegSNR), subjective speech quality (Perceptual Evaluation OfSpeech Quality, PESQ), and short-time objective intelligibility (Short Term Objective Intelligibility, STOI). The specific experimental data obtained are as follows:

Table 1 shows enhanced performance comparisons of the segmented signal-to-noise ratio SegSNR for DNN-MSE (Deep Neural Networks-Mean Square Error, DNN-MSE, deep neural network-mean square error loss function) and DNN-Bark domain weighted loss function (DNN-Weighted loss function in Bark, DNN-BW, deep neural network-Bark domain weighted loss function). The data in table 1 are experimental results of three evaluation indexes obtained at four different signal-to-noise ratios of background noise. From an analysis of the data in Table 1, it can be seen that the DNN-BW algorithm proposed by the present invention can suppress noise better, its SegSNR value is improved by 9.1372dB over the average of noisy speech, and by 0.3321dB over the average of SegSNR using the conventional MSE loss function.

Table 1 SegSNR enhancement Performance comparison of DNN-MSE and DNN-BW

Tables 2 and 3 show the subjective speech quality PESQ and short-term objective intelligibility, STOI, of DNN-MSE and DNN-BW, respectively, versus enhancement performance comparison:

table 2 subjective speech quality PESQ enhancement Performance contrast for DNN-MSE and DNN-BW

TABLE 3 short-time objective intelligibility STOI enhancement Performance comparison of DNN-MSE and DNN-BW

It can be seen from tables 2 and 3 that the enhanced speech reconstructed by the Bark domain loss function proposed by the present invention has an PESQ value increased by 0.5351 and a stoi value increased by 0.1002 over the original noisy speech. Compared with the traditional deep neural network optimized by MSE, the PESQ value and the STOI value of the deep neural network are respectively improved by 0.1407 and 0.0208, which proves that the Bark domain loss function provided by the invention can obviously improve the speech intelligibility and definition of the enhanced speech.

Fig. 6 and 7 show the magnitude comparisons of subjective speech quality PESQ and short-term objective intelligibility STOI at 4 signal-to-noise ratios for different weighting factor values for White noise, babble noise, respectively. Wherein, FIG. 6 (a) is the case of the PESQ score under White noise at different weighting factors; FIG. 6 (b) is the STOI score for White noise at different weighting factors; fig. 7 (a) shows PESQ score for different weighting factors for Babble noise, and fig. 7 (b) shows STOI score for different weighting factors for Babble noise. It can be seen from the graph that, under White noise, babble noise or other colored noise, when the weighting factor β=1, the loss function proposed by the present invention can be regarded as MSE, and when the weighting factors β=0 and β=1 are both extreme cases, the performance index of the neural network speech enhancement model of the present invention under different signal-to-noise ratios is not optimal. Globally, various performance metrics are trended upward when the value of β is between [0,0.6], and the corresponding PESQ and STOI scores are trended downward when the value of β is between [0.6,1 ]. This shows that when the value of β is around 0.6, the voice distortion error is taken as the main factor of the newly constructed Bark domain weighted loss function, so that the balance strategy between noise reduction and voice distortion can be well realized.

In order to further verify the performance of the Bark domain weighted loss function provided by the invention, the MFCC (Mel-Frequency Cepstral Coefficients, MFCC) voice characteristics and the IRM learning target training neural network are adopted, two sets of comparison experiments are designed, and the importance of the loss function conforming to the auditory sense of the human ear on improving the voice quality of the reconstructed voice is further illustrated.

Experiment 1: the neural network speech enhancement model is trained using a loss function based on conventional MSE.

Experiment 2: the Bark domain weighted loss function is adopted to train the neural network voice enhancement model.

In order to intuitively demonstrate the superiority of the Bark domain weighted loss function provided by the invention, taking a random group of voice signals under the 5dB of the Bark noise as an example, fig. 8 shows a time domain waveform comparison diagram based on MSE and Bark domain weighted loss functions. Fig. 8 (a) is a time domain waveform diagram of pure speech, fig. 8 (b) is a time domain waveform diagram of noisy speech, fig. 8 (c) is a time domain waveform diagram of enhanced speech optimized by MSE loss function, and fig. 8 (d) is a time domain waveform diagram of enhanced speech optimized by Bark domain weighting loss function according to the present invention. As can be seen from the positions marked by the rectangular boxes in FIG. 8, the enhanced speech optimized by the Bark domain weighted loss function is more complete than the enhanced speech optimized by the MSE, and has smaller speech distortion degree; in addition, in the mute section, the Bark domain weighting loss function adopted by the invention optimizes the enhanced voice, and the noise suppression capability is stronger.

Fig. 9 shows a graph of the contrast of the speech spectrum based on MSE and Bark domain weighted loss functions. Wherein, fig. 9 (a) is a spectrogram of pure speech, fig. 9 (b) is a spectrogram of noisy speech, fig. 9 (c) is a spectrogram of enhanced speech optimized by MSE loss function, and fig. 9 (d) is a spectrogram of enhanced speech optimized by Bark domain weighted loss function according to the present invention. As shown in fig. 9, the speech reconstructed based on the Bark domain weighted loss function proposed by the present invention remains more complete in the spectrum structure, and the residual noise in the spectrum is less than the speech reconstructed based on the conventional MSE loss function, which is mutually reflected by the information reflected by the time domain waveform diagram in fig. 8. Therefore, the voice enhancement method provided by the invention has better voice enhancement effect, higher voice quality and higher intelligibility.

Example 2

As shown in fig. 10, this embodiment provides a speech enhancement system based on a psychoacoustic domain weighted loss function, where the system adopts the speech enhancement method based on the psychoacoustic domain weighted loss function in embodiment 1, and functions of each module of the system are the same as and correspond to each step of the method in embodiment 1, and the system specifically includes:

the training voice set and test voice set acquisition module M1 is used for acquiring a training voice set and a test voice set; wherein the training speech set comprises clean speech, noise and a part of noisy speech, and the test speech set comprises another part of noisy speech;

The training voice set preprocessing module M2 is used for preprocessing voice samples in the training voice set to obtain noisy voice characteristics and a time-frequency masking value;

the neural network voice enhancement model training module M3 is used for inputting the noisy voice characteristics and the time-frequency masking values into a neural network voice enhancement model for pre-training and outputting preliminary enhanced voice;

a psychoacoustic Bark domain conversion and loudness spectrum calculation module M4, configured to convert the preliminary enhanced speech and the pure speech to a psychoacoustic Bark domain simultaneously, and calculate a loudness spectrum of a speech signal in the psychoacoustic Bark domain;

the Bark domain weighted loss function construction module M5 is used for calculating a voice distortion error and a residual noise error according to the loudness spectrum and constructing a Bark domain weighted loss function according to the voice distortion error and the residual noise error;

the neural network voice enhancement model optimization module M6 is used for training the neural network voice enhancement model by adopting an error back propagation algorithm according to the Bark domain weighted loss function to obtain an optimized neural network voice enhancement model;

and the enhanced voice reconstruction module M7 is used for inputting the noisy voices in the test voice set into the optimized neural network voice enhancement model and outputting the reconstructed enhanced voices.

The invention provides a voice enhancement method and a voice enhancement system based on a psychoacoustic domain weighting loss function. In addition, in order to optimize the two aspects of the residual noise and the voice distortion of the reconstructed voice at the same time, the Bark domain weighting loss function introduces a weighting factor to balance the relation between the residual noise and the voice distortion, so that the simultaneous minimization of the residual noise and the voice distortion is realized in the training process, and finally, the aim of improving the intelligibility of the network reconstructed voice is achieved.

It will be readily understood that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present invention and is not to be construed as limiting thereof. Although a few exemplary embodiments of this invention have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this invention. Accordingly, all such modifications are intended to be included within the scope of this invention as defined in the following claims. It is to be understood that the foregoing is illustrative of the present invention and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The invention is defined by the claims and their equivalents.

Claims

1. A method of speech enhancement based on a psychoacoustic domain weight loss function, comprising:

2. The method for enhancing speech according to claim 1, wherein said preprocessing the speech samples in the training speech set to obtain noisy speech features and time-frequency masking values comprises:

3. The method for speech enhancement according to claim 1, wherein said converting said preliminary enhanced speech and said clean speech simultaneously to a psychoacoustic Bark domain and calculating a loudness spectrum of the speech signal on said psychoacoustic Bark domain comprises:

4. A method of speech enhancement according to claim 3 wherein said calculating a speech distortion error and a residual noise error from said loudness spectrum and constructing a Bark domain weighted loss function from said speech distortion error and said residual noise error comprises:

5. The method for enhancing speech according to claim 1, wherein said training the neural network speech enhancement model by using an error back propagation algorithm according to the Bark domain weighted loss function to obtain an optimized neural network speech enhancement model specifically comprises:

6. The method for enhancing speech according to claim 1, wherein said inputting said noisy speech in said test speech set into said optimized neural network speech enhancement model outputs a reconstructed enhanced speech, specifically comprising:

7. The method for speech enhancement according to claim 6, wherein said extracting noisy speech features of said noisy speech in said test speech set comprises:

8. The method for enhancing speech according to claim 6, wherein said waveform synthesis is performed on said noisy speech in said test speech set according to said predicted time-frequency masking value to obtain said reconstructed enhanced speech, specifically comprising:

9. The speech enhancement method according to claim 1, wherein said neural network speech enhancement model is a deep neural network model formed by stacking two constrained boltzmann machines.

10. A speech enhancement system based on a psychoacoustic domain weight loss function, comprising: