CN113129872A

CN113129872A - Voice enhancement method based on deep compressed sensing

Info

Publication number: CN113129872A
Application number: CN202110367869.XA
Authority: CN
Inventors: 康峥; 黄志华; 赖惠成
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2021-07-16
Anticipated expiration: 2041-04-06
Also published as: CN113129872B

Abstract

The invention discloses a voice enhancement method based on deep compressed sensing, which comprises the following steps: step 1: preprocessing training data to obtain a time domain voice signal sequence; step 2: constructing a speech enhancement model (SEDCS) based on deep compressed sensing, and carrying out joint training on the SEDCS; and step 3: preprocessing a voice test set with noise, denoising and reconstructing the voice test set by using the trained SEDCS model, and storing a result to complete a voice enhancement task; and 4, step 4: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes. The invention combines the compressed sensing with deep learning to realize voice enhancement, can release the sparsity constraint on voice signals in the traditional compressed sensing method, solves the problems of reduced voice intelligibility of reconstructed voice and the like in the traditional compressed sensing method, takes the observation signals of the voice signals as an optimization object, effectively improves the voice enhancement efficiency, reduces the model complexity, and can realize voice enhancement more simply, conveniently and flexibly.

Description

Voice enhancement method based on deep compressed sensing

Technical Field

The invention relates to the technical field of voice enhancement of voice signal processing, in particular to a voice enhancement method based on deep compressed sensing.

Background

Voice is the most natural, fast and efficient way for people to communicate, but in real life voice is often disturbed by various noises, such as environmental noise, mechanical noise, etc. These noises affect the speech quality to different extents, resulting in a degradation of speech intelligibility. To solve these problems, it is necessary to apply to speech enhancement. Speech enhancement is a technique for extracting clean speech from noisy speech, and is an important component of a speech recognition system, and has two main purposes, namely, improving speech quality and improving speech intelligibility.

The existing speech enhancement means mainly comprise two traditional methods such as a spectral subtraction method, a subspace method, a wiener filtering method and the like, although the traditional methods can effectively remove noise and improve speech quality, the traditional methods are generally based on specific hypothesis premise, for example, the noise is stationary, but the speech enhancement effect is poor under low signal-to-noise ratio and non-stationary noise. In view of this problem, a deep learning based speech enhancement method is proposed, and common deep learning speech enhancement methods are a Convolutional Neural Network (CNN) based speech enhancement method, a Recurrent Neural Network (RNN) based speech enhancement method, and a generative countermeasure network (GAN) based speech enhancement method. The voice enhancement method based on the CNN is a common method, and completes a voice enhancement task by training a voice enhancement model, but the method has a large model parameter amount, and if the voice enhancement is performed in a time-frequency domain, there are problems of phase information loss and the like, which causes the voice enhancement quality to be reduced. RNN-based speech enhancement methods are also of interest, but RNN methods have a larger number of parameters and more complex models than CNN methods. The creation of a countermeasure network (GAN) has provided a new approach to speech enhancement that enables end-to-end enhancement of speech signals and that directly accomplishes the speech enhancement task in the time domain. With the development of a compressed sensing technology, a new exploration field is provided for speech enhancement, and although the method can solve the problem that the speech enhancement effect is poor under non-stationary noise in the traditional method, the compressed sensing requires that a speech signal needs to meet a specific structure, for example, the speech signal needs to be sparse, and the speech signal may cause effective information loss in the sparse process, so that the intelligibility of reconstructed speech is reduced.

Most of the existing voice enhancement technologies are realized in a time-frequency domain, and the problems of phase information loss and the like are easily caused after data processing; although speech enhancement is realized in the time domain by many deep learning-based speech enhancement techniques, the models thereof are complex and the original speech signal is taken as an optimization object, resulting in a reduction in the enhancement rate; the speech enhancement method based on the traditional compressed sensing technology is influenced by the sparsity of speech signals, so that the intelligibility of reconstructed speech is reduced.

Disclosure of Invention

The method mainly takes an observation signal of a voice signal as an optimization object to solve the problems of complex model, slow enhancement rate and the like existing in the existing voice enhancement technology and solve the problem of reduced intelligibility of reconstructed voice by the traditional compressed sensing method; the invention aims to provide a compressed sensing voice enhancement method combined with deep learning, which can complete a voice enhancement task, improve the voice enhancement rate and solve the problem of reduced intelligibility of reconstructed voice by a traditional compressed sensing method.

The purpose of the invention is realized by the following technical scheme.

A speech enhancement method based on deep compressed sensing comprises the following steps.

Step 1: preprocessing training data: and pre-emphasis, pairing and framing processing are carried out on the training data to obtain a time domain voice signal sequence.

Step 2: constructing a model and training: establishing a voice enhancement model (SEDCS) based on deep compressed sensing, setting a proper error function, performing combined training on a preprocessed voice training set input model, and deploying the trained SEDCS model into a server.

And step 3: testing the model: and preprocessing the noisy speech test set, denoising the noisy speech test set by using the trained SEDCS model, reconstructing to obtain a denoised speech signal, and completing a speech enhancement task.

And 4, step 4: and (3) evaluating the model: and evaluating the quality and intelligibility of the enhanced voice signal by adopting a plurality of evaluation indexes.

The model is called as an SEDCS model, a voice enhancement task can be completed in a time domain by training the SEDCS model, the problems of phase information loss and the like after time-frequency domain data processing are avoided, and a user can quickly obtain enhanced voice only by providing a voice file with noise; the invention can get rid of the sparsity constraint on the voice signal in the traditional compressed sensing method, solves the problems of the traditional compressed sensing method such as the reduction of the intelligibility of the reconstructed voice and the like, and more conveniently and flexibly realizes the voice enhancement.

Further, the SEDCS model in the step 2 is constructed by two deep neural network models which are respectively called as a generation model

And a measurement model

：

Generating models

Replacing a signal sparse process in compressed sensing, mapping an input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal;

measurement model

And replacing a measurement matrix in compressed sensing to realize a signal observation dimension reduction process for obtaining a clean voice signal and generating an observation signal of the voice signal, wherein the observation signal is used as an optimization object.

Further, the two models forming the SEDCS model in the step 2 are trained in a joint training mode, the training set is firstly optimized for noisy speech before training, the model can be converged faster in the process, the training period is shortened, the optimization mode adopts a gradient descent method, and the objective function is as follows:

wherein

Represents the optimized noisy speech signal,

which represents a clean speech signal, is,

representing a voice signal with noise, and,

representing a generated speech signal obtained after passing the noisy speech signal through the generative model,

and

representing the observation signal obtained after the voice signal passes through the measurement model,

is a weight coefficient, the generative model and the measurement model are not updated during the optimization of data, and the number of times of optimization can be specified.

Further, inputting the optimized noisy speech signal into a generation model

Then passes through the measurement model

Obtaining an observation signal, taking the observation signal as an optimization object, and performing joint training optimization on the two models, wherein the target functions of the models are respectively as follows:

wherein

Indicating added L₁The weight factor of the regularization term,

which represents a clean speech signal, is,

representing a reconstructed speech signal, a depth-compressed-sensing-based speech enhancement method is aimed at

I.e. minimizing the model global objective function.

The joint training mode takes the observation signal of the signal as an optimization object, utilizes the advantage of compressed sensing, and simultaneously converges two models by using less data, thereby simplifying the training process, enabling the reconstructed voice signal to quickly approach a clean voice signal, and effectively solving the problem of slow enhancement rate of the existing voice enhancement technology.

The reconstructed speech intelligibility is reduced in the traditional compressed sensing method due to the reason of speech signal sparseness, the SEDCS model of the invention utilizes the deep neural network to replace the sparse process and the observation dimension reduction process, and the problem of reduced reconstructed speech intelligibility in the traditional compressed sensing method can be effectively solved without considering which sparse basis and which measurement matrix are selected.

Further, step 3 includes the following substeps.

Step 3-1: and preprocessing the test data, and performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein the pre-emphasis factor and the size of each frame are the same as those of the processed training data.

Step 3-2: enhancing the voice: inputting the preprocessed voice signals with noise into the trained SEDCS model, denoising each small section of voice by the model, and splicing and reconstructing the small section of voice according to the sequence of the original clean voice signals to obtain the denoised voice signals.

Step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored in a specified directory.

Further, in the data preprocessing process of step 1 and step 3, the pre-emphasis factor is set to 0.95, the frame length is set to 16384 sampling points, the frame overlap is set to 1/2, that is, a window with the size of 16384 × 1 and the sliding step of 0.5 is used to sample and frame the voice signal, and if the difference is less, 0 is supplemented.

Furthermore, when the trained model is used for carrying out voice denoising processing in the step 3, noise conditions of different noisy voice test data may be different, and the model can complete a voice enhancement task when dealing with unknown noise conditions, which shows that the method can adapt to different noise scenes and has practicability.

Further, the various evaluation indexes in step 4 include: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR. Wherein STOI is short-term objective intelligibility, PESQ is perceptual speech quality assessment, CSIG is mean opinion score for speech signal distortion, CBAK is mean opinion score for assessing background noise interference, COVL is mean opinion score for overall enhancement effect, and SSNR is segmental signal-to-noise ratio. Through the evaluation indexes, the model can be accurately and effectively evaluated.

By adopting the scheme, the invention has the following beneficial effects.

1. The invention provides a voice enhancement method based on deep compressed sensing, which effectively utilizes the respective advantages of a deep learning method and a compressed sensing technology to construct a compressed sensing voice enhancement model combined with deep learning, takes an observation signal of a voice signal as an optimization object, effectively improves the voice enhancement efficiency and reduces the model complexity.

2. The invention adopts a joint training mode to train the model, so that the voice signal with noise can be fitted with a clean voice signal, and the voice enhancement quality and the intelligibility are effectively improved.

3. The invention can complete the voice enhancement under different noise conditions, and has stronger adaptability and certain practicability.

Drawings

To further understand the technical solutions of the embodiments of the present invention, the drawings are described herein, and the drawings herein form a part of the present application and do not form a limitation of the embodiments of the present invention.

In the drawings: fig. 1 is a schematic diagram of a speech enhancement technique according to an embodiment of the present invention.

Detailed Description

The purpose, technical solution and advantages of the embodiments of the present invention will be fully described in detail herein with reference to the accompanying drawings. The embodiments described herein are some, but not all embodiments of the inventions. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The embodiment of the invention provides a voice enhancement method, which obtains an SEDCS model through a combined training mode, completes a voice enhancement task in a time domain, solves the problems of complex model, low enhancement rate and the like existing in the existing voice enhancement technology, solves the problems of reduced intelligibility of reconstructed voice of a traditional compressed sensing method and the like, and realizes voice enhancement more simply, conveniently and flexibly.

As shown in fig. 1, a route diagram of a speech enhancement technology provided by an embodiment of the present invention includes the following steps.

The main role of pre-emphasis in the data pre-processing is to boost the high frequency components to prevent the reconstructed speech quality from being affected, and correspondingly, to perform de-emphasis at the output. The pre-emphasis factor is set to 0.95, the frame length is set to 16384 samples, the frame overlap is set to 1/2, that is, a window with the size of 16384 × 1 and the sliding step of 0.5 is used to sample and frame the voice signal, and if the frame overlap is insufficient, 0 is complemented.

Step 2: constructing a model and training: and constructing a voice enhancement model based on deep compressed sensing, setting a proper error function, performing joint training on the preprocessed voice training set input model, and deploying the trained SEDCS model into a server.

The SEDCS model is constructed by two deep neural network models which are respectively generation models

And a measurement model

。

Generating models

And replacing the signal sparse process in compressed sensing, mapping the input voice signal with noise, and reconstructing a generated voice signal related to a clean voice signal.

Measurement model

Instead of implementing a signal observation dimension reduction process by using a measurement matrix in compressed sensing, the input of the model comprises a clean speech signal and a generated speech signal, aiming at obtaining the clean speech signal and generating an observation signal of the speech signal, and the observation signal is used as an optimization object.

After the model is built, training is carried out in a joint training mode, the noisy speech of the training set is optimized before training, the generated model and the measurement model are not updated during the period of optimizing the noisy speech, and the optimization times can be specified.

The process can lead the model to be converged more quickly, reduce the training period, adopt a gradient descent method in an optimization mode, and have the following objective functions:

wherein

Represents the optimized noisy speech signal,

which represents a clean speech signal, is,

representing a voice signal with noise, and,

and

are the weight coefficients.

Inputting the optimized noisy speech signal into a generation model

Then passes through the measurement model

wherein

Indicating added L₁The weight factor of the regularization term,

which represents a clean speech signal, is,

I.e. minimizing the model global objective function.

And step 3: testing the model: and preprocessing the voice data with the noise test set, denoising the voice data by using the trained SEDCS model, reconstructing to obtain a denoised voice signal, and completing a voice enhancement task.

The preprocessing mode of the noisy test set is the same as that of the training set, namely the pre-emphasis factor is still selected to be 0.95, the frame length is still set to be 16384 sampling points, the frame overlap is still set to be 1/2, a window with the size of 16384 x 1 and the sliding step of 0.5 is used for sampling and framing the voice signal, and if the window is insufficient, 0 is supplemented.

The noise conditions of different noisy speech test data may be different, and when the model deals with unknown noise conditions, the model can also complete the speech enhancement task.

And evaluating the stored denoised voice so as to evaluate the performance of the model.

The evaluation indexes include: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR. Wherein STOI is short-term objective intelligibility, PESQ is perceptual speech quality assessment, CSIG is mean opinion score for speech signal distortion, CBAK is mean opinion score for assessing background noise interference, COVL is mean opinion score for overall enhancement effect, and SSNR is segmental signal-to-noise ratio. Through the evaluation indexes, the model can be accurately and effectively evaluated.

In one embodiment of the invention, the evaluation model employs two noisy test sets.

The noise types in the test set I are 5 kinds of environmental noise different from those in the training set, the enhancement effect obtained when the model is used for dealing with unknown environmental noise is simulated, meanwhile, the effectiveness and the feasibility of the model are proved, and the embodiment results are shown in table 1.

The noise types in the test set two are white, volvo and babble, white noise simulates a stationary noise environment, other noises simulate a non-stationary noise environment, the test set is used for evaluating whether the problem that the intelligibility of the denoised voice in the traditional compressed sensing method is reduced is solved, and the embodiment results are shown in table 2.

Table 1 tests the score of the various indexes of set one.

Table 2 tests the scores for both PESQ and STOI.

To demonstrate the effectiveness and feasibility of the present invention, this example was also compared with the results using wiener speech enhancement methods. As shown in table 1, although PESQ of the present embodiment is 0.01 lower than that of the wiener method, scores of other indexes are all better, which indicates that the present embodiment can effectively suppress noise, improve voice quality, and adapt to different noise environments.

As shown in table 2, the scores in this embodiment are all better and improved to some extent, which indicates that this embodiment can solve the problem of reduced intelligibility in the conventional compressed sensing method.

The above-mentioned embodiments are intended to further illustrate the objects, technical lines and advantages of the present invention, and are only preferred embodiments of the present invention, which should not be construed as limiting the present invention, and any modifications, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A speech enhancement method based on deep compressed sensing is characterized by comprising the following steps:

step 1: preprocessing training data: pre-emphasis, pairing and framing processing are carried out on training data to obtain a time domain voice signal sequence;

step 2: constructing a model and training: constructing a voice enhancement model (SEDCS) based on deep compressed sensing, setting a proper error function, inputting the processed training set voice signals into a model for joint training, and deploying the trained SEDCS model into a server;

and step 3: testing the model: preprocessing a voice test set with noise, denoising the voice test set with the trained SEDCS model, reconstructing to obtain a denoised voice signal, and completing a voice enhancement task;

2. The method of claim 1, wherein the SEDCS model of step 2 is constructed by two deep neural network models, namely a generative model

And a measurement model

：

Generating models

measurement model

3. The method of claim 2, wherein the two models forming the SEDCS model are trained in a joint training manner, before training, noisy speech in the training set is first optimized in a gradient descent method, and the objective function is:

wherein

Represents the optimized noisy speech signal,

which represents a clean speech signal, is,

representing a voice signal with noise, and,

and

are weight coefficients.

4. The method of claim 3, wherein the optimized noisy speech signal is input into the generative model

Then passes through the measurement model

Obtaining an observation signal, using the observation signal as an optimization object, and performing joint training on the two modelsOptimizing, wherein the objective functions of the model are respectively as follows:

wherein

Indicating added L₁The weight factor of the regularization term,

which represents a clean speech signal, is,

I.e. minimizing the model global objective function.

5. The method for enhancing speech based on deep compressed sensing of claim 1, wherein the step 3 comprises the following sub-steps:

step 3-1: preprocessing test data, performing pre-emphasis and framing processing on the noisy speech signals of the test set, wherein pre-emphasis factors and the size of each frame are the same as those of the training data;

step 3-2: enhancing the voice: inputting the preprocessed voice signal with noise into a trained SEDCS model, denoising each small segment of voice by the model, and splicing and reconstructing the small segments of voice according to the sequence of the original clean voice signal to obtain a denoised voice signal;

step 3-3: and (4) storing the result: de-emphasis is carried out on the de-noised voice signal, and the finally obtained de-noised voice signal is stored at a specified position.

6. The method as claimed in claims 1 and 5, wherein the pre-emphasis factor in the data pre-processing of step 1 and step 3 is set to 0.95, the frame length is set to 16384 samples, the frame overlap is set to 1/2, that is, the window with size of 16384 × 1 and sliding step of 0.5 is used to sample and frame the speech signal, and if the difference is less than 0, the frame overlap is complemented with 0.

7. The method as claimed in claim 1, wherein when the trained model is used for performing the speech denoising process in step 3, noise conditions of different noisy speech test data may be different, and the model can complete a speech enhancement task when dealing with unknown noise conditions.

8. The method according to claim 1, wherein the evaluating the index in step 4 comprises: indicators for evaluating speech intelligibility: STOI; indicators for evaluating speech quality: PESQ, CSIG, CBAK, COVL, and SSNR.