CN110751957A

CN110751957A - Speech enhancement method using stacked multi-scale modules

Info

Publication number: CN110751957A
Application number: CN201911182689.3A
Authority: CN
Inventors: 蓝天; 李森; 吕忆蓝; 刘峤; 钱宇欣; 叶文政; 惠国强; 李萌; 彭川
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-25
Filing date: 2019-11-27
Publication date: 2020-02-04
Anticipated expiration: 2039-11-27
Also published as: CN110751957B

Abstract

The invention discloses an end-to-end speech enhancement method using stacked multi-scale modules, which comprises the following steps: s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure; s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features; s3: enhancing the two-dimensional features by utilizing a voice enhancement module; s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis. In order to further improve the performance of the algorithm, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization. Experiments show that the method provided by the invention can obviously improve the voice enhancement effect and has better noise immunity under the conditions of unknown noise and low signal-to-noise ratio.

Description

Speech enhancement method using stacked multi-scale modules

Technical Field

The invention belongs to the technical field of voice enhancement, and particularly relates to an end-to-end voice enhancement method using stacked multi-scale modules.

Background

The voice enhancement refers to the task of removing or attenuating additional noise in noisy voice, improves the overall perception quality and the voice intelligibility of the voice by restraining and separating the noise, and has wide application in the aspects of robust voice recognition, hearing aid design, speaker verification and the like. Traditional speech enhancement methods include spectral subtraction, wiener filtering, statistical model-based methods, subspace-based methods, etc., and over the last few years, supervised speech enhancement methods based on deep learning have become the main direction of research of interest to scholars.

Some scholars consider that the time-domain signal of the voice is directly processed, and the voice signal is not dependent on the frequency-domain representation of the voice signal, so that the voice signal is prevented from being switched back and forth between the time domain and the frequency domain, and the time-domain feature representation of the voice is more fully utilized. Based on the WaveNet framework, Qian et al propose a method of introducing a prior distribution of speech for speech enhancement, and restage et al predict the target by a non-causal dilated convolution. Pascual et al propose SEGAN, which uses convolutional networks to directly enhance time domain speech, Fu et al propose full convolutional neural networks for time domain whole sentence speech enhancement, and Pandey et al combine sequence modeling networks with codec architectures to process time domain signals in order to solve real-time speech enhancement.

These end-to-end based methods map the one-dimensional time domain waveform to the target voice directly, however, the time domain waveform signal itself cannot show an obvious characteristic structure, and it is difficult to model the time domain signal directly, and the modeling difficulty is further improved in the low signal-to-noise ratio environment.

Disclosure of Invention

The present invention provides an end-to-end speech enhancement method using stacked multi-scale modules, which aims to solve the existing problems.

The invention is realized in such a way that a speech enhancement method using stacked multi-scale modules comprises the following steps:

s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure;

s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features;

s3: enhancing the two-dimensional features by utilizing a voice enhancement module;

s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis.

Further, the cascade end-to-end voice enhancement architecture comprises voice time domain signal preprocessing, a voice enhancement module and target voice synthesis post-processing; the method comprises the following specific steps:

a. in the time domain signal preprocessing stage, one-dimensional convolution is used for carrying out convolution operation on input voice fragments, the results of each convolution check on the action of noisy voice Y are stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a self-convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated, and an absolute value feature and sgn mask are obtained;

b. the absolute value feature of the noisy speech y is input into a speech enhancement module for enhancement to obtain an estimate of the absolute value feature

Multiplying it by sgn mask synthesizes a feature representation of the target speech:

c. by means of transposed convolution

Transforming into time domain signals

Further, the multi-scale module includes an average pooling layer, convolution kernels of 1 × 1 and 3 × 3, and dilation convolutions of different dilation rates.

Furthermore, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization.

Further, the specific step of incorporating the STOI indicator into the loss function includes:

1) STOI inputs are clean speech X and degraded speech

Removing a silent region which does not contribute to speech intelligibility, and then transforming a time domain signal into a time-frequency domain by using STFT, and dividing two signals into frames with Hanning windows which are overlapped by 50%;

2) 1/3 times frequency band analysis is carried out, 15 times frequency bands are divided into 1/3 times frequency bands, wherein the central frequency range of the frequency bands is 4.3kHz to 150Hz, and the short-time envelope x of pure voice_j，mIs represented as follows:

[X_j(m-L+1)，X_j(m-L+2)，...X_j(m)]^T

wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice;

3) normalizing and clipping speech to obtain an envelope representation of degraded speech

Intelligibility is expressed as a correlation coefficient between two temporal envelopes:

wherein | · | purple sweet₂Is the L2 norm, μ (-) shows the mean vector of the corresponding sample.

4) Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index:

5) will enhance the speech

And the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:

wherein d is_j，mExpressed as the correlation coefficient of the enhanced speech with the clean speech temporal envelope.

Further, the specific step of incorporating the SDR index into the loss function includes:

1) the SDR inputs are clean speech x and enhanced speech

The SDR calculation process for enhanced speech is as follows:

2) performing equivalent transformation on the SDR optimization target to simplify calculation to obtain:

wherein the process of maximizing the evaluation index SDR is equivalent to minimizing

Further, the specific step of fusing the STOI and SDR evaluation index into the loss function includes:

1) the conventional root mean square error is calculated as follows:

wherein M and N are the number of sampling points of each voice and the total number of voices.

2) The root mean square error is combined with the STOI and SDR based evaluation index loss function:

where α, γ corresponds to the coefficients of different parts of the loss function.

Where X ∈ R is 1/3 times the band from X, M is the total number of frames of a segment of speech, M is the index of the frame, j ∈ {1, 2.. 15} is the index of 1/3 times the band, and L ═ 30 corresponds to the length of the analyzed speech being 384 ms.

Compared with the prior art, the invention has the beneficial effects that: in order to improve the direct processing capability of the neural network on time domain voice signals, the invention provides a novel multi-scale end-to-end voice enhancement framework. In the preprocessing stage, the time domain signal is transformed into a two-dimensional characteristic representation, then the two-dimensional characteristic is enhanced by a voice enhancement module, and finally the enhanced characteristic representation is transformed into a one-dimensional time domain signal through decoding and synthesis. In order to further improve the performance of the algorithm, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization. Experiments show that the method provided by the invention can obviously improve the voice enhancement effect and has better noise immunity under the conditions of unknown noise and low signal-to-noise ratio.

Drawings

FIG. 1 is an overall schematic view of the present invention;

FIG. 2 is a schematic diagram of stacked multi-scale modules of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present invention, it is to be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships illustrated in the drawings, and are used merely for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention. Further, in the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Examples

Referring to fig. 1-2, the present invention provides a technical solution: a method of end-to-end speech enhancement using stacked multi-scale modules, comprising the steps of:

The end-to-end speech enhancement framework proposed by the present invention comprises speech time domain signal preprocessing, speech enhancement module and target speech synthesis post-processing, as shown in fig. 1.

Assuming that the time-domain clean speech is x and the noise signal is n, the noisy speech y can be expressed as:

y＝x+n

in the time domain signal preprocessing stage, one-dimensional convolution is used for performing convolution operation on input voice fragments, the result of each convolution check on the action of noisy voice Y is stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated to obtain an absolute value feature and sgn mask, wherein sgn represents sign function, namely the sign of Y is taken, and the two-dimensional feature Y is represented as the product of the absolute value feature and sgn mask:

Y＝abs(Y)⊙sgn(Y)

wherein ⊙ represents the multiplication of corresponding elements, and then the absolute value feature of the noisy speech y is input to the speech enhancement module for enhancement, resulting in an estimate of the absolute value featureMultiplying it by sgn mask synthesizes a feature representation of the target speech:

finally, the method is performed by transposition convolution

Transforming into time domain signals

The voice enhancement module adopted by the framework is established on the basis of a full convolution network. In the encoding process of the module, each layer of convolution reduces the feature size by half, but channels are doubled, the features are encoded into a small and deep representation form through the multilayer convolution, correspondingly, the feature size is gradually enlarged in the decoding process, and finally, the feature size is restored to the original size. In expanding the feature size, we obtain higher resolution by bilinear interpolation upsampling.

Adding skip connections between layers of the same layer level in the speech enhancement module allows a high level of detail to be preserved by the copy operation. The direct flow of low-level information into high-level information can effectively guide the modeling of high-resolution features by the model.

The symmetrical structure of the speech enhancement module ensures that its input and output have the same shape size, which makes it naturally suitable for any pixel-intensive prediction task, especially for every pixel label in the image.

To make more full use of the Multi-scale context information in speech features, we elaborately designed and Stacked Multi-scale blocks, as shown in fig. 2, SMB (Stacked Multi-scale Block) contains an average pooling layer, ordinary 1 × 1 and 3 × 3 convolutions, and dilated convolutions of different dilation rates; in order to preserve the original information efficiently, we stack the original features with multi-scale feature stitching.

The speech enhancement method based on deep learning usually adopts mean square error mse (mean Squared error) as a loss function of training, but in the speech enhancement process, the intelligibility and speech quality of enhanced speech are often evaluated to check the performance of a model, and the inconsistency between the loss function and an evaluation index cannot guarantee that an optimal model can be obtained.

To calculate the loss function from the angle of the amplitude values, we use RMSE (Root Mean Square Error); STOI is used to assess the intelligibility of speech, the inputs of which are clean speech X and degraded speech

It first removes the silence regions that do not contribute to speech intelligibility, and then programs the time-domain signal with an STFT into the time-frequency domain by dividing the two signals into 50% overlapping frames with a hanning window. 1/3-fold frequency band analysis is performed to divide a total of 15 1/3-fold frequency bands, wherein the center frequency range of the frequency band is 4.3kHz to 150 Hz. Short-time envelope x of pure speech_j，mCan be expressed as follows:

[X_j(m-L+1)，X_j(m-L+2)，...X_j(m)]^T

wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice; then, normalizing and clipping the voice, wherein the normalization is used for compensating global difference, and the difference should not influence the intelligibility of the voice; clipping ensures an upper bound on STOI evaluation on severely degraded speech. Normalized and clipped temporal envelope representation of degraded speech as

wherein | · | purple sweet₂And L2 norm, μ (·) represents the mean vector of the corresponding sample. Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index of the degraded speech:

will enhance the speech

SDR, on the other hand, is enhanced speech

Medium dry net weight

Energy ratio to other components, clean component

Is x is in

Projection of (2):

SDR is defined as:

the combination of the above two formulas can obtain:

equivalent transformation is performed on the SDR optimization target to simplify calculation:

finally, we combine these two metrics with RMSE to form a loss function:

in the equation α, γ corresponds to the coefficient of different parts in the loss function.

Test examples

The speech data used in the experiment was from a TIMIT dataset and the noise dataset used ESC-50 as a training set, and we also used the Noisex92 noise dataset for testing in order to verify the generalization performance of the model presented herein.

In this experiment, the total of the TIMIT data set contained 6300 voices, which were recorded 10 sentences per person for 630 persons, with a male-female ratio of 7: 3. Wherein, 7 among the sentences that everyone recorded are repeated, in order to get rid of the influence of repeated sentence to model training and test, this experiment only took 1890 voices that the sentence is all different. About 80% of the voices were used as training set, and the other 20% were used as test voices, and the male and female ratios were the same as the overall distribution of TIMIT. The ESC-50 dataset contains 2000 tagged ambient sound recordings in a set of 5 main categories: animals, natural sound scenes and underwater sounds, non-speech human voices, indoor sounds, urban sounds. All speech was resampled to 16kHz and all speech was cut to a length of 2 seconds. Adam optimizer is used for random gradient descent (SGD) based optimization. The learning rate is set to be equal to 1 × 10^-4Is constant.

For the baseline model, several typical encoder-decoder solutions are selected for comparison with the method proposed by the present invention, including spectral mapping-based and end-to-end methods, and we also use noisy speech as a baseline comparison: (a) noisy speech, (b) AET, (c) CED, (d) R-CED, (e) NOSMB-SE, (f) SMB-SE. Wherein AET is end-to-end speech enhancement architecture, CED and R-CED are speech enhancement methods of convolutional neural network time-frequency domain, and the non SMB-SE is the basic framework of our proposed SMB-free version, which simply connects low-level information to high-level, and the SMB-SE is based on the non SMB-SE and adds 4 SMBs.

All models were trained under 0dB SNR conditions and evaluated for performance at-15 dB, -10dB, -5dB, 0dB and 5dB signal-to-noise ratios, and to evaluate the generalization performance of the proposed framework, we also tested the proposed framework on the noilex-92 noise dataset.

TABLE I

Test results under visible noise conditions bold for best performance

TABLE II

Test results under invisible noise conditions bold for best performance

The invention provides an end-to-end speech enhancement framework using a stacked multi-scale module, which is characterized in that an original time domain waveform is coded into a two-dimensional feature representation, then a speech enhancement module is used for learning the mapping relation from noisy speech to clean speech, and finally a time domain speech signal is synthesized by decoding. The end-to-end framework provided by the invention can effectively extract the characteristic information of the time domain signal, the SMB module is applied to help the model to mine more information, and the integration of STOI, SDR and RMSE can effectively improve the overall enhancement performance of the model. The framework exhibits noise immunity under low SNR conditions and good generalization in unknown noise environments.

The present invention is not limited to the above preferred embodiments, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of end-to-end speech enhancement using stacked multi-scale modules, comprising the steps of:

2. The speech enhancement method of claim 1, wherein: the cascade end-to-end voice enhancement framework comprises voice time domain signal preprocessing, a voice enhancement module and target voice synthesis post-processing; the method comprises the following specific steps:

c. by means of transposed convolution

Transforming into time domain signals

3. The speech enhancement method of claim 1, wherein: the multiscale module includes an average pooling layer, convolutions with convolution kernels of 1 x 1 and 3 x 3, and dilated convolutions of different dilation rates.

4. The speech enhancement method of claim 1, further comprising the steps of: and integrating evaluation indexes STOI and SDR of voice enhancement into a loss function by using a training strategy of multi-objective joint optimization.

5. The speech enhancement method of claim 4 wherein the step of incorporating the STOI indicator into the loss function comprises:

1) STOI inputs are clean speech x and degraded speech

[X_j(m-L+1)，X_j(m-L+2)，...X_j(m)]^T

wherein | · | purple sweet₂And L2 norm, μ (·) represents the mean vector of the corresponding sample.

4) Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index of the degraded speech:

5) will enhance the speechAnd the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:

6. The speech enhancement method of claim 4 further characterized in that the step of incorporating the SDR indicator into the loss function comprises:

1) the SDR inputs are clean speech x and enhanced speech

The SDR calculation process for enhanced speech is as follows:

7. The speech enhancement method of claim 4 further characterized by fusing the STOI and SDR merit indices into a loss function, the specific steps comprising:

1) the conventional root mean square error is calculated as follows: