CN112309411A

CN112309411A - Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system

Info

Publication number: CN112309411A
Application number: CN202011332442.8A
Authority: CN
Inventors: 刘明; 周彦兵; 唐飞; 周小明; 赵学华
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-02-02

Abstract

The invention provides a phase-sensitive gated multi-scale cavity convolution network voice enhancement method, which is characterized in that a neural network model is utilized to construct a mapping relation between complex frequency spectrums of voice signals, real and imaginary part frequency spectrums of noisy voice subjected to time-frequency analysis are mapped to obtain enhanced real and imaginary part frequency spectrums, and the enhanced real and imaginary part frequency spectrums are restored into enhanced time-domain voice signals. The invention also provides a phase-sensitive gated multi-scale cavity convolution network voice enhancement system. The invention has the beneficial effects that: the method improves the effect of voice enhancement, ensures that the enhanced voice has good voice intelligibility, and better avoids the problem of voice distortion.

Description

Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system

Technical Field

The invention relates to a speech enhancement method, in particular to a phase-sensitive gated multi-scale cavity convolutional network speech enhancement method and system.

Background

Early hearing experimental studies showed that when the signal-to-noise ratio is higher than 6dB, the influence of phase distortion on speech quality and intelligibility is small, so most of the current single-channel speech enhancement methods mainly perform noise reduction processing in the amplitude domain of the speech signal and directly perform reconstruction of the speech signal by using the noisy phase. However, when the acoustic scene faced by our voice product is worse, for example, the signal-to-noise ratio is lower than 0dB, or the noise signal completely submerges the voice signal in local time, if only the amplitude of the voice signal is enhanced, it cannot be guaranteed that the enhanced voice has good voice intelligibility, and even some voice distortion problems such as trembling and buzzing of voice also occur.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a phase-sensitive gated multi-scale hole convolution network speech enhancement method and system.

The invention provides a phase-sensitive gated multi-scale cavity convolution network voice enhancement method, which is characterized in that a neural network model is utilized to construct a mapping relation between complex frequency spectrums of voice signals, real and imaginary part frequency spectrums of noisy voice subjected to time-frequency analysis are mapped to obtain enhanced real and imaginary part frequency spectrums, and the enhanced real and imaginary part frequency spectrums are restored into enhanced time-domain voice signals.

As a further improvement of the invention, firstly, the noisy speech signal is subjected to frame division and windowing processing, then short-time Fourier transform is carried out to obtain a complex spectrum of the noisy speech signal, real and imaginary parts are separated, and only effective value parts are taken, so that two groups of input characteristics are obtained: real and imaginary features.

As a further improvement of the present invention, two sets of input features are then fed into a gated multi-scale hole convolution network model.

As a further improvement of the invention, the processing flow of the gated multi-scale hole convolutional network model comprises the following steps: firstly, a gating coding module carries out gating coding operation to obtain a high-latitude nonlinear feature representation form, then a multi-scale feature analysis module is used for respectively carrying out time sequence feature analysis on coded real part feature representation and coded imaginary part feature representation, and a gating decoding module carries out gating decoding operation respectively to obtain an enhanced real-imaginary part frequency spectrum.

As a further improvement of the invention, the enhanced real-imaginary part frequency spectrum is subjected to inverse Fourier transform and then overlapped and added to finally obtain an enhanced voice signal.

As a further improvement of the present invention, the gated coding module is formed by stacking at least two gated linear coding units, each gated linear coding unit adopts a 1 × 3 convolution kernel, and performs a two-dimensional convolution operation in a manner that the step size is 1 × 2.

As a further improvement of the invention, the output of each gated linear coding unit is exponentially and linearly activated to perform a non-linear transformation of the features.

As a further refinement of the present invention, the input to the multi-scale feature analysis module comprises two sets of features: (1) real or imaginary spectra of the original noisy speech; (2) real or imaginary features of the gated encoding module output.

As a further improvement of the present invention, the multi-scale feature analysis module is formed by stacking at least two multi-scale analysis units, each multi-scale analysis unit performs a splicing operation on two sets of feature tensors, and before the splicing operation, the two sets of tensors need to be reshaped into a three-dimensional tensor, the shape of which is [ sentence number, sentence length, 322 ]. And then, decomposing the spliced characteristic tensor into 8 sub-bands, wherein the tensor of the first 7 sub-bands is in a shape of [ sentence number, sentence length, 40], the shape of the last sub-band is in a shape of [ sentence number, sentence length, 42], splicing the input of the current sub-band with the convolution output of the adjacent sub-bands, then performing one-dimensional cavity convolution operation, after convolution of each sub-band, performing exponential linear activation, expanding the characteristic after multi-dimensional analysis by using a 1024-dimensional full connection layer after passing through a plurality of multi-dimensional analysis units, and remolding the output characteristic tensor into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256], and then respectively sending the two groups of remolded characteristic tensors to a gate control decoding module for decoding operation.

The invention also provides a phase-sensitive gated multi-scale hole convolutional network speech enhancement system, which comprises a readable storage medium, wherein execution instructions are stored in the readable storage medium, and when executed by a processor, the execution instructions are used for realizing the method according to any one of the above items.

The invention has the beneficial effects that: by the scheme, the effect of voice enhancement is improved, the enhanced voice is ensured to have good voice intelligibility, and the problem of voice distortion is well avoided.

Drawings

FIG. 1 is a block diagram of a processing flow of a phase-sensitive gated multi-scale hole convolution network speech enhancement method of the present invention.

FIG. 2 is a diagram of a gated multi-scale hole convolution network structure of a phase-sensitive gated multi-scale hole convolution network speech enhancement method of the present invention.

FIG. 3 is a block diagram of a gated linear coding and decoding unit of a phase-sensitive gated multi-scale hole convolutional network speech enhancement method of the present invention.

FIG. 4 is a diagram of a multi-scale analysis unit of the phase-sensitive gated multi-scale hole convolution network speech enhancement method of the present invention.

Detailed Description

The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.

A phase-sensitive gated multi-scale cavity convolution network speech enhancement method aims to utilize a neural network model to construct a mapping relation between complex frequency spectrums of speech signals, map noisy speech real and imaginary part frequency spectrums after time-frequency analysis processing to obtain enhanced real and imaginary part frequency spectrums, and restore the enhanced real and imaginary part frequency spectrums into enhanced time-domain speech signals. The processing flow of the whole algorithm is shown in figure 1, the dotted line part is a gated multi-scale cavity convolution network structure designed by the invention, the gated multi-scale cavity convolution network structure is a core module of the whole algorithm, and the noise reduction processing of the real and imaginary part frequency spectrum of the voice with noise is realized through three modules of gated encoding, multi-scale feature analysis and gated decoding.

As shown in fig. 1, a noisy speech signal is first subjected to frame division and windowing, and then subjected to short-time fourier transform to obtain a complex spectrum of the noisy speech signal, and the real and imaginary parts are separated, and only the effective value part is taken, so that two sets of input features are obtained: real and imaginary features. And then, the two groups of characteristics are sent into a gated multi-scale cavity convolution network model, firstly, gated encoding operation is carried out to obtain a high-latitude nonlinear characteristic representation form, and then, a multi-scale characteristic analysis module is utilized to respectively carry out time sequence characteristic analysis on the encoded characteristic representation and respectively carry out decoding to obtain the enhanced real-imaginary part frequency spectrum. The details of each module of the gated multi-scale hole convolution network will be described below.

The detailed structure of the gated multi-scale hole convolutional network is shown in fig. 2 and comprises three parts, namely gated encoding, multi-scale feature analysis and gated decoding. Real and imaginary part characteristic X of input noisy speech_real(n, k) and X_imag(n, k) will enter the gated coding part to perform feature transformation, the structure of the gated linear coding unit is shown as (a) in fig. 3, and the tensor shape of the input real and imaginary features is [ sentence number, sentence length, 161, 2 [ ]]Since the speech frames have a length of 20ms and overlap by 10ms with a sampling rate of 16K, 161 in the third dimension is the characteristic length corresponding to each frame of the real part or the imaginary part, and 2 in the fourth dimension represents the real part and the imaginary part, which are two dimensions in total. Here, 5 gated linear coding units are stacked, each coding unit adopts a convolution kernel of 1 × 3, and performs two-dimensional convolution operation in a manner of step size 1 × 2, where the number of channels is 16, 32, 64, 128, and 256, respectively, and thus the output tensors of the 5 linear coding units are obtained in sequence: [ sentence number, sentence length, 80, 16][ number of sentences, length of sentence, 39, 32][ number of sentences, length of sentence, 19, 64]"[ sentence number, sentence length, 9, 128]And [ sentence number, sentence length, 4, 256]. In order to realize attention control among features, a Sigmoid activation function is adopted to carry out nonlinear activation on one side convolution output in each coding unit so that the side convolution output becomes 0,1]The probability values within, are then point-multiplied on the convolution output features on the other side in a gated attention manner. In addition, the output of each gated linear coding unit is subjected to exponential linear activation in the following formula (1) to perform nonlinear transformation of the characteristics.

Wherein α is a parameter to be optimized in the training process, and the exponential linear activation is beneficial to relieving the disappearance of the gradient in the training process, so that the model is more robust to the input noise.

Next, in order to fully utilize the context information between the speech signals, a multi-scale time domain feature analysis method is adopted to analyze and synthesize the feature information in the past frame and the current frame, and capture the context information more favorable for estimating the features of the current frame. The structure of the designed multi-scale analysis unit is shown in fig. 4, wherein the input of the multi-scale analysis unit mainly has two parts: the real or imaginary part spectrum of the original voice with noise and the characteristics output by the former module are spliced, and before splicing, the two groups of tensors are required to be reshaped into a three-dimensional tensor with the shape of [ sentence number, sentence length, 322 ]. Next, the concatenated feature tensor is sub-band decomposed, where it is totally divided into 8 sub-bands, the tensor shape of the first 7 sub-bands is [ sentence number, sentence length, 40], and the shape of the last sub-band is [ sentence number, sentence length, 42 ]. When convolution operation is carried out on each sub-band, input of the current sub-band and convolution output of the adjacent sub-band need to be spliced, and then one-dimensional void convolution operation is carried out, because 5 multi-scale analysis units are stacked together, in order to better enlarge the experience field of convolution, a mode that the void rate is gradually increased is adopted and is respectively 1, 3, 5, 7 and 11. Furthermore, after each subband convolution, an exponential linear activation in equation (1) is employed. The sub-band splicing convolution mode enables each convolution layer to have different receptive field ranges, and in the decomposition direction, the receptive field is linearly increased, so that the convolution layers are guaranteed to have time sequence characteristic analysis capabilities of different scales. However, it is desirable that each multiscale analysis unit produces an intermediate estimate of a set of complex spectral features, which is used as input for the next multiscale analysis unit. Therefore, after the multi-scale convolutional layer, a fully-connected linear decoding layer is designed, and features obtained by the multi-scale layer are subjected to linear transformation to obtain an intermediate estimation value of a real part or an imaginary part, wherein the tensor of the intermediate estimation value is [ sentence number, sentence length, 161 ].

After 5 multi-scale analysis units, a layer of 1024-dimensional full connection layer is used for expanding the multi-scale analyzed features, and the output feature tensor is reshaped into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256 ]. And then, respectively sending the two groups of reshaped feature tensors to a gated linear decoding unit for decoding operation. The operation mode of the linear decoding unit is as shown in fig. 3(b), different from the linear encoding unit, the linear decoding unit implements expansion of the feature tensor by using a two-dimensional deconvolution operation, each decoding unit uses a 1 × 3 convolution kernel, and performs two-dimensional deconvolution in a mode of a step length of 1 × 2, so that gradual widening of each channel feature can be implemented, and the number of channels uses a gradual decreasing mode, which is respectively 128, 64, 32, 16, 1, and thus the output tensors of 5 linear decoding units are obtained in sequence: [ sentence number, sentence length, 9, 128], [ sentence number, sentence length, 19, 64], [ sentence number, sentence length, 39, 32], [ sentence number, sentence length, 80, 16] and [ sentence number, sentence length, 161, 1 ]. Similarly, the output of each gated linear decoding unit will also perform exponential linear activation in equation (1) to perform nonlinear transformation of features, and the gated linear decoding unit is shown in (b) of fig. 3.

After the neural network model is constructed, a large amount of data needs to be trained to enable the neural network model to have the capacity of mapping the pure real-imaginary part frequency spectrum. First, it is necessary to prepare enough pairs of noisy speech complex spectrum and ideal speech complex spectrum as training data set, so we pick the TIMIT data set^[1]Using 4620 words as clean speech data of training set, and then using NOISEX-92^[2]12 kinds of noises in the noise library, including restaurant noise, 2 kinds of fighter noise, 2 kinds of destroyer noise, factory noise, tank noise, Volvo automobile noise, high-frequency channel noise, white noiseNoise, leopard type battle vehicle noise and machine gun noise as noise data to be randomly mixed with pure voice, and the mixed signal-to-noise ratio is [ -5,15]And uniformly distributed, and the noisy training data with the time duration of about 38 hours is obtained in total. In order to optimize the parameters of the model, a verification set is required to be set, 280 sentences of words are selected from the test set of the TMIT data set as verification set pure voice data, and are uniformly mixed with 12 kinds of noise in the training set, wherein the signal to noise ratio is-5 dB to 15 dB. The loss function of the gate-controlled multi-scale cavity convolution network during training is calculated by mean square error, and the calculation formula is shown as formula (2), wherein n and k are respectively the frame and frequency index of the speech signal, and X_real(n, k) and X_imag(n, k) is an ideal real-imaginary spectrum, and

and

then is the real-imaginary spectrum of the neural network output:

during training, the overfitting problem of the model is reduced in a mode of 20% random neuron inactivation rate and batch normalization, backward propagation is carried out by using an Adam optimization algorithm, iteration is carried out for 50 times at a learning rate of 0.001, and then iteration is carried out for 10 times at a learning rate of 0.0001, so that a gated multi-scale cavity convolution network model with a mapped pure speech real-imaginary part spectrum can be obtained.

The following experiment verifies the noise reduction effect of the method provided by the invention, and in order to evaluate the quality, Intelligibility and Distortion of the noise-reduced voice, we adopt pesq (spatial evaluation of speech quality), STOI (Short-Time Objective intelligent) and sdr (signal to disturbance ratio) indexes to evaluate the noise-reduced voice. As shown in table 1, all noise reduction effects and indicators were measured on the test set, with higher indicators representing better performance. The test set used was another 320 words selected from the test set of the TIMIT dataset that did not duplicate both the training and validation sets, and was mixed with 12 trained and 3 untrained noises (untrained fighter noise, untrained factory noise, and pink noise) in NOISEX-92 to five noise pollution levels-5 dB, 0dB, 5dB, 10dB, and 15dB, respectively. The experimental results in table 1 show that the method provided by the invention not only can have a good noise reduction effect in a trained noise scene, but also can be well generalized to an untrained noise scene, and has good model generalization performance. Even under the condition that instantaneous noises such as factory noise, machine gun sound and the like exist, the method provided by the invention still has obvious effect, almost cannot hear sharp background noise, and has good voice quality recovery. In some noise environments with low signal-to-noise ratio, the enhanced voice has no problems of hum, sound jitter and the like. In addition, the time delay of the method designed by the invention is less than 30ms, and the requirement of most voice products on the real-time property can be completely met.

TABLE 1 evaluation results of PESQ, STOI and SDR indexes under different noise environments

Different from the deep neural network noise reduction method which only enhances in the amplitude domain in the prior art, the method is characterized in that the complex spectrum information of a speech signal, namely the real-imaginary spectrum after Fourier transform is modeled to construct a hole convolution neural network with a multi-scale coding and decoding framework, and the mapping relation between a noisy signal and a pure signal is learned in the complex domain, so that the common optimization of phase and amplitude information is realized. The main advantages of this algorithm are as follows:

(1) learning is carried out in a complex domain, enhancement of phase information is considered, and better speech intelligibility and speech quality can be realized in a low signal-to-noise ratio environment;

(2) real and imaginary part information of the complex spectrum is equivalent to two learning targets, and compared with a method for mapping a single magnitude spectrum, the multi-target model has better generalization performance;

(3) the modeling is carried out by utilizing a multi-scale convolution method, so that the context information in the voice can be captured more finely, and more voice details can be recovered;

(4) the designed model is a completely causal system, namely, the output of the model is only related to the information of the current frame and the past frame, and the time delay of the algorithm is reduced to the maximum extent.

Reference documents:

[1]J.S.Garofolo,“Getting started with the DARPA TIMIT CD ROM:An acoustic phonetic continuous speech database NIST Tech Report,”1988.

[2]Andrew Varga,Herman J.M.,Steeneken,“Assessment for automatic speech recognition:II.NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems,”Speech Communication,vol.12, no.3,1993.

the foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A phase-sensitive gated multi-scale hole convolution network speech enhancement method is characterized by comprising the following steps: and constructing a mapping relation between complex frequency spectrums of the voice signals by utilizing a neural network model, mapping the real-imaginary part frequency spectrum of the voice with noise after time-frequency analysis processing to obtain an enhanced real-imaginary part frequency spectrum, and restoring the enhanced real-imaginary part frequency spectrum into an enhanced time domain voice signal.

2. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 1, characterized by: firstly, the voice signal with noise is processed by framing and windowing, then short-time Fourier transform is carried out to obtain a complex spectrum of the voice signal with noise, real and imaginary parts are separated, and only effective value parts are taken, so that two groups of input characteristics are obtained: real and imaginary features.

3. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 2, characterized by: and then feeding the two groups of input features into a gated multi-scale hole convolution network model.

4. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 3, characterized by: the processing flow of the gated multi-scale void convolutional network model comprises the following steps: firstly, a gating coding module carries out gating coding operation to obtain a high-latitude nonlinear feature representation form, then a multi-scale feature analysis module is used for respectively carrying out time sequence feature analysis on coded real part feature representation and coded imaginary part feature representation, and a gating decoding module carries out gating decoding operation respectively to obtain an enhanced real-imaginary part frequency spectrum.

5. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 4, characterized by: and performing inverse Fourier transform on the enhanced real-imaginary part frequency spectrum, and overlapping and adding to finally obtain an enhanced voice signal.

6. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 4, characterized by: the gate control coding module is formed by stacking at least two gate control linear coding units, each gate control linear coding unit adopts a 1 x 3 convolution kernel, and two-dimensional convolution operation is carried out in a mode that the step length is 1 x 2.

7. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 6, characterized by: the output of each gated linear coding unit is exponentially and linearly activated to perform a nonlinear transformation of the features.

8. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 4, characterized by: the input to the multi-scale feature analysis module includes two sets of features: (1) real or imaginary spectra of the original noisy speech; (2) real or imaginary features of the gated encoding module output.

9. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 8, characterized by: the multi-scale feature analysis module is formed by stacking at least two multi-scale analysis units, each multi-scale analysis unit carries out splicing operation on two groups of feature tensors, and before splicing, the two groups of tensors need to be reshaped to be changed into a three-dimensional tensor with the shape of [ sentence number, sentence length, 322 ]. And then, decomposing the spliced characteristic tensor into 8 sub-bands, wherein the tensor of the first 7 sub-bands is in a shape of [ sentence number, sentence length, 40], the shape of the last sub-band is in a shape of [ sentence number, sentence length, 42], splicing the input of the current sub-band with the convolution output of the adjacent sub-bands, then performing one-dimensional cavity convolution operation, after convolution of each sub-band, performing exponential linear activation, expanding the characteristic after multi-dimensional analysis by using a 1024-dimensional full connection layer after passing through a plurality of multi-dimensional analysis units, and remolding the output characteristic tensor into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256], and then respectively sending the two groups of remolded characteristic tensors to a gate control decoding module for decoding operation.

10. A phase-sensitive gated multi-scale hole convolution network speech enhancement system is characterized in that: comprising a readable storage medium having stored therein execution instructions for, when executed by a processor, implementing the method of any one of claims 1 to 9.