CN112309411B

CN112309411B - Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system

Info

Publication number: CN112309411B
Application number: CN202011332442.8A
Authority: CN
Inventors: 刘明; 周彦兵; 唐飞; 周小明; 赵学华
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2024-06-11
Anticipated expiration: 2040-11-24
Also published as: CN112309411A

Abstract

The invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method, which utilizes a neural network model to construct a mapping relation between complex frequency spectrums of voice signals, maps real and imaginary frequency spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and restores the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals. The invention also provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement system. The beneficial effects of the invention are as follows: the effect of voice enhancement is improved, the enhanced voice is guaranteed to have good voice intelligibility, and the problem of voice distortion is well avoided.

Description

Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system

Technical Field

The invention relates to a voice enhancement method, in particular to a phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system.

Background

Early auditory experiment researches show that when the signal-to-noise ratio is higher than 6dB, the influence of phase distortion on voice quality and intelligibility is small, so that most single-channel voice enhancement methods at present mainly perform noise reduction treatment in the amplitude domain of voice signals and directly utilize noisy phases to reconstruct the voice signals. However, when the acoustic scene faced by our voice product is worse, for example, the signal-to-noise ratio is lower than 0dB, or the noise signal completely floods the voice signal in local time, if only the amplitude of the voice signal is enhanced, it cannot be ensured that the enhanced voice has good voice intelligibility, and even some voice distortion problems such as sound tremble, buzzing and the like can occur.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system.

The invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method, which utilizes a neural network model to construct a mapping relation between complex frequency spectrums of voice signals, maps real and imaginary frequency spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and restores the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals.

As a further improvement of the present invention, firstly, the noisy speech signal is subjected to framing and windowing, then is subjected to short-time fourier transform to obtain a complex spectrum of the noisy speech signal, the real and imaginary parts are separated, and only the effective value part is taken, so that two groups of input features are obtained: real and imaginary features.

As a further improvement of the present invention, two sets of input features are then fed into a gated multi-scale hole convolutional network model.

As a further improvement of the present invention, the process flow of the gating multi-scale hole convolution network model includes: firstly, a gating encoding module performs gating encoding operation to obtain a high-latitude nonlinear characteristic representation form, a multi-scale characteristic analysis module is utilized to respectively perform time sequence characteristic analysis on the real part characteristic and the imaginary part characteristic representation of the encoding, and a gating decoding module respectively performs gating decoding operation to obtain an enhanced real-imaginary part frequency spectrum.

As a further improvement of the invention, the enhanced real-imaginary frequency spectrum is subjected to inverse Fourier transform and then overlap-added to finally obtain the enhanced voice signal.

As a further improvement of the invention, the gating coding module is formed by stacking at least two gating linear coding units, and each gating linear coding unit adopts a convolution kernel of 1 multiplied by 3 to perform two-dimensional convolution operation in a mode of 1 multiplied by 2.

As a further improvement of the invention, the output of each gated linear coding unit is exponentially activated to perform a nonlinear transformation of the characteristic.

As a further refinement of the invention, the input of the multi-scale feature analysis module comprises two sets of features: (1) a real or imaginary spectrum of the original noisy speech; (2) And the gating coding module outputs real or imaginary characteristics.

As a further improvement of the invention, the multi-scale feature analysis module is formed by stacking at least two multi-scale analysis units, each multi-scale analysis unit performs a splicing operation on two sets of feature tensors, and before splicing, the two sets of tensors need to be remolded to be changed into a three-dimensional tensor, and the shape is [ sentence number, sentence length, 322]. Then, sub-band decomposition is carried out on the spliced characteristic tensor, wherein the characteristic tensor is totally divided into 8 sub-bands, the tensor shape of the first 7 sub-bands is [ sentence number, sentence length, 40], the shape of the last sub-band is [ sentence number, sentence length, 42], the input of the current sub-band and the winding output of the adjacent sub-band are spliced, then one-dimensional cavity convolution operation is carried out, after each sub-band convolution, exponential linear activation is adopted, after a plurality of multi-scale analysis units are carried out, the multi-scale analyzed characteristics are expanded by utilizing a 1024-dimensional full-connection layer, the output characteristic tensor is remolded into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256], and then the two remolded characteristic tensors are respectively sent to a gate control decoding module for decoding operation.

The invention also provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement system, which comprises a readable storage medium, wherein the readable storage medium is stored with execution instructions, and the execution instructions are used for realizing the method of any one of the above when being executed by a processor.

The beneficial effects of the invention are as follows: through the scheme, the effect of voice enhancement is improved, the enhanced voice is guaranteed to have good voice intelligibility, and the problem of voice distortion is better avoided.

Drawings

FIG. 1 is a process flow diagram of a phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of the present invention.

FIG. 2 is a diagram of a gating multi-scale hole convolution network structure of a phase-sensitive gating multi-scale hole convolution network speech enhancement method of the present invention.

FIG. 3 is a block diagram of a gating linear encoding and decoding unit of a phase-sensitive gating multi-scale hole convolution network speech enhancement method of the present invention.

FIG. 4 is a block diagram of a multi-scale analysis unit of a phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of the present invention.

Detailed Description

The invention is further described with reference to the following description of the drawings and detailed description.

A phase-sensitive gating multi-scale cavity convolution network voice enhancement method aims at constructing a mapping relation between complex frequency spectrums of voice signals by utilizing a neural network model, mapping real and imaginary frequency spectrums of voice with noise after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and recovering the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals. The processing flow of the whole algorithm is shown in figure 1, the dotted line part is a gating multi-scale cavity convolution network structure designed by the invention, and the gating multi-scale cavity convolution network structure is a core module of the whole algorithm, and the processing flow realizes noise reduction processing of real and imaginary frequency spectrums of noisy speech through three modules of gating coding, multi-scale feature analysis and gating decoding.

As shown in fig. 1, the noisy speech signal is first subjected to a frame windowing process, then to a short-time fourier transform to obtain a complex spectrum of the noisy speech signal, separating real and imaginary parts, and taking only the effective value part, two sets of input features are obtained: real and imaginary features. And then, the two groups of features are sent into a gating multi-scale cavity convolution network model, the gating encoding operation is firstly carried out to obtain a high-latitude nonlinear feature representation form, and then, the multi-scale feature analysis module is utilized to respectively carry out time sequence feature analysis on the encoded feature representations and respectively decode the time sequence feature analysis to obtain the enhanced real-imaginary frequency spectrum. The following describes each module of the gating multi-scale hole convolution network in detail.

The detailed structure of the gating multi-scale cavity convolution network is shown in fig. 2, and the gating multi-scale cavity convolution network consists of three parts, namely gating coding, multi-scale feature analysis and gating decoding. The real and imaginary characteristics X _real (n, K) and X _imag (n, K) of the input noisy speech will first enter the gating coding part to perform characteristic transformation, the structure of the gating linear coding unit is as shown in (a) in fig. 3, the tensor shape of the input real and imaginary characteristics is [ sentence number, sentence length, 161,2], and since the sampling rate of 16K is adopted, the frame length of the speech is 20ms, and the frames overlap by 10ms, so that 161 in the third dimension is the characteristic length corresponding to each frame of the real part or the imaginary part, and 2 in the fourth dimension represents the real part and the imaginary part, and the total two dimensions. Here, a total of 5 gating linear coding units are stacked, each coding unit adopts a convolution kernel of 1×3, and performs two-dimensional convolution operation in a mode of 1×2 step length, and the number of channels is 16, 32, 64, 128, 256 respectively, so that output tensors of the 5 linear coding units are sequentially obtained: [ number of sentences, sentence length, 80, 16], [ number of sentences, sentence length, 39, 32], [ number of sentences, sentence length, 19, 64], [ number of sentences, sentence length, 9, 128], and [ number of sentences, sentence length, 4, 256]. In order to achieve attention control between features, a Sigmoid activation function is used to nonlinearly activate the convolved output of one side of each coding unit to become a probability value within [0,1], and then in a manner of gating attention, the points multiply the convolved output features of the other side. In addition, the output of each gating linear coding unit is subjected to exponential linear activation in the following formula (1) to perform nonlinear transformation of the features.

Wherein, alpha is a parameter which needs to be optimized in the training process, and the exponential linear activation is favorable for relieving gradient disappearance in the training process, so that the model is more robust to input noise.

Next, in order to make full use of the context information between the speech signals, we use a multi-scale time domain feature analysis method to analyze and integrate the feature information in the past frame and the current frame, and capture the context information more beneficial to estimating the features of the current frame. The structure of the designed multi-scale analysis cell is shown in fig. 4, wherein the input of the multi-scale analysis cell has mainly two parts: the real or imaginary spectrum of the original voice with noise and the characteristics output by the previous module are then spliced, and before splicing, the two groups of characteristic tensors are required to be remolded to become a three-dimensional tensor, and the shape is [ sentence number, sentence length, 322]. Next, the stitched feature tensor is sub-band decomposed, here divided into a total of 8 sub-bands, with the tensor shape of the first 7 sub-bands [ number of sentences, sentence length, 40], and the shape of the last sub-band [ number of sentences, sentence length, 42]. When each sub-band is subjected to convolution operation, the input of the current sub-band and the output of the adjacent sub-band are spliced, and then one-dimensional cavity convolution operation is performed, wherein 5 multiscale analysis units are stacked in total, and in order to better enlarge the convolution receptive field, a mode that the cavity rate is gradually increased is adopted to be 1,3,5,7 and 11 respectively. Further, after each sub-band is coiled, exponential linear activation in equation (1) is employed. The sub-band splicing convolution mode ensures that each convolution layer has different receptive field ranges, and in the decomposition direction, the receptive field is linearly increased, so that the convolution layer is ensured to have timing sequence characteristic analysis capability of different scales. However, it is desirable that each multi-scale analysis unit can produce an intermediate estimate of a set of complex spectral features and take this as input to the next multi-scale analysis unit. Therefore, after the multi-scale convolution layer, we design a fully connected linear decoding layer, and linearly transform the features obtained by the multi-scale layer to obtain an intermediate estimation value of the real part or the imaginary part, and the tensor is [ sentence number, sentence length, 161].

After 5 multi-scale analysis units, expanding the multi-scale analyzed features by using a 1024-dimensional full-connection layer, and remolding the output feature tensor into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256]. And then, respectively feeding the two groups of remodeled characteristic tensors into a gating linear decoding unit for decoding operation. The operation mode of the linear decoding unit is shown in fig. 3 (b), unlike the linear encoding unit, the linear decoding unit adopts two-dimensional deconvolution operation to realize expansion of the feature tensor, each decoding unit adopts a convolution kernel of 1×3, performs two-dimensional deconvolution in a mode of step length of 1×2, so that gradual expansion of the feature of each channel can be realized, and the number of channels adopts a gradually decreasing mode to be 128, 64, 32, 16,1 respectively, so that the output tensors of 5 linear decoding units are sequentially obtained, namely: [ number of sentences, sentence length, 9, 128], [ number of sentences, sentence length, 19, 64], [ number of sentences, sentence length, 39, 32], [ number of sentences, sentence length, 80, 16], and [ number of sentences, sentence length, 161,1]. Similarly, the output of each gated linear decoding unit also performs exponential linear activation in equation (1) to perform nonlinear transformation of the feature, and the gated linear decoding unit is shown in fig. 3 (b).

After the neural network model is constructed, a great amount of data is required to be trained, so that the neural network model has the capability of mapping pure real-imaginary frequency spectrums. Firstly, preparing enough pairs of noisy speech complex spectrum and ideal speech complex spectrum as training data sets, so that we choose 4620 sentences in TIMIT dataset ^[1] as clean speech data of the training sets, and then using 12 kinds of noise in NOISEX-92 ^[2] noise library, including restaurant noise, 2 kinds of fighter noise, 2 kinds of expelling ship noise, factory noise, tank noise, volvo car noise, high-frequency channel noise, white noise, leopard type of war car noise and gun noise, as noise data, randomly mixing with clean speech, and leading the mixed signal-to-noise ratio to be between [ -5,15], and obtaining noisy training data with a total time length of about 38 hours. In order to adjust the parameters of the model, a verification set is required to be set, 280 sentences are also selected from the test set of the TMIT data set to be used as pure voice data of the verification set, and the pure voice data and 12 kinds of noise in the training set are uniformly mixed with each other with the signal to noise ratio of-5 to 15 dB. The loss function during the training of the gated multi-scale cavity convolution network is calculated by mean square error, the calculation formula is shown as a formula (2), wherein n and k are respectively the frame and frequency indexes of the voice signal, X _real (n, k) and X _imag (n, k) are ideal real-imaginary part spectrums, andAnd/>Then the real-imaginary spectrum of the neural network output:

During training, the overfitting problem of the model is reduced in a mode of 20% random neuron inactivation rate and batch normalization, the Adam optimization algorithm is utilized for back propagation, the iteration is performed for 50 times at a learning rate of 0.001, and then the iteration is performed for 10 times at a learning rate of 0.0001, so that the gated multi-scale cavity convolution network model with the real-imaginary part spectrum of the mapped pure voice can be obtained.

The following experiment verifies the noise reduction effect of the method provided by the invention, and in order to evaluate the quality, the intelligibility and the distortion condition of the noise-reduced voice, PESQ (Perceptual evaluation of speech quality), STOI (Short-Time Objective Intelligibility) and SDR (Signal to Distortion Ratio) indexes are adopted to evaluate the noise-reduced voice. As shown in Table 1, all noise reduction effects and indicators were measured on the test set, with higher indicators representing better performance. The test set used was an additional 320 words selected from the test set of the TIMIT dataset that were not repeated with both the training set and the validation set, and mixed with 12 kinds of trained noise and 3 kinds of untrained noise (untrained fighter noise, untrained factory noise, and pink noise) of NOISEX-92, respectively, to five noise pollution levels of-5 dB,0dB,5dB,10dB, and 15 dB. The experimental results in table 1 show that the method provided by the invention not only has good noise reduction effect in the trained noise scene, but also can be well generalized to the untrained noise scene, and has good model generalization performance. Even if the instantaneous noise such as factory noise, machine gun noise and the like exists, the method provided by the invention has obvious effect, hardly hears abrupt background noise, and has good recovery of voice quality. In some noise environments with low signal-to-noise ratio, the enhanced speech also has no problems such as buzzing, sound jitter, etc. In addition, the time delay of the method designed by the invention is less than 30ms, and the method can completely meet the requirement of most voice products on real-time property.

TABLE 1 evaluation results of the PESQ, STOI and SDR metrics in different noise environments

Different from the deep neural network noise reduction method which only enhances in the amplitude domain in the prior art, the method models complex spectrum information of the voice signal, namely real-imaginary spectrum after Fourier transformation, so as to construct a hollow convolutional neural network of a multi-scale coding and decoding architecture, and learns the mapping relation between the noisy signal and the pure signal in the complex domain, thereby realizing common optimization of phase and amplitude information. The main advantages of the algorithm are as follows:

(1) The learning is carried out in a complex domain, the enhancement of phase information is considered, and better speech intelligibility and speech quality can be realized in a low signal-to-noise ratio environment;

(2) The real and imaginary part information of the complex spectrum is equivalent to two learning targets, and compared with a method for mapping a single-magnitude spectrum, the model with multiple targets has better generalization performance;

(3) The modeling is performed by using a multi-scale convolution method, so that the context information in the voice can be captured more finely, and more voice details can be recovered;

(4) The designed model is a complete causal system, that is, the output of the model is only related to the current frame and the past frame information, and the time delay of the algorithm is reduced to the greatest extent.

Reference is made to:

[1]J.S.Garofolo,"Getting started with the DARPA TIMIT CD ROM:An acoustic phonetic continuous speech database NIST Tech Report,"1988.

[2]Andrew Varga,Herman J.M.,Steeneken,"Assessment for automatic speech recognition:II.NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems,"Speech Communication,vol.12, no.3,1993.

the foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A phase-sensitive gating multi-scale cavity convolution network voice enhancement method is characterized in that: the method comprises the steps of constructing a mapping relation between complex spectrums of voice signals by utilizing a neural network model, mapping real and imaginary spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary spectrums, recovering the enhanced real and imaginary spectrums into enhanced time-domain voice signals, firstly, carrying out framing and windowing processing on the noisy voice signals, then carrying out short-time Fourier transformation to obtain complex spectrums of the noisy voice signals, separating real and imaginary parts, and taking only effective value parts, so that two groups of input features are obtained: the real part features and the imaginary part features, and then two groups of input features are sent into a gating multi-scale cavity convolution network model, wherein the processing flow of the gating multi-scale cavity convolution network model comprises the following steps: firstly, a gating encoding module performs gating encoding operation to obtain a high-latitude nonlinear characteristic representation form, a multi-scale characteristic analysis module is used for respectively performing time sequence characteristic analysis on the encoded real part characteristic and the encoded imaginary part characteristic representation, a gating decoding module performs gating decoding operation to obtain an enhanced real part and imaginary part frequency spectrum, and the input of the multi-scale characteristic analysis module comprises two groups of characteristics: (1) a real or imaginary spectrum of the original noisy speech; (2) The method comprises the steps that a gating coding module outputs real or imaginary characteristics, a multiscale characteristic analysis module is formed by stacking at least two multiscale analysis units, each multiscale analysis unit is used for splicing two groups of characteristic tensors, the two groups of tensors are required to be remodeled before splicing to be changed into a three-dimensional tensor, the three-dimensional tensor is in a shape of [ sentence number, sentence length, 322], then the spliced characteristic tensor is subjected to sub-band decomposition, the three-dimensional tensor is totally divided into 8 sub-bands, the tensor in the shape of the first 7 sub-bands is in a shape of [ sentence number, sentence length, 40], the shape of the last sub-band is in a shape of [ sentence number, sentence length, 42], the input of the current sub-band and the adjacent sub-band convolution output of the current sub-band are spliced, then one-dimensional cavity convolution operation is carried out, after each sub-band convolution, exponential linear activation is adopted, after the multiscale analysis units are subjected to expansion, the output characteristic tensor is remodeled by using a layer of 1024-dimensional full-connection layer, the output characteristic tensor is remodeled into a 4-dimensional tensor form of [ sentence number, sentence length, 4, 256] and then the two-dimensional characteristic tensor after decoding operation is carried out, the decoding operation is carried out by the gating module respectively.

2. The phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of claim 1, wherein: and performing inverse Fourier transform on the enhanced real-imaginary frequency spectrum, and performing overlap addition to finally obtain the enhanced voice signal.

3. The phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of claim 1, wherein: the gating coding module is formed by stacking at least two gating linear coding units, and each gating linear coding unit adopts a convolution kernel of 1 multiplied by 3 to perform two-dimensional convolution operation in a mode of 1 multiplied by 2.

4. A phase-sensitive gated multi-scale void convolution network speech enhancement method according to claim 3, characterized in that: the output of each gating linear coding unit is subjected to exponential linear activation to perform nonlinear transformation of the features.

5. A phase-sensitive gating multi-scale cavity convolution network voice enhancement system is characterized in that: comprising a readable storage medium having stored therein execution instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 4.