CN115101085A

CN115101085A - Multi-speaker time-domain voice separation method for enhancing external attention through convolution

Info

Publication number: CN115101085A
Application number: CN202210647059.4A
Authority: CN
Inventors: 闫河; 张宇宁; 李梦雪; 王潇棠; 刘建骐; 刘宇涵; 黄骏滨
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-09-23

Abstract

The invention relates to the technical field of voice processing, in particular to a multi-speaker time-domain voice separation method for enhancing external attention by convolution. The method comprises the following steps: s1, carrying out convolution operation on mixed voice of multiple speakers through an encoder, and converting the mixed voice into potential feature representation; learning to obtain a voice mask through a separator based on a convolution enhanced external attention module; the speech mask is multiplied by the potential feature representation output by the encoder, and then the waveform is reconstructed through deconvolution operation of the decoder to obtain separated speech. The method can meet the requirements of smaller models and high timeliness of voice separation, and achieves better separation effect by virtue of sequence modeling; the external attention mechanism is enhanced to learn more features and correlations, and the advantage of high separation speed is maintained; the application in the double-path structure can better balance timeliness, model size and separation effect.

Description

Multi-speaker time-domain voice separation method for enhancing external attention through convolution

Technical Field

The invention relates to the technical field of voice processing, in particular to a multi-speaker time domain voice separation method for enhancing external attention by convolution.

Background

In practical application, voice interaction often occurs under the condition of multi-person speaking, the voice of an interfering person can seriously obstruct the machine from extracting voice information, and the voice separation technology separates the voice of multiple speakers, so that the machine can effectively extract information to perform tasks such as voice recognition. Currently, a Time-domain audio separation (TasNet) structure is mainly adopted for single-channel voice separation based on deep learning. Compared with the traditional time-frequency domain based method, the TasNet directly models the speech signal in the time domain by using an encoder-decoder framework and performs speech separation on the non-negative encoder output, thereby omitting a frequency decomposition step, reducing the separation problem to a speech Mask (Mask) of the encoder output, and then synthesizing the speech Mask by a decoder, and therefore having better performance and lower delay.

The BLSTM-TasNet is based on an LSTM network, but the deep LSTM network has higher calculation cost, and the applicability of the BLSTM-TasNet in a low-resource and low-power-consumption platform is limited. Chen Xian et al (Luo Y, Mesgarani N. TasNet: time-domain audio separation network for real-time, single-channel seed separation [ C ].2018) replace LSTM with gated-cycle unit (GRU), can perform long sequence modeling, and can reduce gradient disappearance, but still have a large number of parameters. Luo Y et al (Luo Y, Mesgarani N.Conv-TasNet: parsing Ideal Time-Frequency mapping for Speech Separation [ J ]. IEEE/ACM transformations on Audio, Speech, and Language Processing,2019, PP (99):1-1) propose convolution-based Conv-TasNet, which uses a Time-domain convolution network composed of one-dimensional expansion convolution blocks to calculate a mask, allowing the network to model the long-term dependence of the Speech signal while maintaining a small number of parameters and a faster Separation speed. The SSGAN performs voice separation by adopting a mode based on generation countermeasure network joint training, and a generator and a discriminator are both constructed by using a full convolution neural network, so that high-dimensional feature extraction and recovery can be effectively performed on a time domain waveform. However, one-dimensional Convolutional Neural Networks (CNN) cannot model Speech-level sequences when the receptive field is smaller than the Sequence length, while Recurrent Neural Networks (RNN) cannot model longer sequences efficiently due to the difficulty of optimization, so Dual-Path RNN (Luo Y, Chen Z, Yoshioka T. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation [ C ]// ICASSP 2020 + IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2020:46-50, doi:10.1109/ICASSP40776.2020.9054266.) propose a two-way structure that divides longer audio inputs into smaller blocks, optimizes RNN iterative intra-and inter-block operations in a deep model, alleviates the problem of RNN optimization difficulty and achieves better performance with smaller models. The modeling of the separating model based on RNN to the voice sequence indirectly depends on the context, the transmission of the intermediate state influences the improvement of the separating performance, Chen J and the like (Chen J, Mao Q, Liu D.Dual-path transport network: Direct context-aware modulation for end-to-end single speech separation [ J ]. arXiv prepropressin: 2007.13975,2020.) introduce an improved Transformer in a two-way structure, and the LSTM is added in a feedforward layer, so that the Direct context perception of the voice sequence can be realized by learning the sequence information of the voice sequence without using position coding and the Direct context perception of the voice sequence is realized, but the reasoning speed is very slow. In the aspect of realizing voice separation by using limited computing resources, a SuDoRM-RF (Tzinis E, Wang Z, Smragdis P. Sudo rm-RF: effective network for undivided audio source separation [ C ]//2020 IEEE 30th International work on Machine Learning (MLSP). IEEE,2020:1-6) network uses a one-dimensional volume block similar to UNet to continuously down-sample and up-sample to increase the experience frame of the network, extracts information from multiresolution and has higher reasoning speed; the extraction of multi-scale features combined with the increase in perceptual frame makes SuDoRM-RF superior to other convolutions, but also has larger parameters.

Aiming at the prior art, how to realize considerable separation effect with smaller model size and higher separation speed and better meet and balance the requirements of a speech recognition front end on model size, timeliness and separation effect still has greater challenge, therefore, the invention provides the multi-speaker time-domain speech separation method for enhancing external attention by convolution.

Disclosure of Invention

The invention aims to provide a multi-speaker time-domain voice separation method with convolution for enhancing external attention.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-speaker time-domain speech separation method with convolution to enhance external attention comprises the following steps:

s1, mixing voice of multiple speakers through a coder, performing convolution operation, and converting the voice into potential feature representation

Remembering the mixed speech of multiple speakers as x (t),

wherein the content of the first and second substances,

representing a real number domain; t is the voice length;

let the latent features be denoted as h,

wherein, C _E Is the number of encoder channels; l is a potential feature representation;

s2, learning by a separator based on a convolution enhanced external attention module to obtain a voice mask;

and S3, multiplying the voice mask by the potential feature representation output by the encoder, and reconstructing a waveform through deconvolution operation of a decoder to obtain the separated voice.

Further, S2 mainly includes the following steps:

global normalization and convolutionThe operation is as follows: representing the potential features by h, mapping and obtaining an intermediate representation

Wherein C is the number of channels;

splitting and stacking: dividing the intermediate representation h' into S smaller blocks with overlapping length K to form a three-dimensional vector

Wherein K is the length of the overlapped blocks, and S is the number of the overlapped blocks;

ExConformer transform: iteratively applying a transformation consisting of B ExConformer modules to the intra-block dimension K and S of a three-dimensional vector T, the output T of the intra-block processing _b Will be the input for inter-block processing;

that is, the output of the B-1 th ECBlock will be the input of the B-th block, B being 1, …, B, as follows:

T _b ＝ECBlock _intra (U _b-1 )

U _b ＝ECBlock _inter (T _b )

dimension transformation: output U to the Bth ECBlock _B Two-dimensional convolution is applied to learn a mask for each source, resulting in a three-dimensional vector

Y＝Conv2D(U _B )

Where Conv2D denotes a two-dimensional convolution operation,

aggregate multi-channel information processing: converting three-dimensional vector Y into an intermediate potential representation of each source via overlap-add operations

Applying one-dimensional convolution sum PReLU to y to aggregate information over multiple channels, its estimated mask of the speech mask for the ith source

The following were used:

further, the ExConformer module consists of a position convolution module, an external attention module, a convolution module and a feedforward neural network module, and residual connection is added between the modules;

if the input of the ith ExConformer module is defined as x _i Then it outputs y _i Expressed as follows:

x″ _i ＝x′ _i +Conv(x′ _i )

y _i ＝Layernorm(x″ _i +FFN(x″ _i ))。

further, the convolution module of the excelformer module and the activation function of the feedforward neural network module use Penalized _ tanh.

Further, the position convolution module is composed of a plurality of stacked one-dimensional convolutions with zero-padding, layer normalization and ReLU activation layers.

Further, the steps of the external attention module are as follows:

adjusting the number of channels of the input features by a one-dimensional convolution;

constructing M using a linear layer _k A memory to learn an attention map a between query vectors;

wherein, F is an input characteristic diagram;

is M _k Transposing;

softmax and L1 Norm normalization (L1_ Norm) were performed thereon;

constructing M using a linear layer _v The memory generates a refined feature map, represented as follows:

F _out ＝AM _v

and performing Dropout operation on the output result.

The invention has at least the following beneficial effects:

1. according to the method, through the improved design of the former, the requirements of a smaller voice separation model and high timeliness can be met, and a better separation effect is achieved by the advantage of sequence modeling.

2. The invention enhances the learning of more characteristics and relativity of the external attention mechanism by introducing the external attention mechanism into the voice separation task and keeps the advantage of high separation speed.

3. The present invention encodes the input by using a convolutional neural network such that each frame contains context information, and convolutional encoding makes position encoding trainable.

4. The invention can better balance timeliness, model size and separation effect through the application of the proposed convolution enhanced external attention multi-speaker time domain voice separation method in a double-path structure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram of the overall network framework of the present invention;

FIG. 2 is a block diagram of an ExConformer module according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, the present invention is a multi-talker time-domain speech separation method with convolution enhanced external attention, comprising the steps of:

1. the encoder performs convolution operation on the multi-speaker mixed voice to convert the multi-speaker mixed voice into potential feature expression h;

2. learning to obtain a voice mask through a separator based on a convolution enhanced external attention module;

3. the speech mask is multiplied by the potential feature representation output by the encoder, and the waveform is reconstructed through deconvolution operation of the decoder to obtain the separated speech.

In the above step 2, the following operations are performed:

first, mapping the potential representation h to a new feature space by global normalization and a one-dimensional convolution to obtain an intermediate representation

The number of channels potentially representing h is changed, the network depth is increased under the condition of not changing a receptive field, and the abstract expression capability of a network local module is enhanced.

h′＝Conv1D(GlobLN(h))

Global layer normalization GlobLN (-) defines two learnable parameters

And

normalization of surrogate layers using global layer normalization significantly improves the modelConvergence, since the gradient statistics are interdependent between different channels. For input matrix

Applying global normalization can be defined as:

the intermediate representation h' is then divided into S smaller blocks of overlapping length K, constituting a three-dimensional vector

Then, the transformation composed of B ExConformer modules (ECBlock) is applied iteratively to the intra-block dimension (K dimension) and the inter-block dimension (S dimension) of the three-dimensional vector T, and the output T of the intra-block processing _b The output of the (B-1) th ECBlock will be the input of the B-th block, B being 1, …, B, which will be the input of the inter-block processing.

T _b ＝ECBlock _intra (U _b-1 )

U _b ＝ECBlock _inter (T _b )

The ExConformer module refers to FIG. 2.

Output U to the Bth ECBlock _B Applying a two-dimensional convolution to obtain a three-dimensional vector for each source learning mask

Y＝Conv2D(U _B )

The three-dimensional vector Y is then transformed into an intermediate potential representation for each source via an overlap-add operation

One-dimensional convolution sum PReLU is applied to y to aggregate information on multiple channels, and for the ith source, its estimated mask will be obtained

The following:

finally, multiplying the coded potential representation h with the corresponding mask to obtain the potential representation of each source estimation

Wherein a | _ b denotes multiplication of two tensor corresponding elements having the same shape.

Wherein the ExConformer module is a convolution enhancing external attention module. The former learns the local information based on position by using a convolution-enhanced self-attention mechanism, and uses global interaction based on content, which performs well in a sequence modeling task, does not consider the relation between different samples, and the combination of the self-attention mechanism and the convolution enables a model to have larger parameter number and slower reasoning speed, and is difficult to apply in a voice separation task. While the external attention mechanism uses two shared, learnable external memories, features from other samples can be learned implicitly; linear complexity can be realized by controlling the size of the memory, and the reasoning speed is high; however, due to the simple structure, the learning effect is not as good as the self-attention mechanism. According to the method, external attention is introduced into the former, and the relation among different voice segments is modeled by using a convolution mode for enhancing the external attention, so that the reasoning speed is accelerated while the parameter quantity is reduced; and Penalis _ tanh, which appears more stable in the sequence task, is used as an activation function, so that a convolution enhanced external attention module (ExConformer) is proposed. The ExConformer Module comprises a position Convolution Module (PosConv Module), an External Attention Module (External Attention Module), a Convolution Module (Convolution Module), a feed-forward neural network Module (fed forward Module), and a former classSimilarly, residual concatenation is added between each module, if the input of the ith ExConformer block is defined as x _i Then it outputs y _i This representation can be obtained by the following steps:

x″ _i ＝x′ _i +Conv(x′ _i )

y _i ＝Layernorm(x″ _i +FFN(x″ _i ))

in the former, two half-step feed-forward layers are used as mentioned by Macaron-Net, and in the self-attention mechanism sinusoidal position coding is used. Recent research on speech recognition finds that convolutional position coding can achieve better effect than sinusoidal position coding, so we also adopt convolutional position coding in the expert, which makes the position coding trainable; meanwhile, the gain generated by using two half-step feedforward layers is not obvious through experiments, so that in the invention, the first half-step feedforward layer of the former is changed into a convolution position coding module, and the second half-step feedforward layer is changed into a common feedforward layer. Using convolution with zero-padding, intra-and inter-block dimensions are modeled separately in the separator, whose location context location information is learned faster and with fewer parameters.

For the experiments performed on the multi-speaker time-domain speech separation technique, the experimental procedure is described as follows:

1. data set and evaluation index

The following experiment was conducted mainly for the separation of voices of two speakers using a Libri2Mix formed by mixing a public data set libristech, the setting in the Libri2Mix data set paper being adopted for the mixing mode. Mixing the Libri2Mix training set with data in train-100 for 13900 voices; the verification set and the test set respectively have 3000 pieces of voice data; each piece of data is formed by mixing random signal-to-noise ratios of-5 dB to 5 dB; the pre-treatment was performed down-sampling to 8 kHz. Compared to smaller datasets, such as the WSJ0 dataset and the timmit dataset, the libristech dataset contains more speakers, with 251 speakers in the training set train-100 and 40 speakers in the verification set and the test set; experimental results on the Libri2Mix dataset mixed by libristech can be more reliably generalized to new scenarios and can provide a general modeling trend.

The evaluation index adopts a scale-invariant signal-to-noise ratio SI-SNR and is calculated as follows:

wherein

And x are the estimated target speaker's voice and the pure target speaker's voice, respectively, both of which have been zero-mean normalized prior to the calculation. In general, a larger value of the SI-SNR indicates a better quality of speech separation.

2. Ablation experiment

We tested the former and the convolution-enhanced external attention, excormer, as a splitter module based on the TasNet splitting network, verifying the effect of the external attention mechanism on speech separation. Wherein the former enhances self-attention and two half-step feed forward layers (FFN) using convolution; ExConformer-PosConv denotes a structure that does not use convolution to encode positions; ExConformer-FFN indicates a structure that uses no feedforward layer, only convolution and external attention modules; ExConformer with Swish activation function in former; ExConformer replaces Swish with Penaliied _ tanh, a model proposed herein. The experimental results are shown in table 1, where the experiments are all performed on the same server, and the separation speed is measured by the number of speech fragments in the test set processed in one second, which can reflect the speed of the separation speed. We believe that the separation speed is more indicative of the timeliness of a model in practice than training.

As can be seen from the experimental results in table 1, the former has a larger parameter and the slowest separation speed, and the exformer with convolution enhanced external attention achieves a better separation effect with a parameter less than that and a separation speed 2 times; meanwhile, the results in the rows 2 and 3 in the table show that the convolutional position coding and the feedforward layer are necessary, and the two parts have a remarkable lifting effect on the ExConformer; and replacing the Swish function with Penalied _ tanh also improves the separation effect.

Table 1 comparative experiments with different separator modules

3. Comparative experiment

We compare the experimental effect of the present invention on Libri2Mix datasets with some existing time domain speech separation methods. We reproduced partial code on their own device to test the separation speed of these models. Wherein the BLSTM-TasNet, Conv-TasNet and SuDoRM-RF + + models do not segment the potential representation of the mixed speech into smaller blocks and overlap into three-dimensional vectors for intra-block inter-block operations, and therefore have faster separation speed, but also have larger parameters; both DPRNN and DPTNet and the present invention use segmentation overlap to perform inter-block iterative operations within the block, increasing the separation speed but reducing the number of parameters compared to the first several models. As can be seen from the table, DPTNet was the most effective, and the number of the components was small, but the separation speed was the slowest. The present invention achieves a separation effect equivalent to DPTNet with a minimum model size and a separation speed 2 times as fast as that.

Table 2 comparison of different methods on Libri2Mix dataset

In summary, the following effects are achieved:

1. the former was modified. The former performs well in a sequence modeling task, but for a voice separation task serving as a voice recognition front-end technology, the requirements of the size and timeliness of a voice separation model are difficult to meet due to the large parameter number and the long reasoning time of the voice separation task; the former is improved to meet the requirements of smaller models and high timeliness of voice separation, and a better separation effect is achieved by the advantage of sequence modeling.

2. An external attention mechanism is introduced into the speech separation task. The external attention mechanism replaces the self-attention mechanism with two cascaded linear layers and a normalization layer, which, unlike the self-attention mechanism, implicitly takes into account features from other samples by sharing weight units; linear complexity can be realized by controlling the size of the memory, and the separation speed is high; but the simpler external attention mechanism can not learn the characteristics and the correlation equivalent to the self-attention mechanism, the invention combines the former, and the convolution module of the former can enhance the external attention mechanism to learn more characteristics and the correlation, and keeps the advantage of high separation speed.

3. Convolutional coding is used. The input is encoded using a convolutional neural network, after which each frame contains context information. And convolutional coding makes the position coding trainable.

4. The application of the newly proposed multi-speaker time-domain voice separation method with convolution enhanced external attention in a double-path structure can well balance timeliness, model size and separation effect.

The foregoing shows and describes the general principles, principal features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for separating time domain voice of multiple speakers with convolution enhancing external attention is characterized by comprising the following steps:

Remembering the mixed speech of multiple speakers as x (t),

wherein, the first and the second end of the pipe are connected with each other,

is a real number domain; t is the voice length;

let the latent features be denoted as h,

wherein, C _E Is the number of encoder channels; l is the length of the potential feature representation;

2. The method for convolution-enhanced external attention time-domain speech separation of multiple speakers according to claim 1, wherein S2 mainly includes the steps of:

global normalization and convolution operations: representing the potential features by h, mapping and obtaining an intermediate representation

Wherein C is the number of channels;

splitting and stacking: the intermediate representation h' is divided into S overlapping smaller blocks of length K, forming a three-dimensional vector

T _b ＝ECBlock _intra (U _b-1 )

U _b ＝ECBlock _inter (T _b )

Y＝Conv2D(U _B )

Aggregating multi-channel information processing: converting three-dimensional vector Y into an intermediate potential representation of each source via overlap-add operations

Applying one-dimensional convolution sum PReLU to y to aggregate information on multiple channels, for the ith source, its estimated mask of speech mask

The following were used:

3. the method of claim 2, wherein the ExConformer module comprises a position convolution module, an external attention module, a convolution module, and a feedforward neural network module, and a residual connection is added between each module;

if the input of the ith ExConformer module is defined as x _i Then its output, yi, is expressed as follows:

x” _i ＝x' _i +Conv(x' _i )

y _i ＝Layernorm(x” _i +FFN(x” _i ))。

4. the method of claim 3, wherein the activation functions of the convolution module of the ExConformer module and the feedforward neural network module use Penalized _ tanh.

5. The method of convolution enhanced attention-focused multi-speaker temporal speech separation according to claim 3 wherein the location convolution module is composed of a plurality of stacked zero-padding one-dimensional convolutions, layer normalization and ReLU activation layers.

6. The method of claim 3, wherein the step of the external attention module is as follows:

construction of M Using one Linear layer _k A memory to learn an attention map a between query vectors;

wherein, F is an input characteristic diagram;

is M _k Transposing;

softmax and L1 Norm normalization (L1_ Norm) were performed thereon;

construction of M Using one Linear layer _v The memory generates a refined feature map, represented as follows:

F _out ＝AM _v

and performing Dropout operation on the output result.