CN113782045B

CN113782045B - Single-channel voice separation method for multi-scale time delay sampling

Info

Publication number: CN113782045B
Application number: CN202111006251.7A
Authority: CN
Inventors: 毛启容; 钱双庆; 陈静静; 贾洪杰
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2024-01-05
Anticipated expiration: 2041-08-30
Also published as: CN113782045A

Abstract

The invention provides a single-channel voice separation method based on multi-scale time delay sampling, which is characterized in that mixed voice characteristics of a plurality of speakers are extracted, the mixed voice characteristics are segmented and spliced into 3D tensors, the 3D tensors are modeled iteratively by adopting a first bidirectional circulating neural network, local characteristic information and global characteristic information are captured, the output of the first bidirectional circulating neural network is subjected to time delay sampling of different scales by adopting multi-scale time delay sampling modeling, the characteristic information under different scales is captured by adopting a second bidirectional circulating neural network, the characteristic information is subjected to overlap addition and mapped into masks of pure voices of the plurality of speakers, the masks and the mixed voice characteristics are subjected to dot multiplication to obtain pure voice characteristics of the plurality of speakers, and the pure voice characteristics of each person are reconstructed into a separated voice signal. According to the invention, through multi-scale time delay sampling modeling, the information loss in sectional modeling is made up, and the voice separation performance is greatly improved.

Description

Single-channel voice separation method for multi-scale time delay sampling

Technical Field

The invention belongs to the field of voice separation, and particularly relates to a single-channel voice separation method for multi-scale time delay sampling.

Background

In recent years, with the rise of deep learning, artificial intelligence hot flashes caused by the deep learning change the aspects of life of people. The voice communication is an indispensable part, and a clear voice can enable the intelligent electric appliance to better execute corresponding commands, so that the intelligent degree is greatly improved. In a real acoustic scenario, however, the speaker's voice of interest is often disturbed by other speakers, which is a typical cocktail problem. The voice content of the target speaker can be easily distinguished by virtue of the unique auditory system of the human ear, so that the voice separation is performed by simulating the auditory system of the human ear, and the pure voice of one or all speakers is separated from the mixed speaker voice, thereby removing the interference of background noise, reverberation and the like and improving the definition and the intelligibility of the voice.

The current mainstream voice separation method mostly adopts end-to-end time domain separation, and the model learns potential public representations in the mixed voice waveform by itself in a data driving mode so as to realize separation. The separation method has extremely high frame separation requirements during voice pretreatment, and researches show that the more frames of a section of mixed voice in the same time, the better the final separation effect. However, too many frames can cause incapacity of modeling, so that in order to solve the problem, yi Luo et al propose a two-way structure sectional modeling method in Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation, effectively realize sampling point level modeling of long-sequence voice, and greatly improve voice separation performance. However, when global information is captured, the method of modeling in a sectional mode can enable adjacent frames in a sequence to be far apart in an original sequence, little mutual relation exists between the adjacent frames, and a plurality of frames are not captured, so that information is seriously lost.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a single-channel voice separation method with multi-scale time delay sampling, and the separation performance is further improved by modeling with multi-scale time delay sampling, making up for the information deficiency in sectional modeling.

A single-channel voice separation method for multi-scale time delay sampling comprises the following steps:

s1, extracting mixed voice characteristics from mixed voice signals of a plurality of speakers;

s2, segmenting the mixed voice features, and splicing the segmented mixed voice features into 3D tensors;

s3, modeling the 3D tensor iteratively by adopting a first bidirectional circulating neural network, and capturing local characteristic information and global characteristic information;

s4, adopting multi-scale time delay sampling modeling, performing time delay sampling of different scales on the output of the first bidirectional circulating neural network, and capturing characteristic information of different scales by adopting the second bidirectional circulating neural network;

s5, carrying out overlap-add on the characteristic information in the S4, mapping the overlapped and added result into masks of the pure voices of a plurality of speakers, and carrying out dot multiplication on the masks and the mixed voice characteristics to obtain the pure voice characterization of the plurality of speakers;

s6, reconstructing the pure voice representation of each person into a separated voice signal.

Further, the output of the first bidirectional cyclic neural network is subjected to single-scale time delay sampling, and the specific process is as follows:

the output of the bidirectional cyclic neural network is subjected to sequence recombination and re-segmentation, and the sequence spliced by the re-segmentation is along the block length dimension 2 ^λ K, time delay sampling is carried out, and each sequence u after time delay sampling is carried out _i Splicing along the dimension R' of the block to obtain a 3D tensor h recombined after time delay sampling _λ Wherein 2 is ^λ Is the time delay sampling rate, λ=0,..a.b-1 is the sampling index, B represents the number of stacked multi-scale time delay sampling blocks, and K represents the block length dimension before re-segmentation.

Further, the second bidirectional circulating neural network is adopted to capture the characteristic information under a single scale, and the specific process is as follows:

along the h _λ Capturing the interrelation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer to output FC; and carrying out layer normalization on the FC, and then carrying out residual connection with the output of the bidirectional circulating neural network.

Further, different time delay sampling rates are adopted, single-scale time delay sampling is repeatedly carried out on the output of the first bidirectional circulating neural network, the second bidirectional circulating neural network is adopted to capture the characteristic information under the single scale, and the characteristic information under different scales is integrated.

Further, the spliced sequence of the re-segmented is along the block length dimension 2 ^λ K carries out time delay sampling, and the adopted formula is as follows:

u _i ＝[u′[:,i::2 ^λ ,:],i＝0,...,2 ^λ -1]

wherein u' [: i::2 ^λ ,:]Representation pairu' adopts sequence slices in Python, and i represents an index sequence number.

Still further, the said along the said h _λ Capturing the correlation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then performing dimension transformation on the output result through a full-connection layer to output FC, wherein the specific formula is as follows:

U＝[BiLSTM(h _λ [:,:,m]),m＝1,...,2 ^λ R′]

FC＝[WU[:,:,m]+b,m＝1,...,2 ^λ R′]

wherein U is the output of the bidirectional cyclic neural network, H is the output dimension of the hidden layer, and H _λ [:,:,m]Is a sequence determined by index m, 2 ^λ R' represents the blocking dimension, W, b is the weight and bias of the fully connected layer, respectively.

Further, the layer normalization of the FC is performed according to the following specific formula:

wherein mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a very small positive number, layernorm represents layer normalization, and i, j and k respectively represent N, K and 2 ^λ A particular dimension in R', N is the dimension of the extracted mixed speech feature.

Further, the specific formula for the residual connection is:

Output＝h _λ +Layernorm(FC)。

the beneficial effects of the invention are as follows: according to the invention, multi-scale time delay sampling modeling is adopted, time delay sampling of different scales is carried out on the output of the first bidirectional circulating neural network, and the characteristic information of different scales is captured by adopting the second bidirectional circulating neural network, so that the interrelationship among sequences of different scales is effectively integrated, the information deficiency of the bidirectional circulating neural network in sectional modeling is made up, the voice separation performance is greatly improved, the voice distortion rate is effectively reduced, and the voice intelligibility is improved.

Drawings

Fig. 1 is a flowchart of a single-channel voice separation method based on multi-scale time delay sampling according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to the appended drawings.

As shown in fig. 1, the single-channel voice separation method based on multi-scale time delay sampling of the invention specifically comprises the following steps:

step one, an encoder is adopted to encode a plurality of input speaker mixed voice signals, and corresponding mixed voice characteristics are extracted

For a given mixed speech signalA one-dimensional convolutional neural network is adopted as an encoder to extract high-order mixed voice characteristics ++>Wherein->For the real number set, L is the length of the mixed voice signal, N is the dimension of the extracted mixed voice feature, and T is the frame number of the mixed voice feature; the convolution kernel size of the one-dimensional convolution neural network is W, the moving step of the convolution window is W/2, and a ReLU function is added after the convolution neural network for nonlinear transformation, and the calculation formula is as follows:

step two, segmenting the mixed voice characteristics output by the encoder, and then splicing the segmented mixed voice characteristics into a 3D tensor

Dividing the mixed voice characteristic z output by the encoder by taking P as a unit, dividing each divided small block into R blocks with the length of K, overlapping each small block by 50%, and then splicing all the small blocks into a 3D tensor

Step three, modeling the 3D tensor v iteratively by adopting a bidirectional cyclic neural network, and correspondingly capturing local characteristic information and global characteristic information

Stacking B bi-directional recurrent neural network (bislstm) modules (the bi-directional recurrent neural network modules include odd-numbered bislstm modules and even-numbered bislstm modules, and the odd-numbered bislstm modules and the even-numbered bislstm modules are alternately stacked), modeling an input 3D tensor, wherein the odd-numbered bislstm module B _2i-1 (i=1,., b/2) sequence modeling along a block length dimension K of number R, capturing local feature information of the mixed speech feature sequence; even BiLSTM Module B _2i Modeling along the number K of the blocking dimensions R, and capturing global feature information of a mixed voice feature sequence, wherein the calculation formula is as follows:

U _R ＝[BiLSTM(v[:,:,i]),i＝1,...,R] (2)

U _K ＝[BiLSTM(U _R [:,i,:]),i＝1,...,K] (3)

in U _R 、The outputs of the odd and even BiLSTM modules are effectively integrated for subsequent separation by iterative alternating execution of the odd and even BiLSTM modules.

The bi-directional cyclic neural network is adopted to enable the multi-scale time delay sampling Model (MTDS) to pay attention to not only past context information, but also future context information, which is very beneficial to the task of voice separation, and the calculation formula of the bi-directional cyclic neural network is as follows:

h _u ＝tanh(W _u [a ^＜t-1＞；x ^＜t＞ ]+b _u ) (4)

h _f ＝tanh(W _f [a ^＜t-1＞；x ^＜t＞ ]+b _f ) (5)

h _o ＝tanh(W _o [a ^＜t-1＞；x ^＜t＞ ]+b _o ) (6)

a ^＜t＞＝tanh(c ^＜t＞ ) (9)

in the formula, h _u 、h _f 、h _o Outputs of the LSTM update gate, the forget gate and the output gate, respectively, W _u 、W _f 、W _o 、W _c B _u 、b _f 、b _o 、b _c Is the weight and bias in the bi-directional recurrent neural network; x is x ^＜t＞、a ^＜t＞ Representing the input and output of the bi-directional recurrent neural network at the current moment,c ^＜t＞ is a memory cell in the corresponding two-way recurrent neural network, and tanh is an activation function.

Fourth, multi-scale time delay sampling modeling is adopted to compensate for information loss in the bidirectional circulating neural network so as to further improve separation performance

Although the bi-directional recurrent neural network is very effective for capturing the information of the mixed voice characteristic sequence, when the even number BiLSTM module performs modeling processing, the bi-directional recurrent neural network is applied to K discontinuous frames, the interval between adjacent frames is R//2, the excessive interval obviously causes the sequence relation between the frames to be very small, the bi-directional recurrent neural network is difficult to capture the effective information, and a plurality of frames still exist in the whole mixed voice characteristic sequence and are not captured, so that the information is seriously lost. In order to remedy the defect, the mixed voice feature sequence is remodelled by adopting multi-scale time delay sampling, so that feature information under different scales is captured, and the separation performance is improved. The specific method comprises the following steps:

(1) Output to a bi-directional recurrent neural networkSequence reorganization and re-segmentation are performed, wherein a segmentation strategy in a bi-directional recurrent neural network is still adopted, except that the length of each small segment obtained after segmentation is enlarged from K to 2 ^λ K。

(2) For the sequence after re-segmentation and splicingAlong the block length dimension 2 ^λ K is subjected to time delay sampling, and the calculation formula is as follows:

u _i ＝[u′[:,i::2 ^λ ,:],i＝0,...,2 ^λ -1] (10)

where λ=0,..where B-1 is the sampling index, 2 ^λ Is the time delay sampling rate, especially when λ=0, indicating that the original bi-directional cyclic neural network segmentation strategy is still adopted, without adding any time delay sampling; b represents the number of stacked multi-scale time delay sampling blocks; definition of the definitionThe number of segments after re-segmentation; u' [: i:. 2 ^λ ,:]The sequence slicing in Python is used for u ', i.e. slicing along the second dimension of u', each segment length K, co-sliced into 2 ^λ Segments, then from each sliceTaking out the sequence with index i for splicing to obtain new sequence after time delay sampling +.>

(3) Each sequence u after time delay sampling _i Stitching along the tile dimension R':

in the method, in the process of the invention,is that the passing time delay rate is 2 ^λ Although it is not different in shape and size from the input 3D tensor v of the bi-directional recurrent neural network, the two have changed greatly in internal sequence order.

(4) Along h _λ Capturing the interrelation between the sequences after time delay sampling by using a bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer; the specific calculation formula is as follows:

U＝[BiLSTM(h _λ [:,:,m]),m＝1,...,2 ^λ R′] (12)

FC＝[WU[:,:,m]+b,m＝1,...,2 ^λ R′] (13)

in the method, in the process of the invention,is the output of the bi-directional recurrent neural network, H is the output dimension of the hidden layer,is a sequence determined by index m, +.> Weights and offsets of the full connection layer, respectively,/->Is output after full connection layer dimension conversion.

Through the modeling mode, the whole multi-scale time delay sampling model can expand the receptive field, additionally capture mixed voice characteristic sequence information under different scales, for example, under the condition of low time delay rate, basic information such as phonemes, tones and the like in the whole sequence is obtained, and under the condition of high time delay rate, information such as speaking content, speaker identity and the like in the sequence is obtained.

(5) Performing layer normalization on the result FC of the step (4), and performing residual connection with an initial input (output u of a bidirectional cyclic neural network) so as to facilitate convergence of a multi-scale time delay sampling model and prevent gradient disappearance or gradient explosion in the training process; the calculation formula of the layer normalization and residual error is as follows:

Output＝h _λ +Layernorm(FC) (17)

wherein μ (FC) and σ (FC) respectively represent the mean and variance of the outputs of the full-connection layer, z and r are normalization factors, ε is a minimum positive number preventing the denominator from being 0, layernorm represents layer normalization, i, j and k respectively represent N, K and 2 ^λ A particular dimension in R'.

(6) Repeating (1) - (5), stacking B multi-scale time delay sampling modules altogether, each module employing a different time delay sampling rate, 1,2,4, 2 respectively ^B-1 And time-delay samplingThe rate increases exponentially. Through the stacking of the modules, the multi-scale time delay sampling model captures the sequence characteristics of the phoneme level when the low time delay sampling rate is adopted, and the multi-scale time delay sampling model further focuses on the characteristic information among semantics or speakers along with the expansion of the time delay sampling rate, so that the characteristic information under different scales is effectively integrated, the receptive field is expanded, and the interrelationship among the sequences is fully captured.

And fifthly, performing overlap-add on the Output after multi-scale time delay sampling modeling (the length of the overlap-add is the same as the length of the mixed voice feature sequence extracted by the encoder), then sending the Output into a two-dimensional convolutional neural network, mapping the result after the overlap-add into masks of pure voices of a plurality of speakers, and performing dot multiplication on the masks and the Output (the extracted mixed voice feature sequence) of the original encoder to obtain pure voice characterization of the plurality of speakers.

And step six, adopting a 1-dimensional deconvolution neural network as a decoder, recovering the pure voice characterization of the plurality of masked speakers into a voice waveform signal of a time domain, and realizing voice separation.

The multi-scale time delay sampling model in this embodiment is trained by using normalized signal-to-noise ratio (SI-SNR) as a loss function, and its calculation formula is as follows:

wherein S is _target 、e _noise Represents an intermediate variable which is referred to as,x represents the separated speech and the clean speech, respectively.

Compared with the existing voice separation algorithm, the single-channel voice separation method adopting the multi-scale time delay sampling fully exploits potential correlations in the mixed voice feature sequence, effectively improves the voice separation performance, reduces the distortion rate of the separated voice and improves the intelligibility of the separated voice, and has good reference significance in the fields of theoretical research and practical application.

Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention.

Claims

1. A single-channel voice separation method for multi-scale time delay sampling is characterized by comprising the following steps:

the output of the first bidirectional cyclic neural network is subjected to single-scale time delay sampling, and the specific process is as follows: the output of the bidirectional cyclic neural network is subjected to sequence recombination and re-segmentation, and the sequence spliced by the re-segmentation is along the block length dimension 2 ^λ K, time delay sampling is carried out, and each sequence u after time delay sampling is carried out _i Splicing along the dimension R' of the block to obtain a 3D tensor h recombined after time delay sampling _λ Wherein 2 is ^λ Time of yesThe delay sampling rate, λ=0,..b-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, K represents the block length dimension before re-segmentation;

2. The single-channel voice separation method of multi-scale time delay sampling according to claim 1, wherein the characteristic information under a single scale is captured by adopting a second bidirectional circulating neural network, and the specific process is as follows:

3. The single-channel voice separation method of multi-scale time delay sampling according to claim 2, wherein the time delay sampling of the output of the first bidirectional circulating neural network is repeated at a single scale by adopting different time delay sampling rates, and the characteristic information at the single scale is captured by adopting the second bidirectional circulating neural network, so that the characteristic information at the different scales is integrated.

4. The method of claim 1, wherein the re-segmented spliced sequence is along a block length dimension 2 ^λ K carries out time delay sampling, and the adopted formula is as follows:

u _i ＝[u′[:,i::2 ^λ ,:],i＝0,...,2 ^λ -1]

wherein u' [: i::2 ^λ ,:]Representing the use of Python for uI represents an index number.

5. The method of claim 2, wherein the step of along h is a single channel speech separation of multi-scale time-delay sampling _λ Capturing the correlation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then performing dimension transformation on the output result through a full-connection layer to output FC, wherein the specific formula is as follows:

U＝[BiLSTM(h _λ [:,:,m]),m＝1,...,2 ^λ R′]

FC＝[WU[:,:,m]+b,m＝1,...,2 ^λ R′]

6. The method for single-channel voice separation with multi-scale time delay sampling according to claim 2, wherein the FC is subjected to layer normalization according to the following specific formula:

wherein mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a very small positive number, layernorm represents layer normalization, and i, j and k respectively represent N, K and 2 ^λ A specific dimension in R', N is the extracted mixed speechFeature dimensions.

7. The single-channel speech separation method of multi-scale time-delay sampling according to claim 6, wherein the specific formula for the residual connection is:

Output＝h _λ +Layernorm(FC)。