CN113782045A

CN113782045A - Single-channel voice separation method for multi-scale time delay sampling

Info

Publication number: CN113782045A
Application number: CN202111006251.7A
Authority: CN
Inventors: 毛启容; 钱双庆; 陈静静; 贾洪杰
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-10
Anticipated expiration: 2041-08-30
Also published as: CN113782045B

Abstract

The invention provides a single-channel voice separation method based on multi-scale time delay sampling, which extracts the mixed voice characteristics of a mixed voice signal of a plurality of speakers, the mixed speech features are segmented and spliced into a 3D tensor, the 3D tensor is modeled iteratively by adopting a first bidirectional cyclic neural network, local feature information and global feature information are captured, multi-scale time delay sampling modeling is adopted, performing time delay sampling of different scales on the output of the first bidirectional circulation neural network, capturing characteristic information under different scales by adopting the second bidirectional circulation neural network, performing overlap addition on the characteristic information, and mapping the voice signals into masks of the pure voices of the multiple speakers, performing point multiplication on the masks and the mixed voice characteristics to obtain pure voice characteristics of the multiple speakers, and reconstructing the pure voice characteristics of each person into separated voice signals. According to the invention, through multi-scale time delay sampling modeling, information loss in sectional modeling is compensated, and the voice separation performance is greatly improved.

Description

Single-channel voice separation method for multi-scale time delay sampling

Technical Field

The invention belongs to the field of voice separation, and particularly relates to a single-channel voice separation method for multi-scale time delay sampling.

Background

In recent years, with the rise of deep learning, the artificial intelligence heat tide caused by the deep learning changes the aspects of life of people. The voice communication is an indispensable part, and a clear voice can enable the intelligent electric appliance to better execute a corresponding command, so that the intelligent degree is greatly improved. In real acoustic scenes, however, the speech of the speaker of interest is usually interfered by other speakers, which is a typical cocktail party problem. The voice content of the target speaker can be easily distinguished by the unique auditory system of the human ear, so that the voice separation is performed by simulating the auditory system of the human ear, and the pure voice of one or all speakers is separated from the mixed voice of the speakers, so that the interferences of background noise, reverberation and the like are removed, and the definition and intelligibility of the voice are improved.

At present, most of mainstream voice separation methods adopt end-to-end time domain separation, and a model learns potential public representation in a mixed voice waveform in a data driving mode so as to realize separation. The separation method has extremely high requirements on framing during voice preprocessing, and researches show that the more frames of a section of mixed voice in the same time, the better the final separation effect is. However, too many frames can cause that modeling cannot be carried out, and in order to solve the problem, Yi Luo and the like propose a two-way structure sectional modeling method in Dual-path rnn: effective long sequence modeling for time-domain single-channel speech separation, thereby effectively realizing sampling point level modeling of long-sequence speech and greatly improving the speech separation performance. However, when capturing global information, this segmented modeling method may cause adjacent frames in the sequence to be too far apart in the original sequence, and there is little correlation between adjacent frames, and there are many relationships between frames that are not captured, which seriously results in the loss of information.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a single-channel voice separation method of multi-scale time delay sampling, which makes up the information loss in sectional modeling and further improves the separation performance through multi-scale time delay sampling modeling.

A single-channel voice separation method of multi-scale time delay sampling comprises the following steps:

s1, extracting mixed voice characteristics of the mixed voice signals of the multiple speakers;

s2, segmenting the mixed voice features and splicing into a 3D tensor;

s3, iteratively modeling the 3D tensor by adopting a first bidirectional cyclic neural network, and capturing local characteristic information and global characteristic information;

s4, adopting multi-scale time delay sampling modeling to perform time delay sampling of different scales on the output of the first bidirectional circulating neural network, and adopting the second bidirectional circulating neural network to capture characteristic information under different scales;

s5, overlapping and adding the characteristic information in S4, mapping the result after overlapping and adding to a mask of the pure voice of a plurality of speakers, and performing dot multiplication on the mask and the mixed voice characteristic to obtain the pure voice characteristics of the speakers;

s6, reconstructing the pure speech representation of each person into a separated speech signal.

Further, performing single-scale time delay sampling on the output of the first bi-directional cyclic neural network, specifically comprising the following steps:

performing sequence recombination and re-segmentation on the output of the bidirectional recurrent neural network, and splicing the re-segmented sequences along the block length dimension 2^λK carries out time delay sampling to each sequence u after time delay sampling_iSplicing along the dimension R' of the blocks to obtain the recombined 3D tensor h after time delay sampling_λWherein 2 is^λIs the delay sampling rate, λ is 0., B-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, and K represents the block length dimension before re-segmentation.

Furthermore, a second bidirectional cyclic neural network is adopted to capture the feature information under a single scale, and the specific process is as follows:

along the h_λUsing a second bi-directional recurrent neural network, capturing time-delayed samples between sequencesThe mutual relation is realized, then dimension transformation is carried out on the output result through a full connection layer, and FC is output; and carrying out layer normalization on the FC, and then carrying out residual error connection with the output of the bidirectional cyclic neural network.

Further, different delay sampling rates are adopted, delay sampling of a single scale is repeatedly carried out on the output of the first bidirectional circulating neural network, the characteristic information under the single scale is captured by the second bidirectional circulating neural network, and the characteristic information under different scales is integrated.

Further, the sequence after the re-segment splicing is along the block length dimension 2^λK, carrying out time delay sampling by adopting a formula as follows:

u_i＝[u′[:,i::2^λ,:],i＝0,...,2^λ-1]

wherein u' [: i::2^λ,:]Indicating that u' is a sequence slice in Python, and i represents an index number.

Further, said following said h_λThe second dimension of the method uses a second bidirectional circulation neural network to capture the mutual relation among the sequences after time delay sampling, then dimension conversion is carried out on the output result through a full connection layer, and FC is output, wherein the specific formula is as follows:

U＝[BiLSTM(h_λ[:,:,m]),m＝1,...,2^λR′]

FC＝[WU[:,:,m]+b,m＝1,...,2^λR′]

where U is the output of the bidirectional recurrent neural network, H is the output dimension of the hidden layer, H_λ[:,:,m]Is a sequence determined by the index m, 2^λR' represents the tile dimension, W, b is the weight and offset, respectively, of the fully connected layer.

Further, the FC is subjected to layer normalization, and a specific formula is as follows:

wherein, mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a minimum positive number, Layernorm represents the layer normalization, i, j and k respectively represent N, K and 2^λAnd a certain dimension in R', wherein N is the dimension of the extracted mixed voice features.

Further, the specific formula for the residual connection is as follows:

Output＝h_λ+Layernorm(FC)。

the invention has the beneficial effects that: according to the invention, multi-scale time delay sampling modeling is adopted, time delay sampling of different scales is carried out on the output of the first bidirectional circulating neural network, and the second bidirectional circulating neural network is adopted to capture characteristic information under different scales, so that the interrelation among sequences under different scales is effectively integrated, the information loss of the bidirectional circulating neural network during sectional modeling is compensated, the voice separation performance is greatly improved, the voice distortion rate is effectively reduced, and the intelligibility of separated voice is also improved.

Drawings

Fig. 1 is a flow chart of a single-channel speech separation method based on multi-scale time delay sampling according to the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

As shown in fig. 1, the single-channel speech separation method based on multi-scale delay sampling of the present invention specifically includes the following steps:

step one, coding the input mixed voice signals of a plurality of speakers by adopting a coder, and extracting corresponding mixed voice characteristics

For a given mixed speech signal

Extracting high-order mixed speech characteristics by using one-dimensional convolutional neural network as encoder

Wherein

Is a real number set, L is the length of the mixed voice signal, N is the dimension of the extracted mixed voice feature, and T is the frame number of the mixed voice feature; the convolution kernel size of the one-dimensional convolution neural network is W, the moving step of the convolution window is W/2, and a ReLU function is added after the convolution neural network for nonlinear transformation, and the calculation formula is as follows:

step two, the mixed voice features output by the encoder are segmented, and then a 3D tensor is spliced

Dividing the mixed voice characteristics z output by the encoder by taking P as a unit, wherein the length of each divided small block is K, the small blocks are divided into R blocks, 50% of overlapping exists between the small blocks, and then all the small blocks are spliced into a 3D tensor

Step three, modeling the 3D tensor v by adopting a bidirectional cyclic neural network in an iterative manner, and correspondingly capturing local characteristic information and global characteristic information

B bidirectional recurrent neural network (BilsTM) modules (the bidirectional recurrent neural network modules comprise odd number BilsTM modules and even number BilsTM modules, and the odd number BilsTM modules and the even number BilsTM modules are alternately stacked) are stacked, and the input 3D tensor is modeled, wherein the odd number BilsTM module B_2i-1(i 1.. b/2) sequence modeling along the R number of block length dimensions K, capturing local features of the mixed speech feature sequenceInformation; even number of BilSTM modules B_2iModeling along the dimension R of the blocks with the number of K, and capturing global feature information of the mixed voice feature sequence, wherein the calculation formula is as follows:

U_R＝[BiLSTM(v[:,:,i]),i＝1,...,R] (2)

U_K＝[BiLSTM(U_R[:,i,:]),i＝1,...,K] (3)

in the formula of U_R、

The outputs of the odd number BilSTM module and the even number BilSTM module are respectively, and the speaker information is effectively integrated through the iterative alternate execution of the odd number BilSTM module and the even number BilSTM module so as to facilitate the subsequent separation.

The bidirectional cyclic neural network is adopted to enable a multi-scale time delay sampling Model (MTDS) to pay attention to not only past context information but also more future context information, which is very beneficial to a voice separation task, and a calculation formula of the bidirectional cyclic neural network is as follows:

h_u＝tanh(W_u[a^＜t-1＞；x^＜t＞]+b_u) (4)

h_f＝tanh(W_f[a^＜t-1＞；x^＜t＞]+b_f) (5)

h_o＝tanh(W_o[a^＜t-1＞；x^＜t＞]+b_o) (6)

a^＜t＞＝tanh(c^＜t＞) (9)

in the formula, h_u、h_f、h_oOutputs of LSTM update gate, forget gate and output gate, W_u、W_f、W_o、W_cAnd b_u、b_f、b_o、b_cAre weights and biases in the bidirectional recurrent neural network; x is the number of^＜t＞、a^＜t＞Representing the input and output of the bi-directional recurrent neural network at the current time,

c^＜t＞is the memory cell in the corresponding bi-directional recurrent neural network, and tanh is the activation function.

Step four, adopting multi-scale time delay sampling modeling to make up for information loss in the bidirectional cyclic neural network so as to further improve the separation performance

Although the bidirectional recurrent neural network is very effective for capturing information of the mixed speech feature sequence, when an even number BilSTM module executes modeling processing, the bidirectional recurrent neural network is applied to K discontinuous frames, the interval between adjacent frames is R//2, the overlarge interval obviously causes few sequence relations between the frames, the bidirectional recurrent neural network is difficult to capture effective information, and a plurality of relationships between the frames are not captured in the whole mixed speech feature sequence, thereby seriously causing the loss of the information. In order to make up for the defect, multi-scale time delay sampling is adopted to re-model the mixed voice feature sequence, and then feature information under different scales is captured, so that the separation performance is improved. The method comprises the following specific steps:

(1) output to a bidirectional recurrent neural network

Sequence recombination and re-segmentation are carried out, and a segmentation strategy in a bidirectional recurrent neural network is still adopted, wherein the difference is that the length of each segment obtained after segmentation is expanded from K to 2^λK。

(2) Splicing the re-segmented sequences

Along the block length dimension 2^λK carries out time delay sampling, and the calculation formula is as follows:

u_i＝[u′[:,i::2^λ,:],i＝0,...,2^λ-1] (10)

where λ ═ 0.., B-1 is a sampling index, 2^λIs the time delay sampling rate, especially when λ is 0, it means that the original bidirectional recurrent neural network segmentation strategy is still adopted, without adding any time delay sampling; b represents the number of the stacked multi-scale time delay sampling blocks; definition of

Is the number of segments after re-segmentation; u' [: i::2^λ,:]Indicates that u 'is sliced using the sequence in Python, i.e., along the second dimension of u', each fragment is K in length, and is co-cut into 2^λSegment, then taking out the sequence with index i from each slice for splicing to obtain a new sequence after time delay sampling

(3) Each sequence u after time-delayed sampling_iStitching along the blocking dimension R':

in the formula (I), the compound is shown in the specification,

is a passing delay rate of 2^λAlthough it is not different from the input 3D tensor v of the bidirectional recurrent neural network in the shape and size, the time-delay sampled reconstructed 3D tensor of (1) greatly changes the internal sequence order.

(4) Along h_λThe third dimension of the system uses a bidirectional circulation neural network to capture the mutual relation between the sequences after time delay sampling, and then dimension conversion is carried out on the output result through a full connection layer; concrete computing deviceThe formula is as follows:

U＝[BiLSTM(h_λ[:,:,m]),m＝1,...,2^λR′] (12)

FC＝[WU[:,:,m]+b,m＝1,...,2^λR′] (13)

in the formula (I), the compound is shown in the specification,

is the output of the bi-directional recurrent neural network, H is the output dimension of the hidden layer,

is a sequence determined by the index m,

respectively the weight and the offset of the fully connected layer,

is the output after dimension transformation of the full connection layer.

By the modeling mode, the whole multi-scale time delay sampling model can expand the receptive field and additionally capture mixed speech characteristic sequence information under different scales, for example, basic information such as phonemes, tones and the like in the whole sequence is obtained under the condition of low time delay rate, and information such as speaking content and speaker identity and the like in the sequence is obtained under the condition of high time delay rate.

(5) Performing layer normalization on the result FC in the step (4), and performing residual connection with the initial input (the output u of the bidirectional cyclic neural network) so as to facilitate convergence of the multi-scale time delay sampling model and prevent gradient disappearance or gradient explosion phenomena in the training process; the layer normalization and the residual calculation formula are as follows:

Output＝h_λ+Layernorm(FC) (17)

where μ (FC) and σ (FC) represent the mean and variance of the output of the fully-connected layer, z and r are normalization factors, ε is a very small positive number preventing the denominator to be 0, Layernorm represents the layer normalization, i, j, k represent N, K and 2, respectively^λA particular dimension in R'.

(6) Repeating the steps (1) to (5), and stacking B multi-scale time delay sampling modules, wherein each module adopts different time delay sampling rates which are 1,2,4^B-1And the delay sampling rate increases exponentially. Through the stacking of the modules, the multi-scale time delay sampling model captures the sequence characteristics of phoneme levels when a low time delay sampling rate is adopted, and the multi-scale time delay sampling model further pays more attention to the semantic or the characteristic information among speakers along with the expansion of the time delay sampling rate, so that the characteristic information under different scales is effectively integrated, the receptive field is expanded, and the mutual relation among sequences is fully captured.

And fifthly, overlapping and adding the Output after the multi-scale time delay sampling modeling (the length of the overlapping and adding is the same as the length of the mixed voice characteristic sequence extracted by the encoder), then sending the result into a two-dimensional convolution neural network, mapping the result after the overlapping and adding into a mask of the pure voices of a plurality of speakers, and performing point multiplication on the mask and the original encoder Output (the extracted mixed voice characteristic sequence) to obtain the pure voice characteristics of the speakers.

And step six, adopting a 1-dimensional deconvolution neural network as a decoder, and restoring the masked pure voice representations of the multiple speakers into voice waveform signals of a time domain to realize voice separation.

The multi-scale delay sampling model in this embodiment is trained by using a normalized signal-to-noise ratio (SI-SNR) as a loss function, and the calculation formula is as follows:

in the formula, S_target、e_noiseThe intermediate variable is represented by a number of variables,

x represents separated speech and clean speech, respectively.

Compared with the existing voice separation algorithm, the multi-scale time delay sampling single-channel voice separation method adopted in the embodiment fully excavates the potential correlation in the mixed voice feature sequence, effectively improves the voice separation performance, reduces the distortion rate of the separated voice and improves the intelligibility thereof, and has good reference significance in the fields of theoretical research and practical application.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims

1. A single-channel voice separation method of multi-scale time delay sampling is characterized by comprising the following steps:

s2, segmenting the mixed voice features and splicing into a 3D tensor;

2. The method for separating single-channel voice with multi-scale time delay sampling according to claim 1, wherein the single-scale time delay sampling is performed on the output of the first bi-directional cyclic neural network, and the specific process is as follows:

3. The method for separating the single-channel voice with multi-scale time delay sampling according to claim 2, wherein the second bi-directional circulation neural network is adopted to capture the feature information under the single scale, and the specific process is as follows:

along the h_λUsing a second bi-directional recurrent neural network, capturing the correlation between the time-delayed sampled sequences,then carrying out dimension transformation on the output result through the full connection layer, and outputting FC; and carrying out layer normalization on the FC, and then carrying out residual error connection with the output of the bidirectional cyclic neural network.

4. The method of claim 3, wherein different delay sampling rates are adopted, single-scale delay sampling is repeatedly performed on the output of the first bi-directional cyclic neural network, the feature information under a single scale is captured by the second bi-directional cyclic neural network, and the feature information under different scales is integrated.

5. The multi-scale time-lapse sampled single-channel speech separation method of claim 2, wherein the sequence after the re-segmentation splicing is along a block length dimension of 2^λK, carrying out time delay sampling by adopting a formula as follows:

u_i＝[u′[:,i::2^λ,:],i＝0,...,2^λ-1]

6. The method of claim 3, wherein the h is a length of the h_λThe second dimension of the method uses a second bidirectional circulation neural network to capture the mutual relation among the sequences after time delay sampling, then dimension conversion is carried out on the output result through a full connection layer, and FC is output, wherein the specific formula is as follows:

U＝[BiLSTM(h_λ[:,:,m]),m＝1,...,2^λR′]

FC＝[WU[:,:,m]+b,m＝1,...,2^λR′]

7. The method for separating single-channel speech with multi-scale delay sampling according to claim 3, wherein the FC is subjected to layer normalization by using a specific formula:

8. The method according to claim 7, wherein the residual error is concatenated by the following specific formula:

Output＝h_λ+Layernorm(FC)。