CN113113041A

CN113113041A - Voice separation method based on time-frequency cross-domain feature selection

Info

Publication number: CN113113041A
Application number: CN202110471865.6A
Authority: CN
Inventors: 蓝天; 刘峤; 吴祖峰; 钱宇欣; 吕忆蓝; 李佳佳; 冯雨佳; 陈聪; 康宏博
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-13
Anticipated expiration: 2041-04-29
Also published as: NL2029780B1; CN113113041B; NL2029780A

Abstract

The invention discloses a voice separation method based on time-frequency cross-domain feature selection, and belongs to the field of single-channel voice separation. The method comprises the following steps: coding single-channel voice by using a one-dimensional convolutional neural network and short-time Fourier transform respectively; fusing feature maps obtained by coding in two modes; calculating a mask for each speaker by adopting a separation network according to the feature graph obtained by fusion; applying the obtained mask to the fused feature map to obtain a separation feature map of each speaker; and obtaining the voice of each speaker based on the separation characteristic diagram of each speaker. The invention selects the characteristics in the voice separation, and the characteristics of the time domain and the frequency domain are mutually complemented, so that the characteristic signal of the voice can be accurately captured under the noise environment of multiple speakers, the problem of insufficient utilization of the characteristics of the time domain and the time-frequency domain in the industry is solved, and the problem of poor separation under the unstable noise environment in the past is solved.

Description

Voice separation method based on time-frequency cross-domain feature selection

Technical Field

The invention belongs to the field of single sound channel voice separation, and particularly relates to a voice separation method based on time-frequency cross-domain feature selection.

Background

The speech separation technology is a branch of the natural language processing field and is used for processing the problem that effective speech information cannot be identified in the multi-speaker noise environment. The goal of speech separation is to separate the target speech from background interference.

With the development of deep learning, many new algorithms based on neural networks come out, and Deep Clustering (DC) and Permutation Invariance Training (PIT) surpass the traditional method. Based on deep clustering and permutation invariance training, deep attraction networks (DANet) achieved unprecedented success by using an attractor mechanism to estimate the mask for each spoken utterance. Different from a voice separation network taking an amplitude spectrum as characteristic input, a time domain audio separation network (TasNet) and a full convolution time domain audio separation network (C onv-TasNet) propose to use a time domain signal as the input of the network, and are the most prominent models at present. The core idea of TasNet is to use a one-dimensional convolution to capture the features of the time-domain signal instead of a common transform such as a short-time fourier transform, which is not optimal for the separation task. And an optimal convolution encoder for capturing time domain signal characteristics is obtained by training the network end to end.

Based on these algorithms that directly use the time domain signal features, a combined approach of embedding and clustering the time domain and frequency domain features has been proposed by scholars. In the encoding stage, the features of the two domains, including the convolution extracted time domain features and the fourier transformed magnitude spectrum, are computed in parallel and stitched in the channel dimension. The features are separated by a separation network to obtain the output of a high-dimensional embedding space, and then a mask is generated for each speaking source by an attractor mechanism. In the decoder part, the processed voice signals are obtained by deconvoluting and inverse Fourier transform two characteristic graphs which are masked in different domains. Experiments show that the performance of splicing the time domain and frequency domain features is better than the performance of only using the time domain features.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a voice separation method based on time-frequency cross-domain feature selection, which is realized by adopting a voice separation network based on time-frequency cross-domain feature selection, wherein the voice separation network based on time-frequency cross-domain feature selection mainly comprises a feature encoder, a voice separator and a decoder, and the feature encoder is a voice time-frequency domain cross-domain feature encoder;

the method comprises the following steps:

step 1, sampling voices of a plurality of speakers through a single recording device to obtain single-channel voices containing the voices of the plurality of speakers, and respectively coding the single-channel voices by using a one-dimensional convolutional neural network and short-time Fourier transform in a voice time-frequency domain cross-domain feature coder to obtain a feature map coded by using the one-dimensional convolutional neural network and a feature map coded by using the short-time Fourier transform;

step 2, performing cross-domain feature fusion on the obtained feature graph coded in the mode of using the one-dimensional convolutional neural network and the feature graph coded in the mode of using the short-time Fourier transform by adopting a time-frequency cross-domain feature selection method to obtain a cross-domain fused feature graph;

step 3, according to the feature diagram of cross-domain fusion, a voice separator is adopted to calculate a mask for each speaker in the single sound channel voice, and the mask is acted on the feature diagram of cross-domain fusion to obtain the separation feature diagram of each speaker in the single sound channel voice;

and 4, based on the separation characteristic diagram of each speaker in the single-channel voice, adopting a one-dimensional transposition convolutional neural network in a decoder to reconstruct the voice signal, and finally obtaining the voice of each speaker in the single-channel voice.

Further, the specific process of respectively encoding the mono voice by using the one-dimensional convolutional neural network and the short-time fourier transform in the voice time-frequency domain cross-domain feature encoder in step 1 includes the following steps:

step 1-1: calculating the single-channel voice by using a one-dimensional convolution neural network 1 in a voice time-frequency domain cross-domain feature encoder to obtain a time-domain feature map which is used as a feature map coded by using the one-dimensional convolution neural network;

step 1-2: calculating the single-channel voice by using short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a magnitude spectrum as a time-frequency domain feature map;

step 1-3: linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform;

the number of input and output channels of the one-dimensional convolutional neural network 1 is different from that of the one-dimensional convolutional neural network 2.

Further, the specific process of performing cross-domain feature fusion in step 2 includes the following steps:

step 2-1: corresponding elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation are added to obtain a sum characteristic diagram;

step 2-2: based on the addition feature map, performing average calculation on each feature channel along a time dimension by adopting global pooling to obtain a global feature descriptor, wherein the number of channels of the global feature descriptor is the same as that of the addition feature map;

step 2-3: compressing the global feature descriptor obtained in the step 2-2 by adopting a full-connection network 2 in the voice time-frequency domain cross-domain feature encoder, reducing the feature dimension of the global feature descriptor and obtaining a compressed feature descriptor;

step 2-4: respectively expanding the compressed feature descriptors by using a fully-connected network 3 and a fully-connected network 4 in a voice time-frequency domain cross-domain feature encoder, reducing the feature dimensions of the global feature descriptors to obtain time domain feature descriptors and time-frequency domain feature descriptors respectively, wherein the two fully-connected networks of the fully-connected network 3 and the fully-connected network 4 have the same parameter quantity but different parameter values;

step 2-5: multiplying the time domain feature descriptors to a time domain feature map in a form of multiplying corresponding elements, multiplying the time domain feature descriptors to a time domain feature map after nonlinear transformation in a form of multiplying corresponding elements, completing cross-domain feature selection, obtaining two feature maps after cross-domain feature selection, and finally adding the two feature maps after cross-domain feature selection according to corresponding elements, completing cross-domain feature fusion, and obtaining a cross-domain fused feature map.

Further, the step 3 specifically includes the following steps:

step 3-1: adopting a stacked convolutional neural network in a voice separator to further extract the characteristics of the cross-domain fusion characteristic diagram to obtain a transformed cross-domain fusion characteristic diagram;

step 3-2: adopting a full-connection network 5 in a voice separator to carry out dimensionality increase on the transformed cross-domain fusion characteristic diagram, and transforming the cross-domain fusion characteristic diagram into tensors of three dimensionalities of time, characteristics and embedding;

step 3-3: calculating the mask of each speaker in the single-channel voice by adopting an attractor mechanism based on the time and the feature obtained by transformation and the tensor of the three embedded dimensions;

step 3-4: and (3) multiplying the mask of each speaker in the single-channel voice by the cross-domain fusion feature map obtained in the step (2-5) in a form of multiplying corresponding elements respectively to obtain a separation feature map of each speaker in the single-channel voice.

Further, the speech signal reconstruction in step 4 specifically includes:

a one-dimensional transposition convolution neural network in a decoder is adopted to convert the separation characteristic diagram of each speaker in the single-channel voice into a voice signal of the corresponding speaker, so that voice separation is completed.

The fully connected networks 1-5 are all different in structure and parameter values.

The invention mainly selects the characteristics in the voice separation, and the characteristics of the time domain and the frequency domain complement each other through cross-domain mixing, can accurately capture the characteristic signals of the voice under the noise environment of multiple speakers, overcomes the problem of insufficient utilization of the characteristics of the time domain and the time frequency domain in the industry, constructs a novel characteristic encoder, and can effectively extract the effective information of the voice. The high-dimensional characteristics of the voice signals are extracted through the stacked convolutional neural network, a speaker voice signal cluster is constructed through deep clustering of the attractor sub-networks, the voice signals of non-corresponding speakers are filtered in a self-adaptive mode, the robustness of the model is improved, and the problem that the separation is poor in the unstable noise environment in the prior art is solved.

Drawings

Fig. 1 is a schematic diagram of an overall structure of a voice separation network based on time-frequency cross-domain feature selection.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings.

The invention provides a voice separation method based on time-frequency cross-domain feature selection, which is mainly used for the voice separation problem in a multi-speaker environment and can effectively mix time-frequency domain features.

As shown in fig. 1, the present invention is implemented by using a voice separation network based on time-frequency cross-domain feature selection, and the voice separation network based on time-frequency cross-domain feature selection mainly includes three parts, namely, a feature encoder, a voice separator, and a decoder.

The feature encoder adopts a voice time-frequency domain cross-domain feature encoder, the voice time-frequency domain cross-domain feature encoder extracts a time domain feature map and a time-frequency domain feature map of a single-channel voice containing a plurality of speaker voices through a one-dimensional convolutional neural network 1 and short-time Fourier transform respectively, and the time domain feature map is used as a feature map encoded by using the one-dimensional convolutional neural network; linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram by using a full-connection network 1 in a voice time-frequency domain cross-domain characteristic encoder to obtain a transformed time-frequency domain characteristic diagram, carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram by using a one-dimensional convolutional neural network 2 in the voice time-frequency domain cross-domain characteristic encoder to obtain a time-frequency domain characteristic diagram after nonlinear transformation, and taking the time-frequency domain characteristic diagram after nonlinear transformation as a characteristic diagram coded by using short-time Fourier transform; adding corresponding position elements of the time-frequency domain characteristic diagram and the time-domain characteristic diagram after the nonlinear transformation to obtain a sum characteristic diagram; the adding feature graph is subjected to global pooling and compression and reduction of three full-connection networks (a full-connection network 2, a full-connection network 3 and a full-connection network 4) to obtain a time domain feature descriptor and a time-frequency domain feature descriptor, the time domain feature descriptor is multiplied to the time domain feature graph according to a multiplication form of corresponding elements, the time-frequency domain feature descriptor is multiplied to the time-frequency domain feature graph subjected to nonlinear transformation according to a multiplication form of corresponding elements to obtain two feature graphs subjected to cross-domain feature selection, and finally the two feature graphs subjected to cross-domain feature selection are added according to the corresponding elements to obtain a cross-domain fused feature graph.

Giving a mixed voice signal (namely, a single-channel voice containing n speaker voices) x comprising n speakers, wherein n is a positive integer larger than 1, the time-frequency domain characteristic diagram is obtained by x through short-time Fourier transform, and meanwhile, the time-domain characteristic diagram is obtained by x through one-dimensional convolution operation of a one-dimensional convolution neural network 1, and the formula is as follows:

F_spec＝S(x)，F_conv＝F(x)

where S (-) represents a short-time Fourier transform operation, F (-) represents a one-dimensional convolution operation, F_specRepresenting a time-frequency domain profile obtained using a short-time Fourier transform, F_convRepresenting a time domain signature graph obtained using one-dimensional convolutional neural network coding.

After that F_specFirstly, carrying out linear transformation through a full-connection network 1, and linearly transforming the time-frequency domain characteristic diagram to the same characteristic dimension as the time-domain characteristic diagram to obtain a transformed time-frequency domain characteristic diagram; then, a one-dimensional convolution neural network 2 with a convolution kernel size of 3 is used for carrying out nonlinear transformation on the transformed time-frequency domain characteristic diagram, the transformed time-frequency domain characteristic diagram is coded and converted into a potential representation space which is the same as the time domain characteristic, and the time-frequency domain characteristic diagram after the nonlinear transformation is obtained

The time-frequency domain characteristic diagram after the nonlinear transformation

And time domain feature map F_convAnd adding corresponding position elements to obtain a sum characteristic diagram U:

wherein the content of the first and second substances,

indicating a corresponding element addition operation.

Obtaining global feature descriptors by global pooling of summed feature maps

C represents the total number of feature channels, i.e. feature dimensions,

for computing time-domain and time-frequency-domain feature descriptors, T representing the length of the time dimension, U_tSummation feature graph representing time t:

compressing global feature descriptors over a fully connected network 2

Obtaining a compressed feature descriptor

m is a compressed feature dimension, the compressed feature descriptor is used for guiding feature selection, and the calculation process is as follows, wherein N represents the operation of the fully-connected network 2, δ represents a sigmoid activation function, W represents a weight matrix of the fully-connected network 2, and g represents a global feature descriptor:

z＝δ(N(Wg))

then, two full-connection layers (a full-connection network 3 and a full-connection network 4) are used for carrying out feature dimension reduction on the compressed feature descriptors to respectively obtain time domain feature descriptors

And time-frequency domain feature descriptor

a_jA characteristic selection value b representing that the time domain characteristic descriptor a is positioned in the jth channel in the time domain characteristic diagram_jRepresenting the time-frequency domain feature descriptor b after the nonlinear transformation

To middleFeature selection values for j channels, j ═ 1,2, …, C:

wherein

(where C denotes the total number of eigen-channels, i.e. eigen-dimensions) are the weight matrices for fully-connected network 3 and fully-connected network 4, respectively, A_jAnd B_jRespectively representing the weights of the two weight matrixes in the jth row, and e represents a natural base number.

The cross-domain fused feature map (i.e., cross-domain selected feature map) H is calculated by the following formula:

wherein |, represents the corresponding element multiplication operation.

The voice separator further extracts the characteristics of the cross-domain fusion characteristic diagram through a stacked convolutional neural network to obtain a transformed cross-domain fusion characteristic diagram, extracts high-dimensional cross-domain fusion characteristics (namely time, characteristics and tensor embedded with three dimensions) from the transformed cross-domain fusion characteristic diagram by adopting a full-connection network 5, obtains a mask code of each speaker in the monophonic voice through an attractor mechanism based on the high-dimensional cross-domain fusion characteristics, and multiplies the mask code of each speaker with the cross-domain fusion characteristic diagram according to corresponding elements to obtain a separation characteristic diagram of each speaker in the monophonic voice.

Further feature extraction is carried out on the cross-domain fused feature map H through a stacked convolutional neural network formed by stacking 8-32 one-dimensional convolutional neural networks to obtain a transformed cross-domain fused feature map, the transformed cross-domain fused feature map is mapped to a high-dimensional space through a full-connection network 5, and the method is represented by the following formula:

V＝W_embT_s(H)

wherein

C represents a characteristic dimension, T represents the length of a time dimension, D represents D-dimension embedding, and the value of D is 20. W_embRepresenting the weight, T, of the fully connected network 5_s(. cndot.) represents the operation of a stacked convolutional neural network.

Then, a mask of each speaker in the monophonic voice is obtained based on an attractor mechanism, wherein M is_iA mask representing the ith speaker obtained by the speech separator, i is 1,2, …, n, and the mask of each speaker is multiplied by the corresponding element of the cross-domain fused feature map H to obtain

A separation profile representing the ith speaker after separation using a mask:

the decoder uses a one-dimensional transposition convolution neural network to reconstruct a voice signal, and converts the separation characteristic diagram of each speaker in the single-channel voice into the voice signal of the corresponding speaker, thereby completing voice separation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A voice separation method based on time-frequency cross-domain feature selection is characterized in that the method is realized by adopting a voice separation network based on time-frequency cross-domain feature selection, the voice separation network based on the time-frequency cross-domain feature selection mainly comprises a feature encoder, a voice separator and a decoder, and the feature encoder is a voice time-frequency domain cross-domain feature encoder; the method comprises the following steps:

step 1: sampling the voices of n speakers by using a single recording device to obtain mono voices containing the voices of the n speakers, and respectively encoding the mono voices by using a one-dimensional convolutional neural network and short-time Fourier transform in a voice time-frequency domain cross-domain feature encoder to obtain a feature map encoded by using the one-dimensional convolutional neural network and a feature map encoded by using the short-time Fourier transform, wherein n is a positive integer greater than 1;

step 2: performing cross-domain feature fusion on the obtained feature graph coded in the mode of using the one-dimensional convolutional neural network and the feature graph coded in the mode of using the short-time Fourier transform by adopting a time-frequency cross-domain feature selection method to obtain a cross-domain fused feature graph;

and step 3: according to the feature diagram of cross-domain fusion, a voice separator is adopted to calculate a mask for each speaker in the single-channel voice, and the mask is acted on the feature diagram of cross-domain fusion to obtain a separation feature diagram of each speaker in the single-channel voice;

and 4, step 4: based on the separation characteristic diagram of each speaker in the single-channel voice, a one-dimensional transposition convolution neural network in a decoder is adopted to reconstruct the voice signal, and finally the voice of each speaker in the single-channel voice is obtained.

2. The method for separating speech based on time-frequency cross-domain feature selection according to claim 1, wherein the specific process of respectively encoding the mono speech by using the one-dimensional convolutional neural network and the short-time fourier transform in the speech time-frequency domain cross-domain feature encoder in step 1 comprises the following steps:

3. The method for separating speech based on time-frequency cross-domain feature selection according to claim 2, wherein the specific process of performing cross-domain feature fusion in step 2 comprises the following steps:

4. The method for separating speech based on time-frequency cross-domain feature selection according to claim 3, wherein the step 3 specifically comprises the following steps:

5. The method for separating speech based on time-frequency cross-domain feature selection according to claim 4, wherein the speech signal reconstruction in step 4 specifically comprises:

6. The method according to claim 5, wherein the structure and parameter values of the fully connected networks 1-5 are different.