CN114495973A

CN114495973A - Special person voice separation method based on double-path self-attention mechanism

Info

Publication number: CN114495973A
Application number: CN202210088494.8A
Authority: CN
Inventors: 张东; 暴媛媛
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-13

Abstract

The invention discloses a method for separating specific human voice based on a double-path self-attention mechanism, which comprises the following steps: acquiring a registered corpus and a mixed corpus; extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics; processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics; fusing the identity characteristic and the voice characteristic to obtain a fused characteristic; processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value; and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker. By using the invention, the voice of the target speaker can be quickly and accurately extracted from the mixed corpus containing noise and multi-person voice interference. The method for separating the voice of the specific person based on the double-path self-attention mechanism can be widely applied to the field of voice separation.

Description

Special person voice separation method based on double-path self-attention mechanism

Technical Field

The invention relates to the field of voice separation, in particular to a method for separating voices of a specific person based on a double-path self-attention mechanism.

Background

The speaker-specific separation technique is used for extracting the voice of a target speaker from a mixed corpus containing noise and multi-person voice interference under the condition of giving the reference voice of the target speaker. At present, a specific people separation algorithm based on a deep learning method sometimes has two categories of frequency domain and time domain, the importance of phase information in the voice signal reconstruction process is neglected by the time-frequency domain method, and the characteristic learning of a separation network on an original voice signal is limited to a certain extent by using an existing extracted magnitude spectrum; the signal sequence obtained by the encoder through a specific people separation algorithm based on the time domain is often significantly longer than the sequence length obtained after the traditional short-time fourier transform (STFT), so that the modeling and learning process of the network can be difficult. In addition, in both the time-frequency domain and the time-domain methods, the separation effect is reduced because training conditions are not completely matched, where the non-uniform of the signal-to-noise ratio level of the test data and the training data (the ratio of the target speaker voice to the energy level of the rest of the interfering speakers in the separation task, hereinafter referred to as SNR) is a large factor affecting the separation effect.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method for separating speeches of a specific speaker based on a dual-path self-attention mechanism, which can quickly and accurately extract the voice of the target speaker from a mixed corpus including noise and interference of multi-speaker voices.

The first technical scheme adopted by the invention is as follows: a specific person voice separation method based on a double-path self-attention mechanism comprises the following steps:

acquiring a registered corpus and a mixed corpus;

extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics;

processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics;

fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;

processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;

and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.

Further, the step of extracting mel-frequency spectrum from the registered corpus and inputting the mel-frequency spectrum to the pre-trained speaker encoder to obtain the identity characteristics specifically comprises:

sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;

performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;

converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;

the speaker coder based on pre-training processes the Mel spectrum to obtain identity characteristics.

Further, the pre-trained speaker encoder comprises a front-end feature extraction network, an encoding layer, and a full link layer.

Further, the pre-training-based speaker coder processes the mel spectrum to obtain the identity, which specifically comprises:

extracting the characteristics of learning the speaker confirmation task from the Mel spectrum by the network based on the front-end characteristics to obtain the front-end characteristics;

converting the front-end features into coding vectors based on the coding layer;

and processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimensionality.

Further, the pre-training-based speech coder processes the mixed corpus to obtain speech features, which specifically includes:

converting the mixed corpus into a voice signal sequence based on a pre-trained voice coder;

and cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.

Further, the pre-trained snr estimation module includes a void convolution layer, an LSTM layer, and a full link layer, and the pre-trained snr estimation module processes the fusion feature to obtain an snr estimation value, which specifically includes:

extracting the characteristics of the fusion characteristics based on the cavity convolution;

mining timing information between features based on the LSTM layer;

obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;

and averaging the values in the time dimension to obtain the estimated value of the signal-to-noise ratio of the voice segment.

Further, the step of sequentially passing the fusion features and the snr estimation values through a pre-trained speech separator and a speech decoder to obtain a clean speech signal of the target speaker specifically includes:

splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;

separating the splicing characteristics by a pre-trained double-path self-attention mechanism voice separator to obtain a separated three-dimensional characteristic module;

and recombining and splicing the separated three-dimensional characteristic modules based on a voice decoder to recover and obtain a clean voice signal of the target speaker.

Further, the training step of the speaker coder specifically comprises:

constructing a first training set;

sampling, framing, pre-emphasis, windowing, Fourier transform and Mel filtering are carried out on data of the first training set to obtain a Mel frequency spectrum for training;

and training the speaker encoder by using Cross control Loss as a Loss function according to the Mel frequency spectrum for training and the real label in the first training set to obtain the pre-trained speaker encoder.

The method has the beneficial effects that: the invention adds a signal-to-noise ratio estimation module in the network, extracts the signal-to-noise ratios of the target voice and the interference voice, and uses the estimated signal-to-noise ratio as one of the inputs of the separation network, so that the separation network pays attention to the signal-to-noise ratio level of the current voice fragment to improve the separation performance of the network under different signal-to-noise ratio scenes.

Drawings

FIG. 1 is a flow chart of the steps of a method for separating speaker-specific speech based on a two-path self-attention mechanism according to the present invention;

FIG. 2 is a block diagram of a method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a Mel-spectrum extraction process according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Referring to fig. 1 and 2, the invention provides a method for separating speeches of a specific person based on a double-path self-attention mechanism, which comprises the following steps:

s1, acquiring a registered corpus and a mixed corpus;

s2, extracting Mel spectra of the registered corpus, and inputting the extracted corpus into a pre-trained speaker encoder to obtain identity characteristics;

specifically, the speaker encoder is used for extracting the coding vector which can represent the identity characteristic of the target speaker from the registered audio of the target speaker.

S2.1, sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;

specifically, the speech signal has short-time stationarity, and can be regarded as quasi-static within 10-30 ms, so the speech signal is usually framed in the time domain, and for smooth transition between frames, there is usually an overlapping portion between frames, called frame shift. Here, the speech signal is default to 16k, and 25ms is used as the frame length, and 10ms is used as the frame shift. Due to the characteristics of a human voice production structure, the high-frequency part in a human voice signal can be restrained, in the process of sound transmission, the high-frequency energy is greatly attenuated, and in order to make up for the defects of the high-frequency signal, the high-frequency part of each frame of signal is pre-emphasized. To mitigate the gibbs effect, the signal after pre-emphasis is windowed using a hamming window.

S2.2, performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;

specifically, the windowed signal is subjected to short-time fourier transform, and the signal is converted from a time domain to a frequency domain to obtain an amplitude spectrum. The number of points of the fourier transform is set to 512.

S2.3, converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;

specifically, the human ear experiences logarithmic changes on sound frequency, linear spectrograms subjected to short-time Fourier transform are distributed at equal intervals in a frequency domain, and a Mel spectrum conforms to the auditory characteristics of the human ear. The energy spectrum is multiplied by a group of triangular band-pass filters which are uniformly distributed on the Mel frequency spectrum, namely, the central frequency interval of the band-pass filters is reduced along with the reduction of the filter index and widened along with the increase of the filter index, the logarithmic energy output by each filter is obtained, the conversion from the linear frequency spectrum to the Mel nonlinear frequency spectrum can be realized, the required Mel frequency spectrum is obtained, and the Mel frequency spectrum extraction process refers to FIG. 3.

And S2.4, processing the Mel spectrum based on the pre-trained speaker encoder to obtain the identity characteristics.

S2.4.1, extracting the characteristics of the learning speaker confirmation task from the Mel spectrum by the network based on the front end characteristics to obtain the front end characteristics;

in particular, the front-end feature extraction network is used to learn features from the Mel-spectrum of the speech signal that are suitable for the speaker verification task. Where ResNet-18 is used as a module for front-end feature learning.

S2.4.2, converting the front-end characteristics into coding vectors based on the coding layer;

specifically, the coding layer is used for converting the features containing the time sequence relation output by the front-end feature extractor into a time sequence-independent fixed-length coding vector. Meanwhile, dimension conversion between the feature extraction layer and the classifier is achieved, and the overfitting problem of a deep network is reduced.

S2.4.3, processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimension.

In addition, a classifier is included for classifying the speaker identity feature, each node of the output of which represents each person in the training data. This part exists only during the training process, and after the training process is finished, the part is removed, and the output of the fully connected layer of the previous part is input into the subsequent separation network as the identity characteristic of the speaker.

S3, processing the mixed corpus based on the pre-trained speech coder to obtain speech characteristics;

s3.1, converting the mixed corpus into a voice signal sequence based on a pre-training voice coder;

and S3.2, cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.

Specifically, the speech encoder is configured to encode the mixed speech, simulate an STFT, and implement conversion of a speech signal domain to obtain a feature sequence with a length L and a feature dimension H, and the process is implemented using a one-dimensional convolutional network. The convolved speech signal sequence is usually too long, where the sequence is cut and recombined to obtain three-dimensional feature blocks. The method comprises the steps of cutting a characteristic sequence with the length of L and the characteristic dimension of H into M short sequences, wherein the length of each short sequence is N, overlapping the adjacent short sequences with the length of P, and obtaining a three-dimensional module with the dimension of N multiplied by M multiplied by H, wherein K is 2P.

S4, fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;

specifically, the method and the device are used for fusing the speech characteristics obtained by the speech encoder and the identity characteristics obtained by the speaker encoder. And splicing the voice features and the identity features on feature dimensions, and then obtaining the fused features through a full connection layer. The fusion features are input to both the speech separator and the snr estimation module.

S5, processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;

s5.1, extracting the characteristics of the fusion characteristics based on the cavity convolution;

s5.2, excavating time sequence information among the features based on the LSTM layer;

s5.3, obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;

s5.4, averaging in the time dimension to obtain the signal-to-noise ratio estimation value of the voice segment.

Specifically, the snr estimation module is configured to estimate an snr of the corpus. The signal-to-noise ratio estimation module is realized by three layers of two-dimensional cavity convolution, one layer of LSTM and one layer of full connection. The method comprises the steps of performing hole convolution for feature extraction, performing LSTM for discovering time sequence information among features, obtaining a signal-to-noise ratio estimation value of each frame by a full connection layer, and finally obtaining the signal-to-noise ratio of the whole segment by averaging and smoothing in a time dimension. In the training phase, the module performs multitask training together with the rest of the module; in the testing stage, the module, together with the previous speaker coder, speech coder and feature fusion module, extracts the signal-to-noise ratio (SNR) estimated value in the current speech segment from the mixed corpus, instead of the group try SNR in the training stage, as one of the inputs of the speech separator.

And S6, sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.

S6.1, splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;

s6.2, separating the splicing features by a pre-training-based double-path self-attention mechanism voice separator to obtain a separated three-dimensional feature module;

specifically, the voice separator is used for separating three-dimensional features corresponding to the target speaker, and is realized by a double-path self-attention mechanism module stacked with a B layer. The fusion feature is spliced with the SNR of the current voice segment on the feature dimension, namely the SNR is repeated N multiplied by M times and then spliced on the fusion feature dimension. Each double-path self-attention mechanism module consists of self-attention mechanisms on two paths in each block and between the blocks, and the parameter settings of the modules are the same. The self-attention mechanism allows the network to exploit the correlation between its own sequences. Firstly, generating a query matrix Q, a key matrix K and a value matrix V according to an input feature vector, then carrying out matrix multiplication on Q and K to obtain a correlation scoring matrix R between a certain moment and other moments in a sequence, normalizing the value of R to be between 0 and 1 through a Softmax function, and then carrying out matrix multiplication on R and V to obtain a value output by self attention. The self-attention mechanism is realized by a multi-head attention mechanism, the multi-head self-attention mechanism is composed of h parallel self-attention modules, and the results of the self-attention modules are spliced to obtain the final output. The intra-block self-attention is regarded as a time dimension and is equivalent to calculating the local attention of each short sequence, and the inter-block self-attention is regarded as a time dimension and is equivalent to calculating the global attention of the whole long sequence, so that the effective modeling of the long sequence information is realized by combining the local attention and the whole attention. In the training stage, the SNR of the voice segment uses the group Truth SNR, namely the SNR value when generating the training mixed corpus; during the testing phase, the SNR uses the SNR value calculated by the SNR estimation module.

And S6.2, recombining and splicing the three-dimensional characteristic modules based on a voice decoder, and recovering to obtain a clean voice signal of the target speaker.

Specifically, the voice decoder is used for recovering the three-dimensional characteristic module obtained by the voice separator to obtain a clean audio corresponding to the target speaker, and the clean audio is realized by a one-dimensional deconvolution network. And recombining and splicing the three-dimensional module according to a process reverse to the cutting and splicing in the voice coder to obtain a characteristic sequence with the same length as the output of the voice coder, and then sending the characteristic sequence into a one-dimensional deconvolution network to recover and obtain a clean voice signal of the target speaker.

As a further preferred embodiment of the method, the training step of the speaker coder specifically includes:

constructing a first training set DatasetA;

sampling, framing, pre-emphasizing, windowing, Fourier transforming and Mel filtering are carried out on data of a first training set DatasetA to obtain a Mel frequency spectrum for training;

specifically, 2. where the frame length is set to 25ms, the frame shift is set to 10ms, the pre-emphasis coefficient is set to 0.97, the window function uses a hamming window, the number of points for fourier transform is set to 512, and the mel-filter coefficient is set to 64;

As a preferred embodiment of the method, the training step of the speech separator specifically includes:

constructing a second training set DatasetB;

randomly selecting two speakers A and B each time from a second training set DatasetB, wherein A is used as a target speaker, selecting one audio from the corpus corresponding to A as a registered corpus for extracting an identity characteristic vector E from a trained speaker encoder, and randomly selecting one audio from the remaining corpus of A as a target recovery voice signal S1; b is used as an interfering speaker, an audio is randomly selected from the corpus of B as interfering voice S2, the interfering voice is mixed with S1 according to a certain signal-to-noise ratio (-5-10 dB) and used as training data for separation (all voice fragments used for training are cut to 3S), and the data of the signal-to-noise ratio for mixing is also input into the model;

the speech separator is trained with the SI-SNR (scale-inverse signal-to-noise ratio) as a loss function.

In addition, the training steps of other modules are the same.

The SI-SNR and the L1 Loss are respectively used as Loss functions of a separation module and a signal-to-noise ratio estimation module, the added result is used as the Loss function of the whole model to carry out combined optimization on the two modules, Adam is used as an optimizer, the initial learning rate is set to be 0.001, when the Loss of a verification set does not decrease by more than 10 epochs, the learning rate is adjusted, iteration is carried out for 50 times, the training is completed, and the model is saved;

a person-specific speech separation system based on a two-path self-attention mechanism, comprising:

the data acquisition module is used for acquiring and acquiring the registration corpus and the mixed corpus;

the identity characteristic extraction module is used for carrying out Mel spectrum extraction on the registered corpus and inputting the extracted corpus into a pre-trained speaker encoder to obtain identity characteristics;

the voice feature extraction module is used for processing the mixed corpus based on a pre-trained voice coder to obtain voice features;

the fusion module is used for fusing the identity characteristic and the voice characteristic to obtain a fusion characteristic;

the signal-to-noise ratio estimation module processes the fusion characteristics based on a pre-trained signal-to-noise ratio estimation module to obtain a signal-to-noise ratio estimation value;

and the separation module is used for sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.

Further as a preferred embodiment of the present system, the present system further comprises:

and the training module is used for training the speaker encoder, the voice encoder, the signal-to-noise ratio estimation module, the voice separator and the voice decoder.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

A specific person voice separation device based on a double-path self-attention mechanism comprises:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method for human-specific speech separation based on a two-path self-attentive mechanism as described above.

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a two-path self-attention mechanism based method for human-specific speech separation as described above.

The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A specific person voice separation method based on a double-path self-attention mechanism is characterized by comprising the following steps:

acquiring a registered corpus and a mixed corpus;

2. The method as claimed in claim 1, wherein the step of performing mel-frequency spectrum extraction on the registered corpus and inputting the extracted corpus to a pre-trained speaker encoder to obtain the identity features comprises:

3. The method of claim 2, wherein the pre-trained speaker coder comprises a front-end feature extraction network, a coding layer, and a full connectivity layer.

4. The method as claimed in claim 3, wherein the pre-training speaker coder processes mel-frequency spectrum to obtain identity features, and the method comprises:

5. The method according to claim 4, wherein the step of processing the mixed corpus to obtain the speech features by the pre-training-based speech coder comprises:

6. The method according to claim 5, wherein the pre-trained snr estimation module comprises a void convolution layer, an LSTM layer, and a full link layer, and the pre-trained snr estimation module processes the fusion feature to obtain the snr estimation value, which specifically includes:

mining timing information between features based on the LSTM layer;

7. The method according to claim 6, wherein the step of passing the fusion features and the snr estimate through a pre-trained speech separator and a speech decoder in sequence to obtain a clean speech signal of the target speaker comprises:

8. The method according to claim 7, wherein the training step of the speaker coder comprises:

constructing a first training set;