CN114495973A - Special person voice separation method based on double-path self-attention mechanism - Google Patents
Special person voice separation method based on double-path self-attention mechanism Download PDFInfo
- Publication number
- CN114495973A CN114495973A CN202210088494.8A CN202210088494A CN114495973A CN 114495973 A CN114495973 A CN 114495973A CN 202210088494 A CN202210088494 A CN 202210088494A CN 114495973 A CN114495973 A CN 114495973A
- Authority
- CN
- China
- Prior art keywords
- voice
- signal
- speaker
- trained
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 24
- 238000000926 separation method Methods 0.000 title claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 52
- 238000000034 method Methods 0.000 claims abstract description 50
- 238000001228 spectrum Methods 0.000 claims abstract description 48
- 230000004927 fusion Effects 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 23
- 230000008569 process Effects 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 10
- 108010076504 Protein Sorting Signals Proteins 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 6
- 238000005520 cutting process Methods 0.000 claims description 5
- 238000012935 Averaging Methods 0.000 claims description 4
- 238000012790 confirmation Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000005065 mining Methods 0.000 claims description 2
- 239000011800 void material Substances 0.000 claims description 2
- 239000011159 matrix material Substances 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000003860 storage Methods 0.000 description 5
- 230000002452 interceptive effect Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000037433 frameshift Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a method for separating specific human voice based on a double-path self-attention mechanism, which comprises the following steps: acquiring a registered corpus and a mixed corpus; extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics; processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics; fusing the identity characteristic and the voice characteristic to obtain a fused characteristic; processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value; and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker. By using the invention, the voice of the target speaker can be quickly and accurately extracted from the mixed corpus containing noise and multi-person voice interference. The method for separating the voice of the specific person based on the double-path self-attention mechanism can be widely applied to the field of voice separation.
Description
Technical Field
The invention relates to the field of voice separation, in particular to a method for separating voices of a specific person based on a double-path self-attention mechanism.
Background
The speaker-specific separation technique is used for extracting the voice of a target speaker from a mixed corpus containing noise and multi-person voice interference under the condition of giving the reference voice of the target speaker. At present, a specific people separation algorithm based on a deep learning method sometimes has two categories of frequency domain and time domain, the importance of phase information in the voice signal reconstruction process is neglected by the time-frequency domain method, and the characteristic learning of a separation network on an original voice signal is limited to a certain extent by using an existing extracted magnitude spectrum; the signal sequence obtained by the encoder through a specific people separation algorithm based on the time domain is often significantly longer than the sequence length obtained after the traditional short-time fourier transform (STFT), so that the modeling and learning process of the network can be difficult. In addition, in both the time-frequency domain and the time-domain methods, the separation effect is reduced because training conditions are not completely matched, where the non-uniform of the signal-to-noise ratio level of the test data and the training data (the ratio of the target speaker voice to the energy level of the rest of the interfering speakers in the separation task, hereinafter referred to as SNR) is a large factor affecting the separation effect.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method for separating speeches of a specific speaker based on a dual-path self-attention mechanism, which can quickly and accurately extract the voice of the target speaker from a mixed corpus including noise and interference of multi-speaker voices.
The first technical scheme adopted by the invention is as follows: a specific person voice separation method based on a double-path self-attention mechanism comprises the following steps:
acquiring a registered corpus and a mixed corpus;
extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics;
processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics;
fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;
processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;
and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
Further, the step of extracting mel-frequency spectrum from the registered corpus and inputting the mel-frequency spectrum to the pre-trained speaker encoder to obtain the identity characteristics specifically comprises:
sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;
performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;
converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;
the speaker coder based on pre-training processes the Mel spectrum to obtain identity characteristics.
Further, the pre-trained speaker encoder comprises a front-end feature extraction network, an encoding layer, and a full link layer.
Further, the pre-training-based speaker coder processes the mel spectrum to obtain the identity, which specifically comprises:
extracting the characteristics of learning the speaker confirmation task from the Mel spectrum by the network based on the front-end characteristics to obtain the front-end characteristics;
converting the front-end features into coding vectors based on the coding layer;
and processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimensionality.
Further, the pre-training-based speech coder processes the mixed corpus to obtain speech features, which specifically includes:
converting the mixed corpus into a voice signal sequence based on a pre-trained voice coder;
and cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.
Further, the pre-trained snr estimation module includes a void convolution layer, an LSTM layer, and a full link layer, and the pre-trained snr estimation module processes the fusion feature to obtain an snr estimation value, which specifically includes:
extracting the characteristics of the fusion characteristics based on the cavity convolution;
mining timing information between features based on the LSTM layer;
obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;
and averaging the values in the time dimension to obtain the estimated value of the signal-to-noise ratio of the voice segment.
Further, the step of sequentially passing the fusion features and the snr estimation values through a pre-trained speech separator and a speech decoder to obtain a clean speech signal of the target speaker specifically includes:
splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;
separating the splicing characteristics by a pre-trained double-path self-attention mechanism voice separator to obtain a separated three-dimensional characteristic module;
and recombining and splicing the separated three-dimensional characteristic modules based on a voice decoder to recover and obtain a clean voice signal of the target speaker.
Further, the training step of the speaker coder specifically comprises:
constructing a first training set;
sampling, framing, pre-emphasis, windowing, Fourier transform and Mel filtering are carried out on data of the first training set to obtain a Mel frequency spectrum for training;
and training the speaker encoder by using Cross control Loss as a Loss function according to the Mel frequency spectrum for training and the real label in the first training set to obtain the pre-trained speaker encoder.
The method has the beneficial effects that: the invention adds a signal-to-noise ratio estimation module in the network, extracts the signal-to-noise ratios of the target voice and the interference voice, and uses the estimated signal-to-noise ratio as one of the inputs of the separation network, so that the separation network pays attention to the signal-to-noise ratio level of the current voice fragment to improve the separation performance of the network under different signal-to-noise ratio scenes.
Drawings
FIG. 1 is a flow chart of the steps of a method for separating speaker-specific speech based on a two-path self-attention mechanism according to the present invention;
FIG. 2 is a block diagram of a method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a Mel-spectrum extraction process according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
Referring to fig. 1 and 2, the invention provides a method for separating speeches of a specific person based on a double-path self-attention mechanism, which comprises the following steps:
s1, acquiring a registered corpus and a mixed corpus;
s2, extracting Mel spectra of the registered corpus, and inputting the extracted corpus into a pre-trained speaker encoder to obtain identity characteristics;
specifically, the speaker encoder is used for extracting the coding vector which can represent the identity characteristic of the target speaker from the registered audio of the target speaker.
S2.1, sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;
specifically, the speech signal has short-time stationarity, and can be regarded as quasi-static within 10-30 ms, so the speech signal is usually framed in the time domain, and for smooth transition between frames, there is usually an overlapping portion between frames, called frame shift. Here, the speech signal is default to 16k, and 25ms is used as the frame length, and 10ms is used as the frame shift. Due to the characteristics of a human voice production structure, the high-frequency part in a human voice signal can be restrained, in the process of sound transmission, the high-frequency energy is greatly attenuated, and in order to make up for the defects of the high-frequency signal, the high-frequency part of each frame of signal is pre-emphasized. To mitigate the gibbs effect, the signal after pre-emphasis is windowed using a hamming window.
S2.2, performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;
specifically, the windowed signal is subjected to short-time fourier transform, and the signal is converted from a time domain to a frequency domain to obtain an amplitude spectrum. The number of points of the fourier transform is set to 512.
S2.3, converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;
specifically, the human ear experiences logarithmic changes on sound frequency, linear spectrograms subjected to short-time Fourier transform are distributed at equal intervals in a frequency domain, and a Mel spectrum conforms to the auditory characteristics of the human ear. The energy spectrum is multiplied by a group of triangular band-pass filters which are uniformly distributed on the Mel frequency spectrum, namely, the central frequency interval of the band-pass filters is reduced along with the reduction of the filter index and widened along with the increase of the filter index, the logarithmic energy output by each filter is obtained, the conversion from the linear frequency spectrum to the Mel nonlinear frequency spectrum can be realized, the required Mel frequency spectrum is obtained, and the Mel frequency spectrum extraction process refers to FIG. 3.
And S2.4, processing the Mel spectrum based on the pre-trained speaker encoder to obtain the identity characteristics.
S2.4.1, extracting the characteristics of the learning speaker confirmation task from the Mel spectrum by the network based on the front end characteristics to obtain the front end characteristics;
in particular, the front-end feature extraction network is used to learn features from the Mel-spectrum of the speech signal that are suitable for the speaker verification task. Where ResNet-18 is used as a module for front-end feature learning.
S2.4.2, converting the front-end characteristics into coding vectors based on the coding layer;
specifically, the coding layer is used for converting the features containing the time sequence relation output by the front-end feature extractor into a time sequence-independent fixed-length coding vector. Meanwhile, dimension conversion between the feature extraction layer and the classifier is achieved, and the overfitting problem of a deep network is reduced.
S2.4.3, processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimension.
In addition, a classifier is included for classifying the speaker identity feature, each node of the output of which represents each person in the training data. This part exists only during the training process, and after the training process is finished, the part is removed, and the output of the fully connected layer of the previous part is input into the subsequent separation network as the identity characteristic of the speaker.
S3, processing the mixed corpus based on the pre-trained speech coder to obtain speech characteristics;
s3.1, converting the mixed corpus into a voice signal sequence based on a pre-training voice coder;
and S3.2, cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.
Specifically, the speech encoder is configured to encode the mixed speech, simulate an STFT, and implement conversion of a speech signal domain to obtain a feature sequence with a length L and a feature dimension H, and the process is implemented using a one-dimensional convolutional network. The convolved speech signal sequence is usually too long, where the sequence is cut and recombined to obtain three-dimensional feature blocks. The method comprises the steps of cutting a characteristic sequence with the length of L and the characteristic dimension of H into M short sequences, wherein the length of each short sequence is N, overlapping the adjacent short sequences with the length of P, and obtaining a three-dimensional module with the dimension of N multiplied by M multiplied by H, wherein K is 2P.
S4, fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;
specifically, the method and the device are used for fusing the speech characteristics obtained by the speech encoder and the identity characteristics obtained by the speaker encoder. And splicing the voice features and the identity features on feature dimensions, and then obtaining the fused features through a full connection layer. The fusion features are input to both the speech separator and the snr estimation module.
S5, processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;
s5.1, extracting the characteristics of the fusion characteristics based on the cavity convolution;
s5.2, excavating time sequence information among the features based on the LSTM layer;
s5.3, obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;
s5.4, averaging in the time dimension to obtain the signal-to-noise ratio estimation value of the voice segment.
Specifically, the snr estimation module is configured to estimate an snr of the corpus. The signal-to-noise ratio estimation module is realized by three layers of two-dimensional cavity convolution, one layer of LSTM and one layer of full connection. The method comprises the steps of performing hole convolution for feature extraction, performing LSTM for discovering time sequence information among features, obtaining a signal-to-noise ratio estimation value of each frame by a full connection layer, and finally obtaining the signal-to-noise ratio of the whole segment by averaging and smoothing in a time dimension. In the training phase, the module performs multitask training together with the rest of the module; in the testing stage, the module, together with the previous speaker coder, speech coder and feature fusion module, extracts the signal-to-noise ratio (SNR) estimated value in the current speech segment from the mixed corpus, instead of the group try SNR in the training stage, as one of the inputs of the speech separator.
And S6, sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
S6.1, splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;
s6.2, separating the splicing features by a pre-training-based double-path self-attention mechanism voice separator to obtain a separated three-dimensional feature module;
specifically, the voice separator is used for separating three-dimensional features corresponding to the target speaker, and is realized by a double-path self-attention mechanism module stacked with a B layer. The fusion feature is spliced with the SNR of the current voice segment on the feature dimension, namely the SNR is repeated N multiplied by M times and then spliced on the fusion feature dimension. Each double-path self-attention mechanism module consists of self-attention mechanisms on two paths in each block and between the blocks, and the parameter settings of the modules are the same. The self-attention mechanism allows the network to exploit the correlation between its own sequences. Firstly, generating a query matrix Q, a key matrix K and a value matrix V according to an input feature vector, then carrying out matrix multiplication on Q and K to obtain a correlation scoring matrix R between a certain moment and other moments in a sequence, normalizing the value of R to be between 0 and 1 through a Softmax function, and then carrying out matrix multiplication on R and V to obtain a value output by self attention. The self-attention mechanism is realized by a multi-head attention mechanism, the multi-head self-attention mechanism is composed of h parallel self-attention modules, and the results of the self-attention modules are spliced to obtain the final output. The intra-block self-attention is regarded as a time dimension and is equivalent to calculating the local attention of each short sequence, and the inter-block self-attention is regarded as a time dimension and is equivalent to calculating the global attention of the whole long sequence, so that the effective modeling of the long sequence information is realized by combining the local attention and the whole attention. In the training stage, the SNR of the voice segment uses the group Truth SNR, namely the SNR value when generating the training mixed corpus; during the testing phase, the SNR uses the SNR value calculated by the SNR estimation module.
And S6.2, recombining and splicing the three-dimensional characteristic modules based on a voice decoder, and recovering to obtain a clean voice signal of the target speaker.
Specifically, the voice decoder is used for recovering the three-dimensional characteristic module obtained by the voice separator to obtain a clean audio corresponding to the target speaker, and the clean audio is realized by a one-dimensional deconvolution network. And recombining and splicing the three-dimensional module according to a process reverse to the cutting and splicing in the voice coder to obtain a characteristic sequence with the same length as the output of the voice coder, and then sending the characteristic sequence into a one-dimensional deconvolution network to recover and obtain a clean voice signal of the target speaker.
As a further preferred embodiment of the method, the training step of the speaker coder specifically includes:
constructing a first training set DatasetA;
sampling, framing, pre-emphasizing, windowing, Fourier transforming and Mel filtering are carried out on data of a first training set DatasetA to obtain a Mel frequency spectrum for training;
specifically, 2. where the frame length is set to 25ms, the frame shift is set to 10ms, the pre-emphasis coefficient is set to 0.97, the window function uses a hamming window, the number of points for fourier transform is set to 512, and the mel-filter coefficient is set to 64;
and training the speaker encoder by using Cross control Loss as a Loss function according to the Mel frequency spectrum for training and the real label in the first training set to obtain the pre-trained speaker encoder.
As a preferred embodiment of the method, the training step of the speech separator specifically includes:
constructing a second training set DatasetB;
randomly selecting two speakers A and B each time from a second training set DatasetB, wherein A is used as a target speaker, selecting one audio from the corpus corresponding to A as a registered corpus for extracting an identity characteristic vector E from a trained speaker encoder, and randomly selecting one audio from the remaining corpus of A as a target recovery voice signal S1; b is used as an interfering speaker, an audio is randomly selected from the corpus of B as interfering voice S2, the interfering voice is mixed with S1 according to a certain signal-to-noise ratio (-5-10 dB) and used as training data for separation (all voice fragments used for training are cut to 3S), and the data of the signal-to-noise ratio for mixing is also input into the model;
the speech separator is trained with the SI-SNR (scale-inverse signal-to-noise ratio) as a loss function.
In addition, the training steps of other modules are the same.
The SI-SNR and the L1 Loss are respectively used as Loss functions of a separation module and a signal-to-noise ratio estimation module, the added result is used as the Loss function of the whole model to carry out combined optimization on the two modules, Adam is used as an optimizer, the initial learning rate is set to be 0.001, when the Loss of a verification set does not decrease by more than 10 epochs, the learning rate is adjusted, iteration is carried out for 50 times, the training is completed, and the model is saved;
a person-specific speech separation system based on a two-path self-attention mechanism, comprising:
the data acquisition module is used for acquiring and acquiring the registration corpus and the mixed corpus;
the identity characteristic extraction module is used for carrying out Mel spectrum extraction on the registered corpus and inputting the extracted corpus into a pre-trained speaker encoder to obtain identity characteristics;
the voice feature extraction module is used for processing the mixed corpus based on a pre-trained voice coder to obtain voice features;
the fusion module is used for fusing the identity characteristic and the voice characteristic to obtain a fusion characteristic;
the signal-to-noise ratio estimation module processes the fusion characteristics based on a pre-trained signal-to-noise ratio estimation module to obtain a signal-to-noise ratio estimation value;
and the separation module is used for sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
Further as a preferred embodiment of the present system, the present system further comprises:
and the training module is used for training the speaker encoder, the voice encoder, the signal-to-noise ratio estimation module, the voice separator and the voice decoder.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
A specific person voice separation device based on a double-path self-attention mechanism comprises:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for human-specific speech separation based on a two-path self-attentive mechanism as described above.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a two-path self-attention mechanism based method for human-specific speech separation as described above.
The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (8)
1. A specific person voice separation method based on a double-path self-attention mechanism is characterized by comprising the following steps:
acquiring a registered corpus and a mixed corpus;
extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics;
processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics;
fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;
processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;
and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
2. The method as claimed in claim 1, wherein the step of performing mel-frequency spectrum extraction on the registered corpus and inputting the extracted corpus to a pre-trained speaker encoder to obtain the identity features comprises:
sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;
performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;
converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;
the speaker coder based on pre-training processes the Mel spectrum to obtain identity characteristics.
3. The method of claim 2, wherein the pre-trained speaker coder comprises a front-end feature extraction network, a coding layer, and a full connectivity layer.
4. The method as claimed in claim 3, wherein the pre-training speaker coder processes mel-frequency spectrum to obtain identity features, and the method comprises:
extracting the characteristics of learning the speaker confirmation task from the Mel spectrum by the network based on the front-end characteristics to obtain the front-end characteristics;
converting the front-end features into coding vectors based on the coding layer;
and processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimensionality.
5. The method according to claim 4, wherein the step of processing the mixed corpus to obtain the speech features by the pre-training-based speech coder comprises:
converting the mixed corpus into a voice signal sequence based on a pre-trained voice coder;
and cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.
6. The method according to claim 5, wherein the pre-trained snr estimation module comprises a void convolution layer, an LSTM layer, and a full link layer, and the pre-trained snr estimation module processes the fusion feature to obtain the snr estimation value, which specifically includes:
extracting the characteristics of the fusion characteristics based on the cavity convolution;
mining timing information between features based on the LSTM layer;
obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;
and averaging the values in the time dimension to obtain the estimated value of the signal-to-noise ratio of the voice segment.
7. The method according to claim 6, wherein the step of passing the fusion features and the snr estimate through a pre-trained speech separator and a speech decoder in sequence to obtain a clean speech signal of the target speaker comprises:
splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;
separating the splicing characteristics by a pre-trained double-path self-attention mechanism voice separator to obtain a separated three-dimensional characteristic module;
and recombining and splicing the separated three-dimensional characteristic modules based on a voice decoder to recover and obtain a clean voice signal of the target speaker.
8. The method according to claim 7, wherein the training step of the speaker coder comprises:
constructing a first training set;
sampling, framing, pre-emphasis, windowing, Fourier transform and Mel filtering are carried out on data of the first training set to obtain a Mel frequency spectrum for training;
and training the speaker encoder by using Cross control Loss as a Loss function according to the Mel frequency spectrum for training and the real label in the first training set to obtain the pre-trained speaker encoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210088494.8A CN114495973A (en) | 2022-01-25 | 2022-01-25 | Special person voice separation method based on double-path self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210088494.8A CN114495973A (en) | 2022-01-25 | 2022-01-25 | Special person voice separation method based on double-path self-attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114495973A true CN114495973A (en) | 2022-05-13 |
Family
ID=81474440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210088494.8A Pending CN114495973A (en) | 2022-01-25 | 2022-01-25 | Special person voice separation method based on double-path self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114495973A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115116448A (en) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
WO2024082928A1 (en) * | 2022-10-21 | 2024-04-25 | 腾讯科技(深圳)有限公司 | Voice processing method and apparatus, and device and medium |
-
2022
- 2022-01-25 CN CN202210088494.8A patent/CN114495973A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115116448A (en) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
WO2024082928A1 (en) * | 2022-10-21 | 2024-04-25 | 腾讯科技(深圳)有限公司 | Voice processing method and apparatus, and device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108182949A (en) | A kind of highway anomalous audio event category method based on depth conversion feature | |
CN108899051B (en) | Speech emotion recognition model and recognition method based on joint feature representation | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
Xu et al. | Time-domain speaker extraction network | |
CN105023580B (en) | Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method | |
CN114495973A (en) | Special person voice separation method based on double-path self-attention mechanism | |
CN111292762A (en) | Single-channel voice separation method based on deep learning | |
CN109256118B (en) | End-to-end Chinese dialect identification system and method based on generative auditory model | |
CN108922513A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
KR101807961B1 (en) | Method and apparatus for processing speech signal based on lstm and dnn | |
CN111785285A (en) | Voiceprint recognition method for home multi-feature parameter fusion | |
CN105895082A (en) | Acoustic model training method and device as well as speech recognition method and device | |
Strauss et al. | A flow-based neural network for time domain speech enhancement | |
CN111128211B (en) | Voice separation method and device | |
CN112259080B (en) | Speech recognition method based on neural network model | |
CN104217730B (en) | A kind of artificial speech bandwidth expanding method and device based on K SVD | |
CN108806725A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
Li et al. | Sams-net: A sliced attention-based neural network for music source separation | |
CN109036470A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN111798875A (en) | VAD implementation method based on three-value quantization compression | |
Lim et al. | Harmonic and percussive source separation using a convolutional auto encoder | |
CN114613387A (en) | Voice separation method and device, electronic equipment and storage medium | |
CN113571095B (en) | Speech emotion recognition method and system based on nested deep neural network | |
CN104240717A (en) | Voice enhancement method based on combination of sparse code and ideal binary system mask | |
CN114360571A (en) | Reference-based speech enhancement method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |