CN114495973A - Special person voice separation method based on double-path self-attention mechanism - Google Patents

Special person voice separation method based on double-path self-attention mechanism Download PDF

Info

Publication number
CN114495973A
CN114495973A CN202210088494.8A CN202210088494A CN114495973A CN 114495973 A CN114495973 A CN 114495973A CN 202210088494 A CN202210088494 A CN 202210088494A CN 114495973 A CN114495973 A CN 114495973A
Authority
CN
China
Prior art keywords
voice
signal
speaker
trained
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210088494.8A
Other languages
Chinese (zh)
Inventor
张东
暴媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210088494.8A priority Critical patent/CN114495973A/en
Publication of CN114495973A publication Critical patent/CN114495973A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a method for separating specific human voice based on a double-path self-attention mechanism, which comprises the following steps: acquiring a registered corpus and a mixed corpus; extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics; processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics; fusing the identity characteristic and the voice characteristic to obtain a fused characteristic; processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value; and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker. By using the invention, the voice of the target speaker can be quickly and accurately extracted from the mixed corpus containing noise and multi-person voice interference. The method for separating the voice of the specific person based on the double-path self-attention mechanism can be widely applied to the field of voice separation.

Description

Special person voice separation method based on double-path self-attention mechanism
Technical Field
The invention relates to the field of voice separation, in particular to a method for separating voices of a specific person based on a double-path self-attention mechanism.
Background
The speaker-specific separation technique is used for extracting the voice of a target speaker from a mixed corpus containing noise and multi-person voice interference under the condition of giving the reference voice of the target speaker. At present, a specific people separation algorithm based on a deep learning method sometimes has two categories of frequency domain and time domain, the importance of phase information in the voice signal reconstruction process is neglected by the time-frequency domain method, and the characteristic learning of a separation network on an original voice signal is limited to a certain extent by using an existing extracted magnitude spectrum; the signal sequence obtained by the encoder through a specific people separation algorithm based on the time domain is often significantly longer than the sequence length obtained after the traditional short-time fourier transform (STFT), so that the modeling and learning process of the network can be difficult. In addition, in both the time-frequency domain and the time-domain methods, the separation effect is reduced because training conditions are not completely matched, where the non-uniform of the signal-to-noise ratio level of the test data and the training data (the ratio of the target speaker voice to the energy level of the rest of the interfering speakers in the separation task, hereinafter referred to as SNR) is a large factor affecting the separation effect.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method for separating speeches of a specific speaker based on a dual-path self-attention mechanism, which can quickly and accurately extract the voice of the target speaker from a mixed corpus including noise and interference of multi-speaker voices.
The first technical scheme adopted by the invention is as follows: a specific person voice separation method based on a double-path self-attention mechanism comprises the following steps:
acquiring a registered corpus and a mixed corpus;
extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics;
processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics;
fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;
processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;
and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
Further, the step of extracting mel-frequency spectrum from the registered corpus and inputting the mel-frequency spectrum to the pre-trained speaker encoder to obtain the identity characteristics specifically comprises:
sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;
performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;
converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;
the speaker coder based on pre-training processes the Mel spectrum to obtain identity characteristics.
Further, the pre-trained speaker encoder comprises a front-end feature extraction network, an encoding layer, and a full link layer.
Further, the pre-training-based speaker coder processes the mel spectrum to obtain the identity, which specifically comprises:
extracting the characteristics of learning the speaker confirmation task from the Mel spectrum by the network based on the front-end characteristics to obtain the front-end characteristics;
converting the front-end features into coding vectors based on the coding layer;
and processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimensionality.
Further, the pre-training-based speech coder processes the mixed corpus to obtain speech features, which specifically includes:
converting the mixed corpus into a voice signal sequence based on a pre-trained voice coder;
and cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.
Further, the pre-trained snr estimation module includes a void convolution layer, an LSTM layer, and a full link layer, and the pre-trained snr estimation module processes the fusion feature to obtain an snr estimation value, which specifically includes:
extracting the characteristics of the fusion characteristics based on the cavity convolution;
mining timing information between features based on the LSTM layer;
obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;
and averaging the values in the time dimension to obtain the estimated value of the signal-to-noise ratio of the voice segment.
Further, the step of sequentially passing the fusion features and the snr estimation values through a pre-trained speech separator and a speech decoder to obtain a clean speech signal of the target speaker specifically includes:
splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;
separating the splicing characteristics by a pre-trained double-path self-attention mechanism voice separator to obtain a separated three-dimensional characteristic module;
and recombining and splicing the separated three-dimensional characteristic modules based on a voice decoder to recover and obtain a clean voice signal of the target speaker.
Further, the training step of the speaker coder specifically comprises:
constructing a first training set;
sampling, framing, pre-emphasis, windowing, Fourier transform and Mel filtering are carried out on data of the first training set to obtain a Mel frequency spectrum for training;
and training the speaker encoder by using Cross control Loss as a Loss function according to the Mel frequency spectrum for training and the real label in the first training set to obtain the pre-trained speaker encoder.
The method has the beneficial effects that: the invention adds a signal-to-noise ratio estimation module in the network, extracts the signal-to-noise ratios of the target voice and the interference voice, and uses the estimated signal-to-noise ratio as one of the inputs of the separation network, so that the separation network pays attention to the signal-to-noise ratio level of the current voice fragment to improve the separation performance of the network under different signal-to-noise ratio scenes.
Drawings
FIG. 1 is a flow chart of the steps of a method for separating speaker-specific speech based on a two-path self-attention mechanism according to the present invention;
FIG. 2 is a block diagram of a method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a Mel-spectrum extraction process according to an embodiment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
Referring to fig. 1 and 2, the invention provides a method for separating speeches of a specific person based on a double-path self-attention mechanism, which comprises the following steps:
s1, acquiring a registered corpus and a mixed corpus;
s2, extracting Mel spectra of the registered corpus, and inputting the extracted corpus into a pre-trained speaker encoder to obtain identity characteristics;
specifically, the speaker encoder is used for extracting the coding vector which can represent the identity characteristic of the target speaker from the registered audio of the target speaker.
S2.1, sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;
specifically, the speech signal has short-time stationarity, and can be regarded as quasi-static within 10-30 ms, so the speech signal is usually framed in the time domain, and for smooth transition between frames, there is usually an overlapping portion between frames, called frame shift. Here, the speech signal is default to 16k, and 25ms is used as the frame length, and 10ms is used as the frame shift. Due to the characteristics of a human voice production structure, the high-frequency part in a human voice signal can be restrained, in the process of sound transmission, the high-frequency energy is greatly attenuated, and in order to make up for the defects of the high-frequency signal, the high-frequency part of each frame of signal is pre-emphasized. To mitigate the gibbs effect, the signal after pre-emphasis is windowed using a hamming window.
S2.2, performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;
specifically, the windowed signal is subjected to short-time fourier transform, and the signal is converted from a time domain to a frequency domain to obtain an amplitude spectrum. The number of points of the fourier transform is set to 512.
S2.3, converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;
specifically, the human ear experiences logarithmic changes on sound frequency, linear spectrograms subjected to short-time Fourier transform are distributed at equal intervals in a frequency domain, and a Mel spectrum conforms to the auditory characteristics of the human ear. The energy spectrum is multiplied by a group of triangular band-pass filters which are uniformly distributed on the Mel frequency spectrum, namely, the central frequency interval of the band-pass filters is reduced along with the reduction of the filter index and widened along with the increase of the filter index, the logarithmic energy output by each filter is obtained, the conversion from the linear frequency spectrum to the Mel nonlinear frequency spectrum can be realized, the required Mel frequency spectrum is obtained, and the Mel frequency spectrum extraction process refers to FIG. 3.
And S2.4, processing the Mel spectrum based on the pre-trained speaker encoder to obtain the identity characteristics.
S2.4.1, extracting the characteristics of the learning speaker confirmation task from the Mel spectrum by the network based on the front end characteristics to obtain the front end characteristics;
in particular, the front-end feature extraction network is used to learn features from the Mel-spectrum of the speech signal that are suitable for the speaker verification task. Where ResNet-18 is used as a module for front-end feature learning.
S2.4.2, converting the front-end characteristics into coding vectors based on the coding layer;
specifically, the coding layer is used for converting the features containing the time sequence relation output by the front-end feature extractor into a time sequence-independent fixed-length coding vector. Meanwhile, dimension conversion between the feature extraction layer and the classifier is achieved, and the overfitting problem of a deep network is reduced.
S2.4.3, processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimension.
In addition, a classifier is included for classifying the speaker identity feature, each node of the output of which represents each person in the training data. This part exists only during the training process, and after the training process is finished, the part is removed, and the output of the fully connected layer of the previous part is input into the subsequent separation network as the identity characteristic of the speaker.
S3, processing the mixed corpus based on the pre-trained speech coder to obtain speech characteristics;
s3.1, converting the mixed corpus into a voice signal sequence based on a pre-training voice coder;
and S3.2, cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.
Specifically, the speech encoder is configured to encode the mixed speech, simulate an STFT, and implement conversion of a speech signal domain to obtain a feature sequence with a length L and a feature dimension H, and the process is implemented using a one-dimensional convolutional network. The convolved speech signal sequence is usually too long, where the sequence is cut and recombined to obtain three-dimensional feature blocks. The method comprises the steps of cutting a characteristic sequence with the length of L and the characteristic dimension of H into M short sequences, wherein the length of each short sequence is N, overlapping the adjacent short sequences with the length of P, and obtaining a three-dimensional module with the dimension of N multiplied by M multiplied by H, wherein K is 2P.
S4, fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;
specifically, the method and the device are used for fusing the speech characteristics obtained by the speech encoder and the identity characteristics obtained by the speaker encoder. And splicing the voice features and the identity features on feature dimensions, and then obtaining the fused features through a full connection layer. The fusion features are input to both the speech separator and the snr estimation module.
S5, processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;
s5.1, extracting the characteristics of the fusion characteristics based on the cavity convolution;
s5.2, excavating time sequence information among the features based on the LSTM layer;
s5.3, obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;
s5.4, averaging in the time dimension to obtain the signal-to-noise ratio estimation value of the voice segment.
Specifically, the snr estimation module is configured to estimate an snr of the corpus. The signal-to-noise ratio estimation module is realized by three layers of two-dimensional cavity convolution, one layer of LSTM and one layer of full connection. The method comprises the steps of performing hole convolution for feature extraction, performing LSTM for discovering time sequence information among features, obtaining a signal-to-noise ratio estimation value of each frame by a full connection layer, and finally obtaining the signal-to-noise ratio of the whole segment by averaging and smoothing in a time dimension. In the training phase, the module performs multitask training together with the rest of the module; in the testing stage, the module, together with the previous speaker coder, speech coder and feature fusion module, extracts the signal-to-noise ratio (SNR) estimated value in the current speech segment from the mixed corpus, instead of the group try SNR in the training stage, as one of the inputs of the speech separator.
And S6, sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
S6.1, splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;
s6.2, separating the splicing features by a pre-training-based double-path self-attention mechanism voice separator to obtain a separated three-dimensional feature module;
specifically, the voice separator is used for separating three-dimensional features corresponding to the target speaker, and is realized by a double-path self-attention mechanism module stacked with a B layer. The fusion feature is spliced with the SNR of the current voice segment on the feature dimension, namely the SNR is repeated N multiplied by M times and then spliced on the fusion feature dimension. Each double-path self-attention mechanism module consists of self-attention mechanisms on two paths in each block and between the blocks, and the parameter settings of the modules are the same. The self-attention mechanism allows the network to exploit the correlation between its own sequences. Firstly, generating a query matrix Q, a key matrix K and a value matrix V according to an input feature vector, then carrying out matrix multiplication on Q and K to obtain a correlation scoring matrix R between a certain moment and other moments in a sequence, normalizing the value of R to be between 0 and 1 through a Softmax function, and then carrying out matrix multiplication on R and V to obtain a value output by self attention. The self-attention mechanism is realized by a multi-head attention mechanism, the multi-head self-attention mechanism is composed of h parallel self-attention modules, and the results of the self-attention modules are spliced to obtain the final output. The intra-block self-attention is regarded as a time dimension and is equivalent to calculating the local attention of each short sequence, and the inter-block self-attention is regarded as a time dimension and is equivalent to calculating the global attention of the whole long sequence, so that the effective modeling of the long sequence information is realized by combining the local attention and the whole attention. In the training stage, the SNR of the voice segment uses the group Truth SNR, namely the SNR value when generating the training mixed corpus; during the testing phase, the SNR uses the SNR value calculated by the SNR estimation module.
And S6.2, recombining and splicing the three-dimensional characteristic modules based on a voice decoder, and recovering to obtain a clean voice signal of the target speaker.
Specifically, the voice decoder is used for recovering the three-dimensional characteristic module obtained by the voice separator to obtain a clean audio corresponding to the target speaker, and the clean audio is realized by a one-dimensional deconvolution network. And recombining and splicing the three-dimensional module according to a process reverse to the cutting and splicing in the voice coder to obtain a characteristic sequence with the same length as the output of the voice coder, and then sending the characteristic sequence into a one-dimensional deconvolution network to recover and obtain a clean voice signal of the target speaker.
As a further preferred embodiment of the method, the training step of the speaker coder specifically includes:
constructing a first training set DatasetA;
sampling, framing, pre-emphasizing, windowing, Fourier transforming and Mel filtering are carried out on data of a first training set DatasetA to obtain a Mel frequency spectrum for training;
specifically, 2. where the frame length is set to 25ms, the frame shift is set to 10ms, the pre-emphasis coefficient is set to 0.97, the window function uses a hamming window, the number of points for fourier transform is set to 512, and the mel-filter coefficient is set to 64;
and training the speaker encoder by using Cross control Loss as a Loss function according to the Mel frequency spectrum for training and the real label in the first training set to obtain the pre-trained speaker encoder.
As a preferred embodiment of the method, the training step of the speech separator specifically includes:
constructing a second training set DatasetB;
randomly selecting two speakers A and B each time from a second training set DatasetB, wherein A is used as a target speaker, selecting one audio from the corpus corresponding to A as a registered corpus for extracting an identity characteristic vector E from a trained speaker encoder, and randomly selecting one audio from the remaining corpus of A as a target recovery voice signal S1; b is used as an interfering speaker, an audio is randomly selected from the corpus of B as interfering voice S2, the interfering voice is mixed with S1 according to a certain signal-to-noise ratio (-5-10 dB) and used as training data for separation (all voice fragments used for training are cut to 3S), and the data of the signal-to-noise ratio for mixing is also input into the model;
the speech separator is trained with the SI-SNR (scale-inverse signal-to-noise ratio) as a loss function.
In addition, the training steps of other modules are the same.
The SI-SNR and the L1 Loss are respectively used as Loss functions of a separation module and a signal-to-noise ratio estimation module, the added result is used as the Loss function of the whole model to carry out combined optimization on the two modules, Adam is used as an optimizer, the initial learning rate is set to be 0.001, when the Loss of a verification set does not decrease by more than 10 epochs, the learning rate is adjusted, iteration is carried out for 50 times, the training is completed, and the model is saved;
a person-specific speech separation system based on a two-path self-attention mechanism, comprising:
the data acquisition module is used for acquiring and acquiring the registration corpus and the mixed corpus;
the identity characteristic extraction module is used for carrying out Mel spectrum extraction on the registered corpus and inputting the extracted corpus into a pre-trained speaker encoder to obtain identity characteristics;
the voice feature extraction module is used for processing the mixed corpus based on a pre-trained voice coder to obtain voice features;
the fusion module is used for fusing the identity characteristic and the voice characteristic to obtain a fusion characteristic;
the signal-to-noise ratio estimation module processes the fusion characteristics based on a pre-trained signal-to-noise ratio estimation module to obtain a signal-to-noise ratio estimation value;
and the separation module is used for sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
Further as a preferred embodiment of the present system, the present system further comprises:
and the training module is used for training the speaker encoder, the voice encoder, the signal-to-noise ratio estimation module, the voice separator and the voice decoder.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
A specific person voice separation device based on a double-path self-attention mechanism comprises:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method for human-specific speech separation based on a two-path self-attentive mechanism as described above.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
A storage medium having stored therein instructions executable by a processor, the storage medium comprising: the processor-executable instructions, when executed by the processor, are for implementing a two-path self-attention mechanism based method for human-specific speech separation as described above.
The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A specific person voice separation method based on a double-path self-attention mechanism is characterized by comprising the following steps:
acquiring a registered corpus and a mixed corpus;
extracting Mel spectrum from the registered corpus, and inputting the extracted Mel spectrum to a pre-trained speaker encoder to obtain identity characteristics;
processing the mixed corpus based on a pre-trained voice coder to obtain voice characteristics;
fusing the identity characteristic and the voice characteristic to obtain a fused characteristic;
processing the fusion characteristics by a signal-to-noise ratio estimation module based on pre-training to obtain a signal-to-noise ratio estimation value;
and sequentially passing the fusion characteristics and the signal-to-noise ratio estimation value through a pre-trained voice separator and a voice decoder to obtain a clean voice signal of the target speaker.
2. The method as claimed in claim 1, wherein the step of performing mel-frequency spectrum extraction on the registered corpus and inputting the extracted corpus to a pre-trained speaker encoder to obtain the identity features comprises:
sequentially performing framing processing, pre-emphasis processing and windowing processing on the registered corpus to obtain a windowed signal;
performing short-time Fourier transform on the windowed signal to obtain a linear frequency spectrum;
converting the linear frequency spectrum into a Mel nonlinear frequency spectrum to obtain a Mel spectrum;
the speaker coder based on pre-training processes the Mel spectrum to obtain identity characteristics.
3. The method of claim 2, wherein the pre-trained speaker coder comprises a front-end feature extraction network, a coding layer, and a full connectivity layer.
4. The method as claimed in claim 3, wherein the pre-training speaker coder processes mel-frequency spectrum to obtain identity features, and the method comprises:
extracting the characteristics of learning the speaker confirmation task from the Mel spectrum by the network based on the front-end characteristics to obtain the front-end characteristics;
converting the front-end features into coding vectors based on the coding layer;
and processing the coding vector based on the full connection layer to obtain the speaker identity characteristic with fixed dimensionality.
5. The method according to claim 4, wherein the step of processing the mixed corpus to obtain the speech features by the pre-training-based speech coder comprises:
converting the mixed corpus into a voice signal sequence based on a pre-trained voice coder;
and cutting and recombining the voice signal sequence into a three-dimensional feature block to obtain voice features.
6. The method according to claim 5, wherein the pre-trained snr estimation module comprises a void convolution layer, an LSTM layer, and a full link layer, and the pre-trained snr estimation module processes the fusion feature to obtain the snr estimation value, which specifically includes:
extracting the characteristics of the fusion characteristics based on the cavity convolution;
mining timing information between features based on the LSTM layer;
obtaining a signal-to-noise ratio estimation value of each frame based on the full connection layer;
and averaging the values in the time dimension to obtain the estimated value of the signal-to-noise ratio of the voice segment.
7. The method according to claim 6, wherein the step of passing the fusion features and the snr estimate through a pre-trained speech separator and a speech decoder in sequence to obtain a clean speech signal of the target speaker comprises:
splicing the fusion characteristics and the signal-to-noise ratio estimation value on the characteristic dimension to obtain splicing characteristics;
separating the splicing characteristics by a pre-trained double-path self-attention mechanism voice separator to obtain a separated three-dimensional characteristic module;
and recombining and splicing the separated three-dimensional characteristic modules based on a voice decoder to recover and obtain a clean voice signal of the target speaker.
8. The method according to claim 7, wherein the training step of the speaker coder comprises:
constructing a first training set;
sampling, framing, pre-emphasis, windowing, Fourier transform and Mel filtering are carried out on data of the first training set to obtain a Mel frequency spectrum for training;
and training the speaker encoder by using Cross control Loss as a Loss function according to the Mel frequency spectrum for training and the real label in the first training set to obtain the pre-trained speaker encoder.
CN202210088494.8A 2022-01-25 2022-01-25 Special person voice separation method based on double-path self-attention mechanism Pending CN114495973A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210088494.8A CN114495973A (en) 2022-01-25 2022-01-25 Special person voice separation method based on double-path self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210088494.8A CN114495973A (en) 2022-01-25 2022-01-25 Special person voice separation method based on double-path self-attention mechanism

Publications (1)

Publication Number Publication Date
CN114495973A true CN114495973A (en) 2022-05-13

Family

ID=81474440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210088494.8A Pending CN114495973A (en) 2022-01-25 2022-01-25 Special person voice separation method based on double-path self-attention mechanism

Country Status (1)

Country Link
CN (1) CN114495973A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
WO2024082928A1 (en) * 2022-10-21 2024-04-25 腾讯科技(深圳)有限公司 Voice processing method and apparatus, and device and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
WO2024082928A1 (en) * 2022-10-21 2024-04-25 腾讯科技(深圳)有限公司 Voice processing method and apparatus, and device and medium

Similar Documents

Publication Publication Date Title
CN108182949A (en) A kind of highway anomalous audio event category method based on depth conversion feature
CN108899051B (en) Speech emotion recognition model and recognition method based on joint feature representation
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
Xu et al. Time-domain speaker extraction network
CN105023580B (en) Unsupervised noise estimation based on separable depth automatic coding and sound enhancement method
CN114495973A (en) Special person voice separation method based on double-path self-attention mechanism
CN111292762A (en) Single-channel voice separation method based on deep learning
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
KR101807961B1 (en) Method and apparatus for processing speech signal based on lstm and dnn
CN111785285A (en) Voiceprint recognition method for home multi-feature parameter fusion
CN105895082A (en) Acoustic model training method and device as well as speech recognition method and device
Strauss et al. A flow-based neural network for time domain speech enhancement
CN111128211B (en) Voice separation method and device
CN112259080B (en) Speech recognition method based on neural network model
CN104217730B (en) A kind of artificial speech bandwidth expanding method and device based on K SVD
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
Li et al. Sams-net: A sliced attention-based neural network for music source separation
CN109036470A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111798875A (en) VAD implementation method based on three-value quantization compression
Lim et al. Harmonic and percussive source separation using a convolutional auto encoder
CN114613387A (en) Voice separation method and device, electronic equipment and storage medium
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN104240717A (en) Voice enhancement method based on combination of sparse code and ideal binary system mask
CN114360571A (en) Reference-based speech enhancement method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination