CN113782045B - Single-channel voice separation method for multi-scale time delay sampling - Google Patents
Single-channel voice separation method for multi-scale time delay sampling Download PDFInfo
- Publication number
- CN113782045B CN113782045B CN202111006251.7A CN202111006251A CN113782045B CN 113782045 B CN113782045 B CN 113782045B CN 202111006251 A CN202111006251 A CN 202111006251A CN 113782045 B CN113782045 B CN 113782045B
- Authority
- CN
- China
- Prior art keywords
- time delay
- delay sampling
- neural network
- output
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 71
- 238000000926 separation method Methods 0.000 title claims abstract description 35
- 238000013528 artificial neural network Methods 0.000 claims abstract description 52
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 34
- 125000004122 cyclic group Chemical group 0.000 claims description 16
- 238000000034 method Methods 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 4
- 230000000903 blocking effect Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000005215 recombination Methods 0.000 claims description 2
- 230000006798 recombination Effects 0.000 claims description 2
- 230000000306 recurrent effect Effects 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 208000033830 Hot Flashes Diseases 0.000 description 1
- 206010060800 Hot flush Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000008521 reorganization Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Stereophonic System (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention provides a single-channel voice separation method based on multi-scale time delay sampling, which is characterized in that mixed voice characteristics of a plurality of speakers are extracted, the mixed voice characteristics are segmented and spliced into 3D tensors, the 3D tensors are modeled iteratively by adopting a first bidirectional circulating neural network, local characteristic information and global characteristic information are captured, the output of the first bidirectional circulating neural network is subjected to time delay sampling of different scales by adopting multi-scale time delay sampling modeling, the characteristic information under different scales is captured by adopting a second bidirectional circulating neural network, the characteristic information is subjected to overlap addition and mapped into masks of pure voices of the plurality of speakers, the masks and the mixed voice characteristics are subjected to dot multiplication to obtain pure voice characteristics of the plurality of speakers, and the pure voice characteristics of each person are reconstructed into a separated voice signal. According to the invention, through multi-scale time delay sampling modeling, the information loss in sectional modeling is made up, and the voice separation performance is greatly improved.
Description
Technical Field
The invention belongs to the field of voice separation, and particularly relates to a single-channel voice separation method for multi-scale time delay sampling.
Background
In recent years, with the rise of deep learning, artificial intelligence hot flashes caused by the deep learning change the aspects of life of people. The voice communication is an indispensable part, and a clear voice can enable the intelligent electric appliance to better execute corresponding commands, so that the intelligent degree is greatly improved. In a real acoustic scenario, however, the speaker's voice of interest is often disturbed by other speakers, which is a typical cocktail problem. The voice content of the target speaker can be easily distinguished by virtue of the unique auditory system of the human ear, so that the voice separation is performed by simulating the auditory system of the human ear, and the pure voice of one or all speakers is separated from the mixed speaker voice, thereby removing the interference of background noise, reverberation and the like and improving the definition and the intelligibility of the voice.
The current mainstream voice separation method mostly adopts end-to-end time domain separation, and the model learns potential public representations in the mixed voice waveform by itself in a data driving mode so as to realize separation. The separation method has extremely high frame separation requirements during voice pretreatment, and researches show that the more frames of a section of mixed voice in the same time, the better the final separation effect. However, too many frames can cause incapacity of modeling, so that in order to solve the problem, yi Luo et al propose a two-way structure sectional modeling method in Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation, effectively realize sampling point level modeling of long-sequence voice, and greatly improve voice separation performance. However, when global information is captured, the method of modeling in a sectional mode can enable adjacent frames in a sequence to be far apart in an original sequence, little mutual relation exists between the adjacent frames, and a plurality of frames are not captured, so that information is seriously lost.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a single-channel voice separation method with multi-scale time delay sampling, and the separation performance is further improved by modeling with multi-scale time delay sampling, making up for the information deficiency in sectional modeling.
A single-channel voice separation method for multi-scale time delay sampling comprises the following steps:
s1, extracting mixed voice characteristics from mixed voice signals of a plurality of speakers;
s2, segmenting the mixed voice features, and splicing the segmented mixed voice features into 3D tensors;
s3, modeling the 3D tensor iteratively by adopting a first bidirectional circulating neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling, performing time delay sampling of different scales on the output of the first bidirectional circulating neural network, and capturing characteristic information of different scales by adopting the second bidirectional circulating neural network;
s5, carrying out overlap-add on the characteristic information in the S4, mapping the overlapped and added result into masks of the pure voices of a plurality of speakers, and carrying out dot multiplication on the masks and the mixed voice characteristics to obtain the pure voice characterization of the plurality of speakers;
s6, reconstructing the pure voice representation of each person into a separated voice signal.
Further, the output of the first bidirectional cyclic neural network is subjected to single-scale time delay sampling, and the specific process is as follows:
the output of the bidirectional cyclic neural network is subjected to sequence recombination and re-segmentation, and the sequence spliced by the re-segmentation is along the block length dimension 2 λ K, time delay sampling is carried out, and each sequence u after time delay sampling is carried out i Splicing along the dimension R' of the block to obtain a 3D tensor h recombined after time delay sampling λ Wherein 2 is λ Is the time delay sampling rate, λ=0,..a.b-1 is the sampling index, B represents the number of stacked multi-scale time delay sampling blocks, and K represents the block length dimension before re-segmentation.
Further, the second bidirectional circulating neural network is adopted to capture the characteristic information under a single scale, and the specific process is as follows:
along the h λ Capturing the interrelation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer to output FC; and carrying out layer normalization on the FC, and then carrying out residual connection with the output of the bidirectional circulating neural network.
Further, different time delay sampling rates are adopted, single-scale time delay sampling is repeatedly carried out on the output of the first bidirectional circulating neural network, the second bidirectional circulating neural network is adopted to capture the characteristic information under the single scale, and the characteristic information under different scales is integrated.
Further, the spliced sequence of the re-segmented is along the block length dimension 2 λ K carries out time delay sampling, and the adopted formula is as follows:
u i =[u′[:,i::2 λ ,:],i=0,...,2 λ -1]
wherein u' [: i::2 λ ,:]Representation pairu' adopts sequence slices in Python, and i represents an index sequence number.
Still further, the said along the said h λ Capturing the correlation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then performing dimension transformation on the output result through a full-connection layer to output FC, wherein the specific formula is as follows:
U=[BiLSTM(h λ [:,:,m]),m=1,...,2 λ R′]
FC=[WU[:,:,m]+b,m=1,...,2 λ R′]
wherein U is the output of the bidirectional cyclic neural network, H is the output dimension of the hidden layer, and H λ [:,:,m]Is a sequence determined by index m, 2 λ R' represents the blocking dimension, W, b is the weight and bias of the fully connected layer, respectively.
Further, the layer normalization of the FC is performed according to the following specific formula:
wherein mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a very small positive number, layernorm represents layer normalization, and i, j and k respectively represent N, K and 2 λ A particular dimension in R', N is the dimension of the extracted mixed speech feature.
Further, the specific formula for the residual connection is:
Output=h λ +Layernorm(FC)。
the beneficial effects of the invention are as follows: according to the invention, multi-scale time delay sampling modeling is adopted, time delay sampling of different scales is carried out on the output of the first bidirectional circulating neural network, and the characteristic information of different scales is captured by adopting the second bidirectional circulating neural network, so that the interrelationship among sequences of different scales is effectively integrated, the information deficiency of the bidirectional circulating neural network in sectional modeling is made up, the voice separation performance is greatly improved, the voice distortion rate is effectively reduced, and the voice intelligibility is improved.
Drawings
Fig. 1 is a flowchart of a single-channel voice separation method based on multi-scale time delay sampling according to the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to the appended drawings.
As shown in fig. 1, the single-channel voice separation method based on multi-scale time delay sampling of the invention specifically comprises the following steps:
step one, an encoder is adopted to encode a plurality of input speaker mixed voice signals, and corresponding mixed voice characteristics are extracted
For a given mixed speech signalA one-dimensional convolutional neural network is adopted as an encoder to extract high-order mixed voice characteristics ++>Wherein->For the real number set, L is the length of the mixed voice signal, N is the dimension of the extracted mixed voice feature, and T is the frame number of the mixed voice feature; the convolution kernel size of the one-dimensional convolution neural network is W, the moving step of the convolution window is W/2, and a ReLU function is added after the convolution neural network for nonlinear transformation, and the calculation formula is as follows:
step two, segmenting the mixed voice characteristics output by the encoder, and then splicing the segmented mixed voice characteristics into a 3D tensor
Dividing the mixed voice characteristic z output by the encoder by taking P as a unit, dividing each divided small block into R blocks with the length of K, overlapping each small block by 50%, and then splicing all the small blocks into a 3D tensor
Step three, modeling the 3D tensor v iteratively by adopting a bidirectional cyclic neural network, and correspondingly capturing local characteristic information and global characteristic information
Stacking B bi-directional recurrent neural network (bislstm) modules (the bi-directional recurrent neural network modules include odd-numbered bislstm modules and even-numbered bislstm modules, and the odd-numbered bislstm modules and the even-numbered bislstm modules are alternately stacked), modeling an input 3D tensor, wherein the odd-numbered bislstm module B 2i-1 (i=1,., b/2) sequence modeling along a block length dimension K of number R, capturing local feature information of the mixed speech feature sequence; even BiLSTM Module B 2i Modeling along the number K of the blocking dimensions R, and capturing global feature information of a mixed voice feature sequence, wherein the calculation formula is as follows:
U R =[BiLSTM(v[:,:,i]),i=1,...,R] (2)
U K =[BiLSTM(U R [:,i,:]),i=1,...,K] (3)
in U R 、The outputs of the odd and even BiLSTM modules are effectively integrated for subsequent separation by iterative alternating execution of the odd and even BiLSTM modules.
The bi-directional cyclic neural network is adopted to enable the multi-scale time delay sampling Model (MTDS) to pay attention to not only past context information, but also future context information, which is very beneficial to the task of voice separation, and the calculation formula of the bi-directional cyclic neural network is as follows:
h u =tanh(W u [a <t-1> ;x <t> ]+b u ) (4)
h f =tanh(W f [a <t-1> ;x <t> ]+b f ) (5)
h o =tanh(W o [a <t-1> ;x <t> ]+b o ) (6)
a <t> =tanh(c <t> ) (9)
in the formula, h u 、h f 、h o Outputs of the LSTM update gate, the forget gate and the output gate, respectively, W u 、W f 、W o 、W c B u 、b f 、b o 、b c Is the weight and bias in the bi-directional recurrent neural network; x is x <t> 、a <t> Representing the input and output of the bi-directional recurrent neural network at the current moment,c <t> is a memory cell in the corresponding two-way recurrent neural network, and tanh is an activation function.
Fourth, multi-scale time delay sampling modeling is adopted to compensate for information loss in the bidirectional circulating neural network so as to further improve separation performance
Although the bi-directional recurrent neural network is very effective for capturing the information of the mixed voice characteristic sequence, when the even number BiLSTM module performs modeling processing, the bi-directional recurrent neural network is applied to K discontinuous frames, the interval between adjacent frames is R//2, the excessive interval obviously causes the sequence relation between the frames to be very small, the bi-directional recurrent neural network is difficult to capture the effective information, and a plurality of frames still exist in the whole mixed voice characteristic sequence and are not captured, so that the information is seriously lost. In order to remedy the defect, the mixed voice feature sequence is remodelled by adopting multi-scale time delay sampling, so that feature information under different scales is captured, and the separation performance is improved. The specific method comprises the following steps:
(1) Output to a bi-directional recurrent neural networkSequence reorganization and re-segmentation are performed, wherein a segmentation strategy in a bi-directional recurrent neural network is still adopted, except that the length of each small segment obtained after segmentation is enlarged from K to 2 λ K。
(2) For the sequence after re-segmentation and splicingAlong the block length dimension 2 λ K is subjected to time delay sampling, and the calculation formula is as follows:
u i =[u′[:,i::2 λ ,:],i=0,...,2 λ -1] (10)
where λ=0,..where B-1 is the sampling index, 2 λ Is the time delay sampling rate, especially when λ=0, indicating that the original bi-directional cyclic neural network segmentation strategy is still adopted, without adding any time delay sampling; b represents the number of stacked multi-scale time delay sampling blocks; definition of the definitionThe number of segments after re-segmentation; u' [: i:. 2 λ ,:]The sequence slicing in Python is used for u ', i.e. slicing along the second dimension of u', each segment length K, co-sliced into 2 λ Segments, then from each sliceTaking out the sequence with index i for splicing to obtain new sequence after time delay sampling +.>
(3) Each sequence u after time delay sampling i Stitching along the tile dimension R':
in the method, in the process of the invention,is that the passing time delay rate is 2 λ Although it is not different in shape and size from the input 3D tensor v of the bi-directional recurrent neural network, the two have changed greatly in internal sequence order.
(4) Along h λ Capturing the interrelation between the sequences after time delay sampling by using a bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer; the specific calculation formula is as follows:
U=[BiLSTM(h λ [:,:,m]),m=1,...,2 λ R′] (12)
FC=[WU[:,:,m]+b,m=1,...,2 λ R′] (13)
in the method, in the process of the invention,is the output of the bi-directional recurrent neural network, H is the output dimension of the hidden layer,is a sequence determined by index m, +.> Weights and offsets of the full connection layer, respectively,/->Is output after full connection layer dimension conversion.
Through the modeling mode, the whole multi-scale time delay sampling model can expand the receptive field, additionally capture mixed voice characteristic sequence information under different scales, for example, under the condition of low time delay rate, basic information such as phonemes, tones and the like in the whole sequence is obtained, and under the condition of high time delay rate, information such as speaking content, speaker identity and the like in the sequence is obtained.
(5) Performing layer normalization on the result FC of the step (4), and performing residual connection with an initial input (output u of a bidirectional cyclic neural network) so as to facilitate convergence of a multi-scale time delay sampling model and prevent gradient disappearance or gradient explosion in the training process; the calculation formula of the layer normalization and residual error is as follows:
Output=h λ +Layernorm(FC) (17)
wherein μ (FC) and σ (FC) respectively represent the mean and variance of the outputs of the full-connection layer, z and r are normalization factors, ε is a minimum positive number preventing the denominator from being 0, layernorm represents layer normalization, i, j and k respectively represent N, K and 2 λ A particular dimension in R'.
(6) Repeating (1) - (5), stacking B multi-scale time delay sampling modules altogether, each module employing a different time delay sampling rate, 1,2,4, 2 respectively B-1 And time-delay samplingThe rate increases exponentially. Through the stacking of the modules, the multi-scale time delay sampling model captures the sequence characteristics of the phoneme level when the low time delay sampling rate is adopted, and the multi-scale time delay sampling model further focuses on the characteristic information among semantics or speakers along with the expansion of the time delay sampling rate, so that the characteristic information under different scales is effectively integrated, the receptive field is expanded, and the interrelationship among the sequences is fully captured.
And fifthly, performing overlap-add on the Output after multi-scale time delay sampling modeling (the length of the overlap-add is the same as the length of the mixed voice feature sequence extracted by the encoder), then sending the Output into a two-dimensional convolutional neural network, mapping the result after the overlap-add into masks of pure voices of a plurality of speakers, and performing dot multiplication on the masks and the Output (the extracted mixed voice feature sequence) of the original encoder to obtain pure voice characterization of the plurality of speakers.
And step six, adopting a 1-dimensional deconvolution neural network as a decoder, recovering the pure voice characterization of the plurality of masked speakers into a voice waveform signal of a time domain, and realizing voice separation.
The multi-scale time delay sampling model in this embodiment is trained by using normalized signal-to-noise ratio (SI-SNR) as a loss function, and its calculation formula is as follows:
wherein S is target 、e noise Represents an intermediate variable which is referred to as,x represents the separated speech and the clean speech, respectively.
Compared with the existing voice separation algorithm, the single-channel voice separation method adopting the multi-scale time delay sampling fully exploits potential correlations in the mixed voice feature sequence, effectively improves the voice separation performance, reduces the distortion rate of the separated voice and improves the intelligibility of the separated voice, and has good reference significance in the fields of theoretical research and practical application.
Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention.
Claims (7)
1. A single-channel voice separation method for multi-scale time delay sampling is characterized by comprising the following steps:
s1, extracting mixed voice characteristics from mixed voice signals of a plurality of speakers;
s2, segmenting the mixed voice features, and splicing the segmented mixed voice features into 3D tensors;
s3, modeling the 3D tensor iteratively by adopting a first bidirectional circulating neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling, performing time delay sampling of different scales on the output of the first bidirectional circulating neural network, and capturing characteristic information of different scales by adopting the second bidirectional circulating neural network;
the output of the first bidirectional cyclic neural network is subjected to single-scale time delay sampling, and the specific process is as follows: the output of the bidirectional cyclic neural network is subjected to sequence recombination and re-segmentation, and the sequence spliced by the re-segmentation is along the block length dimension 2 λ K, time delay sampling is carried out, and each sequence u after time delay sampling is carried out i Splicing along the dimension R' of the block to obtain a 3D tensor h recombined after time delay sampling λ Wherein 2 is λ Time of yesThe delay sampling rate, λ=0,..b-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, K represents the block length dimension before re-segmentation;
s5, carrying out overlap-add on the characteristic information in the S4, mapping the overlapped and added result into masks of the pure voices of a plurality of speakers, and carrying out dot multiplication on the masks and the mixed voice characteristics to obtain the pure voice characterization of the plurality of speakers;
s6, reconstructing the pure voice representation of each person into a separated voice signal.
2. The single-channel voice separation method of multi-scale time delay sampling according to claim 1, wherein the characteristic information under a single scale is captured by adopting a second bidirectional circulating neural network, and the specific process is as follows:
along the h λ Capturing the interrelation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer to output FC; and carrying out layer normalization on the FC, and then carrying out residual connection with the output of the bidirectional circulating neural network.
3. The single-channel voice separation method of multi-scale time delay sampling according to claim 2, wherein the time delay sampling of the output of the first bidirectional circulating neural network is repeated at a single scale by adopting different time delay sampling rates, and the characteristic information at the single scale is captured by adopting the second bidirectional circulating neural network, so that the characteristic information at the different scales is integrated.
4. The method of claim 1, wherein the re-segmented spliced sequence is along a block length dimension 2 λ K carries out time delay sampling, and the adopted formula is as follows:
u i =[u′[:,i::2 λ ,:],i=0,...,2 λ -1]
wherein u' [: i::2 λ ,:]Representing the use of Python for uI represents an index number.
5. The method of claim 2, wherein the step of along h is a single channel speech separation of multi-scale time-delay sampling λ Capturing the correlation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then performing dimension transformation on the output result through a full-connection layer to output FC, wherein the specific formula is as follows:
U=[BiLSTM(h λ [:,:,m]),m=1,...,2 λ R′]
FC=[WU[:,:,m]+b,m=1,...,2 λ R′]
wherein U is the output of the bidirectional cyclic neural network, H is the output dimension of the hidden layer, and H λ [:,:,m]Is a sequence determined by index m, 2 λ R' represents the blocking dimension, W, b is the weight and bias of the fully connected layer, respectively.
6. The method for single-channel voice separation with multi-scale time delay sampling according to claim 2, wherein the FC is subjected to layer normalization according to the following specific formula:
wherein mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a very small positive number, layernorm represents layer normalization, and i, j and k respectively represent N, K and 2 λ A specific dimension in R', N is the extracted mixed speechFeature dimensions.
7. The single-channel speech separation method of multi-scale time-delay sampling according to claim 6, wherein the specific formula for the residual connection is:
Output=h λ +Layernorm(FC)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111006251.7A CN113782045B (en) | 2021-08-30 | 2021-08-30 | Single-channel voice separation method for multi-scale time delay sampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111006251.7A CN113782045B (en) | 2021-08-30 | 2021-08-30 | Single-channel voice separation method for multi-scale time delay sampling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113782045A CN113782045A (en) | 2021-12-10 |
CN113782045B true CN113782045B (en) | 2024-01-05 |
Family
ID=78840162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111006251.7A Active CN113782045B (en) | 2021-08-30 | 2021-08-30 | Single-channel voice separation method for multi-scale time delay sampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113782045B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110459240A (en) * | 2019-08-12 | 2019-11-15 | 新疆大学 | The more speaker's speech separating methods clustered based on convolutional neural networks and depth |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111429938A (en) * | 2020-03-06 | 2020-07-17 | 江苏大学 | Single-channel voice separation method and device and electronic equipment |
CN112071325A (en) * | 2020-09-04 | 2020-12-11 | 中山大学 | Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling |
CN112289304A (en) * | 2019-07-24 | 2021-01-29 | 中国科学院声学研究所 | Multi-speaker voice synthesis method based on variational self-encoder |
CN113053407A (en) * | 2021-02-06 | 2021-06-29 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for multiple speakers |
-
2021
- 2021-08-30 CN CN202111006251.7A patent/CN113782045B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112289304A (en) * | 2019-07-24 | 2021-01-29 | 中国科学院声学研究所 | Multi-speaker voice synthesis method based on variational self-encoder |
CN110459240A (en) * | 2019-08-12 | 2019-11-15 | 新疆大学 | The more speaker's speech separating methods clustered based on convolutional neural networks and depth |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111429938A (en) * | 2020-03-06 | 2020-07-17 | 江苏大学 | Single-channel voice separation method and device and electronic equipment |
CN112071325A (en) * | 2020-09-04 | 2020-12-11 | 中山大学 | Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling |
CN113053407A (en) * | 2021-02-06 | 2021-06-29 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for multiple speakers |
Non-Patent Citations (1)
Title |
---|
环境辅助的多任务混合声音事件检测方法;高利剑;毛启容;;计算机科学(01);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113782045A (en) | 2021-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111429938B (en) | Single-channel voice separation method and device and electronic equipment | |
WO2020042707A1 (en) | Convolutional recurrent neural network-based single-channel real-time noise reduction method | |
CN110246510B (en) | End-to-end voice enhancement method based on RefineNet | |
CN110751957B (en) | Speech enhancement method using stacked multi-scale modules | |
CN113470671B (en) | Audio-visual voice enhancement method and system fully utilizing vision and voice connection | |
CN109890043B (en) | Wireless signal noise reduction method based on generative countermeasure network | |
CN108847244A (en) | Voiceprint recognition method and system based on MFCC and improved BP neural network | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN112581979A (en) | Speech emotion recognition method based on spectrogram | |
CN113488060B (en) | Voiceprint recognition method and system based on variation information bottleneck | |
Zhang et al. | Birdsoundsdenoising: Deep visual audio denoising for bird sounds | |
CN115602152B (en) | Voice enhancement method based on multi-stage attention network | |
CN110176250B (en) | Robust acoustic scene recognition method based on local learning | |
CN113643723A (en) | Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
CN113191178A (en) | Underwater sound target identification method based on auditory perception feature deep learning | |
CN112183582A (en) | Multi-feature fusion underwater target identification method | |
CN113763965A (en) | Speaker identification method with multiple attention characteristics fused | |
EP4211686A1 (en) | Machine learning for microphone style transfer | |
CN116467416A (en) | Multi-mode dialogue emotion recognition method and system based on graphic neural network | |
Jin et al. | Speech separation and emotion recognition for multi-speaker scenarios | |
CN113889099A (en) | Voice recognition method and system | |
CN116403594B (en) | Speech enhancement method and device based on noise update factor | |
CN117711442A (en) | Infant crying classification method based on CNN-GRU fusion model | |
CN111916060B (en) | Deep learning voice endpoint detection method and system based on spectral subtraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |