CN113782045B - Single-channel voice separation method for multi-scale time delay sampling - Google Patents

Single-channel voice separation method for multi-scale time delay sampling Download PDF

Info

Publication number
CN113782045B
CN113782045B CN202111006251.7A CN202111006251A CN113782045B CN 113782045 B CN113782045 B CN 113782045B CN 202111006251 A CN202111006251 A CN 202111006251A CN 113782045 B CN113782045 B CN 113782045B
Authority
CN
China
Prior art keywords
time delay
delay sampling
neural network
output
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111006251.7A
Other languages
Chinese (zh)
Other versions
CN113782045A (en
Inventor
毛启容
钱双庆
陈静静
贾洪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202111006251.7A priority Critical patent/CN113782045B/en
Publication of CN113782045A publication Critical patent/CN113782045A/en
Application granted granted Critical
Publication of CN113782045B publication Critical patent/CN113782045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Stereophonic System (AREA)

Abstract

The invention provides a single-channel voice separation method based on multi-scale time delay sampling, which is characterized in that mixed voice characteristics of a plurality of speakers are extracted, the mixed voice characteristics are segmented and spliced into 3D tensors, the 3D tensors are modeled iteratively by adopting a first bidirectional circulating neural network, local characteristic information and global characteristic information are captured, the output of the first bidirectional circulating neural network is subjected to time delay sampling of different scales by adopting multi-scale time delay sampling modeling, the characteristic information under different scales is captured by adopting a second bidirectional circulating neural network, the characteristic information is subjected to overlap addition and mapped into masks of pure voices of the plurality of speakers, the masks and the mixed voice characteristics are subjected to dot multiplication to obtain pure voice characteristics of the plurality of speakers, and the pure voice characteristics of each person are reconstructed into a separated voice signal. According to the invention, through multi-scale time delay sampling modeling, the information loss in sectional modeling is made up, and the voice separation performance is greatly improved.

Description

Single-channel voice separation method for multi-scale time delay sampling
Technical Field
The invention belongs to the field of voice separation, and particularly relates to a single-channel voice separation method for multi-scale time delay sampling.
Background
In recent years, with the rise of deep learning, artificial intelligence hot flashes caused by the deep learning change the aspects of life of people. The voice communication is an indispensable part, and a clear voice can enable the intelligent electric appliance to better execute corresponding commands, so that the intelligent degree is greatly improved. In a real acoustic scenario, however, the speaker's voice of interest is often disturbed by other speakers, which is a typical cocktail problem. The voice content of the target speaker can be easily distinguished by virtue of the unique auditory system of the human ear, so that the voice separation is performed by simulating the auditory system of the human ear, and the pure voice of one or all speakers is separated from the mixed speaker voice, thereby removing the interference of background noise, reverberation and the like and improving the definition and the intelligibility of the voice.
The current mainstream voice separation method mostly adopts end-to-end time domain separation, and the model learns potential public representations in the mixed voice waveform by itself in a data driving mode so as to realize separation. The separation method has extremely high frame separation requirements during voice pretreatment, and researches show that the more frames of a section of mixed voice in the same time, the better the final separation effect. However, too many frames can cause incapacity of modeling, so that in order to solve the problem, yi Luo et al propose a two-way structure sectional modeling method in Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation, effectively realize sampling point level modeling of long-sequence voice, and greatly improve voice separation performance. However, when global information is captured, the method of modeling in a sectional mode can enable adjacent frames in a sequence to be far apart in an original sequence, little mutual relation exists between the adjacent frames, and a plurality of frames are not captured, so that information is seriously lost.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a single-channel voice separation method with multi-scale time delay sampling, and the separation performance is further improved by modeling with multi-scale time delay sampling, making up for the information deficiency in sectional modeling.
A single-channel voice separation method for multi-scale time delay sampling comprises the following steps:
s1, extracting mixed voice characteristics from mixed voice signals of a plurality of speakers;
s2, segmenting the mixed voice features, and splicing the segmented mixed voice features into 3D tensors;
s3, modeling the 3D tensor iteratively by adopting a first bidirectional circulating neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling, performing time delay sampling of different scales on the output of the first bidirectional circulating neural network, and capturing characteristic information of different scales by adopting the second bidirectional circulating neural network;
s5, carrying out overlap-add on the characteristic information in the S4, mapping the overlapped and added result into masks of the pure voices of a plurality of speakers, and carrying out dot multiplication on the masks and the mixed voice characteristics to obtain the pure voice characterization of the plurality of speakers;
s6, reconstructing the pure voice representation of each person into a separated voice signal.
Further, the output of the first bidirectional cyclic neural network is subjected to single-scale time delay sampling, and the specific process is as follows:
the output of the bidirectional cyclic neural network is subjected to sequence recombination and re-segmentation, and the sequence spliced by the re-segmentation is along the block length dimension 2 λ K, time delay sampling is carried out, and each sequence u after time delay sampling is carried out i Splicing along the dimension R' of the block to obtain a 3D tensor h recombined after time delay sampling λ Wherein 2 is λ Is the time delay sampling rate, λ=0,..a.b-1 is the sampling index, B represents the number of stacked multi-scale time delay sampling blocks, and K represents the block length dimension before re-segmentation.
Further, the second bidirectional circulating neural network is adopted to capture the characteristic information under a single scale, and the specific process is as follows:
along the h λ Capturing the interrelation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer to output FC; and carrying out layer normalization on the FC, and then carrying out residual connection with the output of the bidirectional circulating neural network.
Further, different time delay sampling rates are adopted, single-scale time delay sampling is repeatedly carried out on the output of the first bidirectional circulating neural network, the second bidirectional circulating neural network is adopted to capture the characteristic information under the single scale, and the characteristic information under different scales is integrated.
Further, the spliced sequence of the re-segmented is along the block length dimension 2 λ K carries out time delay sampling, and the adopted formula is as follows:
u i =[u′[:,i::2 λ ,:],i=0,...,2 λ -1]
wherein u' [: i::2 λ ,:]Representation pairu' adopts sequence slices in Python, and i represents an index sequence number.
Still further, the said along the said h λ Capturing the correlation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then performing dimension transformation on the output result through a full-connection layer to output FC, wherein the specific formula is as follows:
U=[BiLSTM(h λ [:,:,m]),m=1,...,2 λ R′]
FC=[WU[:,:,m]+b,m=1,...,2 λ R′]
wherein U is the output of the bidirectional cyclic neural network, H is the output dimension of the hidden layer, and H λ [:,:,m]Is a sequence determined by index m, 2 λ R' represents the blocking dimension, W, b is the weight and bias of the fully connected layer, respectively.
Further, the layer normalization of the FC is performed according to the following specific formula:
wherein mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a very small positive number, layernorm represents layer normalization, and i, j and k respectively represent N, K and 2 λ A particular dimension in R', N is the dimension of the extracted mixed speech feature.
Further, the specific formula for the residual connection is:
Output=h λ +Layernorm(FC)。
the beneficial effects of the invention are as follows: according to the invention, multi-scale time delay sampling modeling is adopted, time delay sampling of different scales is carried out on the output of the first bidirectional circulating neural network, and the characteristic information of different scales is captured by adopting the second bidirectional circulating neural network, so that the interrelationship among sequences of different scales is effectively integrated, the information deficiency of the bidirectional circulating neural network in sectional modeling is made up, the voice separation performance is greatly improved, the voice distortion rate is effectively reduced, and the voice intelligibility is improved.
Drawings
Fig. 1 is a flowchart of a single-channel voice separation method based on multi-scale time delay sampling according to the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to the appended drawings.
As shown in fig. 1, the single-channel voice separation method based on multi-scale time delay sampling of the invention specifically comprises the following steps:
step one, an encoder is adopted to encode a plurality of input speaker mixed voice signals, and corresponding mixed voice characteristics are extracted
For a given mixed speech signalA one-dimensional convolutional neural network is adopted as an encoder to extract high-order mixed voice characteristics ++>Wherein->For the real number set, L is the length of the mixed voice signal, N is the dimension of the extracted mixed voice feature, and T is the frame number of the mixed voice feature; the convolution kernel size of the one-dimensional convolution neural network is W, the moving step of the convolution window is W/2, and a ReLU function is added after the convolution neural network for nonlinear transformation, and the calculation formula is as follows:
step two, segmenting the mixed voice characteristics output by the encoder, and then splicing the segmented mixed voice characteristics into a 3D tensor
Dividing the mixed voice characteristic z output by the encoder by taking P as a unit, dividing each divided small block into R blocks with the length of K, overlapping each small block by 50%, and then splicing all the small blocks into a 3D tensor
Step three, modeling the 3D tensor v iteratively by adopting a bidirectional cyclic neural network, and correspondingly capturing local characteristic information and global characteristic information
Stacking B bi-directional recurrent neural network (bislstm) modules (the bi-directional recurrent neural network modules include odd-numbered bislstm modules and even-numbered bislstm modules, and the odd-numbered bislstm modules and the even-numbered bislstm modules are alternately stacked), modeling an input 3D tensor, wherein the odd-numbered bislstm module B 2i-1 (i=1,., b/2) sequence modeling along a block length dimension K of number R, capturing local feature information of the mixed speech feature sequence; even BiLSTM Module B 2i Modeling along the number K of the blocking dimensions R, and capturing global feature information of a mixed voice feature sequence, wherein the calculation formula is as follows:
U R =[BiLSTM(v[:,:,i]),i=1,...,R] (2)
U K =[BiLSTM(U R [:,i,:]),i=1,...,K] (3)
in U RThe outputs of the odd and even BiLSTM modules are effectively integrated for subsequent separation by iterative alternating execution of the odd and even BiLSTM modules.
The bi-directional cyclic neural network is adopted to enable the multi-scale time delay sampling Model (MTDS) to pay attention to not only past context information, but also future context information, which is very beneficial to the task of voice separation, and the calculation formula of the bi-directional cyclic neural network is as follows:
h u =tanh(W u [a <t-1> ;x <t> ]+b u ) (4)
h f =tanh(W f [a <t-1> ;x <t> ]+b f ) (5)
h o =tanh(W o [a <t-1> ;x <t> ]+b o ) (6)
a <t> =tanh(c <t> ) (9)
in the formula, h u 、h f 、h o Outputs of the LSTM update gate, the forget gate and the output gate, respectively, W u 、W f 、W o 、W c B u 、b f 、b o 、b c Is the weight and bias in the bi-directional recurrent neural network; x is x <t> 、a <t> Representing the input and output of the bi-directional recurrent neural network at the current moment,c <t> is a memory cell in the corresponding two-way recurrent neural network, and tanh is an activation function.
Fourth, multi-scale time delay sampling modeling is adopted to compensate for information loss in the bidirectional circulating neural network so as to further improve separation performance
Although the bi-directional recurrent neural network is very effective for capturing the information of the mixed voice characteristic sequence, when the even number BiLSTM module performs modeling processing, the bi-directional recurrent neural network is applied to K discontinuous frames, the interval between adjacent frames is R//2, the excessive interval obviously causes the sequence relation between the frames to be very small, the bi-directional recurrent neural network is difficult to capture the effective information, and a plurality of frames still exist in the whole mixed voice characteristic sequence and are not captured, so that the information is seriously lost. In order to remedy the defect, the mixed voice feature sequence is remodelled by adopting multi-scale time delay sampling, so that feature information under different scales is captured, and the separation performance is improved. The specific method comprises the following steps:
(1) Output to a bi-directional recurrent neural networkSequence reorganization and re-segmentation are performed, wherein a segmentation strategy in a bi-directional recurrent neural network is still adopted, except that the length of each small segment obtained after segmentation is enlarged from K to 2 λ K。
(2) For the sequence after re-segmentation and splicingAlong the block length dimension 2 λ K is subjected to time delay sampling, and the calculation formula is as follows:
u i =[u′[:,i::2 λ ,:],i=0,...,2 λ -1] (10)
where λ=0,..where B-1 is the sampling index, 2 λ Is the time delay sampling rate, especially when λ=0, indicating that the original bi-directional cyclic neural network segmentation strategy is still adopted, without adding any time delay sampling; b represents the number of stacked multi-scale time delay sampling blocks; definition of the definitionThe number of segments after re-segmentation; u' [: i:. 2 λ ,:]The sequence slicing in Python is used for u ', i.e. slicing along the second dimension of u', each segment length K, co-sliced into 2 λ Segments, then from each sliceTaking out the sequence with index i for splicing to obtain new sequence after time delay sampling +.>
(3) Each sequence u after time delay sampling i Stitching along the tile dimension R':
in the method, in the process of the invention,is that the passing time delay rate is 2 λ Although it is not different in shape and size from the input 3D tensor v of the bi-directional recurrent neural network, the two have changed greatly in internal sequence order.
(4) Along h λ Capturing the interrelation between the sequences after time delay sampling by using a bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer; the specific calculation formula is as follows:
U=[BiLSTM(h λ [:,:,m]),m=1,...,2 λ R′] (12)
FC=[WU[:,:,m]+b,m=1,...,2 λ R′] (13)
in the method, in the process of the invention,is the output of the bi-directional recurrent neural network, H is the output dimension of the hidden layer,is a sequence determined by index m, +.> Weights and offsets of the full connection layer, respectively,/->Is output after full connection layer dimension conversion.
Through the modeling mode, the whole multi-scale time delay sampling model can expand the receptive field, additionally capture mixed voice characteristic sequence information under different scales, for example, under the condition of low time delay rate, basic information such as phonemes, tones and the like in the whole sequence is obtained, and under the condition of high time delay rate, information such as speaking content, speaker identity and the like in the sequence is obtained.
(5) Performing layer normalization on the result FC of the step (4), and performing residual connection with an initial input (output u of a bidirectional cyclic neural network) so as to facilitate convergence of a multi-scale time delay sampling model and prevent gradient disappearance or gradient explosion in the training process; the calculation formula of the layer normalization and residual error is as follows:
Output=h λ +Layernorm(FC) (17)
wherein μ (FC) and σ (FC) respectively represent the mean and variance of the outputs of the full-connection layer, z and r are normalization factors, ε is a minimum positive number preventing the denominator from being 0, layernorm represents layer normalization, i, j and k respectively represent N, K and 2 λ A particular dimension in R'.
(6) Repeating (1) - (5), stacking B multi-scale time delay sampling modules altogether, each module employing a different time delay sampling rate, 1,2,4, 2 respectively B-1 And time-delay samplingThe rate increases exponentially. Through the stacking of the modules, the multi-scale time delay sampling model captures the sequence characteristics of the phoneme level when the low time delay sampling rate is adopted, and the multi-scale time delay sampling model further focuses on the characteristic information among semantics or speakers along with the expansion of the time delay sampling rate, so that the characteristic information under different scales is effectively integrated, the receptive field is expanded, and the interrelationship among the sequences is fully captured.
And fifthly, performing overlap-add on the Output after multi-scale time delay sampling modeling (the length of the overlap-add is the same as the length of the mixed voice feature sequence extracted by the encoder), then sending the Output into a two-dimensional convolutional neural network, mapping the result after the overlap-add into masks of pure voices of a plurality of speakers, and performing dot multiplication on the masks and the Output (the extracted mixed voice feature sequence) of the original encoder to obtain pure voice characterization of the plurality of speakers.
And step six, adopting a 1-dimensional deconvolution neural network as a decoder, recovering the pure voice characterization of the plurality of masked speakers into a voice waveform signal of a time domain, and realizing voice separation.
The multi-scale time delay sampling model in this embodiment is trained by using normalized signal-to-noise ratio (SI-SNR) as a loss function, and its calculation formula is as follows:
wherein S is target 、e noise Represents an intermediate variable which is referred to as,x represents the separated speech and the clean speech, respectively.
Compared with the existing voice separation algorithm, the single-channel voice separation method adopting the multi-scale time delay sampling fully exploits potential correlations in the mixed voice feature sequence, effectively improves the voice separation performance, reduces the distortion rate of the separated voice and improves the intelligibility of the separated voice, and has good reference significance in the fields of theoretical research and practical application.
Although embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives, and variations may be made in the above embodiments by those skilled in the art without departing from the spirit and principles of the invention.

Claims (7)

1. A single-channel voice separation method for multi-scale time delay sampling is characterized by comprising the following steps:
s1, extracting mixed voice characteristics from mixed voice signals of a plurality of speakers;
s2, segmenting the mixed voice features, and splicing the segmented mixed voice features into 3D tensors;
s3, modeling the 3D tensor iteratively by adopting a first bidirectional circulating neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling, performing time delay sampling of different scales on the output of the first bidirectional circulating neural network, and capturing characteristic information of different scales by adopting the second bidirectional circulating neural network;
the output of the first bidirectional cyclic neural network is subjected to single-scale time delay sampling, and the specific process is as follows: the output of the bidirectional cyclic neural network is subjected to sequence recombination and re-segmentation, and the sequence spliced by the re-segmentation is along the block length dimension 2 λ K, time delay sampling is carried out, and each sequence u after time delay sampling is carried out i Splicing along the dimension R' of the block to obtain a 3D tensor h recombined after time delay sampling λ Wherein 2 is λ Time of yesThe delay sampling rate, λ=0,..b-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, K represents the block length dimension before re-segmentation;
s5, carrying out overlap-add on the characteristic information in the S4, mapping the overlapped and added result into masks of the pure voices of a plurality of speakers, and carrying out dot multiplication on the masks and the mixed voice characteristics to obtain the pure voice characterization of the plurality of speakers;
s6, reconstructing the pure voice representation of each person into a separated voice signal.
2. The single-channel voice separation method of multi-scale time delay sampling according to claim 1, wherein the characteristic information under a single scale is captured by adopting a second bidirectional circulating neural network, and the specific process is as follows:
along the h λ Capturing the interrelation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then carrying out dimension transformation on the output result through a full-connection layer to output FC; and carrying out layer normalization on the FC, and then carrying out residual connection with the output of the bidirectional circulating neural network.
3. The single-channel voice separation method of multi-scale time delay sampling according to claim 2, wherein the time delay sampling of the output of the first bidirectional circulating neural network is repeated at a single scale by adopting different time delay sampling rates, and the characteristic information at the single scale is captured by adopting the second bidirectional circulating neural network, so that the characteristic information at the different scales is integrated.
4. The method of claim 1, wherein the re-segmented spliced sequence is along a block length dimension 2 λ K carries out time delay sampling, and the adopted formula is as follows:
u i =[u′[:,i::2 λ ,:],i=0,...,2 λ -1]
wherein u' [: i::2 λ ,:]Representing the use of Python for uI represents an index number.
5. The method of claim 2, wherein the step of along h is a single channel speech separation of multi-scale time-delay sampling λ Capturing the correlation between the sequences after time delay sampling by using a second bidirectional cyclic neural network, and then performing dimension transformation on the output result through a full-connection layer to output FC, wherein the specific formula is as follows:
U=[BiLSTM(h λ [:,:,m]),m=1,...,2 λ R′]
FC=[WU[:,:,m]+b,m=1,...,2 λ R′]
wherein U is the output of the bidirectional cyclic neural network, H is the output dimension of the hidden layer, and H λ [:,:,m]Is a sequence determined by index m, 2 λ R' represents the blocking dimension, W, b is the weight and bias of the fully connected layer, respectively.
6. The method for single-channel voice separation with multi-scale time delay sampling according to claim 2, wherein the FC is subjected to layer normalization according to the following specific formula:
wherein mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a very small positive number, layernorm represents layer normalization, and i, j and k respectively represent N, K and 2 λ A specific dimension in R', N is the extracted mixed speechFeature dimensions.
7. The single-channel speech separation method of multi-scale time-delay sampling according to claim 6, wherein the specific formula for the residual connection is:
Output=h λ +Layernorm(FC)。
CN202111006251.7A 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling Active CN113782045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111006251.7A CN113782045B (en) 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111006251.7A CN113782045B (en) 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling

Publications (2)

Publication Number Publication Date
CN113782045A CN113782045A (en) 2021-12-10
CN113782045B true CN113782045B (en) 2024-01-05

Family

ID=78840162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111006251.7A Active CN113782045B (en) 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling

Country Status (1)

Country Link
CN (1) CN113782045B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459240A (en) * 2019-08-12 2019-11-15 新疆大学 The more speaker's speech separating methods clustered based on convolutional neural networks and depth
CN111243579A (en) * 2020-01-19 2020-06-05 清华大学 Time domain single-channel multi-speaker voice recognition method and system
CN111429938A (en) * 2020-03-06 2020-07-17 江苏大学 Single-channel voice separation method and device and electronic equipment
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN112289304A (en) * 2019-07-24 2021-01-29 中国科学院声学研究所 Multi-speaker voice synthesis method based on variational self-encoder
CN113053407A (en) * 2021-02-06 2021-06-29 南京蕴智科技有限公司 Single-channel voice separation method and system for multiple speakers

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112289304A (en) * 2019-07-24 2021-01-29 中国科学院声学研究所 Multi-speaker voice synthesis method based on variational self-encoder
CN110459240A (en) * 2019-08-12 2019-11-15 新疆大学 The more speaker's speech separating methods clustered based on convolutional neural networks and depth
CN111243579A (en) * 2020-01-19 2020-06-05 清华大学 Time domain single-channel multi-speaker voice recognition method and system
CN111429938A (en) * 2020-03-06 2020-07-17 江苏大学 Single-channel voice separation method and device and electronic equipment
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN113053407A (en) * 2021-02-06 2021-06-29 南京蕴智科技有限公司 Single-channel voice separation method and system for multiple speakers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
环境辅助的多任务混合声音事件检测方法;高利剑;毛启容;;计算机科学(01);全文 *

Also Published As

Publication number Publication date
CN113782045A (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
CN111429938B (en) Single-channel voice separation method and device and electronic equipment
CN110246510B (en) End-to-end voice enhancement method based on RefineNet
CN109890043B (en) Wireless signal noise reduction method based on generative countermeasure network
CN110751957B (en) Speech enhancement method using stacked multi-scale modules
CN112581979B (en) Speech emotion recognition method based on spectrogram
CN113470671B (en) Audio-visual voice enhancement method and system fully utilizing vision and voice connection
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN115602152B (en) Voice enhancement method based on multi-stage attention network
CN110176250B (en) Robust acoustic scene recognition method based on local learning
CN113643723A (en) Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
JP2020071482A (en) Word sound separation method, word sound separation model training method and computer readable medium
EP4211686A1 (en) Machine learning for microphone style transfer
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
Zhang et al. Birdsoundsdenoising: Deep visual audio denoising for bird sounds
CN112183582A (en) Multi-feature fusion underwater target identification method
Lim et al. Harmonic and percussive source separation using a convolutional auto encoder
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
CN113889099A (en) Voice recognition method and system
CN113782045B (en) Single-channel voice separation method for multi-scale time delay sampling
CN107564546A (en) A kind of sound end detecting method based on positional information
CN117174105A (en) Speech noise reduction and dereverberation method based on improved deep convolutional network
CN111916060A (en) Deep learning voice endpoint detection method and system based on spectral subtraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant