CN113782045A - Single-channel voice separation method for multi-scale time delay sampling - Google Patents
Single-channel voice separation method for multi-scale time delay sampling Download PDFInfo
- Publication number
- CN113782045A CN113782045A CN202111006251.7A CN202111006251A CN113782045A CN 113782045 A CN113782045 A CN 113782045A CN 202111006251 A CN202111006251 A CN 202111006251A CN 113782045 A CN113782045 A CN 113782045A
- Authority
- CN
- China
- Prior art keywords
- time delay
- delay sampling
- neural network
- voice
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000005070 sampling Methods 0.000 title claims abstract description 68
- 238000000926 separation method Methods 0.000 title claims abstract description 30
- 238000013528 artificial neural network Methods 0.000 claims abstract description 53
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 39
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 15
- 238000013507 mapping Methods 0.000 claims abstract description 4
- 230000000306 recurrent effect Effects 0.000 claims description 19
- 238000000034 method Methods 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000005215 recombination Methods 0.000 claims description 3
- 230000006798 recombination Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Stereophonic System (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention provides a single-channel voice separation method based on multi-scale time delay sampling, which extracts the mixed voice characteristics of a mixed voice signal of a plurality of speakers, the mixed speech features are segmented and spliced into a 3D tensor, the 3D tensor is modeled iteratively by adopting a first bidirectional cyclic neural network, local feature information and global feature information are captured, multi-scale time delay sampling modeling is adopted, performing time delay sampling of different scales on the output of the first bidirectional circulation neural network, capturing characteristic information under different scales by adopting the second bidirectional circulation neural network, performing overlap addition on the characteristic information, and mapping the voice signals into masks of the pure voices of the multiple speakers, performing point multiplication on the masks and the mixed voice characteristics to obtain pure voice characteristics of the multiple speakers, and reconstructing the pure voice characteristics of each person into separated voice signals. According to the invention, through multi-scale time delay sampling modeling, information loss in sectional modeling is compensated, and the voice separation performance is greatly improved.
Description
Technical Field
The invention belongs to the field of voice separation, and particularly relates to a single-channel voice separation method for multi-scale time delay sampling.
Background
In recent years, with the rise of deep learning, the artificial intelligence heat tide caused by the deep learning changes the aspects of life of people. The voice communication is an indispensable part, and a clear voice can enable the intelligent electric appliance to better execute a corresponding command, so that the intelligent degree is greatly improved. In real acoustic scenes, however, the speech of the speaker of interest is usually interfered by other speakers, which is a typical cocktail party problem. The voice content of the target speaker can be easily distinguished by the unique auditory system of the human ear, so that the voice separation is performed by simulating the auditory system of the human ear, and the pure voice of one or all speakers is separated from the mixed voice of the speakers, so that the interferences of background noise, reverberation and the like are removed, and the definition and intelligibility of the voice are improved.
At present, most of mainstream voice separation methods adopt end-to-end time domain separation, and a model learns potential public representation in a mixed voice waveform in a data driving mode so as to realize separation. The separation method has extremely high requirements on framing during voice preprocessing, and researches show that the more frames of a section of mixed voice in the same time, the better the final separation effect is. However, too many frames can cause that modeling cannot be carried out, and in order to solve the problem, Yi Luo and the like propose a two-way structure sectional modeling method in Dual-path rnn: effective long sequence modeling for time-domain single-channel speech separation, thereby effectively realizing sampling point level modeling of long-sequence speech and greatly improving the speech separation performance. However, when capturing global information, this segmented modeling method may cause adjacent frames in the sequence to be too far apart in the original sequence, and there is little correlation between adjacent frames, and there are many relationships between frames that are not captured, which seriously results in the loss of information.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a single-channel voice separation method of multi-scale time delay sampling, which makes up the information loss in sectional modeling and further improves the separation performance through multi-scale time delay sampling modeling.
A single-channel voice separation method of multi-scale time delay sampling comprises the following steps:
s1, extracting mixed voice characteristics of the mixed voice signals of the multiple speakers;
s2, segmenting the mixed voice features and splicing into a 3D tensor;
s3, iteratively modeling the 3D tensor by adopting a first bidirectional cyclic neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling to perform time delay sampling of different scales on the output of the first bidirectional circulating neural network, and adopting the second bidirectional circulating neural network to capture characteristic information under different scales;
s5, overlapping and adding the characteristic information in S4, mapping the result after overlapping and adding to a mask of the pure voice of a plurality of speakers, and performing dot multiplication on the mask and the mixed voice characteristic to obtain the pure voice characteristics of the speakers;
s6, reconstructing the pure speech representation of each person into a separated speech signal.
Further, performing single-scale time delay sampling on the output of the first bi-directional cyclic neural network, specifically comprising the following steps:
performing sequence recombination and re-segmentation on the output of the bidirectional recurrent neural network, and splicing the re-segmented sequences along the block length dimension 2λK carries out time delay sampling to each sequence u after time delay samplingiSplicing along the dimension R' of the blocks to obtain the recombined 3D tensor h after time delay samplingλWherein 2 isλIs the delay sampling rate, λ is 0., B-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, and K represents the block length dimension before re-segmentation.
Furthermore, a second bidirectional cyclic neural network is adopted to capture the feature information under a single scale, and the specific process is as follows:
along the hλUsing a second bi-directional recurrent neural network, capturing time-delayed samples between sequencesThe mutual relation is realized, then dimension transformation is carried out on the output result through a full connection layer, and FC is output; and carrying out layer normalization on the FC, and then carrying out residual error connection with the output of the bidirectional cyclic neural network.
Further, different delay sampling rates are adopted, delay sampling of a single scale is repeatedly carried out on the output of the first bidirectional circulating neural network, the characteristic information under the single scale is captured by the second bidirectional circulating neural network, and the characteristic information under different scales is integrated.
Further, the sequence after the re-segment splicing is along the block length dimension 2λK, carrying out time delay sampling by adopting a formula as follows:
ui=[u′[:,i::2λ,:],i=0,...,2λ-1]
wherein u' [: i::2λ,:]Indicating that u' is a sequence slice in Python, and i represents an index number.
Further, said following said hλThe second dimension of the method uses a second bidirectional circulation neural network to capture the mutual relation among the sequences after time delay sampling, then dimension conversion is carried out on the output result through a full connection layer, and FC is output, wherein the specific formula is as follows:
U=[BiLSTM(hλ[:,:,m]),m=1,...,2λR′]
FC=[WU[:,:,m]+b,m=1,...,2λR′]
where U is the output of the bidirectional recurrent neural network, H is the output dimension of the hidden layer, Hλ[:,:,m]Is a sequence determined by the index m, 2λR' represents the tile dimension, W, b is the weight and offset, respectively, of the fully connected layer.
Further, the FC is subjected to layer normalization, and a specific formula is as follows:
wherein, mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a minimum positive number, Layernorm represents the layer normalization, i, j and k respectively represent N, K and 2λAnd a certain dimension in R', wherein N is the dimension of the extracted mixed voice features.
Further, the specific formula for the residual connection is as follows:
Output=hλ+Layernorm(FC)。
the invention has the beneficial effects that: according to the invention, multi-scale time delay sampling modeling is adopted, time delay sampling of different scales is carried out on the output of the first bidirectional circulating neural network, and the second bidirectional circulating neural network is adopted to capture characteristic information under different scales, so that the interrelation among sequences under different scales is effectively integrated, the information loss of the bidirectional circulating neural network during sectional modeling is compensated, the voice separation performance is greatly improved, the voice distortion rate is effectively reduced, and the intelligibility of separated voice is also improved.
Drawings
Fig. 1 is a flow chart of a single-channel speech separation method based on multi-scale time delay sampling according to the invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
As shown in fig. 1, the single-channel speech separation method based on multi-scale delay sampling of the present invention specifically includes the following steps:
step one, coding the input mixed voice signals of a plurality of speakers by adopting a coder, and extracting corresponding mixed voice characteristics
For a given mixed speech signalExtracting high-order mixed speech characteristics by using one-dimensional convolutional neural network as encoderWhereinIs a real number set, L is the length of the mixed voice signal, N is the dimension of the extracted mixed voice feature, and T is the frame number of the mixed voice feature; the convolution kernel size of the one-dimensional convolution neural network is W, the moving step of the convolution window is W/2, and a ReLU function is added after the convolution neural network for nonlinear transformation, and the calculation formula is as follows:
step two, the mixed voice features output by the encoder are segmented, and then a 3D tensor is spliced
Dividing the mixed voice characteristics z output by the encoder by taking P as a unit, wherein the length of each divided small block is K, the small blocks are divided into R blocks, 50% of overlapping exists between the small blocks, and then all the small blocks are spliced into a 3D tensor
Step three, modeling the 3D tensor v by adopting a bidirectional cyclic neural network in an iterative manner, and correspondingly capturing local characteristic information and global characteristic information
B bidirectional recurrent neural network (BilsTM) modules (the bidirectional recurrent neural network modules comprise odd number BilsTM modules and even number BilsTM modules, and the odd number BilsTM modules and the even number BilsTM modules are alternately stacked) are stacked, and the input 3D tensor is modeled, wherein the odd number BilsTM module B2i-1(i 1.. b/2) sequence modeling along the R number of block length dimensions K, capturing local features of the mixed speech feature sequenceInformation; even number of BilSTM modules B2iModeling along the dimension R of the blocks with the number of K, and capturing global feature information of the mixed voice feature sequence, wherein the calculation formula is as follows:
UR=[BiLSTM(v[:,:,i]),i=1,...,R] (2)
UK=[BiLSTM(UR[:,i,:]),i=1,...,K] (3)
in the formula of UR、The outputs of the odd number BilSTM module and the even number BilSTM module are respectively, and the speaker information is effectively integrated through the iterative alternate execution of the odd number BilSTM module and the even number BilSTM module so as to facilitate the subsequent separation.
The bidirectional cyclic neural network is adopted to enable a multi-scale time delay sampling Model (MTDS) to pay attention to not only past context information but also more future context information, which is very beneficial to a voice separation task, and a calculation formula of the bidirectional cyclic neural network is as follows:
hu=tanh(Wu[a<t-1>;x<t>]+bu) (4)
hf=tanh(Wf[a<t-1>;x<t>]+bf) (5)
ho=tanh(Wo[a<t-1>;x<t>]+bo) (6)
a<t>=tanh(c<t>) (9)
in the formula, hu、hf、hoOutputs of LSTM update gate, forget gate and output gate, Wu、Wf、Wo、WcAnd bu、bf、bo、bcAre weights and biases in the bidirectional recurrent neural network; x is the number of<t>、a<t>Representing the input and output of the bi-directional recurrent neural network at the current time,c<t>is the memory cell in the corresponding bi-directional recurrent neural network, and tanh is the activation function.
Step four, adopting multi-scale time delay sampling modeling to make up for information loss in the bidirectional cyclic neural network so as to further improve the separation performance
Although the bidirectional recurrent neural network is very effective for capturing information of the mixed speech feature sequence, when an even number BilSTM module executes modeling processing, the bidirectional recurrent neural network is applied to K discontinuous frames, the interval between adjacent frames is R//2, the overlarge interval obviously causes few sequence relations between the frames, the bidirectional recurrent neural network is difficult to capture effective information, and a plurality of relationships between the frames are not captured in the whole mixed speech feature sequence, thereby seriously causing the loss of the information. In order to make up for the defect, multi-scale time delay sampling is adopted to re-model the mixed voice feature sequence, and then feature information under different scales is captured, so that the separation performance is improved. The method comprises the following specific steps:
(1) output to a bidirectional recurrent neural networkSequence recombination and re-segmentation are carried out, and a segmentation strategy in a bidirectional recurrent neural network is still adopted, wherein the difference is that the length of each segment obtained after segmentation is expanded from K to 2λK。
(2) Splicing the re-segmented sequencesAlong the block length dimension 2λK carries out time delay sampling, and the calculation formula is as follows:
ui=[u′[:,i::2λ,:],i=0,...,2λ-1] (10)
where λ ═ 0.., B-1 is a sampling index, 2λIs the time delay sampling rate, especially when λ is 0, it means that the original bidirectional recurrent neural network segmentation strategy is still adopted, without adding any time delay sampling; b represents the number of the stacked multi-scale time delay sampling blocks; definition ofIs the number of segments after re-segmentation; u' [: i::2λ,:]Indicates that u 'is sliced using the sequence in Python, i.e., along the second dimension of u', each fragment is K in length, and is co-cut into 2λSegment, then taking out the sequence with index i from each slice for splicing to obtain a new sequence after time delay sampling
(3) Each sequence u after time-delayed samplingiStitching along the blocking dimension R':
in the formula (I), the compound is shown in the specification,is a passing delay rate of 2λAlthough it is not different from the input 3D tensor v of the bidirectional recurrent neural network in the shape and size, the time-delay sampled reconstructed 3D tensor of (1) greatly changes the internal sequence order.
(4) Along hλThe third dimension of the system uses a bidirectional circulation neural network to capture the mutual relation between the sequences after time delay sampling, and then dimension conversion is carried out on the output result through a full connection layer; concrete computing deviceThe formula is as follows:
U=[BiLSTM(hλ[:,:,m]),m=1,...,2λR′] (12)
FC=[WU[:,:,m]+b,m=1,...,2λR′] (13)
in the formula (I), the compound is shown in the specification,is the output of the bi-directional recurrent neural network, H is the output dimension of the hidden layer,is a sequence determined by the index m, respectively the weight and the offset of the fully connected layer,is the output after dimension transformation of the full connection layer.
By the modeling mode, the whole multi-scale time delay sampling model can expand the receptive field and additionally capture mixed speech characteristic sequence information under different scales, for example, basic information such as phonemes, tones and the like in the whole sequence is obtained under the condition of low time delay rate, and information such as speaking content and speaker identity and the like in the sequence is obtained under the condition of high time delay rate.
(5) Performing layer normalization on the result FC in the step (4), and performing residual connection with the initial input (the output u of the bidirectional cyclic neural network) so as to facilitate convergence of the multi-scale time delay sampling model and prevent gradient disappearance or gradient explosion phenomena in the training process; the layer normalization and the residual calculation formula are as follows:
Output=hλ+Layernorm(FC) (17)
where μ (FC) and σ (FC) represent the mean and variance of the output of the fully-connected layer, z and r are normalization factors, ε is a very small positive number preventing the denominator to be 0, Layernorm represents the layer normalization, i, j, k represent N, K and 2, respectivelyλA particular dimension in R'.
(6) Repeating the steps (1) to (5), and stacking B multi-scale time delay sampling modules, wherein each module adopts different time delay sampling rates which are 1,2,4B-1And the delay sampling rate increases exponentially. Through the stacking of the modules, the multi-scale time delay sampling model captures the sequence characteristics of phoneme levels when a low time delay sampling rate is adopted, and the multi-scale time delay sampling model further pays more attention to the semantic or the characteristic information among speakers along with the expansion of the time delay sampling rate, so that the characteristic information under different scales is effectively integrated, the receptive field is expanded, and the mutual relation among sequences is fully captured.
And fifthly, overlapping and adding the Output after the multi-scale time delay sampling modeling (the length of the overlapping and adding is the same as the length of the mixed voice characteristic sequence extracted by the encoder), then sending the result into a two-dimensional convolution neural network, mapping the result after the overlapping and adding into a mask of the pure voices of a plurality of speakers, and performing point multiplication on the mask and the original encoder Output (the extracted mixed voice characteristic sequence) to obtain the pure voice characteristics of the speakers.
And step six, adopting a 1-dimensional deconvolution neural network as a decoder, and restoring the masked pure voice representations of the multiple speakers into voice waveform signals of a time domain to realize voice separation.
The multi-scale delay sampling model in this embodiment is trained by using a normalized signal-to-noise ratio (SI-SNR) as a loss function, and the calculation formula is as follows:
in the formula, Starget、enoiseThe intermediate variable is represented by a number of variables,x represents separated speech and clean speech, respectively.
Compared with the existing voice separation algorithm, the multi-scale time delay sampling single-channel voice separation method adopted in the embodiment fully excavates the potential correlation in the mixed voice feature sequence, effectively improves the voice separation performance, reduces the distortion rate of the separated voice and improves the intelligibility thereof, and has good reference significance in the fields of theoretical research and practical application.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.
Claims (8)
1. A single-channel voice separation method of multi-scale time delay sampling is characterized by comprising the following steps:
s1, extracting mixed voice characteristics of the mixed voice signals of the multiple speakers;
s2, segmenting the mixed voice features and splicing into a 3D tensor;
s3, iteratively modeling the 3D tensor by adopting a first bidirectional cyclic neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling to perform time delay sampling of different scales on the output of the first bidirectional circulating neural network, and adopting the second bidirectional circulating neural network to capture characteristic information under different scales;
s5, overlapping and adding the characteristic information in S4, mapping the result after overlapping and adding to a mask of the pure voice of a plurality of speakers, and performing dot multiplication on the mask and the mixed voice characteristic to obtain the pure voice characteristics of the speakers;
s6, reconstructing the pure speech representation of each person into a separated speech signal.
2. The method for separating single-channel voice with multi-scale time delay sampling according to claim 1, wherein the single-scale time delay sampling is performed on the output of the first bi-directional cyclic neural network, and the specific process is as follows:
performing sequence recombination and re-segmentation on the output of the bidirectional recurrent neural network, and splicing the re-segmented sequences along the block length dimension 2λK carries out time delay sampling to each sequence u after time delay samplingiSplicing along the dimension R' of the blocks to obtain the recombined 3D tensor h after time delay samplingλWherein 2 isλIs the delay sampling rate, λ is 0., B-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, and K represents the block length dimension before re-segmentation.
3. The method for separating the single-channel voice with multi-scale time delay sampling according to claim 2, wherein the second bi-directional circulation neural network is adopted to capture the feature information under the single scale, and the specific process is as follows:
along the hλUsing a second bi-directional recurrent neural network, capturing the correlation between the time-delayed sampled sequences,then carrying out dimension transformation on the output result through the full connection layer, and outputting FC; and carrying out layer normalization on the FC, and then carrying out residual error connection with the output of the bidirectional cyclic neural network.
4. The method of claim 3, wherein different delay sampling rates are adopted, single-scale delay sampling is repeatedly performed on the output of the first bi-directional cyclic neural network, the feature information under a single scale is captured by the second bi-directional cyclic neural network, and the feature information under different scales is integrated.
5. The multi-scale time-lapse sampled single-channel speech separation method of claim 2, wherein the sequence after the re-segmentation splicing is along a block length dimension of 2λK, carrying out time delay sampling by adopting a formula as follows:
ui=[u′[:,i::2λ,:],i=0,...,2λ-1]
wherein u' [: i::2λ,:]Indicating that u' is a sequence slice in Python, and i represents an index number.
6. The method of claim 3, wherein the h is a length of the hλThe second dimension of the method uses a second bidirectional circulation neural network to capture the mutual relation among the sequences after time delay sampling, then dimension conversion is carried out on the output result through a full connection layer, and FC is output, wherein the specific formula is as follows:
U=[BiLSTM(hλ[:,:,m]),m=1,...,2λR′]
FC=[WU[:,:,m]+b,m=1,...,2λR′]
where U is the output of the bidirectional recurrent neural network, H is the output dimension of the hidden layer, Hλ[:,:,m]Is a sequence determined by the index m, 2λR' represents the tile dimension, W, b is the weight and offset, respectively, of the fully connected layer.
7. The method for separating single-channel speech with multi-scale delay sampling according to claim 3, wherein the FC is subjected to layer normalization by using a specific formula:
wherein, mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a minimum positive number, Layernorm represents the layer normalization, i, j and k respectively represent N, K and 2λAnd a certain dimension in R', wherein N is the dimension of the extracted mixed voice features.
8. The method according to claim 7, wherein the residual error is concatenated by the following specific formula:
Output=hλ+Layernorm(FC)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111006251.7A CN113782045B (en) | 2021-08-30 | 2021-08-30 | Single-channel voice separation method for multi-scale time delay sampling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111006251.7A CN113782045B (en) | 2021-08-30 | 2021-08-30 | Single-channel voice separation method for multi-scale time delay sampling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113782045A true CN113782045A (en) | 2021-12-10 |
CN113782045B CN113782045B (en) | 2024-01-05 |
Family
ID=78840162
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111006251.7A Active CN113782045B (en) | 2021-08-30 | 2021-08-30 | Single-channel voice separation method for multi-scale time delay sampling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113782045B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110459240A (en) * | 2019-08-12 | 2019-11-15 | 新疆大学 | The more speaker's speech separating methods clustered based on convolutional neural networks and depth |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111429938A (en) * | 2020-03-06 | 2020-07-17 | 江苏大学 | Single-channel voice separation method and device and electronic equipment |
CN112071325A (en) * | 2020-09-04 | 2020-12-11 | 中山大学 | Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling |
CN112289304A (en) * | 2019-07-24 | 2021-01-29 | 中国科学院声学研究所 | Multi-speaker voice synthesis method based on variational self-encoder |
CN113053407A (en) * | 2021-02-06 | 2021-06-29 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for multiple speakers |
-
2021
- 2021-08-30 CN CN202111006251.7A patent/CN113782045B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112289304A (en) * | 2019-07-24 | 2021-01-29 | 中国科学院声学研究所 | Multi-speaker voice synthesis method based on variational self-encoder |
CN110459240A (en) * | 2019-08-12 | 2019-11-15 | 新疆大学 | The more speaker's speech separating methods clustered based on convolutional neural networks and depth |
CN111243579A (en) * | 2020-01-19 | 2020-06-05 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111429938A (en) * | 2020-03-06 | 2020-07-17 | 江苏大学 | Single-channel voice separation method and device and electronic equipment |
CN112071325A (en) * | 2020-09-04 | 2020-12-11 | 中山大学 | Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling |
CN113053407A (en) * | 2021-02-06 | 2021-06-29 | 南京蕴智科技有限公司 | Single-channel voice separation method and system for multiple speakers |
Non-Patent Citations (1)
Title |
---|
高利剑;毛启容;: "环境辅助的多任务混合声音事件检测方法", 计算机科学, no. 01 * |
Also Published As
Publication number | Publication date |
---|---|
CN113782045B (en) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109841226B (en) | Single-channel real-time noise reduction method based on convolution recurrent neural network | |
WO2021043015A1 (en) | Speech recognition method and apparatus, and neural network training method and apparatus | |
CN111429938B (en) | Single-channel voice separation method and device and electronic equipment | |
Chang et al. | Temporal modeling using dilated convolution and gating for voice-activity-detection | |
CN110751044B (en) | Urban noise identification method based on deep network migration characteristics and augmented self-coding | |
CN110751957B (en) | Speech enhancement method using stacked multi-scale modules | |
KR100908121B1 (en) | Speech feature vector conversion method and apparatus | |
CN106782511A (en) | Amendment linear depth autoencoder network audio recognition method | |
CN108847244A (en) | Voiceprint recognition method and system based on MFCC and improved BP neural network | |
CN113053407B (en) | Single-channel voice separation method and system for multiple speakers | |
KR102294638B1 (en) | Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments | |
CN110111803A (en) | Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference | |
KR101807961B1 (en) | Method and apparatus for processing speech signal based on lstm and dnn | |
CN113470671B (en) | Audio-visual voice enhancement method and system fully utilizing vision and voice connection | |
Shi et al. | Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation. | |
Zhang et al. | Birdsoundsdenoising: Deep visual audio denoising for bird sounds | |
CN109785852A (en) | A kind of method and system enhancing speaker's voice | |
CN115602152B (en) | Voice enhancement method based on multi-stage attention network | |
CN111899757A (en) | Single-channel voice separation method and system for target speaker extraction | |
CN110176250B (en) | Robust acoustic scene recognition method based on local learning | |
CN111986679A (en) | Speaker confirmation method, system and storage medium for responding to complex acoustic environment | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
Shi et al. | End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network. | |
CN112183582A (en) | Multi-feature fusion underwater target identification method | |
CN114333773A (en) | Industrial scene abnormal sound detection and identification method based on self-encoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |