CN113782045A - Single-channel voice separation method for multi-scale time delay sampling - Google Patents

Single-channel voice separation method for multi-scale time delay sampling Download PDF

Info

Publication number
CN113782045A
CN113782045A CN202111006251.7A CN202111006251A CN113782045A CN 113782045 A CN113782045 A CN 113782045A CN 202111006251 A CN202111006251 A CN 202111006251A CN 113782045 A CN113782045 A CN 113782045A
Authority
CN
China
Prior art keywords
time delay
delay sampling
neural network
voice
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111006251.7A
Other languages
Chinese (zh)
Other versions
CN113782045B (en
Inventor
毛启容
钱双庆
陈静静
贾洪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202111006251.7A priority Critical patent/CN113782045B/en
Publication of CN113782045A publication Critical patent/CN113782045A/en
Application granted granted Critical
Publication of CN113782045B publication Critical patent/CN113782045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Stereophonic System (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a single-channel voice separation method based on multi-scale time delay sampling, which extracts the mixed voice characteristics of a mixed voice signal of a plurality of speakers, the mixed speech features are segmented and spliced into a 3D tensor, the 3D tensor is modeled iteratively by adopting a first bidirectional cyclic neural network, local feature information and global feature information are captured, multi-scale time delay sampling modeling is adopted, performing time delay sampling of different scales on the output of the first bidirectional circulation neural network, capturing characteristic information under different scales by adopting the second bidirectional circulation neural network, performing overlap addition on the characteristic information, and mapping the voice signals into masks of the pure voices of the multiple speakers, performing point multiplication on the masks and the mixed voice characteristics to obtain pure voice characteristics of the multiple speakers, and reconstructing the pure voice characteristics of each person into separated voice signals. According to the invention, through multi-scale time delay sampling modeling, information loss in sectional modeling is compensated, and the voice separation performance is greatly improved.

Description

Single-channel voice separation method for multi-scale time delay sampling
Technical Field
The invention belongs to the field of voice separation, and particularly relates to a single-channel voice separation method for multi-scale time delay sampling.
Background
In recent years, with the rise of deep learning, the artificial intelligence heat tide caused by the deep learning changes the aspects of life of people. The voice communication is an indispensable part, and a clear voice can enable the intelligent electric appliance to better execute a corresponding command, so that the intelligent degree is greatly improved. In real acoustic scenes, however, the speech of the speaker of interest is usually interfered by other speakers, which is a typical cocktail party problem. The voice content of the target speaker can be easily distinguished by the unique auditory system of the human ear, so that the voice separation is performed by simulating the auditory system of the human ear, and the pure voice of one or all speakers is separated from the mixed voice of the speakers, so that the interferences of background noise, reverberation and the like are removed, and the definition and intelligibility of the voice are improved.
At present, most of mainstream voice separation methods adopt end-to-end time domain separation, and a model learns potential public representation in a mixed voice waveform in a data driving mode so as to realize separation. The separation method has extremely high requirements on framing during voice preprocessing, and researches show that the more frames of a section of mixed voice in the same time, the better the final separation effect is. However, too many frames can cause that modeling cannot be carried out, and in order to solve the problem, Yi Luo and the like propose a two-way structure sectional modeling method in Dual-path rnn: effective long sequence modeling for time-domain single-channel speech separation, thereby effectively realizing sampling point level modeling of long-sequence speech and greatly improving the speech separation performance. However, when capturing global information, this segmented modeling method may cause adjacent frames in the sequence to be too far apart in the original sequence, and there is little correlation between adjacent frames, and there are many relationships between frames that are not captured, which seriously results in the loss of information.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a single-channel voice separation method of multi-scale time delay sampling, which makes up the information loss in sectional modeling and further improves the separation performance through multi-scale time delay sampling modeling.
A single-channel voice separation method of multi-scale time delay sampling comprises the following steps:
s1, extracting mixed voice characteristics of the mixed voice signals of the multiple speakers;
s2, segmenting the mixed voice features and splicing into a 3D tensor;
s3, iteratively modeling the 3D tensor by adopting a first bidirectional cyclic neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling to perform time delay sampling of different scales on the output of the first bidirectional circulating neural network, and adopting the second bidirectional circulating neural network to capture characteristic information under different scales;
s5, overlapping and adding the characteristic information in S4, mapping the result after overlapping and adding to a mask of the pure voice of a plurality of speakers, and performing dot multiplication on the mask and the mixed voice characteristic to obtain the pure voice characteristics of the speakers;
s6, reconstructing the pure speech representation of each person into a separated speech signal.
Further, performing single-scale time delay sampling on the output of the first bi-directional cyclic neural network, specifically comprising the following steps:
performing sequence recombination and re-segmentation on the output of the bidirectional recurrent neural network, and splicing the re-segmented sequences along the block length dimension 2λK carries out time delay sampling to each sequence u after time delay samplingiSplicing along the dimension R' of the blocks to obtain the recombined 3D tensor h after time delay samplingλWherein 2 isλIs the delay sampling rate, λ is 0., B-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, and K represents the block length dimension before re-segmentation.
Furthermore, a second bidirectional cyclic neural network is adopted to capture the feature information under a single scale, and the specific process is as follows:
along the hλUsing a second bi-directional recurrent neural network, capturing time-delayed samples between sequencesThe mutual relation is realized, then dimension transformation is carried out on the output result through a full connection layer, and FC is output; and carrying out layer normalization on the FC, and then carrying out residual error connection with the output of the bidirectional cyclic neural network.
Further, different delay sampling rates are adopted, delay sampling of a single scale is repeatedly carried out on the output of the first bidirectional circulating neural network, the characteristic information under the single scale is captured by the second bidirectional circulating neural network, and the characteristic information under different scales is integrated.
Further, the sequence after the re-segment splicing is along the block length dimension 2λK, carrying out time delay sampling by adopting a formula as follows:
ui=[u′[:,i::2λ,:],i=0,...,2λ-1]
wherein u' [: i::2λ,:]Indicating that u' is a sequence slice in Python, and i represents an index number.
Further, said following said hλThe second dimension of the method uses a second bidirectional circulation neural network to capture the mutual relation among the sequences after time delay sampling, then dimension conversion is carried out on the output result through a full connection layer, and FC is output, wherein the specific formula is as follows:
U=[BiLSTM(hλ[:,:,m]),m=1,...,2λR′]
FC=[WU[:,:,m]+b,m=1,...,2λR′]
where U is the output of the bidirectional recurrent neural network, H is the output dimension of the hidden layer, Hλ[:,:,m]Is a sequence determined by the index m, 2λR' represents the tile dimension, W, b is the weight and offset, respectively, of the fully connected layer.
Further, the FC is subjected to layer normalization, and a specific formula is as follows:
Figure BDA0003237184760000031
Figure BDA0003237184760000032
Figure BDA0003237184760000033
wherein, mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a minimum positive number, Layernorm represents the layer normalization, i, j and k respectively represent N, K and 2λAnd a certain dimension in R', wherein N is the dimension of the extracted mixed voice features.
Further, the specific formula for the residual connection is as follows:
Output=hλ+Layernorm(FC)。
the invention has the beneficial effects that: according to the invention, multi-scale time delay sampling modeling is adopted, time delay sampling of different scales is carried out on the output of the first bidirectional circulating neural network, and the second bidirectional circulating neural network is adopted to capture characteristic information under different scales, so that the interrelation among sequences under different scales is effectively integrated, the information loss of the bidirectional circulating neural network during sectional modeling is compensated, the voice separation performance is greatly improved, the voice distortion rate is effectively reduced, and the intelligibility of separated voice is also improved.
Drawings
Fig. 1 is a flow chart of a single-channel speech separation method based on multi-scale time delay sampling according to the invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
As shown in fig. 1, the single-channel speech separation method based on multi-scale delay sampling of the present invention specifically includes the following steps:
step one, coding the input mixed voice signals of a plurality of speakers by adopting a coder, and extracting corresponding mixed voice characteristics
For a given mixed speech signal
Figure BDA0003237184760000034
Extracting high-order mixed speech characteristics by using one-dimensional convolutional neural network as encoder
Figure BDA0003237184760000035
Wherein
Figure BDA0003237184760000036
Is a real number set, L is the length of the mixed voice signal, N is the dimension of the extracted mixed voice feature, and T is the frame number of the mixed voice feature; the convolution kernel size of the one-dimensional convolution neural network is W, the moving step of the convolution window is W/2, and a ReLU function is added after the convolution neural network for nonlinear transformation, and the calculation formula is as follows:
Figure BDA0003237184760000041
step two, the mixed voice features output by the encoder are segmented, and then a 3D tensor is spliced
Dividing the mixed voice characteristics z output by the encoder by taking P as a unit, wherein the length of each divided small block is K, the small blocks are divided into R blocks, 50% of overlapping exists between the small blocks, and then all the small blocks are spliced into a 3D tensor
Figure BDA0003237184760000042
Step three, modeling the 3D tensor v by adopting a bidirectional cyclic neural network in an iterative manner, and correspondingly capturing local characteristic information and global characteristic information
B bidirectional recurrent neural network (BilsTM) modules (the bidirectional recurrent neural network modules comprise odd number BilsTM modules and even number BilsTM modules, and the odd number BilsTM modules and the even number BilsTM modules are alternately stacked) are stacked, and the input 3D tensor is modeled, wherein the odd number BilsTM module B2i-1(i 1.. b/2) sequence modeling along the R number of block length dimensions K, capturing local features of the mixed speech feature sequenceInformation; even number of BilSTM modules B2iModeling along the dimension R of the blocks with the number of K, and capturing global feature information of the mixed voice feature sequence, wherein the calculation formula is as follows:
UR=[BiLSTM(v[:,:,i]),i=1,...,R] (2)
UK=[BiLSTM(UR[:,i,:]),i=1,...,K] (3)
in the formula of UR
Figure BDA0003237184760000043
The outputs of the odd number BilSTM module and the even number BilSTM module are respectively, and the speaker information is effectively integrated through the iterative alternate execution of the odd number BilSTM module and the even number BilSTM module so as to facilitate the subsequent separation.
The bidirectional cyclic neural network is adopted to enable a multi-scale time delay sampling Model (MTDS) to pay attention to not only past context information but also more future context information, which is very beneficial to a voice separation task, and a calculation formula of the bidirectional cyclic neural network is as follows:
hu=tanh(Wu[a<t-1>;x<t>]+bu) (4)
hf=tanh(Wf[a<t-1>;x<t>]+bf) (5)
ho=tanh(Wo[a<t-1>;x<t>]+bo) (6)
Figure BDA0003237184760000044
Figure BDA0003237184760000051
a<t>=tanh(c<t>) (9)
in the formula, hu、hf、hoOutputs of LSTM update gate, forget gate and output gate, Wu、Wf、Wo、WcAnd bu、bf、bo、bcAre weights and biases in the bidirectional recurrent neural network; x is the number of<t>、a<t>Representing the input and output of the bi-directional recurrent neural network at the current time,
Figure BDA0003237184760000052
c<t>is the memory cell in the corresponding bi-directional recurrent neural network, and tanh is the activation function.
Step four, adopting multi-scale time delay sampling modeling to make up for information loss in the bidirectional cyclic neural network so as to further improve the separation performance
Although the bidirectional recurrent neural network is very effective for capturing information of the mixed speech feature sequence, when an even number BilSTM module executes modeling processing, the bidirectional recurrent neural network is applied to K discontinuous frames, the interval between adjacent frames is R//2, the overlarge interval obviously causes few sequence relations between the frames, the bidirectional recurrent neural network is difficult to capture effective information, and a plurality of relationships between the frames are not captured in the whole mixed speech feature sequence, thereby seriously causing the loss of the information. In order to make up for the defect, multi-scale time delay sampling is adopted to re-model the mixed voice feature sequence, and then feature information under different scales is captured, so that the separation performance is improved. The method comprises the following specific steps:
(1) output to a bidirectional recurrent neural network
Figure BDA0003237184760000053
Sequence recombination and re-segmentation are carried out, and a segmentation strategy in a bidirectional recurrent neural network is still adopted, wherein the difference is that the length of each segment obtained after segmentation is expanded from K to 2λK。
(2) Splicing the re-segmented sequences
Figure BDA0003237184760000054
Along the block length dimension 2λK carries out time delay sampling, and the calculation formula is as follows:
ui=[u′[:,i::2λ,:],i=0,...,2λ-1] (10)
where λ ═ 0.., B-1 is a sampling index, 2λIs the time delay sampling rate, especially when λ is 0, it means that the original bidirectional recurrent neural network segmentation strategy is still adopted, without adding any time delay sampling; b represents the number of the stacked multi-scale time delay sampling blocks; definition of
Figure BDA0003237184760000055
Is the number of segments after re-segmentation; u' [: i::2λ,:]Indicates that u 'is sliced using the sequence in Python, i.e., along the second dimension of u', each fragment is K in length, and is co-cut into 2λSegment, then taking out the sequence with index i from each slice for splicing to obtain a new sequence after time delay sampling
Figure BDA0003237184760000056
(3) Each sequence u after time-delayed samplingiStitching along the blocking dimension R':
Figure BDA0003237184760000061
in the formula (I), the compound is shown in the specification,
Figure BDA0003237184760000062
is a passing delay rate of 2λAlthough it is not different from the input 3D tensor v of the bidirectional recurrent neural network in the shape and size, the time-delay sampled reconstructed 3D tensor of (1) greatly changes the internal sequence order.
(4) Along hλThe third dimension of the system uses a bidirectional circulation neural network to capture the mutual relation between the sequences after time delay sampling, and then dimension conversion is carried out on the output result through a full connection layer; concrete computing deviceThe formula is as follows:
U=[BiLSTM(hλ[:,:,m]),m=1,...,2λR′] (12)
FC=[WU[:,:,m]+b,m=1,...,2λR′] (13)
in the formula (I), the compound is shown in the specification,
Figure BDA0003237184760000063
is the output of the bi-directional recurrent neural network, H is the output dimension of the hidden layer,
Figure BDA0003237184760000064
is a sequence determined by the index m,
Figure BDA0003237184760000065
Figure BDA0003237184760000066
respectively the weight and the offset of the fully connected layer,
Figure BDA0003237184760000067
is the output after dimension transformation of the full connection layer.
By the modeling mode, the whole multi-scale time delay sampling model can expand the receptive field and additionally capture mixed speech characteristic sequence information under different scales, for example, basic information such as phonemes, tones and the like in the whole sequence is obtained under the condition of low time delay rate, and information such as speaking content and speaker identity and the like in the sequence is obtained under the condition of high time delay rate.
(5) Performing layer normalization on the result FC in the step (4), and performing residual connection with the initial input (the output u of the bidirectional cyclic neural network) so as to facilitate convergence of the multi-scale time delay sampling model and prevent gradient disappearance or gradient explosion phenomena in the training process; the layer normalization and the residual calculation formula are as follows:
Figure BDA0003237184760000068
Figure BDA0003237184760000069
Figure BDA00032371847600000610
Output=hλ+Layernorm(FC) (17)
where μ (FC) and σ (FC) represent the mean and variance of the output of the fully-connected layer, z and r are normalization factors, ε is a very small positive number preventing the denominator to be 0, Layernorm represents the layer normalization, i, j, k represent N, K and 2, respectivelyλA particular dimension in R'.
(6) Repeating the steps (1) to (5), and stacking B multi-scale time delay sampling modules, wherein each module adopts different time delay sampling rates which are 1,2,4B-1And the delay sampling rate increases exponentially. Through the stacking of the modules, the multi-scale time delay sampling model captures the sequence characteristics of phoneme levels when a low time delay sampling rate is adopted, and the multi-scale time delay sampling model further pays more attention to the semantic or the characteristic information among speakers along with the expansion of the time delay sampling rate, so that the characteristic information under different scales is effectively integrated, the receptive field is expanded, and the mutual relation among sequences is fully captured.
And fifthly, overlapping and adding the Output after the multi-scale time delay sampling modeling (the length of the overlapping and adding is the same as the length of the mixed voice characteristic sequence extracted by the encoder), then sending the result into a two-dimensional convolution neural network, mapping the result after the overlapping and adding into a mask of the pure voices of a plurality of speakers, and performing point multiplication on the mask and the original encoder Output (the extracted mixed voice characteristic sequence) to obtain the pure voice characteristics of the speakers.
And step six, adopting a 1-dimensional deconvolution neural network as a decoder, and restoring the masked pure voice representations of the multiple speakers into voice waveform signals of a time domain to realize voice separation.
The multi-scale delay sampling model in this embodiment is trained by using a normalized signal-to-noise ratio (SI-SNR) as a loss function, and the calculation formula is as follows:
Figure BDA0003237184760000071
Figure BDA0003237184760000072
Figure BDA0003237184760000073
in the formula, Starget、enoiseThe intermediate variable is represented by a number of variables,
Figure BDA0003237184760000074
x represents separated speech and clean speech, respectively.
Compared with the existing voice separation algorithm, the multi-scale time delay sampling single-channel voice separation method adopted in the embodiment fully excavates the potential correlation in the mixed voice feature sequence, effectively improves the voice separation performance, reduces the distortion rate of the separated voice and improves the intelligibility thereof, and has good reference significance in the fields of theoretical research and practical application.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims (8)

1. A single-channel voice separation method of multi-scale time delay sampling is characterized by comprising the following steps:
s1, extracting mixed voice characteristics of the mixed voice signals of the multiple speakers;
s2, segmenting the mixed voice features and splicing into a 3D tensor;
s3, iteratively modeling the 3D tensor by adopting a first bidirectional cyclic neural network, and capturing local characteristic information and global characteristic information;
s4, adopting multi-scale time delay sampling modeling to perform time delay sampling of different scales on the output of the first bidirectional circulating neural network, and adopting the second bidirectional circulating neural network to capture characteristic information under different scales;
s5, overlapping and adding the characteristic information in S4, mapping the result after overlapping and adding to a mask of the pure voice of a plurality of speakers, and performing dot multiplication on the mask and the mixed voice characteristic to obtain the pure voice characteristics of the speakers;
s6, reconstructing the pure speech representation of each person into a separated speech signal.
2. The method for separating single-channel voice with multi-scale time delay sampling according to claim 1, wherein the single-scale time delay sampling is performed on the output of the first bi-directional cyclic neural network, and the specific process is as follows:
performing sequence recombination and re-segmentation on the output of the bidirectional recurrent neural network, and splicing the re-segmented sequences along the block length dimension 2λK carries out time delay sampling to each sequence u after time delay samplingiSplicing along the dimension R' of the blocks to obtain the recombined 3D tensor h after time delay samplingλWherein 2 isλIs the delay sampling rate, λ is 0., B-1 is the sampling index, B represents the number of stacked multi-scale delay sampling blocks, and K represents the block length dimension before re-segmentation.
3. The method for separating the single-channel voice with multi-scale time delay sampling according to claim 2, wherein the second bi-directional circulation neural network is adopted to capture the feature information under the single scale, and the specific process is as follows:
along the hλUsing a second bi-directional recurrent neural network, capturing the correlation between the time-delayed sampled sequences,then carrying out dimension transformation on the output result through the full connection layer, and outputting FC; and carrying out layer normalization on the FC, and then carrying out residual error connection with the output of the bidirectional cyclic neural network.
4. The method of claim 3, wherein different delay sampling rates are adopted, single-scale delay sampling is repeatedly performed on the output of the first bi-directional cyclic neural network, the feature information under a single scale is captured by the second bi-directional cyclic neural network, and the feature information under different scales is integrated.
5. The multi-scale time-lapse sampled single-channel speech separation method of claim 2, wherein the sequence after the re-segmentation splicing is along a block length dimension of 2λK, carrying out time delay sampling by adopting a formula as follows:
ui=[u′[:,i::2λ,:],i=0,...,2λ-1]
wherein u' [: i::2λ,:]Indicating that u' is a sequence slice in Python, and i represents an index number.
6. The method of claim 3, wherein the h is a length of the hλThe second dimension of the method uses a second bidirectional circulation neural network to capture the mutual relation among the sequences after time delay sampling, then dimension conversion is carried out on the output result through a full connection layer, and FC is output, wherein the specific formula is as follows:
U=[BiLSTM(hλ[:,:,m]),m=1,...,2λR′]
FC=[WU[:,:,m]+b,m=1,...,2λR′]
where U is the output of the bidirectional recurrent neural network, H is the output dimension of the hidden layer, Hλ[:,:,m]Is a sequence determined by the index m, 2λR' represents the tile dimension, W, b is the weight and offset, respectively, of the fully connected layer.
7. The method for separating single-channel speech with multi-scale delay sampling according to claim 3, wherein the FC is subjected to layer normalization by using a specific formula:
Figure FDA0003237184750000021
Figure FDA0003237184750000022
Figure FDA0003237184750000023
wherein, mu (FC) and sigma (FC) respectively represent the mean value and variance of the output of the full-connection layer, z and r are normalization factors, epsilon is a minimum positive number, Layernorm represents the layer normalization, i, j and k respectively represent N, K and 2λAnd a certain dimension in R', wherein N is the dimension of the extracted mixed voice features.
8. The method according to claim 7, wherein the residual error is concatenated by the following specific formula:
Output=hλ+Layernorm(FC)。
CN202111006251.7A 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling Active CN113782045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111006251.7A CN113782045B (en) 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111006251.7A CN113782045B (en) 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling

Publications (2)

Publication Number Publication Date
CN113782045A true CN113782045A (en) 2021-12-10
CN113782045B CN113782045B (en) 2024-01-05

Family

ID=78840162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111006251.7A Active CN113782045B (en) 2021-08-30 2021-08-30 Single-channel voice separation method for multi-scale time delay sampling

Country Status (1)

Country Link
CN (1) CN113782045B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110459240A (en) * 2019-08-12 2019-11-15 新疆大学 The more speaker's speech separating methods clustered based on convolutional neural networks and depth
CN111243579A (en) * 2020-01-19 2020-06-05 清华大学 Time domain single-channel multi-speaker voice recognition method and system
CN111429938A (en) * 2020-03-06 2020-07-17 江苏大学 Single-channel voice separation method and device and electronic equipment
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN112289304A (en) * 2019-07-24 2021-01-29 中国科学院声学研究所 Multi-speaker voice synthesis method based on variational self-encoder
CN113053407A (en) * 2021-02-06 2021-06-29 南京蕴智科技有限公司 Single-channel voice separation method and system for multiple speakers

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112289304A (en) * 2019-07-24 2021-01-29 中国科学院声学研究所 Multi-speaker voice synthesis method based on variational self-encoder
CN110459240A (en) * 2019-08-12 2019-11-15 新疆大学 The more speaker's speech separating methods clustered based on convolutional neural networks and depth
CN111243579A (en) * 2020-01-19 2020-06-05 清华大学 Time domain single-channel multi-speaker voice recognition method and system
CN111429938A (en) * 2020-03-06 2020-07-17 江苏大学 Single-channel voice separation method and device and electronic equipment
CN112071325A (en) * 2020-09-04 2020-12-11 中山大学 Many-to-many voice conversion method based on double-voiceprint feature vector and sequence-to-sequence modeling
CN113053407A (en) * 2021-02-06 2021-06-29 南京蕴智科技有限公司 Single-channel voice separation method and system for multiple speakers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
高利剑;毛启容;: "环境辅助的多任务混合声音事件检测方法", 计算机科学, no. 01 *

Also Published As

Publication number Publication date
CN113782045B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
WO2021043015A1 (en) Speech recognition method and apparatus, and neural network training method and apparatus
CN111429938B (en) Single-channel voice separation method and device and electronic equipment
Chang et al. Temporal modeling using dilated convolution and gating for voice-activity-detection
CN110751044B (en) Urban noise identification method based on deep network migration characteristics and augmented self-coding
CN110751957B (en) Speech enhancement method using stacked multi-scale modules
KR100908121B1 (en) Speech feature vector conversion method and apparatus
CN106782511A (en) Amendment linear depth autoencoder network audio recognition method
CN108847244A (en) Voiceprint recognition method and system based on MFCC and improved BP neural network
CN113053407B (en) Single-channel voice separation method and system for multiple speakers
KR102294638B1 (en) Combined learning method and apparatus using deepening neural network based feature enhancement and modified loss function for speaker recognition robust to noisy environments
CN110111803A (en) Based on the transfer learning sound enhancement method from attention multicore Largest Mean difference
KR101807961B1 (en) Method and apparatus for processing speech signal based on lstm and dnn
CN113470671B (en) Audio-visual voice enhancement method and system fully utilizing vision and voice connection
Shi et al. Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation.
Zhang et al. Birdsoundsdenoising: Deep visual audio denoising for bird sounds
CN109785852A (en) A kind of method and system enhancing speaker's voice
CN115602152B (en) Voice enhancement method based on multi-stage attention network
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN110176250B (en) Robust acoustic scene recognition method based on local learning
CN111986679A (en) Speaker confirmation method, system and storage medium for responding to complex acoustic environment
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Shi et al. End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network.
CN112183582A (en) Multi-feature fusion underwater target identification method
CN114333773A (en) Industrial scene abnormal sound detection and identification method based on self-encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant