CN115101085A - Multi-speaker time-domain voice separation method for enhancing external attention through convolution - Google Patents
Multi-speaker time-domain voice separation method for enhancing external attention through convolution Download PDFInfo
- Publication number
- CN115101085A CN115101085A CN202210647059.4A CN202210647059A CN115101085A CN 115101085 A CN115101085 A CN 115101085A CN 202210647059 A CN202210647059 A CN 202210647059A CN 115101085 A CN115101085 A CN 115101085A
- Authority
- CN
- China
- Prior art keywords
- convolution
- voice
- module
- separation
- mask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 66
- 230000002708 enhancing effect Effects 0.000 title claims abstract description 9
- 238000000034 method Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 10
- 239000013598 vector Substances 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 5
- 238000010586 diagram Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 230000010365 information processing Effects 0.000 claims description 2
- 238000010276 construction Methods 0.000 claims 2
- 230000004931 aggregating effect Effects 0.000 claims 1
- 230000002123 temporal effect Effects 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 17
- 230000007246 mechanism Effects 0.000 abstract description 16
- 230000008901 benefit Effects 0.000 abstract description 7
- 238000002474 experimental method Methods 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000002679 ablation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to the technical field of voice processing, in particular to a multi-speaker time-domain voice separation method for enhancing external attention by convolution. The method comprises the following steps: s1, carrying out convolution operation on mixed voice of multiple speakers through an encoder, and converting the mixed voice into potential feature representation; learning to obtain a voice mask through a separator based on a convolution enhanced external attention module; the speech mask is multiplied by the potential feature representation output by the encoder, and then the waveform is reconstructed through deconvolution operation of the decoder to obtain separated speech. The method can meet the requirements of smaller models and high timeliness of voice separation, and achieves better separation effect by virtue of sequence modeling; the external attention mechanism is enhanced to learn more features and correlations, and the advantage of high separation speed is maintained; the application in the double-path structure can better balance timeliness, model size and separation effect.
Description
Technical Field
The invention relates to the technical field of voice processing, in particular to a multi-speaker time domain voice separation method for enhancing external attention by convolution.
Background
In practical application, voice interaction often occurs under the condition of multi-person speaking, the voice of an interfering person can seriously obstruct the machine from extracting voice information, and the voice separation technology separates the voice of multiple speakers, so that the machine can effectively extract information to perform tasks such as voice recognition. Currently, a Time-domain audio separation (TasNet) structure is mainly adopted for single-channel voice separation based on deep learning. Compared with the traditional time-frequency domain based method, the TasNet directly models the speech signal in the time domain by using an encoder-decoder framework and performs speech separation on the non-negative encoder output, thereby omitting a frequency decomposition step, reducing the separation problem to a speech Mask (Mask) of the encoder output, and then synthesizing the speech Mask by a decoder, and therefore having better performance and lower delay.
The BLSTM-TasNet is based on an LSTM network, but the deep LSTM network has higher calculation cost, and the applicability of the BLSTM-TasNet in a low-resource and low-power-consumption platform is limited. Chen Xian et al (Luo Y, Mesgarani N. TasNet: time-domain audio separation network for real-time, single-channel seed separation [ C ].2018) replace LSTM with gated-cycle unit (GRU), can perform long sequence modeling, and can reduce gradient disappearance, but still have a large number of parameters. Luo Y et al (Luo Y, Mesgarani N.Conv-TasNet: parsing Ideal Time-Frequency mapping for Speech Separation [ J ]. IEEE/ACM transformations on Audio, Speech, and Language Processing,2019, PP (99):1-1) propose convolution-based Conv-TasNet, which uses a Time-domain convolution network composed of one-dimensional expansion convolution blocks to calculate a mask, allowing the network to model the long-term dependence of the Speech signal while maintaining a small number of parameters and a faster Separation speed. The SSGAN performs voice separation by adopting a mode based on generation countermeasure network joint training, and a generator and a discriminator are both constructed by using a full convolution neural network, so that high-dimensional feature extraction and recovery can be effectively performed on a time domain waveform. However, one-dimensional Convolutional Neural Networks (CNN) cannot model Speech-level sequences when the receptive field is smaller than the Sequence length, while Recurrent Neural Networks (RNN) cannot model longer sequences efficiently due to the difficulty of optimization, so Dual-Path RNN (Luo Y, Chen Z, Yoshioka T. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation [ C ]// ICASSP 2020 + IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2020:46-50, doi:10.1109/ICASSP40776.2020.9054266.) propose a two-way structure that divides longer audio inputs into smaller blocks, optimizes RNN iterative intra-and inter-block operations in a deep model, alleviates the problem of RNN optimization difficulty and achieves better performance with smaller models. The modeling of the separating model based on RNN to the voice sequence indirectly depends on the context, the transmission of the intermediate state influences the improvement of the separating performance, Chen J and the like (Chen J, Mao Q, Liu D.Dual-path transport network: Direct context-aware modulation for end-to-end single speech separation [ J ]. arXiv prepropressin: 2007.13975,2020.) introduce an improved Transformer in a two-way structure, and the LSTM is added in a feedforward layer, so that the Direct context perception of the voice sequence can be realized by learning the sequence information of the voice sequence without using position coding and the Direct context perception of the voice sequence is realized, but the reasoning speed is very slow. In the aspect of realizing voice separation by using limited computing resources, a SuDoRM-RF (Tzinis E, Wang Z, Smragdis P. Sudo rm-RF: effective network for undivided audio source separation [ C ]//2020 IEEE 30th International work on Machine Learning (MLSP). IEEE,2020:1-6) network uses a one-dimensional volume block similar to UNet to continuously down-sample and up-sample to increase the experience frame of the network, extracts information from multiresolution and has higher reasoning speed; the extraction of multi-scale features combined with the increase in perceptual frame makes SuDoRM-RF superior to other convolutions, but also has larger parameters.
Aiming at the prior art, how to realize considerable separation effect with smaller model size and higher separation speed and better meet and balance the requirements of a speech recognition front end on model size, timeliness and separation effect still has greater challenge, therefore, the invention provides the multi-speaker time-domain speech separation method for enhancing external attention by convolution.
Disclosure of Invention
The invention aims to provide a multi-speaker time-domain voice separation method with convolution for enhancing external attention.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-speaker time-domain speech separation method with convolution to enhance external attention comprises the following steps:
s1, mixing voice of multiple speakers through a coder, performing convolution operation, and converting the voice into potential feature representation
Remembering the mixed speech of multiple speakers as x (t),wherein the content of the first and second substances,representing a real number domain; t is the voice length;
let the latent features be denoted as h,wherein, C E Is the number of encoder channels; l is a potential feature representation;
s2, learning by a separator based on a convolution enhanced external attention module to obtain a voice mask;
and S3, multiplying the voice mask by the potential feature representation output by the encoder, and reconstructing a waveform through deconvolution operation of a decoder to obtain the separated voice.
Further, S2 mainly includes the following steps:
global normalization and convolutionThe operation is as follows: representing the potential features by h, mapping and obtaining an intermediate representation
Wherein C is the number of channels;
splitting and stacking: dividing the intermediate representation h' into S smaller blocks with overlapping length K to form a three-dimensional vector
Wherein K is the length of the overlapped blocks, and S is the number of the overlapped blocks;
ExConformer transform: iteratively applying a transformation consisting of B ExConformer modules to the intra-block dimension K and S of a three-dimensional vector T, the output T of the intra-block processing b Will be the input for inter-block processing;
that is, the output of the B-1 th ECBlock will be the input of the B-th block, B being 1, …, B, as follows:
T b =ECBlock intra (U b-1 )
U b =ECBlock inter (T b )
dimension transformation: output U to the Bth ECBlock B Two-dimensional convolution is applied to learn a mask for each source, resulting in a three-dimensional vector
Y=Conv2D(U B )
Where Conv2D denotes a two-dimensional convolution operation,
aggregate multi-channel information processing: converting three-dimensional vector Y into an intermediate potential representation of each source via overlap-add operationsApplying one-dimensional convolution sum PReLU to y to aggregate information over multiple channels, its estimated mask of the speech mask for the ith sourceThe following were used:
further, the ExConformer module consists of a position convolution module, an external attention module, a convolution module and a feedforward neural network module, and residual connection is added between the modules;
if the input of the ith ExConformer module is defined as x i Then it outputs y i Expressed as follows:
x″ i =x′ i +Conv(x′ i )
y i =Layernorm(x″ i +FFN(x″ i ))。
further, the convolution module of the excelformer module and the activation function of the feedforward neural network module use Penalized _ tanh.
Further, the position convolution module is composed of a plurality of stacked one-dimensional convolutions with zero-padding, layer normalization and ReLU activation layers.
Further, the steps of the external attention module are as follows:
adjusting the number of channels of the input features by a one-dimensional convolution;
constructing M using a linear layer k A memory to learn an attention map a between query vectors;
softmax and L1 Norm normalization (L1_ Norm) were performed thereon;
constructing M using a linear layer v The memory generates a refined feature map, represented as follows:
F out =AM v
and performing Dropout operation on the output result.
The invention has at least the following beneficial effects:
1. according to the method, through the improved design of the former, the requirements of a smaller voice separation model and high timeliness can be met, and a better separation effect is achieved by the advantage of sequence modeling.
2. The invention enhances the learning of more characteristics and relativity of the external attention mechanism by introducing the external attention mechanism into the voice separation task and keeps the advantage of high separation speed.
3. The present invention encodes the input by using a convolutional neural network such that each frame contains context information, and convolutional encoding makes position encoding trainable.
4. The invention can better balance timeliness, model size and separation effect through the application of the proposed convolution enhanced external attention multi-speaker time domain voice separation method in a double-path structure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a diagram of the overall network framework of the present invention;
FIG. 2 is a block diagram of an ExConformer module according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the present invention is a multi-talker time-domain speech separation method with convolution enhanced external attention, comprising the steps of:
1. the encoder performs convolution operation on the multi-speaker mixed voice to convert the multi-speaker mixed voice into potential feature expression h;
2. learning to obtain a voice mask through a separator based on a convolution enhanced external attention module;
3. the speech mask is multiplied by the potential feature representation output by the encoder, and the waveform is reconstructed through deconvolution operation of the decoder to obtain the separated speech.
In the above step 2, the following operations are performed:
first, mapping the potential representation h to a new feature space by global normalization and a one-dimensional convolution to obtain an intermediate representationThe number of channels potentially representing h is changed, the network depth is increased under the condition of not changing a receptive field, and the abstract expression capability of a network local module is enhanced.
h′=Conv1D(GlobLN(h))
Global layer normalization GlobLN (-) defines two learnable parametersAnd normalization of surrogate layers using global layer normalization significantly improves the modelConvergence, since the gradient statistics are interdependent between different channels. For input matrixApplying global normalization can be defined as:
the intermediate representation h' is then divided into S smaller blocks of overlapping length K, constituting a three-dimensional vectorThen, the transformation composed of B ExConformer modules (ECBlock) is applied iteratively to the intra-block dimension (K dimension) and the inter-block dimension (S dimension) of the three-dimensional vector T, and the output T of the intra-block processing b The output of the (B-1) th ECBlock will be the input of the B-th block, B being 1, …, B, which will be the input of the inter-block processing.
T b =ECBlock intra (U b-1 )
U b =ECBlock inter (T b )
The ExConformer module refers to FIG. 2.
Output U to the Bth ECBlock B Applying a two-dimensional convolution to obtain a three-dimensional vector for each source learning mask
Y=Conv2D(U B )
The three-dimensional vector Y is then transformed into an intermediate potential representation for each source via an overlap-add operationOne-dimensional convolution sum PReLU is applied to y to aggregate information on multiple channels, and for the ith source, its estimated mask will be obtainedThe following:
finally, multiplying the coded potential representation h with the corresponding mask to obtain the potential representation of each source estimation
Wherein a | _ b denotes multiplication of two tensor corresponding elements having the same shape.
Wherein the ExConformer module is a convolution enhancing external attention module. The former learns the local information based on position by using a convolution-enhanced self-attention mechanism, and uses global interaction based on content, which performs well in a sequence modeling task, does not consider the relation between different samples, and the combination of the self-attention mechanism and the convolution enables a model to have larger parameter number and slower reasoning speed, and is difficult to apply in a voice separation task. While the external attention mechanism uses two shared, learnable external memories, features from other samples can be learned implicitly; linear complexity can be realized by controlling the size of the memory, and the reasoning speed is high; however, due to the simple structure, the learning effect is not as good as the self-attention mechanism. According to the method, external attention is introduced into the former, and the relation among different voice segments is modeled by using a convolution mode for enhancing the external attention, so that the reasoning speed is accelerated while the parameter quantity is reduced; and Penalis _ tanh, which appears more stable in the sequence task, is used as an activation function, so that a convolution enhanced external attention module (ExConformer) is proposed. The ExConformer Module comprises a position Convolution Module (PosConv Module), an External Attention Module (External Attention Module), a Convolution Module (Convolution Module), a feed-forward neural network Module (fed forward Module), and a former classSimilarly, residual concatenation is added between each module, if the input of the ith ExConformer block is defined as x i Then it outputs y i This representation can be obtained by the following steps:
x″ i =x′ i +Conv(x′ i )
y i =Layernorm(x″ i +FFN(x″ i ))
in the former, two half-step feed-forward layers are used as mentioned by Macaron-Net, and in the self-attention mechanism sinusoidal position coding is used. Recent research on speech recognition finds that convolutional position coding can achieve better effect than sinusoidal position coding, so we also adopt convolutional position coding in the expert, which makes the position coding trainable; meanwhile, the gain generated by using two half-step feedforward layers is not obvious through experiments, so that in the invention, the first half-step feedforward layer of the former is changed into a convolution position coding module, and the second half-step feedforward layer is changed into a common feedforward layer. Using convolution with zero-padding, intra-and inter-block dimensions are modeled separately in the separator, whose location context location information is learned faster and with fewer parameters.
For the experiments performed on the multi-speaker time-domain speech separation technique, the experimental procedure is described as follows:
1. data set and evaluation index
The following experiment was conducted mainly for the separation of voices of two speakers using a Libri2Mix formed by mixing a public data set libristech, the setting in the Libri2Mix data set paper being adopted for the mixing mode. Mixing the Libri2Mix training set with data in train-100 for 13900 voices; the verification set and the test set respectively have 3000 pieces of voice data; each piece of data is formed by mixing random signal-to-noise ratios of-5 dB to 5 dB; the pre-treatment was performed down-sampling to 8 kHz. Compared to smaller datasets, such as the WSJ0 dataset and the timmit dataset, the libristech dataset contains more speakers, with 251 speakers in the training set train-100 and 40 speakers in the verification set and the test set; experimental results on the Libri2Mix dataset mixed by libristech can be more reliably generalized to new scenarios and can provide a general modeling trend.
The evaluation index adopts a scale-invariant signal-to-noise ratio SI-SNR and is calculated as follows:
whereinAnd x are the estimated target speaker's voice and the pure target speaker's voice, respectively, both of which have been zero-mean normalized prior to the calculation. In general, a larger value of the SI-SNR indicates a better quality of speech separation.
2. Ablation experiment
We tested the former and the convolution-enhanced external attention, excormer, as a splitter module based on the TasNet splitting network, verifying the effect of the external attention mechanism on speech separation. Wherein the former enhances self-attention and two half-step feed forward layers (FFN) using convolution; ExConformer-PosConv denotes a structure that does not use convolution to encode positions; ExConformer-FFN indicates a structure that uses no feedforward layer, only convolution and external attention modules; ExConformer with Swish activation function in former; ExConformer replaces Swish with Penaliied _ tanh, a model proposed herein. The experimental results are shown in table 1, where the experiments are all performed on the same server, and the separation speed is measured by the number of speech fragments in the test set processed in one second, which can reflect the speed of the separation speed. We believe that the separation speed is more indicative of the timeliness of a model in practice than training.
As can be seen from the experimental results in table 1, the former has a larger parameter and the slowest separation speed, and the exformer with convolution enhanced external attention achieves a better separation effect with a parameter less than that and a separation speed 2 times; meanwhile, the results in the rows 2 and 3 in the table show that the convolutional position coding and the feedforward layer are necessary, and the two parts have a remarkable lifting effect on the ExConformer; and replacing the Swish function with Penalied _ tanh also improves the separation effect.
Table 1 comparative experiments with different separator modules
3. Comparative experiment
We compare the experimental effect of the present invention on Libri2Mix datasets with some existing time domain speech separation methods. We reproduced partial code on their own device to test the separation speed of these models. Wherein the BLSTM-TasNet, Conv-TasNet and SuDoRM-RF + + models do not segment the potential representation of the mixed speech into smaller blocks and overlap into three-dimensional vectors for intra-block inter-block operations, and therefore have faster separation speed, but also have larger parameters; both DPRNN and DPTNet and the present invention use segmentation overlap to perform inter-block iterative operations within the block, increasing the separation speed but reducing the number of parameters compared to the first several models. As can be seen from the table, DPTNet was the most effective, and the number of the components was small, but the separation speed was the slowest. The present invention achieves a separation effect equivalent to DPTNet with a minimum model size and a separation speed 2 times as fast as that.
Table 2 comparison of different methods on Libri2Mix dataset
In summary, the following effects are achieved:
1. the former was modified. The former performs well in a sequence modeling task, but for a voice separation task serving as a voice recognition front-end technology, the requirements of the size and timeliness of a voice separation model are difficult to meet due to the large parameter number and the long reasoning time of the voice separation task; the former is improved to meet the requirements of smaller models and high timeliness of voice separation, and a better separation effect is achieved by the advantage of sequence modeling.
2. An external attention mechanism is introduced into the speech separation task. The external attention mechanism replaces the self-attention mechanism with two cascaded linear layers and a normalization layer, which, unlike the self-attention mechanism, implicitly takes into account features from other samples by sharing weight units; linear complexity can be realized by controlling the size of the memory, and the separation speed is high; but the simpler external attention mechanism can not learn the characteristics and the correlation equivalent to the self-attention mechanism, the invention combines the former, and the convolution module of the former can enhance the external attention mechanism to learn more characteristics and the correlation, and keeps the advantage of high separation speed.
3. Convolutional coding is used. The input is encoded using a convolutional neural network, after which each frame contains context information. And convolutional coding makes the position coding trainable.
4. The application of the newly proposed multi-speaker time-domain voice separation method with convolution enhanced external attention in a double-path structure can well balance timeliness, model size and separation effect.
The foregoing shows and describes the general principles, principal features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (6)
1. A method for separating time domain voice of multiple speakers with convolution enhancing external attention is characterized by comprising the following steps:
s1, mixing voice of multiple speakers through a coder, performing convolution operation, and converting the voice into potential feature representation
Remembering the mixed speech of multiple speakers as x (t),wherein, the first and the second end of the pipe are connected with each other,is a real number domain; t is the voice length;
let the latent features be denoted as h,wherein, C E Is the number of encoder channels; l is the length of the potential feature representation;
s2, learning by a separator based on a convolution enhanced external attention module to obtain a voice mask;
and S3, multiplying the voice mask by the potential feature representation output by the encoder, and reconstructing a waveform through deconvolution operation of a decoder to obtain the separated voice.
2. The method for convolution-enhanced external attention time-domain speech separation of multiple speakers according to claim 1, wherein S2 mainly includes the steps of:
global normalization and convolution operations: representing the potential features by h, mapping and obtaining an intermediate representationWherein C is the number of channels;
splitting and stacking: the intermediate representation h' is divided into S overlapping smaller blocks of length K, forming a three-dimensional vector
Wherein K is the length of the overlapped blocks, and S is the number of the overlapped blocks;
ExConformer transform: iteratively applying a transformation consisting of B ExConformer modules to the intra-block dimension K and S of a three-dimensional vector T, the output T of the intra-block processing b Will be the input for inter-block processing;
that is, the output of the B-1 th ECBlock will be the input of the B-th block, B being 1, …, B, as follows:
T b =ECBlock intra (U b-1 )
U b =ECBlock inter (T b )
dimension transformation: output U to the Bth ECBlock B Two-dimensional convolution is applied to learn a mask for each source, resulting in a three-dimensional vector
Y=Conv2D(U B )
Aggregating multi-channel information processing: converting three-dimensional vector Y into an intermediate potential representation of each source via overlap-add operationsApplying one-dimensional convolution sum PReLU to y to aggregate information on multiple channels, for the ith source, its estimated mask of speech maskThe following were used:
3. the method of claim 2, wherein the ExConformer module comprises a position convolution module, an external attention module, a convolution module, and a feedforward neural network module, and a residual connection is added between each module;
if the input of the ith ExConformer module is defined as x i Then its output, yi, is expressed as follows:
x” i =x' i +Conv(x' i )
y i =Layernorm(x” i +FFN(x” i ))。
4. the method of claim 3, wherein the activation functions of the convolution module of the ExConformer module and the feedforward neural network module use Penalized _ tanh.
5. The method of convolution enhanced attention-focused multi-speaker temporal speech separation according to claim 3 wherein the location convolution module is composed of a plurality of stacked zero-padding one-dimensional convolutions, layer normalization and ReLU activation layers.
6. The method of claim 3, wherein the step of the external attention module is as follows:
adjusting the number of channels of the input features by a one-dimensional convolution;
construction of M Using one Linear layer k A memory to learn an attention map a between query vectors;
softmax and L1 Norm normalization (L1_ Norm) were performed thereon;
construction of M Using one Linear layer v The memory generates a refined feature map, represented as follows:
F out =AM v
and performing Dropout operation on the output result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210647059.4A CN115101085A (en) | 2022-06-09 | 2022-06-09 | Multi-speaker time-domain voice separation method for enhancing external attention through convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210647059.4A CN115101085A (en) | 2022-06-09 | 2022-06-09 | Multi-speaker time-domain voice separation method for enhancing external attention through convolution |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115101085A true CN115101085A (en) | 2022-09-23 |
Family
ID=83289286
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210647059.4A Pending CN115101085A (en) | 2022-06-09 | 2022-06-09 | Multi-speaker time-domain voice separation method for enhancing external attention through convolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115101085A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116403599A (en) * | 2023-06-07 | 2023-07-07 | 中国海洋大学 | Efficient voice separation method and model building method thereof |
CN116805061A (en) * | 2023-05-10 | 2023-09-26 | 杭州水务数智科技股份有限公司 | Leakage event judging method based on optical fiber sensing |
WO2024087128A1 (en) * | 2022-10-24 | 2024-05-02 | 大连理工大学 | Multi-scale hybrid attention mechanism modeling method for predicting remaining useful life of aero engine |
-
2022
- 2022-06-09 CN CN202210647059.4A patent/CN115101085A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024087128A1 (en) * | 2022-10-24 | 2024-05-02 | 大连理工大学 | Multi-scale hybrid attention mechanism modeling method for predicting remaining useful life of aero engine |
CN116805061A (en) * | 2023-05-10 | 2023-09-26 | 杭州水务数智科技股份有限公司 | Leakage event judging method based on optical fiber sensing |
CN116805061B (en) * | 2023-05-10 | 2024-04-12 | 杭州水务数智科技股份有限公司 | Leakage event judging method based on optical fiber sensing |
CN116403599A (en) * | 2023-06-07 | 2023-07-07 | 中国海洋大学 | Efficient voice separation method and model building method thereof |
CN116403599B (en) * | 2023-06-07 | 2023-08-15 | 中国海洋大学 | Efficient voice separation method and model building method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7337953B2 (en) | Speech recognition method and device, neural network training method and device, and computer program | |
CN115101085A (en) | Multi-speaker time-domain voice separation method for enhancing external attention through convolution | |
JP2022529641A (en) | Speech processing methods, devices, electronic devices and computer programs | |
CN110600047A (en) | Perceptual STARGAN-based many-to-many speaker conversion method | |
CN110060657B (en) | SN-based many-to-many speaker conversion method | |
CN114141238A (en) | Voice enhancement method fusing Transformer and U-net network | |
CN111429893A (en) | Many-to-many speaker conversion method based on Transitive STARGAN | |
CN113823264A (en) | Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment | |
Chen et al. | Distilled binary neural network for monaural speech separation | |
CN113539232B (en) | Voice synthesis method based on lesson-admiring voice data set | |
CN114373451A (en) | End-to-end Chinese speech recognition method | |
CN113823273A (en) | Audio signal processing method, audio signal processing device, electronic equipment and storage medium | |
CN114141237A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN115602165A (en) | Digital staff intelligent system based on financial system | |
Wei et al. | EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting | |
CN114067819B (en) | Speech enhancement method based on cross-layer similarity knowledge distillation | |
Luo et al. | Tiny-sepformer: A tiny time-domain transformer network for speech separation | |
Guo et al. | Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training | |
CN116863920B (en) | Voice recognition method, device, equipment and medium based on double-flow self-supervision network | |
US20220207321A1 (en) | Convolution-Augmented Transformer Models | |
CN116391191A (en) | Generating neural network models for processing audio samples in a filter bank domain | |
CN112257464A (en) | Machine translation decoding acceleration method based on small intelligent mobile device | |
Li et al. | A fast convolutional self-attention based speech dereverberation method for robust speech recognition | |
CN117980915A (en) | Contrast learning and masking modeling for end-to-end self-supervised pre-training | |
CN115376484A (en) | Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |