CN115101085A - Multi-speaker time-domain voice separation method for enhancing external attention through convolution - Google Patents

Multi-speaker time-domain voice separation method for enhancing external attention through convolution Download PDF

Info

Publication number
CN115101085A
CN115101085A CN202210647059.4A CN202210647059A CN115101085A CN 115101085 A CN115101085 A CN 115101085A CN 202210647059 A CN202210647059 A CN 202210647059A CN 115101085 A CN115101085 A CN 115101085A
Authority
CN
China
Prior art keywords
convolution
voice
module
separation
mask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210647059.4A
Other languages
Chinese (zh)
Inventor
闫河
张宇宁
李梦雪
王潇棠
刘建骐
刘宇涵
黄骏滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202210647059.4A priority Critical patent/CN115101085A/en
Publication of CN115101085A publication Critical patent/CN115101085A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to the technical field of voice processing, in particular to a multi-speaker time-domain voice separation method for enhancing external attention by convolution. The method comprises the following steps: s1, carrying out convolution operation on mixed voice of multiple speakers through an encoder, and converting the mixed voice into potential feature representation; learning to obtain a voice mask through a separator based on a convolution enhanced external attention module; the speech mask is multiplied by the potential feature representation output by the encoder, and then the waveform is reconstructed through deconvolution operation of the decoder to obtain separated speech. The method can meet the requirements of smaller models and high timeliness of voice separation, and achieves better separation effect by virtue of sequence modeling; the external attention mechanism is enhanced to learn more features and correlations, and the advantage of high separation speed is maintained; the application in the double-path structure can better balance timeliness, model size and separation effect.

Description

Multi-speaker time-domain voice separation method for enhancing external attention through convolution
Technical Field
The invention relates to the technical field of voice processing, in particular to a multi-speaker time domain voice separation method for enhancing external attention by convolution.
Background
In practical application, voice interaction often occurs under the condition of multi-person speaking, the voice of an interfering person can seriously obstruct the machine from extracting voice information, and the voice separation technology separates the voice of multiple speakers, so that the machine can effectively extract information to perform tasks such as voice recognition. Currently, a Time-domain audio separation (TasNet) structure is mainly adopted for single-channel voice separation based on deep learning. Compared with the traditional time-frequency domain based method, the TasNet directly models the speech signal in the time domain by using an encoder-decoder framework and performs speech separation on the non-negative encoder output, thereby omitting a frequency decomposition step, reducing the separation problem to a speech Mask (Mask) of the encoder output, and then synthesizing the speech Mask by a decoder, and therefore having better performance and lower delay.
The BLSTM-TasNet is based on an LSTM network, but the deep LSTM network has higher calculation cost, and the applicability of the BLSTM-TasNet in a low-resource and low-power-consumption platform is limited. Chen Xian et al (Luo Y, Mesgarani N. TasNet: time-domain audio separation network for real-time, single-channel seed separation [ C ].2018) replace LSTM with gated-cycle unit (GRU), can perform long sequence modeling, and can reduce gradient disappearance, but still have a large number of parameters. Luo Y et al (Luo Y, Mesgarani N.Conv-TasNet: parsing Ideal Time-Frequency mapping for Speech Separation [ J ]. IEEE/ACM transformations on Audio, Speech, and Language Processing,2019, PP (99):1-1) propose convolution-based Conv-TasNet, which uses a Time-domain convolution network composed of one-dimensional expansion convolution blocks to calculate a mask, allowing the network to model the long-term dependence of the Speech signal while maintaining a small number of parameters and a faster Separation speed. The SSGAN performs voice separation by adopting a mode based on generation countermeasure network joint training, and a generator and a discriminator are both constructed by using a full convolution neural network, so that high-dimensional feature extraction and recovery can be effectively performed on a time domain waveform. However, one-dimensional Convolutional Neural Networks (CNN) cannot model Speech-level sequences when the receptive field is smaller than the Sequence length, while Recurrent Neural Networks (RNN) cannot model longer sequences efficiently due to the difficulty of optimization, so Dual-Path RNN (Luo Y, Chen Z, Yoshioka T. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation [ C ]// ICASSP 2020 + IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,2020:46-50, doi:10.1109/ICASSP40776.2020.9054266.) propose a two-way structure that divides longer audio inputs into smaller blocks, optimizes RNN iterative intra-and inter-block operations in a deep model, alleviates the problem of RNN optimization difficulty and achieves better performance with smaller models. The modeling of the separating model based on RNN to the voice sequence indirectly depends on the context, the transmission of the intermediate state influences the improvement of the separating performance, Chen J and the like (Chen J, Mao Q, Liu D.Dual-path transport network: Direct context-aware modulation for end-to-end single speech separation [ J ]. arXiv prepropressin: 2007.13975,2020.) introduce an improved Transformer in a two-way structure, and the LSTM is added in a feedforward layer, so that the Direct context perception of the voice sequence can be realized by learning the sequence information of the voice sequence without using position coding and the Direct context perception of the voice sequence is realized, but the reasoning speed is very slow. In the aspect of realizing voice separation by using limited computing resources, a SuDoRM-RF (Tzinis E, Wang Z, Smragdis P. Sudo rm-RF: effective network for undivided audio source separation [ C ]//2020 IEEE 30th International work on Machine Learning (MLSP). IEEE,2020:1-6) network uses a one-dimensional volume block similar to UNet to continuously down-sample and up-sample to increase the experience frame of the network, extracts information from multiresolution and has higher reasoning speed; the extraction of multi-scale features combined with the increase in perceptual frame makes SuDoRM-RF superior to other convolutions, but also has larger parameters.
Aiming at the prior art, how to realize considerable separation effect with smaller model size and higher separation speed and better meet and balance the requirements of a speech recognition front end on model size, timeliness and separation effect still has greater challenge, therefore, the invention provides the multi-speaker time-domain speech separation method for enhancing external attention by convolution.
Disclosure of Invention
The invention aims to provide a multi-speaker time-domain voice separation method with convolution for enhancing external attention.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-speaker time-domain speech separation method with convolution to enhance external attention comprises the following steps:
s1, mixing voice of multiple speakers through a coder, performing convolution operation, and converting the voice into potential feature representation
Remembering the mixed speech of multiple speakers as x (t),
Figure BDA0003686388670000041
wherein the content of the first and second substances,
Figure BDA0003686388670000042
representing a real number domain; t is the voice length;
let the latent features be denoted as h,
Figure BDA0003686388670000043
wherein, C E Is the number of encoder channels; l is a potential feature representation;
s2, learning by a separator based on a convolution enhanced external attention module to obtain a voice mask;
and S3, multiplying the voice mask by the potential feature representation output by the encoder, and reconstructing a waveform through deconvolution operation of a decoder to obtain the separated voice.
Further, S2 mainly includes the following steps:
global normalization and convolutionThe operation is as follows: representing the potential features by h, mapping and obtaining an intermediate representation
Figure BDA0003686388670000044
Wherein C is the number of channels;
splitting and stacking: dividing the intermediate representation h' into S smaller blocks with overlapping length K to form a three-dimensional vector
Figure BDA0003686388670000045
Wherein K is the length of the overlapped blocks, and S is the number of the overlapped blocks;
ExConformer transform: iteratively applying a transformation consisting of B ExConformer modules to the intra-block dimension K and S of a three-dimensional vector T, the output T of the intra-block processing b Will be the input for inter-block processing;
that is, the output of the B-1 th ECBlock will be the input of the B-th block, B being 1, …, B, as follows:
T b =ECBlock intra (U b-1 )
U b =ECBlock inter (T b )
dimension transformation: output U to the Bth ECBlock B Two-dimensional convolution is applied to learn a mask for each source, resulting in a three-dimensional vector
Figure BDA0003686388670000051
Y=Conv2D(U B )
Where Conv2D denotes a two-dimensional convolution operation,
aggregate multi-channel information processing: converting three-dimensional vector Y into an intermediate potential representation of each source via overlap-add operations
Figure BDA0003686388670000052
Applying one-dimensional convolution sum PReLU to y to aggregate information over multiple channels, its estimated mask of the speech mask for the ith source
Figure BDA0003686388670000053
The following were used:
Figure BDA0003686388670000054
further, the ExConformer module consists of a position convolution module, an external attention module, a convolution module and a feedforward neural network module, and residual connection is added between the modules;
if the input of the ith ExConformer module is defined as x i Then it outputs y i Expressed as follows:
Figure BDA0003686388670000055
Figure BDA0003686388670000056
x″ i =x′ i +Conv(x′ i )
y i =Layernorm(x″ i +FFN(x″ i ))。
further, the convolution module of the excelformer module and the activation function of the feedforward neural network module use Penalized _ tanh.
Further, the position convolution module is composed of a plurality of stacked one-dimensional convolutions with zero-padding, layer normalization and ReLU activation layers.
Further, the steps of the external attention module are as follows:
adjusting the number of channels of the input features by a one-dimensional convolution;
constructing M using a linear layer k A memory to learn an attention map a between query vectors;
Figure BDA0003686388670000061
wherein, F is an input characteristic diagram;
Figure BDA0003686388670000062
is M k Transposing;
softmax and L1 Norm normalization (L1_ Norm) were performed thereon;
constructing M using a linear layer v The memory generates a refined feature map, represented as follows:
F out =AM v
and performing Dropout operation on the output result.
The invention has at least the following beneficial effects:
1. according to the method, through the improved design of the former, the requirements of a smaller voice separation model and high timeliness can be met, and a better separation effect is achieved by the advantage of sequence modeling.
2. The invention enhances the learning of more characteristics and relativity of the external attention mechanism by introducing the external attention mechanism into the voice separation task and keeps the advantage of high separation speed.
3. The present invention encodes the input by using a convolutional neural network such that each frame contains context information, and convolutional encoding makes position encoding trainable.
4. The invention can better balance timeliness, model size and separation effect through the application of the proposed convolution enhanced external attention multi-speaker time domain voice separation method in a double-path structure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a diagram of the overall network framework of the present invention;
FIG. 2 is a block diagram of an ExConformer module according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the present invention is a multi-talker time-domain speech separation method with convolution enhanced external attention, comprising the steps of:
1. the encoder performs convolution operation on the multi-speaker mixed voice to convert the multi-speaker mixed voice into potential feature expression h;
2. learning to obtain a voice mask through a separator based on a convolution enhanced external attention module;
3. the speech mask is multiplied by the potential feature representation output by the encoder, and the waveform is reconstructed through deconvolution operation of the decoder to obtain the separated speech.
In the above step 2, the following operations are performed:
first, mapping the potential representation h to a new feature space by global normalization and a one-dimensional convolution to obtain an intermediate representation
Figure BDA0003686388670000081
The number of channels potentially representing h is changed, the network depth is increased under the condition of not changing a receptive field, and the abstract expression capability of a network local module is enhanced.
h′=Conv1D(GlobLN(h))
Global layer normalization GlobLN (-) defines two learnable parameters
Figure BDA0003686388670000082
And
Figure BDA0003686388670000083
Figure BDA0003686388670000084
normalization of surrogate layers using global layer normalization significantly improves the modelConvergence, since the gradient statistics are interdependent between different channels. For input matrix
Figure BDA0003686388670000085
Applying global normalization can be defined as:
Figure BDA0003686388670000086
the intermediate representation h' is then divided into S smaller blocks of overlapping length K, constituting a three-dimensional vector
Figure BDA0003686388670000087
Then, the transformation composed of B ExConformer modules (ECBlock) is applied iteratively to the intra-block dimension (K dimension) and the inter-block dimension (S dimension) of the three-dimensional vector T, and the output T of the intra-block processing b The output of the (B-1) th ECBlock will be the input of the B-th block, B being 1, …, B, which will be the input of the inter-block processing.
T b =ECBlock intra (U b-1 )
U b =ECBlock inter (T b )
The ExConformer module refers to FIG. 2.
Output U to the Bth ECBlock B Applying a two-dimensional convolution to obtain a three-dimensional vector for each source learning mask
Figure BDA0003686388670000088
Y=Conv2D(U B )
The three-dimensional vector Y is then transformed into an intermediate potential representation for each source via an overlap-add operation
Figure BDA0003686388670000091
One-dimensional convolution sum PReLU is applied to y to aggregate information on multiple channels, and for the ith source, its estimated mask will be obtained
Figure BDA0003686388670000092
The following:
Figure BDA0003686388670000093
finally, multiplying the coded potential representation h with the corresponding mask to obtain the potential representation of each source estimation
Figure BDA0003686388670000094
Figure BDA0003686388670000095
Wherein a | _ b denotes multiplication of two tensor corresponding elements having the same shape.
Wherein the ExConformer module is a convolution enhancing external attention module. The former learns the local information based on position by using a convolution-enhanced self-attention mechanism, and uses global interaction based on content, which performs well in a sequence modeling task, does not consider the relation between different samples, and the combination of the self-attention mechanism and the convolution enables a model to have larger parameter number and slower reasoning speed, and is difficult to apply in a voice separation task. While the external attention mechanism uses two shared, learnable external memories, features from other samples can be learned implicitly; linear complexity can be realized by controlling the size of the memory, and the reasoning speed is high; however, due to the simple structure, the learning effect is not as good as the self-attention mechanism. According to the method, external attention is introduced into the former, and the relation among different voice segments is modeled by using a convolution mode for enhancing the external attention, so that the reasoning speed is accelerated while the parameter quantity is reduced; and Penalis _ tanh, which appears more stable in the sequence task, is used as an activation function, so that a convolution enhanced external attention module (ExConformer) is proposed. The ExConformer Module comprises a position Convolution Module (PosConv Module), an External Attention Module (External Attention Module), a Convolution Module (Convolution Module), a feed-forward neural network Module (fed forward Module), and a former classSimilarly, residual concatenation is added between each module, if the input of the ith ExConformer block is defined as x i Then it outputs y i This representation can be obtained by the following steps:
Figure BDA0003686388670000101
Figure BDA0003686388670000102
x″ i =x′ i +Conv(x′ i )
y i =Layernorm(x″ i +FFN(x″ i ))
in the former, two half-step feed-forward layers are used as mentioned by Macaron-Net, and in the self-attention mechanism sinusoidal position coding is used. Recent research on speech recognition finds that convolutional position coding can achieve better effect than sinusoidal position coding, so we also adopt convolutional position coding in the expert, which makes the position coding trainable; meanwhile, the gain generated by using two half-step feedforward layers is not obvious through experiments, so that in the invention, the first half-step feedforward layer of the former is changed into a convolution position coding module, and the second half-step feedforward layer is changed into a common feedforward layer. Using convolution with zero-padding, intra-and inter-block dimensions are modeled separately in the separator, whose location context location information is learned faster and with fewer parameters.
For the experiments performed on the multi-speaker time-domain speech separation technique, the experimental procedure is described as follows:
1. data set and evaluation index
The following experiment was conducted mainly for the separation of voices of two speakers using a Libri2Mix formed by mixing a public data set libristech, the setting in the Libri2Mix data set paper being adopted for the mixing mode. Mixing the Libri2Mix training set with data in train-100 for 13900 voices; the verification set and the test set respectively have 3000 pieces of voice data; each piece of data is formed by mixing random signal-to-noise ratios of-5 dB to 5 dB; the pre-treatment was performed down-sampling to 8 kHz. Compared to smaller datasets, such as the WSJ0 dataset and the timmit dataset, the libristech dataset contains more speakers, with 251 speakers in the training set train-100 and 40 speakers in the verification set and the test set; experimental results on the Libri2Mix dataset mixed by libristech can be more reliably generalized to new scenarios and can provide a general modeling trend.
The evaluation index adopts a scale-invariant signal-to-noise ratio SI-SNR and is calculated as follows:
Figure BDA0003686388670000111
Figure BDA0003686388670000112
Figure BDA0003686388670000113
wherein
Figure BDA0003686388670000114
And x are the estimated target speaker's voice and the pure target speaker's voice, respectively, both of which have been zero-mean normalized prior to the calculation. In general, a larger value of the SI-SNR indicates a better quality of speech separation.
2. Ablation experiment
We tested the former and the convolution-enhanced external attention, excormer, as a splitter module based on the TasNet splitting network, verifying the effect of the external attention mechanism on speech separation. Wherein the former enhances self-attention and two half-step feed forward layers (FFN) using convolution; ExConformer-PosConv denotes a structure that does not use convolution to encode positions; ExConformer-FFN indicates a structure that uses no feedforward layer, only convolution and external attention modules; ExConformer with Swish activation function in former; ExConformer replaces Swish with Penaliied _ tanh, a model proposed herein. The experimental results are shown in table 1, where the experiments are all performed on the same server, and the separation speed is measured by the number of speech fragments in the test set processed in one second, which can reflect the speed of the separation speed. We believe that the separation speed is more indicative of the timeliness of a model in practice than training.
As can be seen from the experimental results in table 1, the former has a larger parameter and the slowest separation speed, and the exformer with convolution enhanced external attention achieves a better separation effect with a parameter less than that and a separation speed 2 times; meanwhile, the results in the rows 2 and 3 in the table show that the convolutional position coding and the feedforward layer are necessary, and the two parts have a remarkable lifting effect on the ExConformer; and replacing the Swish function with Penalied _ tanh also improves the separation effect.
Table 1 comparative experiments with different separator modules
Figure BDA0003686388670000121
3. Comparative experiment
We compare the experimental effect of the present invention on Libri2Mix datasets with some existing time domain speech separation methods. We reproduced partial code on their own device to test the separation speed of these models. Wherein the BLSTM-TasNet, Conv-TasNet and SuDoRM-RF + + models do not segment the potential representation of the mixed speech into smaller blocks and overlap into three-dimensional vectors for intra-block inter-block operations, and therefore have faster separation speed, but also have larger parameters; both DPRNN and DPTNet and the present invention use segmentation overlap to perform inter-block iterative operations within the block, increasing the separation speed but reducing the number of parameters compared to the first several models. As can be seen from the table, DPTNet was the most effective, and the number of the components was small, but the separation speed was the slowest. The present invention achieves a separation effect equivalent to DPTNet with a minimum model size and a separation speed 2 times as fast as that.
Table 2 comparison of different methods on Libri2Mix dataset
Figure BDA0003686388670000131
In summary, the following effects are achieved:
1. the former was modified. The former performs well in a sequence modeling task, but for a voice separation task serving as a voice recognition front-end technology, the requirements of the size and timeliness of a voice separation model are difficult to meet due to the large parameter number and the long reasoning time of the voice separation task; the former is improved to meet the requirements of smaller models and high timeliness of voice separation, and a better separation effect is achieved by the advantage of sequence modeling.
2. An external attention mechanism is introduced into the speech separation task. The external attention mechanism replaces the self-attention mechanism with two cascaded linear layers and a normalization layer, which, unlike the self-attention mechanism, implicitly takes into account features from other samples by sharing weight units; linear complexity can be realized by controlling the size of the memory, and the separation speed is high; but the simpler external attention mechanism can not learn the characteristics and the correlation equivalent to the self-attention mechanism, the invention combines the former, and the convolution module of the former can enhance the external attention mechanism to learn more characteristics and the correlation, and keeps the advantage of high separation speed.
3. Convolutional coding is used. The input is encoded using a convolutional neural network, after which each frame contains context information. And convolutional coding makes the position coding trainable.
4. The application of the newly proposed multi-speaker time-domain voice separation method with convolution enhanced external attention in a double-path structure can well balance timeliness, model size and separation effect.
The foregoing shows and describes the general principles, principal features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (6)

1. A method for separating time domain voice of multiple speakers with convolution enhancing external attention is characterized by comprising the following steps:
s1, mixing voice of multiple speakers through a coder, performing convolution operation, and converting the voice into potential feature representation
Remembering the mixed speech of multiple speakers as x (t),
Figure FDA0003686388660000011
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003686388660000012
is a real number domain; t is the voice length;
let the latent features be denoted as h,
Figure FDA0003686388660000013
wherein, C E Is the number of encoder channels; l is the length of the potential feature representation;
s2, learning by a separator based on a convolution enhanced external attention module to obtain a voice mask;
and S3, multiplying the voice mask by the potential feature representation output by the encoder, and reconstructing a waveform through deconvolution operation of a decoder to obtain the separated voice.
2. The method for convolution-enhanced external attention time-domain speech separation of multiple speakers according to claim 1, wherein S2 mainly includes the steps of:
global normalization and convolution operations: representing the potential features by h, mapping and obtaining an intermediate representation
Figure FDA0003686388660000014
Wherein C is the number of channels;
splitting and stacking: the intermediate representation h' is divided into S overlapping smaller blocks of length K, forming a three-dimensional vector
Figure FDA0003686388660000015
Wherein K is the length of the overlapped blocks, and S is the number of the overlapped blocks;
ExConformer transform: iteratively applying a transformation consisting of B ExConformer modules to the intra-block dimension K and S of a three-dimensional vector T, the output T of the intra-block processing b Will be the input for inter-block processing;
that is, the output of the B-1 th ECBlock will be the input of the B-th block, B being 1, …, B, as follows:
T b =ECBlock intra (U b-1 )
U b =ECBlock inter (T b )
dimension transformation: output U to the Bth ECBlock B Two-dimensional convolution is applied to learn a mask for each source, resulting in a three-dimensional vector
Figure FDA0003686388660000021
Y=Conv2D(U B )
Aggregating multi-channel information processing: converting three-dimensional vector Y into an intermediate potential representation of each source via overlap-add operations
Figure FDA0003686388660000022
Applying one-dimensional convolution sum PReLU to y to aggregate information on multiple channels, for the ith source, its estimated mask of speech mask
Figure FDA0003686388660000023
The following were used:
Figure FDA0003686388660000024
3. the method of claim 2, wherein the ExConformer module comprises a position convolution module, an external attention module, a convolution module, and a feedforward neural network module, and a residual connection is added between each module;
if the input of the ith ExConformer module is defined as x i Then its output, yi, is expressed as follows:
Figure FDA0003686388660000026
Figure FDA0003686388660000025
x” i =x' i +Conv(x' i )
y i =Layernorm(x” i +FFN(x” i ))。
4. the method of claim 3, wherein the activation functions of the convolution module of the ExConformer module and the feedforward neural network module use Penalized _ tanh.
5. The method of convolution enhanced attention-focused multi-speaker temporal speech separation according to claim 3 wherein the location convolution module is composed of a plurality of stacked zero-padding one-dimensional convolutions, layer normalization and ReLU activation layers.
6. The method of claim 3, wherein the step of the external attention module is as follows:
adjusting the number of channels of the input features by a one-dimensional convolution;
construction of M Using one Linear layer k A memory to learn an attention map a between query vectors;
Figure FDA0003686388660000031
wherein, F is an input characteristic diagram;
Figure FDA0003686388660000032
is M k Transposing;
softmax and L1 Norm normalization (L1_ Norm) were performed thereon;
construction of M Using one Linear layer v The memory generates a refined feature map, represented as follows:
F out =AM v
and performing Dropout operation on the output result.
CN202210647059.4A 2022-06-09 2022-06-09 Multi-speaker time-domain voice separation method for enhancing external attention through convolution Pending CN115101085A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210647059.4A CN115101085A (en) 2022-06-09 2022-06-09 Multi-speaker time-domain voice separation method for enhancing external attention through convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210647059.4A CN115101085A (en) 2022-06-09 2022-06-09 Multi-speaker time-domain voice separation method for enhancing external attention through convolution

Publications (1)

Publication Number Publication Date
CN115101085A true CN115101085A (en) 2022-09-23

Family

ID=83289286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210647059.4A Pending CN115101085A (en) 2022-06-09 2022-06-09 Multi-speaker time-domain voice separation method for enhancing external attention through convolution

Country Status (1)

Country Link
CN (1) CN115101085A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403599A (en) * 2023-06-07 2023-07-07 中国海洋大学 Efficient voice separation method and model building method thereof
CN116805061A (en) * 2023-05-10 2023-09-26 杭州水务数智科技股份有限公司 Leakage event judging method based on optical fiber sensing
WO2024087128A1 (en) * 2022-10-24 2024-05-02 大连理工大学 Multi-scale hybrid attention mechanism modeling method for predicting remaining useful life of aero engine

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024087128A1 (en) * 2022-10-24 2024-05-02 大连理工大学 Multi-scale hybrid attention mechanism modeling method for predicting remaining useful life of aero engine
CN116805061A (en) * 2023-05-10 2023-09-26 杭州水务数智科技股份有限公司 Leakage event judging method based on optical fiber sensing
CN116805061B (en) * 2023-05-10 2024-04-12 杭州水务数智科技股份有限公司 Leakage event judging method based on optical fiber sensing
CN116403599A (en) * 2023-06-07 2023-07-07 中国海洋大学 Efficient voice separation method and model building method thereof
CN116403599B (en) * 2023-06-07 2023-08-15 中国海洋大学 Efficient voice separation method and model building method thereof

Similar Documents

Publication Publication Date Title
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
CN115101085A (en) Multi-speaker time-domain voice separation method for enhancing external attention through convolution
JP2022529641A (en) Speech processing methods, devices, electronic devices and computer programs
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
CN110060657B (en) SN-based many-to-many speaker conversion method
CN114141238A (en) Voice enhancement method fusing Transformer and U-net network
CN111429893A (en) Many-to-many speaker conversion method based on Transitive STARGAN
CN113823264A (en) Speech recognition method, speech recognition device, computer-readable storage medium and computer equipment
Chen et al. Distilled binary neural network for monaural speech separation
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
CN114373451A (en) End-to-end Chinese speech recognition method
CN113823273A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN115602165A (en) Digital staff intelligent system based on financial system
Wei et al. EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting
CN114067819B (en) Speech enhancement method based on cross-layer similarity knowledge distillation
Luo et al. Tiny-sepformer: A tiny time-domain transformer network for speech separation
Guo et al. Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training
CN116863920B (en) Voice recognition method, device, equipment and medium based on double-flow self-supervision network
US20220207321A1 (en) Convolution-Augmented Transformer Models
CN116391191A (en) Generating neural network models for processing audio samples in a filter bank domain
CN112257464A (en) Machine translation decoding acceleration method based on small intelligent mobile device
Li et al. A fast convolutional self-attention based speech dereverberation method for robust speech recognition
CN117980915A (en) Contrast learning and masking modeling for end-to-end self-supervised pre-training
CN115376484A (en) Lightweight end-to-end speech synthesis system construction method based on multi-frame prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination